<a href="https://www.coursera.org/learn/applied-data-science-capstone/"><img src = "https://seetorontonow.mx/uploads/2018/01/kayak-on-lake-ontario-looking-at-toronto-skyline.jpg" width = 500> </a>

<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>
<h1 align=center><font size = 4>By Miguel Lira Vidrio</font></h1>
<h1 align=center><font size = 2.5>For Coursera Applied Data Science Capstone</font></h1>

## Introduction
In this notebook is presented a work around Toronto Neighborhoods to understand their characteristics and segmenting them.
The neighborhoods information was obteined from Wikipedia (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) and grouping by Postalcode. Also the geo coordinates was added to each Postalcode. The Foursquare API was used to get venues around of the PostalCode Coordinates. With the collected information a dataframe was built and included top 10 venues categories for each postalcode, venues density per square meters. Finally we applied a kmeans algorithm.  

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Create Toronto Neighborhoods Dataframe</a>

2. <a href="#item2">Get PostalCode Coordinates</a>

3. <a href="#item3">Get Venues from Foursquare API</a>

4. <a href="#item4">Cluster Neighborhoods</a>

5. <a href="#item5">Examine Results</a>    
</font>
</div>

##### Install and import libraries:

In [1]:

!pip install tensorflow
!pip install geopy
!pip install folium
!pip install bs4
!pip install googlemaps
!pip install plotly
!pip install cufflinks



In [2]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
import json
from pandas.io.json import json_normalize
import folium # plotting library
from bs4 import BeautifulSoup #Library to scraping info. from a web portal such a wikipedia.


print('Libraries are correctly imported')

Libraries are correctly imported


## 1. Create Toronto Neighborhoods DataFrame

The Toronto Neighborhoods information is available in the following Wikipedia Page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 
BeautifulScrape was used to scrape the table content. A pandas dataframe was built from content in order to know which Neighborhoods belongs to a PostalCode and Borough with the follow structure:
<img src="https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1559779200000&hmac=dyAVJgKU6e2BrXIIHUTkjX3L4kgDdkzjFF2IEV6tmP4" width = 400>

In the table on the Wikipedia page, is possible see that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. For our purpuse these two rows was combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

Also the Wikipedia table has PostalCode with Not assigned Boroughs and Neighborhoods. The analysis required process the cells that had an assigned borough. When a PostalCode had borough but a Not assigned neighborhood, the neighborhood was replaced with borough name. 

In the next cells the code to create the desire dataframe is presented.


1) Toronto Neighborhood Data from Wikipedia was processed to a Dataframe.

In [3]:
#Scrape content from Wikipedia table:
url=requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
soup=BeautifulSoup(url,"html.parser")
tb = soup.find('table', class_='wikitable')
rows = tb.find_all('tr')
l = []
for tr in rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
#Create a pandas dataframe: 
toronto_df=pd.DataFrame(l, columns=["PostalCode", "Borough","Neighborhood"])
toronto_df=toronto_df.drop([0])
toronto_df=toronto_df.reset_index(drop=True)
#Replace dirty string:
toronto_df["Neighborhood"]=toronto_df["Neighborhood"].str.replace('\n', '')

2) Only assigned Boroughs were selected. All "Not assigned" Neighborhood records were replaced with the correspondent Borough name.

In [4]:
totalPost=toronto_df["PostalCode"].unique().shape[0]
toronto_df=toronto_df.loc[toronto_df["Borough"]!="Not assigned"]
index=toronto_df.loc[toronto_df["Neighborhood"]=="Not assigned"].index.values.astype(int)
for i in index:
    toronto_df.at[i, 'Neighborhood'] = toronto_df.at[i, 'Borough']
PC=toronto_df["PostalCode"].unique().shape[0]
PC_na_Boroughs=totalPost-PC

3) Adjusted DataFrame to contains Borough, Postalcode and a list of related Neighborhoods separated with coma.

In [5]:
toronto_df=toronto_df.groupby(["PostalCode", "Borough"])["Neighborhood"].apply(lambda x:"%s" % ", ".join(x)).to_frame()
toronto_df.reset_index(level=1, inplace=True)
toronto_df.reset_index(level=0, inplace=True)

4) In the follow cell the results from the code are presented:

In [6]:
print("At the beginning we had: {} Postalcodes.".format(totalPost))
print('A total of {} PostalCodes was discarded.'.format(PC_na_Boroughs))
print("Only {} Neighborhoods was replaced with the Borough Name.".format(index.shape[0]))
print("The final shape of our dataframe is {}".format(toronto_df.shape))

At the beginning we had: 180 Postalcodes.
A total of 77 PostalCodes was discarded.
Only 1 Neighborhoods was replaced with the Borough Name.
The final shape of our dataframe is (103, 3)


In [7]:
toronto_df.sample(n=10)

Unnamed: 0,PostalCode,Borough,Neighborhood
93,M9A,Etobicoke,Islington Avenue
10,M1P,Scarborough,"Dorset Park, Scarborough Town Centre, Wexford ..."
79,M6L,North York,"Downsview, North Park, Upwood Park"
35,M4B,East York,"Woodbine Gardens, Parkview Hill"
68,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo..."
75,M6G,Downtown Toronto,Christie
76,M6H,West Toronto,"Dovercourt Village, Dufferin"
36,M4C,East York,Woodbine Heights
49,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi..."
11,M1R,Scarborough,"Maryvale, Wexford"


This work continiues in the part II...