<center><h1>Capstone Project:  Clustering Neighborhoods in Toronto.</h1></center>
<h2>Introduction</h2>
<p>In this notebook, we'll explore, segment, and cluster the neighborhoods in the city of Toronto. All of this, for later discover insights about the different places in this city. We could be considering different questions to have in count, like: how it is aggrouped the different places in this city? How much and Which are the main classes of aggolmerative places and which characteristics do they share?</p>
<p>Also, we'll notice how in the process of go throughout the stages for this project, more questions are going to take place.</p>

<p>This notebook will consist of two stages: </p>
<ul>
    <li>Data collection and wrangling.</li>
    <li>Geolocalization and clustering.</li>
</ul>

<h3>Table of contents:</h3>
<ul>
    <li>Data collection and wrangling.</li>
    <li>Geolocalization and clustering.</li>
</ul>

<h2>Data collection and wrangling</h2>

<p>This stage will require us to: </p>
<p>Firstly, extract the data. In our case of study, the information of Toronto boroughs and neighborhoods isn't explicitly available in a csv file or in an ordered file. However, we'll make use of some libraries to scraping some html file and extract that data.</p>
<p>Secondly, after have that data, we'll need clean it, dropping rows that doesn't help us and sorting it in a conveninant.</p>

In [121]:
#Install required packages to fetch html files and make calls to API's.
#!conda install -c conda-forge requests --yes 

In [122]:
#Import packages.
import pandas as pd
import numpy as np
import requests

<p>For retriving data we'll need to provide the website url, and pass it through the requests method <i>"get"</i>.</p>

In [123]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)

Let's see what is the length of what returns the get request through store it in an response object.

In [124]:
len(response.content)

79293

The above result seems pretty messy, but there's many html tags in whose content we aren't interested in. Nevertheless, through the 'read_html' pandas method we can automatically convert the first table in the HTML file into a DataFrame. I've prefered to convert the data in this way, without having to install other packages. 

In [125]:
df = pd.read_html(response.text)[0] #Read the first table contained in response.text html file.
df.columns = ['Postal Code', 'Borough', 'Neighborhood'] # Name the Data Frame columns.
df.drop([0], axis=0, inplace=True) # Get rid of first column.
df.reset_index(inplace=True, drop=True) # Re set the index without creating a new columns called "index"
df.head() # See first 5 rows.

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [126]:
# Watch data frame size.
df.shape 

(288, 3)

As we can see above, there are many rows in Neighborhood and Borough columns that have 'Not assigned' values; that are useless rows, and therefore we'll get rid of them. However, we'll get rid of the rows that only lack the Borough value, because we can recover the values of the Neighborhood rows by copying their borough value, but not viceversa.

In [127]:
useless_rows = df[df['Borough'] == 'Not assigned']
print('Useless rows: {}'.format(useless_rows.shape[0]))
useless_rows.head()

Useless rows: 77


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
9,M8A,Not assigned,Not assigned
13,M2B,Not assigned,Not assigned
20,M7B,Not assigned,Not assigned


Now, let's see the number of rows that contains borough values; i.e. the useful rows.

In [128]:
df = df[df['Borough'] != 'Not assigned']
print('Useful rows: {}'.format(df.shape[0]))
df.head()

Useful rows: 211


Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Now, in order to recover some rows that have Borough value but lack of Neghborhood value, we can replace that 'Not assigned' value into its Borough value. We'll iterate through the dataset in order to do this.

In [129]:
index_neighs_to_replace_for_boroughs = df[(df['Borough'] != 'Not assigned') & (df['Neighborhood'] == 'Not assigned')].index.tolist()

for i in index_neighs_to_replace_for_boroughs:
    print('Neighborhood with this value: "'+ df['Neighborhood'][i] + '" will be replaced with this: "' + df['Borough'][i] + '" value')
    df['Neighborhood'][i] = df['Borough'][i]

Neighborhood with this value: "Not assigned" will be replaced with this: "Queen's Park" value


In [130]:
#Chek if 'Borough' and 'Neighbourhood' columns haven't 'Not assigned' values.
if all((Borough, Neighbourhood != 'Not assigned') for i, Borough, Neighbourhood in df[['Borough','Neighborhood']].itertuples()):
    print('All Borough and Neighbourhood have assigned values.')

All Borough and Neighbourhood have assigned values.


Now let's look into the data frame shape until this point.

In [131]:
df.shape

(211, 3)

In [132]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Notice above that in 'Postalcode' column, there are repeated postal codes, we can resume this by aggrouping its corresonding Neighborhoods separated by commas, instaed of having repeated ones.

In [133]:
df = df.astype(str).set_index(['Postal Code', 'Borough'])
df_merged = df.groupby(level=['Postal Code', 'Borough'], sort=False).agg(', '.join)
df_merged.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood
Postal Code,Borough,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Harbourfront, Regent Park"
M6A,North York,"Lawrence Heights, Lawrence Manor"
M7A,Queen's Park,Queen's Park


Now our dataset have unique postcodes and its pertaining Neighborhoods are separated by commas. However, we'll need to reset the index and recover the Postalcode back into the dataframes columns.

In [134]:
df_merged.reset_index(inplace=True)
df_merged.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


Finally, let's look at its shape; it has now more than a hundred of rows less; all because we've group its Neighborhoods in order to avoid postal codes duplicates.

In [135]:
df_merged.shape

(103, 3)

<h2>Geolocalization stage.</h2>

So far, our data frame is cleaned, but in order to plot maps and can analyze its ubicaction; we'll need to search for each geolocalization data for every postal code. We can do this through packages; however, for sake of simplicity, we've find in internet an csv file with the geolocalization data for each of our Postal Codes. Otherwise, if our case of study would required a great variarity of locations, we could use one of those that packages. 


In [136]:
#Import csv file
df_latlong = pd.read_csv('http://cocl.us/Geospatial_data')
df_latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now, in order to check if our localization data is row-wise equal to what we have in our dataframe, let's sort both by the Postal Code. 

In [137]:
df_latlong.sort_values(by='Postal Code').head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [138]:
df_merged.sort_values(by='Postal Code', inplace=True)
df_merged.reset_index(inplace=True, drop=True)
df_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


As you can see above, sorting both dataframse and then concatenate will make that each row contain its pertain location.

In [140]:
df_places = pd.concat([df_merged, df_latlong[['Latitude', 'Longitude']]], axis=1)


In [143]:
df_places.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
