<h1 align=center><font size = 8>Segmenting and Clustering Neighborhoods in Toronto, Canada</font></h1>

First of all, we have to import libraries and the dataset

## Part 1: Downloading and cleaning the data

### Step 1: Import libraries

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install folium
import folium # map rendering library

print('Libraries imported.')

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/fd/a0/ccb3094026649cda4acd55bf2c3822bb8c277eb11446d13d384e5be35257/folium-0.10.1-py2.py3-none-any.whl (91kB)
[K     |████████████████████████████████| 92kB 18.7MB/s eta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/81/6d/31c83485189a2521a75b4130f1fee5364f772a0375f81afff619004e5237/branca-0.4.0-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.0 folium-0.10.1
Libraries imported.


### Step 2: Download dataset and clean the data

In this case, the dataset that we have is given by Wikipedia, which is a table of neighborhoods and their post code. This table is given in HTML, so we have to convert it to a Panda Dataframe.

To do that, I create a string with the HTML code.

In [93]:
from bs4 import BeautifulSoup

req = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

soup = BeautifulSoup(req.content,'lxml')

table = soup.find_all('table')[0]

df_can = pd.read_html(str(table))

neighborhood=pd.DataFrame(df_can[0])

In [94]:
df_can = df_can[0].sort_values(by = ["Postal code"])
df_can.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
9,M1B,Scarborough,Malvern / Rouge
18,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
27,M1E,Scarborough,Guildwood / Morningside / West Hill
36,M1G,Scarborough,Woburn


In [95]:
df_can.dropna(inplace = True)
df_can.drop(df_can.loc[df_can['Borough']=='Not assigned'].index, inplace=True)
df_can.head()

Unnamed: 0,Postal code,Borough,Neighborhood
9,M1B,Scarborough,Malvern / Rouge
18,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
27,M1E,Scarborough,Guildwood / Morningside / West Hill
36,M1G,Scarborough,Woburn
45,M1H,Scarborough,Cedarbrae


In [96]:
df_can.sort_values(by=["Postal code"],axis=0)
df_can = df_can.reset_index(drop=True)

Another criteria is to group the neighborhoods by post code, which is already done in the wikipedia table given. The problem is that is separated in each row with a slash "/", so there is need to replace it with a comma " ,". Then, a reset index is done.

In [97]:
df_can['Neighborhood'] = df_can['Neighborhood'].str.replace(" /",",")
df_can.reset_index(drop=True, inplace = True)
df_can.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


And there we have it. The dataframe is sorted by Postal code and every neighborhood is in the row of each postal code. Let's see the shape of the dataframe:

In [98]:
df_can.shape

(103, 3)

## Part 2: get the geolocalization Data and merge it to the existing dataframe

### Step 1: Downloading the data available in Coursera

I'd tried to get the data by the Geospatial data, but it didn't work. So there is the data available from the Coursera's .csv.

In [71]:
df_loc = pd.read_csv("https://cocl.us/Geospatial_data")
df_loc.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Before we merge this data to the part 1 dataframe is needed to rewrite the name of the Postal Code's column to Postal code (code in downcaps).

In [86]:
df_loc = df_loc.sort_values(by="Postal Code")
df_loc.columns = ["Postal code", "Latitude", "Longitude"]
df_loc.head()

Unnamed: 0,Postal code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [91]:
df_canloc = df_can.merge(df_loc, how = 'left', on = 'Postal code' )
df_canloc = df_canloc.dropna()
df_canloc.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
1,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
2,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
3,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
4,M1G,Scarborough,Woburn,43.770992,-79.216917
5,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [100]:
df_canloc.shape

(103, 5)