<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

## Introduction

In this assignment, the data of the neighborhoods in Toronto is scraped from wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown in the assignment description.

In [83]:
# importing necessary libraries
#!conda install -c conda-forge bs4 --yes # Install bs4
#!conda install lxml --yes # Install lxml
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests


Using BeautifullSoup to scrap the table from Wiki

In [84]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
raw_url = requests.get(url).text # request raw data from url. 
soup = BeautifulSoup(raw_url,'html5lib') # converts raw website content to xml
table = soup.find('table') # Finding keyword 'table' in xml file named soup.  



### Converting table to dataframe! Removing "Not Assigned"

In [85]:
data = []
columns = []

for index, tr in enumerate(table.find_all('tr')):
    section = []
    for td in tr.find_all(['th','td']):
        section.append(td.text.rstrip())
    
    #First row of data is the header
    if (index == 0):
        columns = section
    else:
        data.append(section)

#convert list into Pandas DataFrame
canada_df = pd.DataFrame(data = data,columns = columns)
canada_df.head()

#Remove Boroughs that are 'Not assigned'
canada_df = canada_df[canada_df['Borough'] != 'Not assigned']
canada_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## Merging identical postal numbers to neighbourhood

In [86]:
canada_df["Neighborhood"] = canada_df.groupby("Postal Code")["Neighborhood"].transform(lambda neigh: ', '.join(neigh))

In [87]:
#update index to be postcode if it isn't already
if(canada_df.index.name != 'Postal Code'):
    canada_df = canada_df.set_index('Postal Code')
    
canada_df.head()

Unnamed: 0_level_0,Borough,Neighborhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [88]:
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
canada_df['Neighborhood'].replace("Not assigned", canada_df["Borough"],inplace=True)
canada_df.head()

canada_df.shape


(103, 2)

## Get Langitude and longitude coordintes from the postal code

In [134]:
# !conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
# !conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab




Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    brotlipy-0.7.0             |py36h8c4c3a4_1000         346 KB  conda-forge
    chardet-3.0.4              |py36h9f0ad1d_1006         188 KB  conda-forge
    cryptography-2.9.2         |   py36h45558ae_0         613 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    pandas-1.0.5               |   py36h83

In [90]:
df_coordinates = pd.read_csv("Geospatial_Coordinates.csv")
df_coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [91]:
df_coordinates["Postal Code"][2]

len(canada_df)

103

In [92]:
Latitude = []
Longitude = []

for i in range(len(canada_df)): 
    for j in range(len(df_coordinates)):
        if canada_df.index[i] == df_coordinates["Postal Code"][j]:
            Latitude.append(df_coordinates["Latitude"][j])
            Longitude.append(df_coordinates["Longitude"][j])



In [93]:
canada_df['Latitude'] = Latitude
canada_df['Longitude'] = Longitude



In [124]:
boolean_series = canada_df.Borough.str.contains("Toronto")
Toronto_df = canada_df[boolean_series]


In [172]:
Toronto_df

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
M4E,East Toronto,The Beaches,43.676357,-79.293031
M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
M6G,Downtown Toronto,Christie,43.669542,-79.422564
M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


### Exploring the Canadian map. focusing on Toronto

In [136]:
# finding latitude and longitude of a random Central Toronto neighborhood
import folium # map rendering library


latitude = Toronto_df.loc["M4N"].Latitude
longitude = Toronto_df.loc["M4N"].Longitude

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(Toronto_df['Latitude'], Toronto_df['Longitude'], Toronto_df['Borough'], Toronto_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [165]:
# Importing KMeans from ML package sklearn and, cm & colors from matplotlib 
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors





# grouping Toronto Data

Toronto_grouped = Toronto_df.groupby('Neighborhood').mean().reset_index()

In [168]:
Cluster_Toronto = Toronto_grouped.drop('Neighborhood', 1)
Cluster_Toronto

Unnamed: 0,Latitude,Longitude
0,43.644771,-79.373306
1,43.636847,-79.428191
2,43.662744,-79.321558
3,43.628947,-79.39442
4,43.657952,-79.387383
5,43.669542,-79.422564
6,43.66586,-79.38316
7,43.648198,-79.379817
8,43.704324,-79.38879
9,43.712751,-79.390197


In [170]:
# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Cluster_Toronto)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Toronto_grouped.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto_grouped

Unnamed: 0,Cluster Labels,Neighborhood,Latitude,Longitude
0,0,Berczy Park,43.644771,-79.373306
1,1,"Brockton, Parkdale Village, Exhibition Place",43.636847,-79.428191
2,4,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
3,0,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442
4,0,Central Bay Street,43.657952,-79.387383
5,1,Christie,43.669542,-79.422564
6,0,Church and Wellesley,43.66586,-79.38316
7,0,"Commerce Court, Victoria Hotel",43.648198,-79.379817
8,2,Davisville,43.704324,-79.38879
9,2,Davisville North,43.712751,-79.390197


In [171]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_grouped['Latitude'], Toronto_grouped['Longitude'], Toronto_grouped['Neighborhood'], Toronto_grouped['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

The clustering seem very much correct, they fit cery well with the borough of downtown, east, west and Central Toronto 

## Thank you for reading