# Capstone Week3 Assignment - Segmenting and Clustering the Neighborhoods in the City of Toronto, Canada

### This assignment consists of 3 parts/steps (see below), all of which are implemented in a single Jupyter Notebook (i.e., this Notebook) for clarity and conveniance of peer review.

  ####    *1. Obtain data by scraping the Wikipedia page containg information for the neighborhoods of Canada;*
  ####    *2. Process and clean the data for clustering;*
  ####    *3. Perform clustering by using the K Means modeling methodology, clusters are then plotted using the Folium Library.*

### Note that only Boroughs whose names containing the phrase 'Toronto' are included in the clustering and final plotting.

###### -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## 1. Obtain data by scraping the Wikipedia page containg information for the neighborhoods of Canada

### 1.1. Import & Install the necessary Libraries:

In [1]:
!pip install beautifulsoup4
!pip install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes #this is necessary if geopy hasn't been installed yet.
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
print('Geocoders has been installed')

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
from IPython.display import display_html

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes  #this is necessary if Folium hasn't been installed yet.
import folium # plotting library
from bs4 import BeautifulSoup
print('Folium has been installed')

from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
print('All necessary libraries have been imported.')


Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 13.9MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2; python_version >= "3.0" (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/36/69/d82d04022f02733bf9a72bc3b96332d360c0c5307096d76f6bb7489f7e57/soupsieve-2.2.1-py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.3 soupsieve-2.2.1
Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.10.0
  latest version: 4.10.1

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package  

### 1.2. Fetch the Toronto neighborhoods data from Wikipedia and scrap it

In [2]:
# Get data from wikipedia
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')

# Assemble and clean up a dataframe obtained from Toronto Wikipedia page website, it will contain not-null ostalCode, Borough, and Neighborhood columns. 
table_contents=[]
table=soup.find('table')  # BeautifulSoup Library is used for scraping tables from Wikipedia.
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

#print(table_contents)
df=pd.DataFrame(table_contents)

print("The shape of df is:", df.shape)
df.head()

The shape of df is: (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [4]:
# replace messed up names with clean ones 
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


## 2. Process and clean the data for clustering

#### 2.1. Data preprocessing and cleaning

In [5]:
# Combine the neighbourhoods with same Postal Code
df_a = df.groupby(['PostalCode','Borough'], sort=False).agg(', '.join)
df_a.reset_index(inplace=True)

# Replacing the name of the neighbourhoods which are 'Not assigned' with names of Borough
df_a['Neighborhood'] = np.where(df_a['Neighborhood'] == 'Not assigned',df_a['Borough'], df_a['Neighborhood'])

print("The shape of df_a is:", df_a.shape)
df_a.head()

The shape of df_a is: (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


#### 2.2. Import the csv file that contains latitudes and longitudes information for the neighbourhoods in Canada

In [6]:
lat_lon = pd.read_csv('https://cocl.us/Geospatial_data')
lat_lon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### 2.3. Merge the two tables to get the Latitudes and Longitudes for the neighbourhoods in Canada

In [7]:
lat_lon.rename(columns={'Postal Code':'PostalCode'},inplace=True)
df_b = pd.merge(df_a,lat_lon,on='PostalCode')

print("The shape of df_b is:", df_b.shape)
df_b.head()

The shape of df_b is: (103, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


#### 2.4. Get all the rows from the dataframe that contains the phrase "Toronto" in Borough.

In [8]:
df_c = df_b[df_b['Borough'].str.contains('Toronto',regex=False)]

print("The shape of df_c is:", df_c.shape)
df_c

The shape of df_c is: (39, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259
35,M4J,East York/East Toronto,The Danforth East,43.685347,-79.338106


#### 2.5. Visualize all neighbourhoods of the above dataframe using Folium

In [9]:
map_toronto = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

for lat,lng,borough,neighborhood in zip(df_c['Latitude'],df_c['Longitude'],df_c['Borough'],df_c['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_toronto)
map_toronto

#### Image_1 Toronto Neighborhoods Visualization

![Image 1: Toronto Neighborhoods Visualization](C:\Users\LTA2017\Documents\Kui\Learning\Image_1_Toronto_Neighborhoods_Visualization.jpg "Toronto Neighborhoods Visualization")

## 3. Clustering and plotting the Toronto Neighborhoods

#### 3.1. Use K Means clustering methodology to cluster the neighborhoods

In [10]:
k=5
toronto_clustering = df_c.drop(['PostalCode','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
df_c.insert(0, 'Cluster Labels', kmeans.labels_)

In [13]:
#check the dataframe
df_c.head()

Unnamed: 0,Cluster Labels,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
9,4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,4,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,1,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


#### 3.2. Plot the neighborhoods using newly completed clustering

In [12]:
# create map
map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighborhood, cluster in zip(df_c['Latitude'], df_c['Longitude'], df_c['Neighborhood'], df_c['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [14]:
print("The shape of df_c is:", df_c.shape)

The shape of df_c is: (39, 6)


#### Image_2 Toronto Neighborhoods Clustering

![Image 2: Toronto Neighborhoods Clustering](C:\Users\LTA2017\Documents\Kui\Learning\Image_2_Toronto_Neighborhoods_Clustering.jpg "Toronto Neighborhoods Clustering")