## Courser Capstone Project

Import essential libraries.

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

Get the wikipedia page usring requests and BeautifulSoup libraries.

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

Grab the wanted table by its class, then find all rows in this table.

In [3]:
table = soup.find('table', class_="wikitable sortable")
table_rows = table.find_all('tr')

1. Define an empty list which will contain all rows.
2. Loop through the rows and find all cells in this row by using _td_ tag.
3. Ignore the first row because it contains the headers.
4. Loop through the cells of each row:
    - Remove "\n" from any cell if exists.
    - Check the "Neighbourhood" cell if its value "Not assigned" then fill it with the value of its Borough.
    - Append this row to **l** list.
5. Create a dataframe _df_ and fill it with **l**

In [4]:
l = []
for tr in table_rows:
    td = tr.find_all('td')
    if len(td) != 0:
        row = [tr.text.replace("\n", "") for tr in td]
        if row[2] == 'Not assigned':
            row[2] = row[1]
        l.append(row)
    
df = pd.DataFrame(l)

1. Loop throuth _th_ the headers of the table and put them as columns name.
2. Remove "\n" any header.
3. Drop all rows which have the vlaue "Not assigned" in the column "Borough".

In [5]:
df.columns = [th.text.replace("\n", "") for th in table.find_all('th')]
df.drop(df[df.Borough == 'Not assigned'].index, inplace=True)

Join all cells which have the same PostCode and print its dimentions.

In [6]:
df = df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join).reset_index()
df.shape

(103, 3)

1. Download the Geospatial_coordinates.csv file.
2. Save it in coordinates dataframe.
3. Rename the Postal Code column to match the Postcode column in our dataframe _df_.

In [7]:
!wget -q -O 'Geospatial_coordinates.csv' http://cocl.us/Geospatial_data
    
coordinates = pd.read_csv('Geospatial_coordinates.csv')
coordinates.rename(columns={'Postal Code': 'Postcode'}, inplace=True)

Merge the two dataframes depending on "PostCode" column.

In [8]:
df = pd.merge(df, coordinates, on='Postcode', how='left')
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### Toronto data

Filter Borough column based on the cells which contain the word Toronto.

In [9]:
toronto_data = df[df['Borough'].str.contains('Toronto')].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


Import the essential libraries for clustering and drawing the map.

In [10]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
geopy                     1.18.1                     py_0    conda-forge
Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge
Libraries imported.


Find the geospatial coordinates of Toronto City.

In [11]:
address = 'Toronto'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

  app.launch_new_instance()


The geograpical coordinate of Toronto are 43.653963, -79.387207.


1. Create new dataframe with normalized data of Borough column.
2. Add a Borough column to the new dataframe.
3. Move the created column to the first place.

In [14]:
toronto_onehot = pd.get_dummies(toronto_data[['Borough']], prefix="", prefix_sep="")
toronto_onehot['Borough'] = toronto_data['Borough']

fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot

Unnamed: 0,Borough,Central Toronto,Downtown Toronto,East Toronto,West Toronto
0,East Toronto,0,0,1,0
1,East Toronto,0,0,1,0
2,East Toronto,0,0,1,0
3,East Toronto,0,0,1,0
4,Central Toronto,1,0,0,0
5,Central Toronto,1,0,0,0
6,Central Toronto,1,0,0,0
7,Central Toronto,1,0,0,0
8,Central Toronto,1,0,0,0
9,Central Toronto,1,0,0,0


## Clustering

Depend on data we have 4 clusters, we drop the Borough column to be able to pass the dataframe to KMeans clustering.

In [15]:
kclusters = 4

toronto_onehot_clustering = toronto_onehot.drop('Borough', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_onehot_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
       2, 2, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 0], dtype=int32)

Add "Cluster label" column to df.

In [16]:
toronto_data['Cluster label'] = kmeans.labels_

Draw all clusters on the map with different colors.

In [17]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighbourhood'], toronto_data['Cluster label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters