# Applied data science capstone, week 3

## Part 1
### Reading wiki page with postal codes and neighborhoods into pandas dataframe

In [1]:
!conda install -c anaconda lxml -y -q

import pandas as pd
import lxml

print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


---
I parsed wiki page ([List_of_postal_codes_of_Canada](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)) and cleaned it from rows with a borough that is 'Not assigned'.

For today (2021-01-09) wiki page not contents rows with a borough that is not 'Not assigned' and a neighborhood that is 'Not assigned' at the same time.

The same way wiki page not contents rows with duplicate Postal codes.

My dataframe of 103 rows.

In [2]:
# read table from html into dataframe
df_raw = pd.read_html(io='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', attrs={'class': 'wikitable sortable'})[0]
df = df_raw.loc[df_raw['Borough'] != 'Not assigned']
df.shape

(103, 3)

---
## Part 2
### Reading csv with coordinates, merging it with main dataframe.

In [4]:
#!conda install -c conda-forge geocoder -y
#import geocoder # import geocoder

#print('Libraries imported.')

It seems geocoder hit the wall of Google's day limit of requests.
I decided to use IBM's CSV.

In [3]:
# read table from csv into additional dataframe
df_latlng = pd.read_csv('http://cocl.us/Geospatial_data')

In [6]:
# merge main and additional dataframes
df = df.merge(df_latlng, on='Postal Code')
pd.set_option('display.max_rows', df.shape[0]+1) # check
df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude_x,Longitude_x,Latitude_y,Longitude_y,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656,43.753259,-79.329656,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572,43.725882,-79.315572,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,43.65426,-79.360636,43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,43.718518,-79.464763,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,43.662301,-79.389494,43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242,43.667856,-79.532242,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,43.806686,-79.194353,43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188,43.745906,-79.352188,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,43.706397,-79.309937,43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,43.657162,-79.378937,43.657162,-79.378937


---
## Part 3
### Exploring, clustering and visualization of heighbourhoods.

I get boroughs that contain 'Toronto' and reset index (don't want gaps).

In [7]:
df_toronto = df[df['Borough'].str.contains('Toronto', regex=False)].reset_index(drop=True)
df_toronto

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude_x,Longitude_x,Latitude_y,Longitude_y,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,43.65426,-79.360636,43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,43.662301,-79.389494,43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,43.657162,-79.378937,43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,43.651494,-79.375418,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,43.676357,-79.293031,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,43.644771,-79.373306,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,43.657952,-79.387383,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564,43.669542,-79.422564,43.669542,-79.422564
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568,43.650571,-79.384568,43.650571,-79.384568
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259,43.669005,-79.442259,43.669005,-79.442259


In [8]:
#!conda install -c conda-forge folium=0.5.0 -y -q # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
import numpy as np

print('Libraries imported.')

Libraries imported.


I create a map of Toronto with neighborhoods taking coordinates from maps.

In [9]:
lat, lng = 43.681498, -79.335725

map_toronto = folium.Map(location=[lat, lng], zoom_start=12)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='yellow',
        fill_opacity=0.5,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

The neighbourhoods are densely packed and the rest fall into three zones. Let there be 4 clusters.

In [10]:
# set number of clusters
kclusters = 4

toronto_clustering = df_toronto.drop(['Postal Code', 'Borough', 'Neighbourhood'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:11] 

array([1, 1, 1, 1, 3, 1, 1, 2, 1, 2, 1], dtype=int32)

In [11]:
# add clustering labels
toronto_clustering.insert(0, 'Cluster Labels', kmeans.labels_)
df_toronto = df_toronto.merge(toronto_clustering)

In [12]:
# create map
map_clusters = folium.Map(location=[lat, lng], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to map
for lat, lng, cluster, neighbourhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], 
                                            df_toronto['Cluster Labels'], df_toronto['Neighbourhood']):
    label = folium.Popup(str(neighbourhood) + ' Cluster ' + str(cluster), parse_html=True)
    # replacement of hard-to-see yellow
    if rainbow[cluster-1] == '#d4dd80':
        rainbow[cluster-1] = 'orange'
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.5,
        parse_html=False).add_to(map_clusters)  
    
map_clusters