# Toronto neighbourhoods assignment
IMPORTANT: Please copy the Github link for this .ipynb file into nbviewer.org to view full-rendered maps. Folium maps are not supported by Github.

### created by Edward Jackson

<br/>

## Scraping and cleaning the data

First, install the Pandas library.

In [297]:
# install pandas library
import pandas as pd
import numpy as np

import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



<br/>
Next, locate and read the data from the identified Wikipedia page. Create a dataframe containing the first table in the webpage. Display the first 5 rows of the new dataframe.

In [298]:
# read data
table = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
# create dataframe from first table in the data page
dfToronto = pd.DataFrame(table[0])
# display first 5 rows of dataframe
dfToronto.head()


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


<br/>
Create a variable to identify the unassigned postal codes. Remove these rows from the dataframe and display the first 5 rows of the updated dataframe.

In [299]:
# create variable
noassign = dfToronto[ dfToronto['Borough'] == 'Not assigned' ].index
# remove redundant rows
dfToronto.drop(noassign , inplace=True)
# display top of updated dataframe
dfToronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [300]:
# check for any unassigned Neighbourhoods. ?Wikipedia page updated since task set? No further rows to drop.
dfToronto.loc[dfToronto['Neighbourhood'] == "Not assigned"]

Unnamed: 0,Postal Code,Borough,Neighbourhood


<br/>
Display the dimensions of the dataframe.

In [301]:
# dataframe rows and columns
dfToronto.shape

(103, 3)

<br/>

## Combining new data

In order to map the postal codes, access the Latitude and Longitude for each. As the Geocoder was not working for me, I accessed the .csv file provided instead (as permitted within the task outline).

In [302]:
# Access latitude and longitude data for all postal codes in Toronto and convert to new dataframe.
geo = pd.read_csv('http://cocl.us/Geospatial_data', index_col=0)
print('Data downloaded and read into a dataframe!')

Data downloaded and read into a dataframe!


In [303]:
# Check the top of the new dataframe.
geo.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


<br/>
Next, merge the two dataframes.

In [304]:
# merge the 2 dataframes by the common key.
dfTorontogeo = pd.merge(dfToronto, geo, on='Postal Code')
dfTorontogeo.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## Mapping, clustering and labelling

Create a map displaying all the neighbourhoods in Toronto.

In [305]:
# co-ordinates for the centre of Toronto
latitude = 43.70
longitude = -79.385

In [306]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(dfTorontogeo['Latitude'], dfTorontogeo['Longitude'], dfTorontogeo['Borough'], dfTorontogeo['Neighbourhood']):
    label = '{} ({})'.format(neighborhood, borough) # added () brackets to show Borough name and removed final comma
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Wrangle the data to display only boroughs which feature the name 'Toronto'.

In [307]:
# reduce number of data points
dfTorontofocus = dfTorontogeo[dfTorontogeo['Borough'].str.contains(pat='Toronto')]
dfTorontofocus.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


Use Machine Learning K-Means clustering to split the neighbourhoods into zones(clusters) of Toronto.
I experimented with different 'k' values and how many iterations (random_state), finally settling on the following values which gave the most pleasing visual outcome.

In [308]:
# set number of clusters
kclusters = 4

toronto_grouped_clustering = dfTorontofocus.drop(['Postal Code','Borough','Neighbourhood'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=12).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 3, 3, 3, 0, 3, 3, 2, 3, 2])

*At this point, I had some difficulty with errors which needed me to check values matched in two different dataframes. In the end, this information was not required as I had made a simple mistake: fitting k-means to the wrong dataframe!*

In [309]:
# frustration
toronto_grouped_clustering.count()

Latitude     39
Longitude    39
dtype: int64

Add cluster labels to the dataframe and visualise them in an updated map.

In [310]:
# add clustering labels
dfTorontofocus.insert(0, 'Cluster Labels', kmeans.labels_)
dfTorontofocus.head()

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,3,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,3,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,0,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [311]:
map_torontofocus = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lng, borough, neighborhood, cluster in zip(dfTorontofocus['Latitude'], dfTorontofocus['Longitude'], dfTorontofocus['Borough'], dfTorontofocus['Neighbourhood'], dfTorontofocus['Cluster Labels']):
    label = '{}: {} ({})'.format(cluster, neighborhood, borough) # added () brackets to show Borough name and removed final comma
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7,
        parse_html=False).add_to(map_torontofocus)
       
    
map_torontofocus

I realised a huge frustration here as the original map is seen in full but this second ones has scroll bars which I cannot get rid of!