# Totonto Neighbourhoods Notebook
## Scrubbing data from wikipedia
This Jupyter notebook explores and clusters Toronto neighbourhoods.
We first import all the tools we need.

In [54]:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import requests

from geopy.geocoders import Nominatim 

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium 

Solving environment: done

# All requested packages already installed.



Next we scrape the wikipedia page given for Neighbourhood names, and put them in a readable format.

In [55]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
page = BeautifulSoup(source,'lxml')

Viewing the page html, we find what part of the source code we wanted, and use the BeautifulSoup functions to get a list of the portions of html code representing rows in the table. We then remove the headers.

In [56]:
table = page.find('table', class_='wikitable sortable')
tableLines = table.find_all('tr')
tableLines.pop(0)

<tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>

We create an empty list, and, for each portion of html representing a row, extract the rows elements, and add them all to a list.

In [57]:
listOfTableEntries = list()
for line in tableLines:
    lineElements = line.find_all('td')
    for element in lineElements:
        elementText = element.text
        listOfTableEntries.append(elementText)
    
    

Going through the list of all elemtns in the table, we reconstruct it into a list of list, each inside list representing a row in the table. We remove any rows with "not assigned" as a borough, and replace the neighbourhood with the borough name if the neighbourhood is 'Not assigned'. We also trim \n from the end of the neighbourhood names.

In [58]:
listOfRows=list()
for i in range(0,867,3):
    if (listOfTableEntries[i+1] != 'Not assigned'):
        if (listOfTableEntries[i+2] == 'Not assigned\n'):
            listOfRows.append([listOfTableEntries[i],listOfTableEntries[i+1],listOfTableEntries[i+1]])
        else:
            listOfRows.append([listOfTableEntries[i],listOfTableEntries[i+1],listOfTableEntries[i+2][:-1]])


Using a while loop, we merge rows with the same PostalCode, getting a row ogf all neighbourhoods with the same PostalCose

In [59]:
i = 0
while (i < len(listOfRows)-1):
    row = listOfRows[i]
    belowRow = listOfRows[i+1]
    if (row[0]==belowRow[0]):
        row[2]= row[2] + ', ' + belowRow[2]
        listOfRows.pop(i+1)
    else:
        i=i+1
        


We then place the whole thing into a dataframe, now that it has been processed as required.

In [60]:
neighbourhoodTable = pd.DataFrame(listOfRows,columns = ['PostalCode','Borough','Neighbourhood'])

Looking at the head, we can see an example of the merging of PostalCodes in row indexes 2 and 3, and an exmaple of the replacement of 'Not assigned' as a neighbourhood with the borough name in row 4

In [61]:
neighbourhoodTable.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [62]:
neighbourhoodTable.shape

(103, 3)

## Adding in location data
First we shall get the lat/long data from the provided csv file.

In [63]:
latLongData= pd.DataFrame.from_csv('https://cocl.us/Geospatial_data')

  """Entry point for launching an IPython kernel.


Lets check this data

In [64]:
latLongData.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


We use a table join, similar to those used in SQL, to combine the postal code lat/long data with the neighbourhood table we prepared above.

In [65]:
neighbourhoodLocationTable = neighbourhoodTable.join(latLongData, on='PostalCode')

In [66]:
neighbourhoodLocationTable.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


Lets use geolocater to find the lat/long of Toronto

In [67]:
address = 'Toronto, Ontario'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


Now we will use the Folium package to display the locations of the given nieghbourhoods on a map of toronto

In [68]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighbourhoodLocationTable['Latitude'], neighbourhoodLocationTable['Longitude'], neighbourhoodLocationTable['Borough'], neighbourhoodLocationTable['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Now lets do a bit of exploring of these neighbourhoods with FourSquare. We will first collect all of the nearby venues for each of these venues. YOU WILL NEED TO INPUT YOUR OWN FOURSQUARE CREDENTIALS.

In [69]:
CLIENT_ID = 'KXMEYZ5FNM24P0FFVBOZ2Q0O2B0GDB1TVFKITKMYKSTLFRN5' # your Foursquare ID
CLIENT_SECRET = '54XYMKJPZ2XDIJAVVXBMQJJGRQIVRLZCSPX423LPHZRN5OIW' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: KXMEYZ5FNM24P0FFVBOZ2Q0O2B0GDB1TVFKITKMYKSTLFRN5
CLIENT_SECRET:54XYMKJPZ2XDIJAVVXBMQJJGRQIVRLZCSPX423LPHZRN5OIW


We write a function that we can use to return for us all of the nearby venues to each PostalCode in our neighbourhoodLocationTable.

In [70]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            500, 
            100)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode', 
                  'PostalCode Latitude', 
                  'PostalCode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Lets retrieve those venues!

In [71]:
torontoVenues=getNearbyVenues(neighbourhoodLocationTable['PostalCode'],neighbourhoodLocationTable['Latitude'],neighbourhoodLocationTable['Longitude'])
torontoVenues.head()

Unnamed: 0,PostalCode,PostalCode Latitude,PostalCode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,M3A,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,M3A,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,M4A,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,M4A,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


We'll use one hot enconding to assign numeric value and columns to each venue category

In [72]:
# one hot encoding
torontoOneHot = pd.get_dummies(torontoVenues[['Venue Category']], prefix="", prefix_sep="")

# add PostalCode column back to dataframe
torontoOneHot['PostalCode'] = torontoVenues['PostalCode'] 

# move PostalCode column to the first column
fixed_columns = [torontoOneHot.columns[-1]] + list(torontoOneHot.columns[:-1])
torontoOneHot = torontoOneHot[fixed_columns]

torontoOneHot.head()

Unnamed: 0,PostalCode,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M4A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We want group these together, and use mean to understand what proportion of nearby venues are of each category for each postal code.

In [73]:
torontoGrouped = torontoOneHot.groupby('PostalCode').mean().reset_index()
torontoGrouped.head()

Unnamed: 0,PostalCode,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We will use the following method to show what type of venue is most frequently found near each Postal Code. This will help us find a nice area to live in!

In [74]:
num_top_venues = 3

for pstCde in torontoGrouped['PostalCode']:
    print("----"+pstCde+"----")
    temp = torontoGrouped[torontoGrouped['PostalCode'] == pstCde].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M1B----
                        venue  freq
0        Fast Food Restaurant   1.0
1           Accessories Store   0.0
2  Modern European Restaurant   0.0


----M1C----
               venue  freq
0      Moving Target   0.5
1                Bar   0.5
2  Accessories Store   0.0


----M1E----
                 venue  freq
0       Medical Center  0.17
1       Breakfast Spot  0.17
2  Rental Car Location  0.17


----M1G----
               venue  freq
0        Coffee Shop  0.50
1           Pharmacy  0.25
2  Korean Restaurant  0.25


----M1H----
                  venue  freq
0  Caribbean Restaurant  0.12
1                Bakery  0.12
2                  Bank  0.12


----M1J----
                        venue  freq
0                  Playground   0.5
1  Construction & Landscaping   0.5
2  Modern European Restaurant   0.0


----M1K----
                venue  freq
0      Discount Store  0.33
1         Coffee Shop  0.17
2  Chinese Restaurant  0.17


----M1L----
           venue  freq
0         Baker

Looking thorugh these, we find M2P has a bank, park and electronics store nearby. A perfect place for any up and coming data scientist who enjoys the outdoors! Lets find out what neighbourhoods have this postal code.

In [75]:
neighbourhoodLocationTable.loc[neighbourhoodLocationTable['PostalCode']=='M2P']

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
66,M2P,North York,York Mills West,43.752758,-79.400049


So looks like York Mills West is the place to be!

## Clustering neighbourhoods
We'll cluster the neighbourhoods based upon their venues.

In [76]:
# set number of clusters
kclusters = 5

torontoGroupedClustering = torontoGrouped.drop('PostalCode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(torontoGroupedClustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)

Now we'll create a new dataframe that includes the clustering with the neighbourhood location data.

In [77]:
torontoMerged = torontoGrouped

# add clustering labels
torontoMerged['Cluster Labels'] = kmeans.labels_

#Make it only contain the columns we want
torontoMerged = pd.concat([torontoMerged['PostalCode'],torontoMerged['Cluster Labels']],axis = 1)

#Merge in location data
torontoMerged = torontoMerged.join(latLongData, on='PostalCode')

torontoMerged.head() # check the last columns!

Unnamed: 0,PostalCode,Cluster Labels,Latitude,Longitude
0,M1B,3,43.806686,-79.194353
1,M1C,2,43.784535,-79.160497
2,M1E,2,43.763573,-79.188711
3,M1G,2,43.770992,-79.216917
4,M1H,2,43.773136,-79.239476


Fianlly, we can place the postal code on a map of Toronto, coloured by cluster.

In [78]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(torontoMerged['Latitude'], torontoMerged['Longitude'], torontoMerged['PostalCode'], torontoMerged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

So we see the majority of neighbourhoods are in one cluster, with the remaining 4 clusters picking up a few others. Thanks for reading!