<h1>Week 3 Assignment Notebook</h1>

This notebook will be used to record the workings and the outcomes of the assignment for Week 3 of the Applied Data Science Capstone on Coursera.

<h2>Question 1 - Importing and Cleaning Data</h2>

In [1]:
import pandas as pd
import numpy as np

<h4>Import Data from Wikipedia</h4>

In [2]:
raw_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0] #first table is the one we want
raw_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


<h4>Process the Data</h4>

Let's drop all the rows where the Borough is not assigned.

In [3]:
assigned_borough = raw_df['Borough'] != 'Not assigned'
clean_df = raw_df[assigned_borough]
clean_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Then we can combine all the rows with the same postcode and borough, but different neighbourhoods.

In [4]:
clean_df = clean_df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
clean_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


And finally for this section, we will assign any <code>'Not assigned'</code> Neighbourhoods to their Borough value instead.

In [5]:
unassigned_neighbourhood = (clean_df['Neighbourhood'] == 'Not assigned')
clean_df.loc[unassigned_neighbourhood,'Neighbourhood'] = clean_df.loc[unassigned_neighbourhood,'Borough']

In [6]:
#this should now show an empty dataframe if we have removed all the 'Not assigned' neighbourhoods
clean_df[clean_df['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood


Question complete!

<h2>Question 2 - Adding Geospatial Location Data</h2>

To make our lives easier, we are going to use the <code>Geospatial_Coordinates.csv</code> file supplied by Coursera to find the correct latitude and longitude for each postcode.

In [7]:
latlong = pd.read_csv('Geospatial_Coordinates.csv')
latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We'll create a new dataframe called <code>geo_df</code>, to which we will add the cleaned dataframe we had before.  We then use the postal code as the index temporarily.  Then we use the index to join the geospatial dataframe to <code>geo_df</code>.  Then finally we reset the index to turn the postal code back into a normal column.

In [35]:
geo_df = clean_df
geo_df = geo_df.set_index('Postcode')
geo_df = geo_df.join(latlong.set_index('Postal Code'))
geo_df = geo_df.reset_index()
geo_df

#this could also be achieved in one line, but it is a little harder to parse!
# geo_df = clean_df.set_index('Postcode').join(latlong.set_index('Postal Code')).reset_index()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


Question complete!

<h2>Question 3 - Clustering Neighbourhoods</h2>

For this question we will use all boroughs, not just those within the main area of Toronto itself.

<h4>Import Necessary Libraries etc.</h4>

In [9]:
from sklearn.cluster import KMeans
import folium
import json
import requests

<h4>Make Calls to FourSquare</h4>

There is a hidden cell below that contains my FourSquare ID and FourSquare Secrets.  Those are of course not going to be shared, so you should just see the output that confirms that these have been entered.

In [4]:
from IPython.display import HTML
from IPython.display import display

# Taken from https://stackoverflow.com/questions/31517194/how-to-hide-one-specific-cell-input-or-output-in-ipython-notebook
tag = HTML('''<script>
code_show=true; 
function code_toggle() {
    if (code_show){
        $('div.cell.code_cell.rendered.selected div.input').hide();
    } else {
        $('div.cell.code_cell.rendered.selected div.input').show();
    }
    code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>''')

display(tag)

CLIENT_ID = 'A1YLOAE5XVHRMC3A32X3AUXBL13IZHHGSUPVOVIX4ROORBXE' # your Foursquare ID
CLIENT_SECRET = 'IDOWIQF0BOUM1NSP4ZWSBSVNWDT44RTYLOQRR4J5UCMAC444' # your Foursquare Secret
VERSION = '20190903' # Foursquare API version
print('Client ID and Client Secret have been entered.  The Version is {}.'.format(VERSION))

Client ID and Client Secret have been entered.  The Version is 20190903.


Now we need to generate the URLS for the API calls we will be making.  We can do this using a simply loop through all the latitude and longitude values in geo_df. Because it says we can, we will just use the function for this from the Jupyter Lab in the Module.

In [11]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postcode', 
                  'Postcode Latitude', 
                  'Postcode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

toronto_venues = getNearbyVenues(names = geo_df['Postcode'],
                                 latitudes = geo_df['Latitude'],
                                 longitudes = geo_df['Longitude'])
print('API calls complete')

API calls complete


In [12]:
toronto_venues.head()

Unnamed: 0,Postcode,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Images Salon & Spa,43.802283,-79.198565,Spa
1,M1B,43.806686,-79.194353,Staples Morningside,43.800285,-79.196607,Paper / Office Supplies Store
2,M1B,43.806686,-79.194353,Wendy's,43.802008,-79.19808,Fast Food Restaurant
3,M1B,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
4,M1B,43.806686,-79.194353,Harvey's,43.800106,-79.198258,Fast Food Restaurant


<h4>Generate Dummy Variables</h4>

In [13]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']],prefix="",prefix_sep="")
toronto_onehot.head()

Unnamed: 0,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,...,Vietnamese Restaurant,Warehouse Store,Waste Facility,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we need to add the postcode data back into this!

In [14]:
toronto_onehot['Postcode'] = toronto_venues['Postcode']
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
toronto_onehot.head()

Unnamed: 0,Postcode,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,...,Vietnamese Restaurant,Warehouse Store,Waste Facility,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And then let's find the mean value in each column per postcode.

In [15]:
toronto_grouped = toronto_onehot.groupby('Postcode').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Postcode,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,...,Vietnamese Restaurant,Warehouse Store,Waste Facility,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,M1B,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.037037,0.0


And lastly for this section, we will add the latitude and longitude data back in.  We can do this by using the pandas <code>merge</code> function.

In [28]:
toronto_df = pd.merge(geo_df,toronto_grouped,how="inner",left_on="Postcode",right_on="Postcode")
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,...,Vietnamese Restaurant,Warehouse Store,Waste Facility,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0.0,0.0,0.052632,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.037037,0.0


<h4>Generate K-Means Analysis</h4>

In [17]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

We will use a value of 5 as our k in this, so we will have 5 clusters.

In [18]:
tor_data = toronto_df.iloc[:,5:-1] #just use the venue data for this bit

In [26]:
k_value = 5
km_model = KMeans(n_clusters=k_value)
km_model.fit(tor_data)
cluster_labels = km_model.labels_
cluster_labels

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 0, 1, 2, 2,
       2, 0, 0, 2, 2, 0, 2, 2, 0, 3, 0, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 0,
       0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 4])

Let's have a look at what makes these clusters different from each other.

In [29]:
toronto_df.insert(5,'Cluster Label',cluster_labels)
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Label,Accessories Store,Afghan Restaurant,African Restaurant,Airport,...,Vietnamese Restaurant,Warehouse Store,Waste Facility,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0,0.0,0.0,0.052632,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.037037,0.0


In [30]:
toronto_bycluster = toronto_df.iloc[:,5:-1].groupby(by = "Cluster Label").mean()
toronto_bycluster

Unnamed: 0_level_0,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,...,Video Store,Vietnamese Restaurant,Warehouse Store,Waste Facility,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
Cluster Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.001284,0.0,0.0,0.005553,0.0,0.0,0.0,0.0,...,0.004381,0.006587,0.0,0.0,0.0,0.0,0.001787,0.001471,0.003038,0.000903
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000556,0.000876,0.0,0.002107,0.001149,0.01031,0.000345,0.000172,0.000345,0.00069,...,0.000517,0.006353,0.000352,0.000261,0.000172,0.002159,0.000424,0.001263,0.000614,0.005129
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we can have a look at what the ten most common places are in each cluster.

In [31]:
toronto_bycluster = toronto_bycluster.transpose()

In [32]:
toronto_topten = pd.DataFrame()#empty dataframe for storage
for c in np.arange(toronto_bycluster.shape[1]): # for each row (venue category)
    col_top = toronto_bycluster.iloc[:,c].sort_values(ascending=False).iloc[0:10,]
    col_top = pd.DataFrame(col_top).reset_index().rename({'index':str(c),2:'ignore'},axis='columns') #convert to formate
    toronto_topten[str(c)] = col_top[str(c)] #add to top tne
toronto_topten

Unnamed: 0,0,1,2,3,4
0,Park,Park,Coffee Shop,Vietnamese Restaurant,Hotel
1,Coffee Shop,Pool,Café,Baseball Field,Dog Run
2,Pizza Place,Farmers Market,Park,Restaurant,Coffee Shop
3,Fast Food Restaurant,Eastern European Restaurant,Pizza Place,Yoga Studio,Yoga Studio
4,Pharmacy,Electronics Store,Restaurant,Fast Food Restaurant,Farmers Market
5,Grocery Store,Elementary School,Italian Restaurant,Elementary School,Elementary School
6,Bank,Empanada Restaurant,Bakery,Empanada Restaurant,Empanada Restaurant
7,Chinese Restaurant,Ethiopian Restaurant,Sushi Restaurant,Ethiopian Restaurant,Ethiopian Restaurant
8,Sandwich Place,Event Space,Japanese Restaurant,Event Space,Event Space
9,Convenience Store,Falafel Restaurant,Sandwich Place,Falafel Restaurant,Falafel Restaurant


<h4>Visualise Clusters</h4>

Let's create a map of Toronto to start with, and we can add labels to denote each postcode.

In [33]:
map_toronto = folium.Map(location=[43.6532,-79.3832,], zoom_start=11)

Then we can use the following code to iterate through all the postcodes and give a different coloured marker depending on what cluster it belongs to.

In [34]:
x = np.arange(k_value)
ys = [i + x + (i*x)**2 for i in range(k_value)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Postcode'], cluster_labels):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_toronto)
       
map_toronto

We can see that on our map there is a clear grouping of Cluster 2 (blue) postcodes around the inner city of Toronto.  This explains why there are so many cafes and bars in this cluster.  There is also a spread of Cluster 2 postcodes along the waterfront and dotted around, which may represent other urban pockets or tourist-centric areas.  

The cluster 0 (red) postcodes look to be more residential with grocery stores, banks, and pharmacies.

Finally clusters 1, 3, and 4 are single postcodes only.  If we look at what is contained within them we can see that they mainly stand out for a couple of features only.  For example, cluster 3 has a baseball field, and cluster 4 has a dog run.  These may be sufficiently different from the surrounding areas that the classification system puts them as separate clusters.