# Applied Data Science Capstone Project
## Final Project of [IBM's Data Science Professional Certificate Course] (https://www.coursera.org/professional-certificates/ibm-data-science)
## Part 1:
## Clustering neighborhoods of Toronto (CA) based on the similarities of their venues 

First, let's import all the libraries needed

In [1]:
import pandas as pd
import numpy as np 
import geocoder
import requests 
import folium

import matplotlib.cm as cm
import matplotlib.colors as colors

from pandas.io.html import read_html
from sklearn.cluster import KMeans


### Downloading Data
Let's scrape the Toronto's neighborhoods dataframe from a wikipedia table

In [2]:
# Get a list of wiki tables from the following link 
page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikitable = read_html(page,  attrs = {"class":"wikitable"})

# Get the dataframe for the first table 
df_toronto = wikitable[0]

### Pre-processing

In [3]:
# Drop all rows where borough is not assigned
df_toronto.drop(df_toronto[df_toronto['Borough'] == 'Not assigned'].index, inplace = True)

# Drop duplicate values
df_toronto.drop_duplicates(subset = 'Neighborhood', keep = False, inplace = True)

# As neighborhoods are already grouped by postal codes, only replace the slashes with commas 
for i in df_toronto.index:
    df_toronto.at[i, 'Neighborhood'] = df_toronto.at[i, 'Neighborhood'].replace(" /", ",")

# Also, there are no "Not assigned" neighborhoods, so there is no need to correct it

# Reset the index, as some rows were dropped 
df_toronto.reset_index(drop = True, inplace = True)

# Print the dataframe
df_toronto.head(20)

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M4B,East York,"Parkview Hill, Woodbine Gardens"
8,M5B,Downtown Toronto,"Garden District, Ryerson"
9,M6B,North York,Glencairn


Let's create a subset containing only the boroughs with "Toronto" in its name, to make it simpler to analyze it

In [4]:
boroughs_containing_toronto = df_toronto[df_toronto['Borough'].str.contains("Toronto")].reset_index(drop=True)

Now, let's add the Latitude and Logitude for each location 

In [5]:
latitude = []
longitude = []

# For each postal code, we find its coordinates and append it to the latitude and longitude lists
for postal_code in boroughs_containing_toronto['Postalcode']:
    lat_lng_coords = None

    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng

    latitude.append(lat_lng_coords[0])
    longitude.append(lat_lng_coords[1])

# Create new columns with the latitude and longitude lists
boroughs_containing_toronto['Latitude'] = latitude
boroughs_containing_toronto['Longitude'] = longitude

# Print the dataframe 
boroughs_containing_toronto.head(20)

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657491,-79.377529
3,M5C,Downtown Toronto,St. James Town,43.651734,-79.375554
4,M4E,East Toronto,The Beaches,43.678148,-79.295349
5,M5E,Downtown Toronto,Berczy Park,43.645196,-79.373855
6,M5G,Downtown Toronto,Central Bay Street,43.656072,-79.385653
7,M6G,Downtown Toronto,Christie,43.668602,-79.420387
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650542,-79.384116
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.66491,-79.438664


### Explore and analyze neighborhoods in Toronto
First, we declare the foursquare credentials


In [6]:
CLIENT_ID = 'CSB5CUHREMDRX4YDSCQICEVW0VVYSWWZOCGLOKW4NTAYQFG0'
CLIENT_SECRET = 'S5PTZPM2UWJ4JWYQVXVIHYWO3EA3HGGABO5FCKUIVMFW1M3A' 
VERSION = '20180605' # Foursquare API version

Then let's define a function to get the venues of the neighborhoods from the foursquare API

In [7]:
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # Create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])            

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Then we can get a dataframe containing the venues in the given radius for each neighborhood

In [8]:
toronto_venues = getNearbyVenues(names = boroughs_containing_toronto['Neighborhood'],
                                   latitudes = boroughs_containing_toronto['Latitude'],
                                   longitudes = boroughs_containing_toronto['Longitude'])

We need to process the dataset again to prepare it for the clustering algorithm  

In [9]:
# Apply one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# Move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

# Group rows by neighborhood and take the mean of the frequency of occurrency of each category
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

### Cluster neighborhoods

Let's create 8 different clusters based on the similarities of venues

In [10]:
# Set number of clusters
kclusters = 8

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# Run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

Then, we merge the cluster labels with the neighborhood location dataframe, creating the 'toronto_merged' dataframe

In [11]:
# Insert labels on the grouped dataframe
toronto_grouped.insert(0, 'Cluster Labels', kmeans.labels_)

# Join toronto_grouped and boroughs_containing_toronto by the Neighborhoods column
toronto_merged = boroughs_containing_toronto.join(toronto_grouped[['Neighborhood','Cluster Labels']].set_index('Neighborhood'), on='Neighborhood')

# Drop NAN values in case they exist and convert the labels to int
toronto_merged.dropna(inplace = True)
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype(int)


Finally, we can print the map containing the neighborhoods colored by their labels

In [12]:
# Get Toronto city coordinates
g = geocoder.arcgis('Toronto, Ontario')
lat_lng_coords = g.latlng
toronto_latitude = lat_lng_coords[0]
toronto_longitude = lat_lng_coords[1]

# Create map centered in Toronto city
map_clusters = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=12)

# Set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [13]:
# Saves the map to an html file
map_clusters.save('result_map.html')

We can notice that a lot of neighborhoods in the boroughs that contain the word "Toronto" are very much alike, as almost all of them are grouped in the cluster with label 0