# IBM Applied Data Science Capstone Week 3 Project
### By Elliot Taylor

#### This notebook will explore and cluster the neighborhoods in Toronto.

## Section 1 - Data Acquisition
#### In this section I will scrape the Toronto list of postal codes from Wikipedia to create a usable data set.

link to page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [3]:
# Import needed libraries
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [4]:
# Using beautifulsoup access the websites html in the lxml format
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_url,'lxml')

In [5]:
# Retrieve the table from the websites html
My_table = soup.find('table',{'class':'wikitable sortable'})
# My_table

In [6]:
# Retrieve the column headers from the table
headers_tagged = My_table.findAll('th')
headers = []
for header in headers_tagged:
    header = header.string.strip('<th>')
    header = header.strip('\n')
    headers.append(header)

headers

['Postal Code', 'Borough', 'Neighbourhood']

In [5]:
# Scrape each rows content
c1=[]
c2=[]
c3=[]

for row in My_table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells)==3: 
        c1.append(cells[0].find(text=True).string.strip('\n'))
        c2.append(cells[1].find(text=True).string.strip('\n'))
        c3.append(cells[2].find(text=True).string.strip('\n'))


In [6]:
# Create a dictionary to store the scaped data and assign columns data to keys
d = dict([x,0] for x in headers)
d['Postal Code'] = c1
d['Borough'] = c2
d['Neighbourhood'] = c3

In [7]:
# Create our Pandas dataframe using the dictionary
df = pd.DataFrame(d)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [8]:
df.shape

(180, 3)

In [9]:
# Drop all rows with no borough assigned and replace neighbourhood with borough where no neighbourhood is assigned
df.drop(df[df.Borough == 'Not assigned'].index, inplace=True)

In [10]:
df.shape

(103, 3)

In [11]:
# Confirm if any Neighbourhoods have 'Not assigned' listed
df.loc[df['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postal Code,Borough,Neighbourhood


In [12]:
df.shape

(103, 3)

## Section 2 - Assign Coordinates for each Postal Code
#### In this section I will add the latitude and longitude to each entry in the pandas dataframe

As the geocoder method was not working I will be using the coordinate data provided at: http://cocl.us/Geospatial_data instead.


In [17]:
# Import needed library and download data into a new dataframe
import io
url="http://cocl.us/Geospatial_data"
s=requests.get(url).content
df_coords = pd.read_csv(io.StringIO(s.decode('utf-8')))

In [19]:
# Merge dataframes based on Postal Codes
df = df.merge(df_coords, left_on='Postal Code', right_on='Postal Code')

In [20]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## Section 3 - Clustering Neighbourhoods
#### In this section I will explore and cluster the neighborhoods in Toronto using FourSquare data to produce clustered neighborhoods. 



In [94]:
# Import needed libraries
from sklearn.cluster import KMeans
import folium
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

In [93]:
# create map of Toronto using latitude and longitude values
latitude = 43.6532
longitude = -79.3832
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) 

map_toronto

In [28]:
# Load Foursquare credentials
CLIENT_ID = '1UWPXEJJPVA3N424ACSVHJTOPARDZHAA51KZYMSKXKUFE1JG' # your Foursquare ID
CLIENT_SECRET = 'FNXFZRIGHXLLHD0C4E3ILHO1RILM24LA3R4WDRJV5NIDN3VX' # your Foursquare Secret
ACCESS_TOKEN = '3SE5USOAVDLOBV3RI3DIKXG05UPNX34HQOU5PR24LDVIA5YO' # your FourSquare Access Token
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 1UWPXEJJPVA3N424ACSVHJTOPARDZHAA51KZYMSKXKUFE1JG
CLIENT_SECRET:FNXFZRIGHXLLHD0C4E3ILHO1RILM24LA3R4WDRJV5NIDN3VX


In [39]:
# Function to get the nearby venues for each neighbourhood
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [40]:
# Create a dataframe containing all of the venues by neighbourhood
toronto_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [41]:
print(toronto_venues.shape)
toronto_venues.head()

(2125, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [64]:
# Group venues by Neigbourhood to see how many venues each has
toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Alderwood, Long Branch",9,9,9,9,9,9
"Bathurst Manor, Wilson Heights, Downsview North",21,21,21,21,21,21
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",24,24,24,24,24,24
...,...,...,...,...,...,...
"Willowdale, Willowdale East",34,34,34,34,34,34
"Willowdale, Willowdale West",5,5,5,5,5,5
Woburn,3,3,3,3,3,3
Woodbine Heights,7,7,7,7,7,7


In [43]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 264 uniques categories.


In [101]:
# Onehot encode each venue category to allow the data to be used by the k-means clustering algorithm 
df_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix='',prefix_sep='')
df_onehot['Neighbourhood'] =  toronto_venues['Neighbourhood']
fixed_columns = [df_onehot.columns[-1]] + list(df_onehot.columns[:-1])
df_onehot = df_onehot[fixed_columns]
df_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [177]:
# Group the data by neighbourhood and use mean to produce weighted one hot encoded values based on venue category frequency
df_grouped = df_onehot.groupby('Neighbourhood').mean().reset_index()
df_grouped.shape

(95, 265)

In [178]:
# Now we will run the k-means clustering on our grouped one hot encoded data set
# set number of clusters
kclusters = 5

df_grouped_clustering = df_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [179]:
df_grouped.insert(0,'Cluster Labels', kmeans.labels_)

df_grouped.shape

df_merged = df.merge(df_grouped[['Neighbourhood', 'Cluster Labels']], left_on='Neighbourhood', right_on='Neighbourhood')

In [180]:
# create map
latitude = 43.6532
longitude = -79.3832
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['Neighbourhood'], df_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [181]:

df_grouped_clusters = df_grouped.drop(columns=['Neighbourhood']).groupby(by= 'Cluster Labels').mean()
df_grouped_clusters.reset_index(inplace=True)

num_top_venues = 5

for cluster in df_grouped_clusters['Cluster Labels']:
    print("---- Cluster: "+str(cluster)+"----")
    temp = df_grouped_clusters[df_grouped_clusters['Cluster Labels'] == cluster].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

---- Cluster: 0----
               venue  freq
0               Park  0.40
1         Playground  0.16
2  Convenience Store  0.13
3       Intersection  0.07
4              River  0.05


---- Cluster: 1----
            venue  freq
0     Coffee Shop  0.07
1     Pizza Place  0.04
2            Park  0.03
3            Café  0.03
4  Sandwich Place  0.03


---- Cluster: 2----
               venue  freq
0     Baseball Field   1.0
1  Accessories Store   0.0
2              Motel   0.0
3     Massage Studio   0.0
4     Medical Center   0.0


---- Cluster: 3----
                 venue  freq
0                  Bar  0.62
1        Garden Center  0.12
2  Rental Car Location  0.12
3            Drugstore  0.12
4    Accessories Store  0.00


---- Cluster: 4----
                       venue  freq
0  Middle Eastern Restaurant   1.0
1          Accessories Store   0.0
2                      Motel   0.0
3        Martial Arts School   0.0
4             Massage Studio   0.0




From this analysis we can see that there are 3 diverse clusters in the data.

- Cluster 0 which appears to be more residential areas with lots of park, playgrounds and stores.
- Cluster 1 which appears to be diverse inner city areas with a large spread of different food/drink venues
- Cluster 3 which appears to be a nightlife area with a high number of bars. 

However as demonstrated by clusters 2 and 4 this data may not be enough to go off for Toronto. Some neighbourhoods may be lacking venues in the Foursquare meaning than valuable insights from the data may be difficult to find without further data aquisition and exploration.
