# Segmenting and Clustering Neighborhoods in Toronto

This is Peer-graded Assignment for Course Applied Data Science Capstone, Week 3 



It Contains three parts coresponding the three submit, just click the link below:

- [Scrape neighborhoods in Toronto](#0)<br>
- [Fetching Location data of each neighborhood](#2)<br>
- [Neighbourhoods Clustering Analysis](#5)<br>

In [1]:
import pandas as pd
import numpy as np
import requests


## Scrape neighborhoods in Toronto <a id="0"></a>

**Step(1)** We get all possible tables in the Wiki page via pandas function **read_html**:

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
res = requests.get(url)
dfs = pd.read_html(url)

Let's have a general idea of what we got:

In [3]:
for idx, df in enumerate(dfs):
    print('DataFrame[{}]:{}'.format(idx, df.shape))
    

DataFrame[0]:(180, 3)
DataFrame[1]:(4, 18)
DataFrame[2]:(2, 18)


It's easy to guess that **ONLY** the first dataframe is what we need, which has 180 rows and 3 columns.

Let's verify our thought by reviewing the first 5 rows:

In [4]:
dfs[0].head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Great! 

Let's rename the first column name to 'PostalCode' and save it into a new variable then we are done our first step.

In [5]:
nb_toronto = dfs[0].rename({'Postal code':'PostalCode'}, axis='columns')
nb_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Now lets move to the data clearning procedures, as described below.

**Step (2)** Ignore cells with a borough that is **Not assigned.**

In [6]:
nb_toronto.shape

(180, 3)

In [7]:
nb_toronto[nb_toronto['Borough']=='Not assigned'].shape

(77, 3)

In [8]:
nb_toronto = nb_toronto[nb_toronto['Borough']!='Not assigned']
nb_toronto.shape

(103, 3)

In [9]:
nb_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


**Step (3)** Check duplications on postal code 

In [10]:
nb_toronto['PostalCode'].unique().shape

(103,)

Since the unique number is same with the total row number, it's proved that there's no dupoication on column Postal Code.


**Stpe (4)** replace '/' with ',' in neighbourhoods combination 


In [11]:
nb_toronto.loc[:,'Neighborhood'] = nb_toronto.apply(lambda x: x['Neighborhood'].replace(' / ',','), axis=1)
nb_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park,Harbourfront"
5,M6A,North York,"Lawrence Manor,Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park,Ontario Provincial Government"


**Step (5)** copy borough to neighborhood if neighborhood is missing


In [12]:
nb_toronto[ nb_toronto['Neighborhood'] == 'Not assigned' ]

Unnamed: 0,PostalCode,Borough,Neighborhood


Looks we do not have any rows with **Not assigned** neighbourhood.

To be more safe, also check **None** value for the column:

In [13]:
nb_toronto[ nb_toronto['Neighborhood'] == None ]

Unnamed: 0,PostalCode,Borough,Neighborhood


In [14]:
nb_toronto.shape

(103, 3)

We are good for this part.

**This is the end of the submition of part 1.**

---

## Fetching Location data of each neighborhood<a id="2"></a>

Since the Geocoder package can be very unreliable, we use the provided csv file as our data source of the geographical coordinates of each postal code.

**Step (1)** Read the data from the given URL.

In [15]:
url = 'http://cocl.us/Geospatial_data'
geo_df = pd.read_csv(url)
print(geo_df.shape)
geo_df.head()

(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [16]:
geo_df.dtypes

Postal Code     object
Latitude       float64
Longitude      float64
dtype: object

To keep consistance, we removed the space in column name 'Postal Code'.

In [17]:
geo_df.rename({'Postal Code':'PostalCode'}, axis=1, inplace=True)
geo_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


**Step (2)** Merge it into the neibourhood data frame **nb_toronto** which we already populated in previous part.

In [18]:
nb_toronto_geo = pd.merge(nb_toronto, geo_df, on = ['PostalCode'])
print(nb_toronto_geo.shape)
nb_toronto_geo.head()

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park,Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor,Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park,Ontario Provincial Government",43.662301,-79.389494



**This is the end of the submition of part 2.**

***


## Neighbourhoods Clustering Analysis<a id="5"></a>

We performed the Clustering Analysis with following steps:
- Step (1) General Idea for all Toronto neighbourhood  
- Step (2) Narrow down to Downtown Toronto for further analysis
- Step (3) Define Foursqure Credentials and Version and explore the neighborhood in Downtown Toronto
- Step (4) Prepare venue category for Clustering
- Step (5) Clustering the neighborhood according venue categories count
- Step (6) Visualize the resulting on map


In [19]:
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans
import folium # map rendering library


**Step (1)** Let's first get a general idea about those neibourhoods in Toronto on map, just list all of them there.

In [20]:
lat, lng = geo_df[['Latitude','Longitude']].max() + geo_df[['Latitude','Longitude']].min() 

lat, lng  = lat /2, lng /2 
lat, lng

(43.71926920000001, -79.38815804999999)

In [42]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[lat, lng], zoom_start=11)

# add markers to map
for idx, r  in nb_toronto_geo.iterrows():
    lat, lng, bor, nb  =  r['Latitude'], r['Longitude'],r['Borough'], r['Neighborhood']
    label = '{}, {}'.format(nb, bor)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**Step (2)** Let's focus on the borough **Downtown Toronto** only to demostrate the analysis.

First we slice the orinal data frame nb_toronto_geo to nb_york, we omit the geo from the name for all data frame will have geo information from now on.


In [22]:
dt_toronto = nb_toronto_geo[nb_toronto_geo['Borough']=='Downtown Toronto']
print(dt_toronto.shape)
dt_toronto.head()

(19, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park,Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park,Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


Now lets show it on map

In [23]:
## first we find out the center of the map
lat, lng = dt_toronto[['Latitude','Longitude']].max() + dt_toronto[['Latitude','Longitude']].min() 

lat, lng  = lat /2, lng /2 

lat, lng

(43.65425465, -79.3915998)

In [41]:
# create map of New York using latitude and longitude values
map_dt_toronto = folium.Map(location=[lat, lng], zoom_start=13)

# add markers to map
for idx, r  in dt_toronto.iterrows():
    lat, lng, bor, nb  =  r['Latitude'], r['Longitude'],r['Borough'], r['Neighborhood']
    label = '{}, {}'.format(nb, bor)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dt_toronto)  
    
map_dt_toronto

**Step (3)** Define Foursqure Credentials and Version and explore the neighborhood in **Downtown Toronto**


In [25]:
CLIENT_ID = 'Y5FK5TTSXY24B0DDCUJBGCWCL2B01DYMXZRFOXROSYNCSSYJ' 
CLIENT_SECRET = 'TUOXNZ2M4NVKE4BLA1N0XOV5CC54GIWOD0D4RF4A3CFMR3MV'
VERSION = '20180605'


We borrow the function **getNearbyVenues** from the course lab, but only keep the catogry for further analysis.



In [26]:
def getNearbyVenues(df, radius = 700, LIMIT = 70):
    
    venues_list=[]
    for idx, row in df.iterrows():
        name, lat, lng = row['Neighborhood'], row['Latitude'], row['Longitude']    
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood Name', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [27]:
venues_dt_toronto = getNearbyVenues( dt_toronto )
print(venues_dt_toronto.shape)

(1174, 6)


venues_dt_toronto.rename({'Neighborhood':'Neighborhood Name'}, axis = 1, inplace = True)
venues_dt_toronto.columns

In [28]:
venues_dt_toronto.head()

Unnamed: 0,Neighborhood Name,Neighborhood Latitude,Neighborhood Longitude,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park,Harbourfront",43.65426,-79.360636,43.653447,-79.362017,Bakery
1,"Regent Park,Harbourfront",43.65426,-79.360636,43.653559,-79.361809,Coffee Shop
2,"Regent Park,Harbourfront",43.65426,-79.360636,43.653249,-79.358008,Distribution Center
3,"Regent Park,Harbourfront",43.65426,-79.360636,43.654735,-79.359874,Spa
4,"Regent Park,Harbourfront",43.65426,-79.360636,43.656369,-79.35698,Restaurant


Lets have a whole picture by checking the numbers of venues for each neighborhood, and cateogry of venues.

In [29]:
venues_dt_toronto[['Neighborhood Name','Venue Category']].groupby('Neighborhood Name').count()

Unnamed: 0_level_0,Venue Category
Neighborhood Name,Unnamed: 1_level_1
Berczy Park,70
"CN Tower,King and Spadina,Railway Lands,Harbourfront West,Bathurst Quay,South Niagara,Island airport",25
Central Bay Street,70
Christie,29
Church and Wellesley,70
"Commerce Court,Victoria Hotel",70
"First Canadian Place,Underground city",70
"Garden District, Ryerson",70
"Harbourfront East,Union Station,Toronto Islands",70
"Kensington Market,Chinatown,Grange Park",70


In [30]:
venues_dt_toronto[['Neighborhood Name','Venue Category']].groupby('Venue Category').count()\
               .sort_values(by='Neighborhood Name', ascending = False).head(10)

Unnamed: 0_level_0,Neighborhood Name
Venue Category,Unnamed: 1_level_1
Coffee Shop,106
Café,75
Restaurant,40
Hotel,38
Japanese Restaurant,29
Park,29
Gastropub,26
Italian Restaurant,24
Bakery,21
Seafood Restaurant,20


We can tell that most common venue Category is: **Coffee Shop, Dafe, Restaurant, Hotel**, and **Park.**

**Step(4)** Prepare venue category for Clustering

We need apply one-hot encoding to the category column first. 

In [31]:
dt_toronto_onehot = pd.get_dummies(venues_dt_toronto[['Venue Category']], prefix = "", prefix_sep="")
print(dt_toronto_onehot.shape)
dt_toronto_clustering  = dt_toronto_onehot.copy()

(1174, 200)


Now we put **neighborhood name** back to one-hot data frame as first column and rename it as **Name** just for convenience.

In [32]:
dt_toronto_onehot = pd.concat( [venues_dt_toronto[['Neighborhood Name']], dt_toronto_onehot], axis= 1)
print(dt_toronto_onehot.shape)

(1174, 201)


In [33]:
dt_toronto_onehot.head()

Unnamed: 0,Neighborhood Name,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Animal Shelter,Aquarium,...,Tunnel,Udon Restaurant,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,"Regent Park,Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park,Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park,Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park,Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park,Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Since we are clustering the neiborhood, we can group the data by neighborhood and categories before we apply k-means clustering.


In [34]:
dt_toronto_groups = dt_toronto_onehot.groupby('Neighborhood Name').sum()

**Step(5)** Clustering the neighborhood according venue categories count

We will run the k-means to cluster the neighborhood into 5 clusters.

In [35]:
kclusters = 5 

kmeans = KMeans(n_clusters= kclusters, random_state = 0).fit(dt_toronto_groups)

kmeans.labels_

array([1, 0, 2, 3, 2, 1, 1, 2, 2, 4, 2, 2, 1, 0, 1, 3, 1, 1, 4],
      dtype=int32)

Now it's time to put the label and neiborhood name together.


In [36]:
dt_toronto_groups['Label'] = kmeans.labels_

In [37]:
tmp =  dt_toronto_groups['Label'].reset_index().rename({'Neighborhood Name':'Neighborhood'}, axis = 1)
cluster_result = pd.merge(tmp, dt_toronto) 
print(cluster_result.shape)
cluster_result


(19, 6)


Unnamed: 0,Neighborhood,Label,PostalCode,Borough,Latitude,Longitude
0,Berczy Park,1,M5E,Downtown Toronto,43.644771,-79.373306
1,"CN Tower,King and Spadina,Railway Lands,Harbou...",0,M5V,Downtown Toronto,43.628947,-79.39442
2,Central Bay Street,2,M5G,Downtown Toronto,43.657952,-79.387383
3,Christie,3,M6G,Downtown Toronto,43.669542,-79.422564
4,Church and Wellesley,2,M4Y,Downtown Toronto,43.66586,-79.38316
5,"Commerce Court,Victoria Hotel",1,M5L,Downtown Toronto,43.648198,-79.379817
6,"First Canadian Place,Underground city",1,M5X,Downtown Toronto,43.648429,-79.38228
7,"Garden District, Ryerson",2,M5B,Downtown Toronto,43.657162,-79.378937
8,"Harbourfront East,Union Station,Toronto Islands",2,M5J,Downtown Toronto,43.640816,-79.381752
9,"Kensington Market,Chinatown,Grange Park",4,M5T,Downtown Toronto,43.653206,-79.400049


**Step (6)** Visualize the resulting on map


In [43]:
#find out the center point of the map 
lat, lng = cluster_result[['Latitude','Longitude']].max() + cluster_result[['Latitude','Longitude']].min() 
lat, lng  = lat /2, lng /2 
print(lat, lng)

# create map
map_clusters = folium.Map(location=[lat, lng], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(cluster_result['Latitude'], 
                                  cluster_result['Longitude'], 
                                  cluster_result['Neighborhood'], 
                                  kmeans.labels_):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


43.65425465 -79.3915998
