# Segmenting and Clustering Neighborhoods in Toronto

## Part 1
### Installing dependencies

In [1]:
import sys
!conda install --yes --prefix {sys.prefix} numpy
!conda install --yes --prefix {sys.prefix} pandas
!conda install --yes --prefix {sys.prefix} lxml
!conda install --yes --prefix {sys.prefix} requests
!conda install --yes --prefix {sys.prefix} -c conda-forge geopy
!conda install --yes --prefix {sys.prefix} -c conda-forge folium=0.5.0
!conda install --yes --prefix {sys.prefix} -c conda-forge scikit-learn

print('Dependencies installed')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/ivan/anaconda3/envs/Coursera_Capstone

  added / updated specs:
    - numpy


The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2019.11.~ --> pkgs/main::ca-certificates-2020.1.1-0

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            conda-forge::certifi-2019.11.28-py37h~ --> pkgs/main::certifi-2019.11.28-py37_1
  openssl            conda-forge::openssl-1.1.1f-h516909a_0 --> pkgs/main::openssl-1.1.1f-h7b6447c_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package met

### Importing libraries

In [2]:
import numpy as np
import pandas as pd
import folium 
import requests 
from pandas import json_normalize
from sklearn.cluster import KMeans

print('Libraries imported')

Libraries imported


### Reading HTML with pandas

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url)[0]
print(df.shape)
df.head()

(180, 3)


Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


### Only process the cells that have an assigned borough.
Ignore cells with a borough that is Not assigned.

In [4]:
df_cleared = df[df['Borough'] != 'Not assigned']
df_cleared = df_cleared.reset_index()
df_cleared = df_cleared.drop('index', axis=1)
print(df_cleared.shape)
df_cleared.head()

(103, 3)


Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


### If a cell has a borough but a Not assigned neighborhood.
Then the neighborhood will be the same as the borough.   
Let's check

In [5]:
df_cleared[df_cleared['Neighborhood' ] == 'Not assigned']

Unnamed: 0,Postal code,Borough,Neighborhood


### More than one neighborhood can exist in one postal code area.
Rows will be combined into one row with the neighborhoods separated with a comma.   

Let's check if one or more postal code have more than one neighborhood

In [6]:
print(len(df_cleared['Postal code']) == len(np.unique(df_cleared['Postal code'])))

True


In [7]:
df_cleared['Postal code'].value_counts()

M5T    1
M2H    1
M1S    1
M2R    1
M2L    1
      ..
M5N    1
M6S    1
M6R    1
M7A    1
M5J    1
Name: Postal code, Length: 103, dtype: int64

In [8]:
# Sort by postal code
df_cleared.sort_values(by=['Postal code'], inplace=True)
df_cleared = df_cleared.reset_index()
df_cleared = df_cleared.drop('index', axis=1)
df_cleared.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [9]:
df_cleared.shape

(103, 3)

<hr>

## Part 2
### Getting Latitude and Longitude
Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data,
we need to get the latitude and the longitude coordinates of each neighborhood.

Let's read the coordinates

In [10]:
df_coors = pd.read_csv('Geospatial_Coordinates.csv')

Let's sort the dataframe so it match with the main dataframe (df_cleared)

In [11]:
# Sort by postal code
df_coors.sort_values(by=['Postal Code'], inplace=True)
df_coors = df_coors.reset_index()
df_coors = df_coors.drop('index', axis=1)
df_coors.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [12]:
df_coors.shape

(103, 3)

Dataframes to be used
    1. df_cleared
    2. df_coors
    
Drop the postal code column of the second dataframe

In [13]:
df_lat_lng = df_coors.drop(columns=['Postal Code'], axis=1)
df_lat_lng.head()

Unnamed: 0,Latitude,Longitude
0,43.806686,-79.194353
1,43.784535,-79.160497
2,43.763573,-79.188711
3,43.770992,-79.216917
4,43.773136,-79.239476


In [14]:
df_canada = pd.concat([df_cleared, df_lat_lng], axis=1 )
df_canada.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


<hr>

## Part 3  
### Cluster Neighborhoods
Explore and cluster the neighborhoods in Toronto.
You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Set number of clusters

In [15]:
df_canada.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Let's plot the Neighborhoods to explore them

In [16]:
lat_canada = 43.7001100
lng_canada = -79.4163000

In [17]:
map_canada = folium.Map(location=[lat_canada, lng_canada], zoom_start=10)  
 
 # add markers to map  
for lat, lng, borough, neighborhood in zip(df_canada['Latitude'], df_canada['Longitude'], df_canada['Borough'], df_canada['Neighborhood']):  
    label = '{}, {}'.format(neighborhood, borough)  
    label = folium.Popup(label, parse_html=True)  
    folium.CircleMarker(
        [lat, lng],  
        radius=5,  
        popup=label,  
        color='blue',  
        fill=True,  
        fill_color='#3186cc',  
        fill_opacity=0.7,  
        parse_html=False).add_to(map_canada)
    
map_canada

Let's assign a color for each borough

In [18]:
# How many borough are there
df_canada['Borough'].value_counts()

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
East Toronto         5
York                 5
Mississauga          1
Name: Borough, dtype: int64

In [19]:
# Assign a color
borough_colors = {
    'North York': '#EA2027',
    'Downtown Toronto': '#006266',
    'Scarborough': '#1B1464',
    'Etobicoke': '#5758BB',
    'Central Toronto': '#6F1E51',
    'West Toronto': '#EE5A24',
    'York': '#009432',
    'East York': '#0652DD',
    'East Toronto': '#9980FA',
    'Mississauga': '#833471',
}

In [20]:
# Plot
map_canada = folium.Map(location=[lat_canada, lng_canada], zoom_start=10)  
 
 # add markers to map  
for lat, lng, borough, neighborhood in zip(df_canada['Latitude'], df_canada['Longitude'], df_canada['Borough'], df_canada['Neighborhood']):  
    label = '{}, {}'.format(neighborhood, borough)  
    label = folium.Popup(label, parse_html=True)  
    folium.CircleMarker(
        [lat, lng],  
        radius=5,  
        popup=label,  
        color=borough_colors[borough],  
        fill=True,  
        fill_color=borough_colors[borough],  
        fill_opacity=0.7,  
        parse_html=False).add_to(map_canada)
    
map_canada


Now let's get venues for each Neighborhood


In [21]:
# Foursquare config parameters
CLIENT_ID = 'WSBDS3PHA2ZA2QRF1K2PFSPE1G2DOMXDFX5LTEJ2NCC5OUG1' # your Foursquare ID
CLIENT_SECRET = 'LOHFOAR0DHZK5WYJOU1N0FRMLVYOUKNYK3KBRCTT33YSEQBH' # your Foursquare Secret
VERSION = '20200404'
LIMIT = 100

define URL

In [22]:
# define URL with a sample latitude and longitude
latitude = df_canada['Latitude'][0]
longitude = df_canada['Longitude'][0]

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}'.format(CLIENT_ID,
                                                                                                                 CLIENT_SECRET,
                                                                                                                 latitude,
                                                                                                                 longitude,
                                                                                                                 VERSION)


### For each neighborhood we are doing the following steps

Get venues nearby Neighborhood location

In [23]:

# send GET request and get trending venues
venues_json_dirty = requests.get(url).json()
print('Request sent')


Request sent


Process each venue and find out its category

In [24]:
if len(venues_json_dirty['response']['venues']) == 0:
    print('No trending venues are available at the moment!')

else:
    # assign relevant part of JSON to venues
    venues_json = venues_json_dirty['response']['venues']
    # Getting the name of the primary category
    for v in venues_json:
        if isinstance(v['categories'], list):
            if len( v['categories'] ) > 0:
                v['categories'] = v['categories'][0]['name']
            else:
                v['categories'] = 'Not assigned'
    # tranform venues into a dataframe
    venues_df_dirty = json_normalize(venues_json)
    ## Preprocessing 
    venues_df = pd.DataFrame({
        'category': venues_df_dirty['categories'],
        'distance': venues_df_dirty['location.distance']
    })
    


Venue categories for the Neighborhood location

In [25]:
venues_df['category'].value_counts()

Automotive Shop                  4
Not assigned                     3
Fast Food Restaurant             2
Restaurant                       2
Coffee Shop                      2
Business Service                 1
Trail                            1
Filipino Restaurant              1
Fried Chicken Joint              1
Convenience Store                1
Caribbean Restaurant             1
Gas Station                      1
Print Shop                       1
African Restaurant               1
Auto Garage                      1
Sandwich Place                   1
Paper / Office Supplies Store    1
Greek Restaurant                 1
Construction & Landscaping       1
Building                         1
Pet Store                        1
Shopping Mall                    1
Name: category, dtype: int64

One hot encoding for each category

In [26]:
one_hot_categories_df = pd.get_dummies(venues_df['category'])
one_hot_categories_df.head()

Unnamed: 0,African Restaurant,Auto Garage,Automotive Shop,Building,Business Service,Caribbean Restaurant,Coffee Shop,Construction & Landscaping,Convenience Store,Fast Food Restaurant,...,Gas Station,Greek Restaurant,Not assigned,Paper / Office Supplies Store,Pet Store,Print Shop,Restaurant,Sandwich Place,Shopping Mall,Trail
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


Get the mean value for each category

In [27]:
one_hot_categories_sers = one_hot_categories_df.mean()
one_hot_categories_sers

African Restaurant               0.033333
Auto Garage                      0.033333
Automotive Shop                  0.133333
Building                         0.033333
Business Service                 0.033333
Caribbean Restaurant             0.033333
Coffee Shop                      0.066667
Construction & Landscaping       0.033333
Convenience Store                0.033333
Fast Food Restaurant             0.066667
Filipino Restaurant              0.033333
Fried Chicken Joint              0.033333
Gas Station                      0.033333
Greek Restaurant                 0.033333
Not assigned                     0.100000
Paper / Office Supplies Store    0.033333
Pet Store                        0.033333
Print Shop                       0.033333
Restaurant                       0.066667
Sandwich Place                   0.033333
Shopping Mall                    0.033333
Trail                            0.033333
dtype: float64

Create the dataframe, each row for each neighborhood

In [28]:
one_hot_categories_mean_df = pd.DataFrame( [one_hot_categories_sers.values], columns = one_hot_categories_sers.index)

one_hot_categories_mean_df.head()

Unnamed: 0,African Restaurant,Auto Garage,Automotive Shop,Building,Business Service,Caribbean Restaurant,Coffee Shop,Construction & Landscaping,Convenience Store,Fast Food Restaurant,...,Gas Station,Greek Restaurant,Not assigned,Paper / Office Supplies Store,Pet Store,Print Shop,Restaurant,Sandwich Place,Shopping Mall,Trail
0,0.033333,0.033333,0.133333,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.066667,...,0.033333,0.033333,0.1,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333


Add mean distance

In [29]:
one_hot_categories_mean_df['distance'] = venues_df['distance'].mean()
one_hot_categories_mean_df.head()

Unnamed: 0,African Restaurant,Auto Garage,Automotive Shop,Building,Business Service,Caribbean Restaurant,Coffee Shop,Construction & Landscaping,Convenience Store,Fast Food Restaurant,...,Greek Restaurant,Not assigned,Paper / Office Supplies Store,Pet Store,Print Shop,Restaurant,Sandwich Place,Shopping Mall,Trail,distance
0,0.033333,0.033333,0.133333,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.066667,...,0.033333,0.1,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.033333,639.8


### Let's do the above process for all of the neighborhoods 

In [42]:
# create a dataframe to store the result
canada_categories_mean_df = pd.DataFrame()
canada_categories_mean_df

# for each neighborhood
print('Processing data ...')
for lat, lng, neighborhood in zip(df_canada['Latitude'], df_canada['Longitude'], df_canada['Neighborhood']):
    # Define URL 
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}'.format(
                                                                                                    CLIENT_ID,
                                                                                                    CLIENT_SECRET,
                                                                                                    lat,
                                                                                                    lng,
                                                                                                    VERSION)
    # Make the request
    venues_json_dirty = requests.get(url).json()
    venues_json_dirty
    # Preprocess the categories for each venue in this neighborhood
    if len(venues_json_dirty['response']['venues']) == 0:
        print('No venues are available at the moment!')

    else:
        # assign relevant part of JSON to venues
        venues_json = venues_json_dirty['response']['venues']
        # Getting the name of the primary category
        for v in venues_json:
            if isinstance(v['categories'], list):
                if len( v['categories'] ) > 0:
                    v['categories'] = v['categories'][0]['name']
                else:
                    v['categories'] = 'Not assigned'
        # tranform venues into a dataframe
        venues_df_dirty = json_normalize(venues_json)
        ## Preprocessing 
        venues_df = pd.DataFrame({
            'category': venues_df_dirty['categories'],
            'distance': venues_df_dirty['location.distance']
        })
        # One-hot encoding
        one_hot_categories_df = pd.get_dummies(venues_df['category'])
        # Mean value for each category
        one_hot_categories_sers = one_hot_categories_df.mean()
        one_hot_categories_mean_df = pd.DataFrame( [one_hot_categories_sers.values], columns = one_hot_categories_sers.index)
        # Append mean distance
        one_hot_categories_mean_df['distance'] = venues_df['distance'].mean()
        # one_hot_categories_mean_df['neighborhood'] = neighborhood
        canada_categories_mean_df = pd.concat([canada_categories_mean_df, one_hot_categories_mean_df], axis=0, ignore_index=True)

    
print('... Data processed')


Processing data ...
... Data processed


In [43]:
canada_categories_mean_df.head()

Unnamed: 0,African Restaurant,Auto Garage,Automotive Shop,Building,Business Service,Caribbean Restaurant,Coffee Shop,Construction & Landscaping,Convenience Store,Fast Food Restaurant,...,Burmese Restaurant,Hungarian Restaurant,Shop & Service,State / Provincial Park,River,Poutine Place,Social Club,Drugstore,Kingdom Hall,Swiss Restaurant
0,0.033333,0.033333,0.1,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.066667,...,,,,,,,,,,
1,,0.033333,,,,,,,,,...,,,,,,,,,,
2,,,0.033333,,,,,,,,...,,,,,,,,,,
3,,,,0.033333,,,0.033333,,0.033333,,...,,,,,,,,,,
4,,,0.033333,0.066667,,0.033333,,,,,...,,,,,,,,,,


Verify that all neighborhoods are in the dataframe

In [44]:
canada_categories_mean_df.shape

(103, 395)

In [45]:
df_canada.shape

(103, 5)

Deal with NaN values

In [46]:
canada_categories_mean_df.replace(np.nan, 0, inplace=True)
canada_categories_mean_df.head()

Unnamed: 0,African Restaurant,Auto Garage,Automotive Shop,Building,Business Service,Caribbean Restaurant,Coffee Shop,Construction & Landscaping,Convenience Store,Fast Food Restaurant,...,Burmese Restaurant,Hungarian Restaurant,Shop & Service,State / Provincial Park,River,Poutine Place,Social Club,Drugstore,Kingdom Hall,Swiss Restaurant
0,0.033333,0.033333,0.1,0.033333,0.033333,0.033333,0.066667,0.033333,0.033333,0.066667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.033333,0.0,0.0,0.033333,0.0,0.033333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.033333,0.066667,0.0,0.033333,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Scale because of the distance column

In [47]:
# Scaling the features didn't give me a better result
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# X = scaler.fit_transform(canada_categories_mean_df.values)
# 
X = canada_categories_mean_df.values

Let's cluster neighborhoods

In [48]:
# set number of clusters
num_clusters = 5

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=200)
k_means.fit(X)
labels = k_means.predict(X)
print('Labels')
print(labels.shape)
labels


Labels
(103,)


array([0, 2, 3, 2, 3, 0, 2, 0, 2, 0, 3, 0, 0, 3, 0, 1, 4, 0, 1, 0, 2, 0,
       3, 0, 0, 0, 3, 1, 1, 0, 0, 0, 2, 0, 3, 3, 3, 0, 1, 1, 0, 3, 3, 3,
       2, 1, 0, 1, 0, 1, 0, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 0, 1,
       1, 1, 3, 1, 1, 1, 3, 0, 0, 3, 1, 1, 1, 2, 1, 0, 3, 3, 3, 1, 3, 3,
       3, 3, 0, 0, 3, 2, 2, 3, 0, 2, 0, 0, 0, 1, 2], dtype=int32)

Assign the predicted label to each neighborhood

In [49]:
canada_labeled_df = pd.DataFrame(df_canada)
canada_labeled_df['Label'] = labels
canada_labeled_df.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude,Label
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353,0
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497,2
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711,3
3,M1G,Scarborough,Woburn,43.770992,-79.216917,2
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,3


Visualize the result

In [50]:
# Assign a color to labels
label_colors = {
    9: '#EA2027',
    8: '#006266',
    7: '#1B1464',
    6: '#5758BB',
    5: '#6F1E51',
    4: '#EE5A24',
    3: '#009432',
    2: '#0652DD',
    1: '#9980FA',
    0: '#833471',
}

In [51]:
# Plot
map_canada = folium.Map(location=[lat_canada, lng_canada], zoom_start=10)  
 
 # add markers to map  
for lat, lng, borough, neighborhood, cluster_label in zip(canada_labeled_df['Latitude'], canada_labeled_df['Longitude'], canada_labeled_df['Borough'], canada_labeled_df['Neighborhood'], canada_labeled_df['Label']):  
    label = '{}, {}'.format(neighborhood, borough)  
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],  
        radius=5,  
        popup=label,  
        color=label_colors[cluster_label],  
        fill=True,  
        fill_color=label_colors[cluster_label],  
        fill_opacity=0.7,  
        parse_html=False).add_to(map_canada)
    
map_canada


### The End