# Segmenting and Clustering Neighborhoods in Toronto

## 1. Read Postal Codes and Neighborhood Data
In this section,we read the datas of Toronto postal codes from mentioned wiki page and create a dataframe. We exclude the entries which has 'Borough' as 'Not Assigned'. We also populate the missing or not assigned 'Neighborhood' values with the corresponding 'Borough' values. The we combine all entries for a postal code with Negihborhood column having comma separated values.    
I have used pandas for reading the html page. Upoon examining the source pf the page, we get the class atrribute of the html table which can be used to filter out unwanted data.

In [146]:
#!conda install -c conda-forge geocoder
#!conda install -c conda-forge folium --yes
import pandas as pd
import folium
import requests

html_df_list = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',attrs={'class' : "wikitable sortable"})
print('read table from html page')
folium.__version__

read table from html page


'0.10.0'

In [147]:
df = html_df_list[0]
df.columns=['Postal Code','Borough','Neighborhood']
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


In [148]:
df = df[df.Borough != 'Not assigned'].reset_index(drop=True)
df['Neighborhood'] = df.apply(lambda x: x['Borough'] if x['Neighborhood'] == 'Not assigned' else x['Neighborhood'],axis=1)
df = df.groupby(['Postal Code','Borough'])['Neighborhood'].apply(', '.join).reset_index()
print(df.shape)
df.head(10)

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


## 2. Read geo co-ordinates for postal codes
For this task, I tried to use the geocode library as suggested, but it returned None for most of the times. So I used the csv shared in the assignment for the data.

In [149]:
#import geocoder
#lat_lng_coords = None
#while(lat_lng_coords is None):
# g = geocoder.google('{}, Toronto, Ontario'.format('M8Z'))
# lat_lng_coords = g.latlng

In [150]:
geo_df = pd.read_csv('http://cocl.us/Geospatial_data')
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


## 3. Prepare required dataset
We need to combine both geo and neighborhood data. To reduce the number of entries for FourSquare API, I havs filtered only entries with Borough names having 'Toronto'. This is as suggested in the assignment description.

In [151]:
df_joined = df.join(geo_df.set_index('Postal Code'),on = 'Postal Code')

In [152]:
df_toronto = df_joined[df_joined['Borough'].str.contains('Toronto')].reset_index(drop=True)
df_toronto

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049


## 4. Explore all the postal codes with FourSquare API

In [153]:
#Copied from the ungraded assignment.
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Postal Code Latitude', 
                  'Postal Code Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [165]:
# The code was removed by Watson Studio for sharing.

In [155]:
VERSION = '20180605' # Foursquare API version
radius = 500
LIMIT = 30


In [156]:
toronto_venues = getNearbyVenues(names=df_toronto['Postal Code'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude']
                                  )
toronto_venues

M4E
M4K
M4L
M4M
M4N
M4P
M4R
M4S
M4T
M4V
M4W
M4X
M4Y
M5A
M5B
M5C
M5E
M5G
M5H
M5J
M5K
M5L
M5N
M5P
M5R
M5S
M5T
M5V
M5W
M5X
M6G
M6H
M6J
M6K
M6P
M6R
M6S
M7Y


Unnamed: 0,Postal Code,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M4E,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,M4E,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,M4E,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,M4E,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,M4K,43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
5,M4K,43.679557,-79.352188,MenEssentials,43.677820,-79.351265,Cosmetics Shop
6,M4K,43.679557,-79.352188,Dolce Gelato,43.677773,-79.351187,Ice Cream Shop
7,M4K,43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant
8,M4K,43.679557,-79.352188,La Diperie,43.677530,-79.352295,Ice Cream Shop
9,M4K,43.679557,-79.352188,Louis Cifer Brew Works,43.677663,-79.351313,Brewery


## 5. Prepare data for Clustering

### 5.1 Classification of venue categories
While working with only the venue categories returned by FourSquare API, I faced issues while concluding the clustered data as done in ungraded assignment. This is because of high number of categories (185+) and clustering based on textual values than numeric data. 

To address these issues, I created another set of classes (18-20), and mapped the venue categories to these classes. Here a venue class consolidates one to few venue categories returned by FourSquare API. for example, Eatries include all the cafes, coffee shops, breakfast joints, fast food places, ice cream places and so on. These classes are features to be used in the K-Means clustering.

_value-cat_ dataframe holds the mapping of the classes to venue categories.   
***__I have printed the whole dataframe for reference.__

In [157]:
# The code was removed by Watson Studio for sharing.

In [158]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(value_cat)

                    Value Category         Classification
0                          Airport                Airport
1               Airport Food Court                Airport
2                     Airport Gate                Airport
3                   Airport Lounge                Airport
4                  Airport Service                Airport
5                 Airport Terminal                Airport
6              American Restaurant  Restaurant - American
7                         Aquarium            Exhibitions
8                      Art Gallery            Exhibitions
9              Arts & Crafts Store           Leisure Shop
10                Asian Restaurant     Restaurant - Asian
11                   Auto Workshop             Commercial
12                      Baby Store             Commercial
13                      Bagel Shop                Grocery
14                          Bakery                Grocery
15                            Bank               Business
16            

__Now I created a dataframe which has number of venues for each class as columns and each row presenting a postal code. Here I used the group by fucntion to get consolidated number for each venue class and postal codes. Then using pivot_table function to get all the classes as columns and postal codes as rows. This is the dataframe to be used for clustering and segmentation.__

In [159]:
import numpy as np
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))
toronto_venues1 = toronto_venues.join(value_cat.set_index('Value Category'),on='Venue Category')
toronto_venues1
t1 = toronto_venues1.groupby(['Postal Code','Classification']).size()
t2 = pd.DataFrame(t1).reset_index()
t2.columns = ['Postal Code','Venue Classification','Count']
t3 = pd.pivot_table(t2, values='Count', index=['Postal Code'],columns=['Venue Classification'], aggfunc=np.sum, fill_value=0)
x=pd.DataFrame(t3)
x

There are 188 uniques categories.


Venue Classification,Academic,Airport,Business,Commercial,Eatery,Entertainment,Exhibitions,Fitness,Grocery,Hangout,Leisure Shop,Recreational,Residence,Restaurant - American,Restaurant - Asian,Restaurant - Europian,Restaurant - General,Sport,Transit
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
M4E,0,0,0,0,0,0,0,0,1,1,0,1,1,0,0,0,0,0,0
M4K,0,0,0,3,4,0,0,2,5,2,0,1,0,0,1,10,2,0,0
M4L,0,0,0,1,8,1,0,1,2,2,0,1,0,1,1,1,1,0,0
M4M,0,0,1,2,8,0,0,1,5,2,0,1,1,2,2,2,3,0,0
M4N,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1
M4P,0,0,1,1,3,0,0,1,0,0,0,1,0,0,0,0,0,0,0
M4R,0,0,1,4,4,0,0,2,2,0,2,1,0,1,1,0,2,0,0
M4S,0,0,0,1,8,1,0,2,5,1,1,1,0,0,4,3,3,0,0
M4T,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0
M4V,0,0,0,1,4,0,0,0,3,3,0,0,0,1,2,0,1,0,1


__classes_data__ holds the total number of venues for a perticular class. It is to be used to calculate class weights while representing the clusters.

In [160]:
classes_data = pd.DataFrame(toronto_venues1.groupby('Classification').size()).reset_index()
classes_data.columns = ['Classification','Count']
print(classes_data)

           Classification  Count
0                Academic      2
1                 Airport      8
2                Business     24
3              Commercial     52
4                  Eatery    215
5           Entertainment     28
6             Exhibitions     19
7                 Fitness     33
8                 Grocery     90
9                 Hangout     83
10           Leisure Shop     10
11           Recreational     35
12              Residence      5
13  Restaurant - American     37
14     Restaurant - Asian     57
15  Restaurant - Europian     49
16   Restaurant - General     65
17                  Sport      8
18                Transit      7


## 6. Perform segmentation with K-Means clustering

I found clustering with above dataframe is providing better ouput than dealing with only textual data. Also the scaling helped to reduce the dominance of one class over the others. Such as city has 230+ eatries and each area has few and mostly outshines over other classes as they are relatively less in number. This and similar kind of decrepencies were reduced using the scaling.
I executed below code with different number of clusters and found more conclusive results with either 3 or 5 clusters.

In [161]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn import preprocessing
# import k-means from clustering stage
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 5
x.reset_index(inplace=True)
x_data = x.drop(columns=['Postal Code'])
x_scaled = preprocessing.StandardScaler().fit_transform(x_data)
kmeans = KMeans(init='k-means++',n_clusters=kclusters, random_state=0).fit(x_scaled)

# add clustering labels
x.insert(0, 'Cluster Labels', kmeans.labels_)
kmeans.labels_[0:10] 

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


array([4, 0, 0, 0, 4, 4, 0, 0, 4, 0], dtype=int32)

In [162]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
result = x.join(df_toronto.set_index('Postal Code'), on='Postal Code')
result.head() # check the last columns!

Unnamed: 0,Cluster Labels,Postal Code,Academic,Airport,Business,Commercial,Eatery,Entertainment,Exhibitions,Fitness,...,Restaurant - American,Restaurant - Asian,Restaurant - Europian,Restaurant - General,Sport,Transit,Borough,Neighborhood,Latitude,Longitude
0,4,M4E,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,East Toronto,The Beaches,43.676357,-79.293031
1,0,M4K,0,0,0,3,4,0,0,2,...,0,1,10,2,0,0,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,0,M4L,0,0,0,1,8,1,0,1,...,1,1,1,1,0,0,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,0,M4M,0,0,1,2,8,0,0,1,...,2,2,2,3,0,0,East Toronto,Studio District,43.659526,-79.340923
4,4,M4N,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,Central Toronto,Lawrence Park,43.72802,-79.38879


### Toronto Map with Clusters marking

In [163]:
# create map
latitude = 43.6532
longitude = -79.3832
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(result['Latitude'], result['Longitude'], result['Postal Code'], result['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Representing Clustered Data
Now to represent the clustered data, I have included class weights which are nothing but the ratio of number of venues in a particular class within the zip code to total number of venues for the class. I sort the data with these weightsto conclude.

In [164]:
def get_weight(cls,count):
    row = classes_data[classes_data['Classification']==cls]
    wt = count/row.iloc[0]['Count']
    return wt

pd.set_option('display.width', 250)
for i in range(0,kclusters):
    df = manhattan_merged.loc[manhattan_merged['Cluster Labels'] == i, :]
    df = df.drop(columns=['Latitude','Longitude','Cluster Labels'])
    #remove all the classes having zero value for all the postal codes in the cluster.
    df = df.loc[:, (df != 0).any(axis=0)]
    df_sum = df.sum(numeric_only=True) 
    df_pd = pd.DataFrame(df_sum)
    df_pd.reset_index(inplace=True)
    df_pd.columns=['Classification','Count']
    df_pd['Class Wt'] = df_pd.apply(lambda x:get_weight(x['Classification'],x['Count']),axis=1)
    df_pd.sort_values(by=['Class Wt'],ascending=False,inplace=True)
    print('-------Start cluster: '+str(i) +'---------')
    print(df_pd)
    print('-------End cluster: '+str(i)+ '\ttotal postal codes: ' + str(len(df.index))+'---------')

-------Start cluster: 0---------
           Classification  Count  Class Wt
8            Leisure Shop      8  0.800000
13  Restaurant - Europian     38  0.775510
1              Commercial     38  0.730769
12     Restaurant - Asian     39  0.684211
6                 Grocery     61  0.677778
3           Entertainment     18  0.642857
2                  Eatery    133  0.618605
11  Restaurant - American     22  0.594595
5                 Fitness     18  0.545455
14   Restaurant - General     34  0.523077
7                 Hangout     42  0.506024
9            Recreational     14  0.400000
0                Business      7  0.291667
16                Transit      2  0.285714
15                  Sport      2  0.250000
4             Exhibitions      4  0.210526
10              Residence      1  0.200000
-------End cluster: 0	total postal codes: 19---------
-------Start cluster: 1---------
           Classification  Count  Class Wt
4             Exhibitions     12  0.631579
13   Restaurant - Ge

## 7.Deduction from the clustered data.    
  ### Cluster 0: Social and Commercial    
   The cluster represents the social and commercial locations dense with restaurants, shops, eatries/cafes. It incldues 19 of 37 postal codes, which is far more than any other clusters.     

  ### Cluster 1: Exhibits and Hangouts      
   The  cluster has more number of exhibits such as museums or art galleries etc. It also has good concentration of hagout places and restaurants. 


###    Cluster 2: Outlier      
This cluster seem outlier on the cursory look as only on postal code included. It does not have any perticular category highlighted, though has couple of sports venues and few business places within single zip code.    

###   Cluster 3: Academic      
Though this one includes only one postal code, it has been highlighted with having all the academic places such as colleges and related venues. 

###    Cluster 4: Transportation and Recreations     
The cluster includes major transport areas such as airport and transit centers. It also incldues good number of recreational places such as park, gardens and so on.     

  