<h1 align=center><font size = 6>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

#### Installing and importing packets 

In [28]:
#pip install geocoder

In [29]:
import numpy as np
import json
import pandas as pd
import requests
from bs4 import BeautifulSoup
import geocoder


In [30]:
!conda install -c conda-forge geopy --yes 

Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.



In [31]:
from geopy.geocoders import Nominatim

In [32]:
import requests
from pandas.io.json import json_normalize

In [33]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [34]:
from sklearn.cluster import KMeans

In [35]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium

Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.



# Part 1. Obtaining the data from Wikipedia and shaping it  as required

## 1.1 Obtaining data from Wikipedia 

Requesting the HTML of the website:

In [36]:
#wiki_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
wiki_url = requests.get('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=861324217').text

Reading the source code and enabling it to view it in a nested format:

In [37]:
soup = BeautifulSoup(wiki_url,'lxml')
#print(soup.prettify())

After inspecting the HTML script, we see that our tables belongs to wikitable sortable table class, so we find tables with this attribute:

In [38]:
table = soup.find_all('table',{'class':'wikitable sortable'})[0]
#table

We read the html table and convert it to a dataframe:

In [39]:
df = pd.read_html(str(table))
neighborhoods=pd.DataFrame(df[0])
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## 1.2 Formatting the table as required

Renaming the columns as required:

In [40]:
neighborhoods.rename(columns={'Postcode':'PostalCode','Neighbourhood':'Neighborhood'},inplace=True)
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Ignoring cells with a borough that is 'Not assigned':

In [41]:
neighborhoods.drop(neighborhoods[neighborhoods.Borough=='Not assigned'].index,inplace=True)
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Combining neighborhoods that are in the same postal code:

In [42]:
neighborhoods.groupby(['PostalCode','Borough','Neighborhood']).sum().head(15)

PostalCode,Borough,Neighborhood
M1B,Scarborough,Malvern
M1B,Scarborough,Rouge
M1C,Scarborough,Highland Creek
M1C,Scarborough,Port Union
M1C,Scarborough,Rouge Hill
M1E,Scarborough,Guildwood
M1E,Scarborough,Morningside
M1E,Scarborough,West Hill
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


In [43]:
neighborhoods=neighborhoods.groupby(['PostalCode','Borough']).agg(lambda x: ', '.join(x)).reset_index()
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Filling not assigned neighborhoods with the name of its boroughs:

In [44]:
neighborhoods[neighborhoods['Neighborhood']=='Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Not assigned


In [45]:
neighborhoods.loc[neighborhoods.Neighborhood == 'Not assigned', 'Neighborhood'] = neighborhoods.loc[neighborhoods.Neighborhood == 'Not assigned', 'Borough']

In [46]:
neighborhoods[neighborhoods['Neighborhood']=='Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood


In [47]:
neighborhoods.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Obtainig the number of rows of dataframe

In [48]:
neighborhoods.shape[0]

103

# Part 2. Getting the geographical coordinates for each postal code

## 2.1 Trying Geocoder 

In [49]:
postal_code='M1A'

In [None]:
# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

I got tired of waiting for any response from Google so I decided to use the link provided

## 2.2 Using the csv file 

In [50]:
postal_codes = pd.read_csv('http://cocl.us/Geospatial_data')
postal_codes.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [51]:
postal_codes.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)

In [52]:
neighborhoods=pd.merge(neighborhoods, postal_codes, on='PostalCode')

In [53]:
neighborhoods.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


# Part 3: Exploring and clustering the postal codes in Toronto

In the dataframe neighboorhoods obtained in 2.2, a borough in column 'Borough' can comprehend several sets of neighborhoods in column 'Neighborhood' but every set of neiborhoods in column 'Neighborhood' is associated univocally to a postal code in column 'PostalCode', so every set of neighborhood is represented by a postal code. In this sense, the later analysis will be by postal code.

## 3.1 Visualizing postal codes

### 3.1.1 Visualizing all postal codes

Getting the longitude and latitude values of Toronto

In [54]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Creating a map of Toronto with neighborhoods

In [55]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for code,lat, lng, borough, neighborhood in zip(neighborhoods['PostalCode'],neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{} {} ({})'.format(code,borough,neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

The dataframe has 103 postal codes, what makes it a little hard to visualize it in the above map.

### 3.1.2 Visualizing only postal codes whose boroughs  contain the word Toronto

Extracting data where boroughs contaning the word Toronto

In [56]:
toronto_data=neighborhoods[neighborhoods['Borough'].str.contains("Toronto")].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [57]:
print('The data extracted has {} postal codes'.format(toronto_data.shape[0]))

The data extracted has 38 postal codes


In [58]:
latitute=toronto_data['Latitude'].mean()
longitude=toronto_data['Longitude'].mean()
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11.4)

for code,lat, lng, borough, neighborhood in zip(toronto_data['PostalCode'],toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighborhood']):
    label = '{} {} ({})'.format(code,borough,neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

The 38 postal codes in the above map are better visualized, but I consider that this quantity is too small to obtain a good clustering. So, onwards, all the analysis will be made to all postal codes.

## 3.2 Exploring postal codes

Defining Foursquare Credentials and Version

In [59]:
CLIENT_ID = '13XYAJUS4H2UNBY3TV35FZ5WIWFQ2LH2Z4G21PSGJHUWZYIH' 
CLIENT_SECRET = 'TR4NDYBCZERLHHGB0KEBZ5CC3ITKOLT3RO3TVCJD1F4PI1ZV'
VERSION = '20180605'

Setting the location and latitude of Toronto

In [60]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude

Especifying the data to analyze

In [61]:
toronto_data=neighborhoods

### 3.2.1 Exploring the first postal code

Getting the first postal code with its correspondent borough and latitude and longitude values

In [62]:
borough_postcode = toronto_data.loc[0,'PostalCode']+' '+toronto_data.loc[0,'Borough']
borough_latitude=toronto_data.loc[0,'Latitude']
borough_longitude=toronto_data.loc[0,'Longitude']
print('Latitude and longitude values of {} are ({}, {}).'.format(borough_postcode,
                                                               borough_latitude, 
                                                               borough_longitude))

Latitude and longitude values of M1B Scarborough are (43.806686299999996, -79.19435340000001).


Creating the GET request url

In [63]:
LIMIT = 100
radius = 500 

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    borough_latitude, 
    borough_longitude, 
    radius, 
    LIMIT)

url

'https://api.foursquare.com/v2/venues/explore?&client_id=13XYAJUS4H2UNBY3TV35FZ5WIWFQ2LH2Z4G21PSGJHUWZYIH&client_secret=TR4NDYBCZERLHHGB0KEBZ5CC3ITKOLT3RO3TVCJD1F4PI1ZV&v=20180605&ll=43.806686299999996,-79.19435340000001&radius=500&limit=100'

Sending the GET request and examining the resutls

In [64]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e89321e40a7ea001c4c87bc'},
  'headerLocation': 'Malvern',
  'headerFullLocation': 'Malvern, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 1,
  'suggestedBounds': {'ne': {'lat': 43.8111863045, 'lng': -79.18812958073042},
   'sw': {'lat': 43.80218629549999, 'lng': -79.2005772192696}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bb6b9446edc76b0d771311c',
       'name': 'Wendy’s',
       'location': {'crossStreet': 'Morningside & Sheppard',
        'lat': 43.80744841934756,
        'lng': -79.19905558052072,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.80744841934756,
          'lng': -79.19905558052072}],
        'distance': 387,
        'cc': 'CA',
        'city': 'Toronto',
    

Defining a function to extract the category of the venue

In [65]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Cleaning the json and structuring it into a dataframe.

In [66]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues)

filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Wendy’s,Fast Food Restaurant,43.807448,-79.199056


In [67]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

1 venues were returned by Foursquare.


### 3.2.2 Exploring all postal codes

This is a function based on one from "Segmenting and Clustering Neighborhoods in New York City" lab to repeat the same process to all the boroughs. 

In [68]:
def getNearbyVenues(postcodes, names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for postcode, name, lat, lng in zip(postcodes, names, latitudes, longitudes):
        print(postcode + ' ' + name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            postcode,
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode',
                             'Borough', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In the above maps some areas covered by postal codes were very close, so it would be a good idea to reduce the radius in order to reduce overlaps. But this could avoid finding venues in small dense areas, so I keep the radius of 500.

In [69]:
toronto_venues = getNearbyVenues(postcodes=toronto_data['PostalCode'],
                                    names=toronto_data['Borough'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude'],
                                  )

M1B Scarborough
M1C Scarborough
M1E Scarborough
M1G Scarborough
M1H Scarborough
M1J Scarborough
M1K Scarborough
M1L Scarborough
M1M Scarborough
M1N Scarborough
M1P Scarborough
M1R Scarborough
M1S Scarborough
M1T Scarborough
M1V Scarborough
M1W Scarborough
M1X Scarborough
M2H North York
M2J North York
M2K North York
M2L North York
M2M North York
M2N North York
M2P North York
M2R North York
M3A North York
M3B North York
M3C North York
M3H North York
M3J North York
M3K North York
M3L North York
M3M North York
M3N North York
M4A North York
M4B East York
M4C East York
M4E East Toronto
M4G East York
M4H East York
M4J East York
M4K East Toronto
M4L East Toronto
M4M East Toronto
M4N Central Toronto
M4P Central Toronto
M4R Central Toronto
M4S Central Toronto
M4T Central Toronto
M4V Central Toronto
M4W Downtown Toronto
M4X Downtown Toronto
M4Y Downtown Toronto
M5A Downtown Toronto
M5B Downtown Toronto
M5C Downtown Toronto
M5E Downtown Toronto
M5G Downtown Toronto
M5H Downtown Toronto
M5J Downtow

#### Checking the size of the resulting dataframe

In [70]:
print(toronto_venues.shape)
toronto_venues.head()

(2201, 8)


Unnamed: 0,PostalCode,Borough,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,Scarborough,43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,M1C,Scarborough,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,M1C,Scarborough,43.784535,-79.160497,Scarborough Historical Society,43.788755,-79.162438,History Museum
3,M1E,Scarborough,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,M1E,Scarborough,43.763573,-79.188711,Big Bite Burrito,43.766299,-79.19072,Mexican Restaurant


#### Venues returned for each postal code within a radius of 400

In [71]:
toronto_venues.groupby(['PostalCode','Borough']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
PostalCode,Borough,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
M1B,Scarborough,1,1,1,1,1,1
M1C,Scarborough,2,2,2,2,2,2
M1E,Scarborough,7,7,7,7,7,7
M1G,Scarborough,4,4,4,4,4,4
M1H,Scarborough,9,9,9,9,9,9
M1J,Scarborough,1,1,1,1,1,1
M1K,Scarborough,6,6,6,6,6,6
M1L,Scarborough,9,9,9,9,9,9
M1M,Scarborough,3,3,3,3,3,3
M1N,Scarborough,4,4,4,4,4,4


#### Unique venues and categories found in dataframe 

In [72]:
print('There are {} unique venues.'.format(len(toronto_venues['Venue'].unique())))

There are 1431 unique venues.


In [73]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 268 unique categories.


## 3.3 Analizing each postal code by venue category

In [74]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add postal code and borough  columns back to dataframe
toronto_onehot['PostalCode'] = toronto_venues['PostalCode'] 
#toronto_onehot['Borough'] = toronto_venues['Borough'] 
# move postal code and borough columns to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,PostalCode,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [75]:
toronto_onehot.shape

(2201, 269)

#### Grouping by postal code taking the mean of the frequency of ocurrence of each category 

In [76]:
toronto_grouped = toronto_onehot.groupby('PostalCode').mean().reset_index()
toronto_grouped

Unnamed: 0,PostalCode,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00000,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.00000
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00000,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.00000
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00000,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.00000
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00000,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.00000
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00000,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.00000
5,M1J,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00000,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.00000
6,M1K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00000,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.00000
7,M1L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00000,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.00000
8,M1M,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,...,0.0,0.00000,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.00000
9,M1N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00000,0.000000,0.0,0.000000,0.0,0.00,0.000000,0.000000,0.00000


In [77]:
toronto_grouped.shape

(99, 269)

### 3.3.2 Getting the most common venue categories for each postal code 

#### Printing each postal code with its top 5 venue categories 

In [78]:
num_top_categories = 5

for postcode in toronto_grouped['PostalCode']:
    print("----"+postcode+"----")
    temp = toronto_grouped[toronto_grouped['PostalCode'] == postcode].T.reset_index()
    temp.columns = ['Venue category','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_categories))
    print('\n')

----M1B----
         Venue category  freq
0  Fast Food Restaurant   1.0
1     Accessories Store   0.0
2   Monument / Landmark   0.0
3    Mac & Cheese Joint   0.0
4                Market   0.0


----M1C----
               Venue category  freq
0              History Museum   0.5
1                         Bar   0.5
2           Accessories Store   0.0
3                 Men's Store   0.0
4  Modern European Restaurant   0.0


----M1E----
        Venue category  freq
0         Intersection  0.14
1  Rental Car Location  0.14
2       Medical Center  0.14
3   Mexican Restaurant  0.14
4    Electronics Store  0.14


----M1G----
      Venue category  freq
0        Coffee Shop  0.50
1           Pharmacy  0.25
2  Korean Restaurant  0.25
3             Market  0.00
4  Martial Arts Dojo  0.00


----M1H----
         Venue category  freq
0                Bakery  0.11
1    Athletics & Sports  0.11
2   Fried Chicken Joint  0.11
3  Caribbean Restaurant  0.11
4           Gas Station  0.11


----M1J----
      

   Venue category  freq
0     Coffee Shop  0.12
1  Clothing Store  0.12
2     Yoga Studio  0.06
3            Park  0.06
4             Spa  0.06


----M4S----
   Venue category  freq
0     Pizza Place  0.08
1  Sandwich Place  0.08
2    Dessert Shop  0.08
3            Café  0.06
4     Coffee Shop  0.06


----M4T----
       Venue category  freq
0         Summer Camp   0.5
1          Playground   0.5
2      Medical Center   0.0
3   Mobile Phone Shop   0.0
4  Miscellaneous Shop   0.0


----M4V----
       Venue category  freq
0         Coffee Shop  0.13
1                 Pub  0.13
2         Pizza Place  0.07
3  Light Rail Station  0.07
4          Sports Bar  0.07


----M4W----
               Venue category  freq
0                        Park  0.50
1                  Playground  0.25
2                       Trail  0.25
3                 Men's Store  0.00
4  Modern European Restaurant  0.00


----M4X----
  Venue category  freq
0    Coffee Shop  0.09
1           Park  0.04
2            Pub  0.0

               Venue category  freq
0                 Golf Course   1.0
1    Mediterranean Restaurant   0.0
2  Modern European Restaurant   0.0
3           Mobile Phone Shop   0.0
4          Miscellaneous Shop   0.0


----M9C----
   Venue category  freq
0     Pizza Place   0.1
1      Beer Store   0.1
2  Cosmetics Shop   0.1
3            Park   0.1
4     Coffee Shop   0.1


----M9L----
        Venue category  freq
0  Empanada Restaurant   1.0
1    Accessories Store   0.0
2  Monument / Landmark   0.0
3   Mac & Cheese Joint   0.0
4               Market   0.0


----M9M----
               Venue category  freq
0              Baseball Field   1.0
1           Accessories Store   0.0
2                 Men's Store   0.0
3  Modern European Restaurant   0.0
4           Mobile Phone Shop   0.0


----M9N----
               Venue category  freq
0                        Park  0.75
1           Convenience Store  0.25
2           Accessories Store  0.00
3                 Men's Store  0.00
4  Modern Euro

#### Putting the above results in a dataframe 

Function to sort the categories in descending order:

In [79]:
def return_most_common_categories(row, num_top_categories):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_categories]

Creating the new dataframe and displaying the top 10 venues for each neighborhood

In [80]:
num_top_categories = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_categories):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
postcodes_categories_sorted = pd.DataFrame(columns=columns)
postcodes_categories_sorted['PostalCode'] = toronto_grouped['PostalCode']

for ind in np.arange(toronto_grouped.shape[0]):
    postcodes_categories_sorted.iloc[ind, 1:] = return_most_common_categories(toronto_grouped.iloc[ind, :], num_top_categories)

postcodes_categories_sorted

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Fast Food Restaurant,Diner,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
1,M1C,Bar,History Museum,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
2,M1E,Medical Center,Breakfast Spot,Rental Car Location,Mexican Restaurant,Bank,Electronics Store,Intersection,Yoga Studio,Doner Restaurant,Donut Shop
3,M1G,Coffee Shop,Pharmacy,Korean Restaurant,Yoga Studio,Dumpling Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore
4,M1H,Caribbean Restaurant,Bakery,Fried Chicken Joint,Thai Restaurant,Athletics & Sports,Gas Station,Bank,Lounge,Hakka Restaurant,Empanada Restaurant
5,M1J,Playground,Yoga Studio,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant
6,M1K,Department Store,Convenience Store,Hobby Shop,Chinese Restaurant,Coffee Shop,Discount Store,Yoga Studio,Drugstore,Dog Run,Doner Restaurant
7,M1L,Bus Line,Bakery,Park,Ice Cream Shop,Bus Station,Intersection,Soccer Field,Gift Shop,Diner,Ethiopian Restaurant
8,M1M,American Restaurant,Intersection,Motel,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Yoga Studio
9,M1N,College Stadium,Skating Rink,General Entertainment,Café,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Diner,Drugstore


## 3.4 Clustering postal codes 

#### Running K-means for 5 clusters 

In [81]:
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

#### Creating a new dataframe that includes the cluster as well as the top 10 categories venues for each postal code

In [82]:
# add clustering labels
postcodes_categories_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(postcodes_categories_sorted.set_index('PostalCode'), on='PostalCode',how='inner')

toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,2,Fast Food Restaurant,Diner,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0,Bar,History Museum,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0,Medical Center,Breakfast Spot,Rental Car Location,Mexican Restaurant,Bank,Electronics Store,Intersection,Yoga Studio,Doner Restaurant,Donut Shop
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0,Coffee Shop,Pharmacy,Korean Restaurant,Yoga Studio,Dumpling Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0,Caribbean Restaurant,Bakery,Fried Chicken Joint,Thai Restaurant,Athletics & Sports,Gas Station,Bank,Lounge,Hakka Restaurant,Empanada Restaurant


#### Visualizing the resulting clusters in a map 

In [83]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, pc, bo, nb, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['PostalCode'], toronto_merged['Borough'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(pc) + ' ' + str(bo) +  " (" + str(nb) +  ") " + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 3.5 Examining clusters 

In [84]:
for cluster in np.sort(toronto_merged['Cluster Labels'].unique()):
    print('Cluster {}: {} postal codes'.format(cluster,toronto_merged['Cluster Labels'].value_counts()[cluster]))

Cluster 0: 86 postal codes
Cluster 1: 3 postal codes
Cluster 2: 1 postal codes
Cluster 3: 1 postal codes
Cluster 4: 8 postal codes


### Cluster 0 

In [85]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[0,1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,M1C,Scarborough,0,Bar,History Museum,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
2,M1E,Scarborough,0,Medical Center,Breakfast Spot,Rental Car Location,Mexican Restaurant,Bank,Electronics Store,Intersection,Yoga Studio,Doner Restaurant,Donut Shop
3,M1G,Scarborough,0,Coffee Shop,Pharmacy,Korean Restaurant,Yoga Studio,Dumpling Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore
4,M1H,Scarborough,0,Caribbean Restaurant,Bakery,Fried Chicken Joint,Thai Restaurant,Athletics & Sports,Gas Station,Bank,Lounge,Hakka Restaurant,Empanada Restaurant
5,M1J,Scarborough,0,Playground,Yoga Studio,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant
6,M1K,Scarborough,0,Department Store,Convenience Store,Hobby Shop,Chinese Restaurant,Coffee Shop,Discount Store,Yoga Studio,Drugstore,Dog Run,Doner Restaurant
7,M1L,Scarborough,0,Bus Line,Bakery,Park,Ice Cream Shop,Bus Station,Intersection,Soccer Field,Gift Shop,Diner,Ethiopian Restaurant
8,M1M,Scarborough,0,American Restaurant,Intersection,Motel,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Yoga Studio
9,M1N,Scarborough,0,College Stadium,Skating Rink,General Entertainment,Café,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Diner,Drugstore
10,M1P,Scarborough,0,Indian Restaurant,Vietnamese Restaurant,Gaming Cafe,Chinese Restaurant,Furniture / Home Store,Pet Store,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


I see this cluster has venues of every type, so i wouls call it "neigborhoods that have everything".

### Cluster 1

In [86]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[0,1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
32,M3M,North York,1,Food Truck,Baseball Field,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Yoga Studio,Field
91,M8Y,Etobicoke,1,Construction & Landscaping,Baseball Field,Yoga Studio,Electronics Store,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant,Empanada Restaurant
97,M9M,North York,1,Baseball Field,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store,Filipino Restaurant


This cluster has quite varied venues. Draws attention sport venues related to baseball and yoga, the other venues complements these activities as well. I would call this cluster "neighborhoods for athletes".

### Cluster 2 

In [87]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[0,1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,2,Fast Food Restaurant,Diner,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


Fast foods restaurants outstands here, so 'fast food neighborhoods'

### Cluster 3

In [88]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[0,1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
94,M9B,Etobicoke,3,Golf Course,Yoga Studio,Eastern European Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store


Here, Golf and Yoga outstand, so I would call this cluster as a "neigborhoods with boring sports venues"

### Cluster 4

In [89]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[0,1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,M1V,Scarborough,4,Park,Playground,Coffee Shop,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
23,M2P,North York,4,Park,Bank,Convenience Store,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
25,M3A,North York,4,Park,Food & Drink Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Fast Food Restaurant,Dumpling Restaurant
30,M3K,North York,4,Park,Airport,Eastern European Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
40,M4J,East York,4,Park,Convenience Store,Coffee Shop,Eastern European Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
50,M4W,Downtown Toronto,4,Park,Playground,Trail,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
74,M6E,York,4,Park,Market,Women's Store,Concert Hall,Distribution Center,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant
98,M9N,York,4,Park,Convenience Store,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant


Parks outstand here, so I would name this cluster as the "neighborhoods with green areas".