# The Toronto project -  Data Science Capstone

This repository will be mainly used for my first data science project

In [1]:
pip install geocoder

Note: you may need to restart the kernel to use updated packages.


In [92]:
import numpy as np
import pandas as pd
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import time 
import matplotlib.pyplot as plt 
from matplotlib.ticker import MaxNLocator
import geocoder
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [3]:
print("Hello Coursera Capstone")

Hello Coursera Capstone


## Scraping

1 - In this section, I start by creating list in which I will store the values of the postal codes, the borough and neighborhood 

2 - Then I open the Wikipedia page and turn it into a BeautifulSoup Object

3 - In this object, I look for the Wikipedia table. Then I decided to loop through every row of the table to get the data, knowing that every row starts with a <tr> tag. 
    
4 - In every row, I decided to loop over each and every cell, knowing that every cell starts with a <td> tag. In order to avoid empty data, I impose a condition that a row must have 3 cells exaxctly so the values will be retrieved 
    
5 - I eventually append the values retrieved to the empty list

In [4]:
#1
postal_code = []
borough = []
neighborhood = []

#2
my_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")

#3
right_table=page_soup.find('table', class_='wikitable sortable')

for row in right_table.findAll('tr'):

#4
    cells = row.findAll('td')
    if len(cells) ==3:
        my_code = cells[0].find(text=True)
        my_borough = cells[1].find(text=True)
        my_neighborhood = cells[2].find(text=True)
#5        
        postal_code.append(my_code)
        borough.append(my_borough)
        neighborhood.append(str(my_neighborhood))


## Making a dataframe

1 - I decided to make a function that permits to create a dataframe out of the scraped data. I defined every column as one of the list in which there is the data about the postal codes, the borough and the neighborhood. 

2 - To get a nice and clean data frame, I remove the extra text such as '\n' and change the name of the columns

In [5]:
#1 - make a function that stores values into a dataframe

def scrap_to_data(my_code, my_borough, my_neighborhood):
        final_csv = pd.DataFrame()
        final_csv['my_code'] = postal_code
        final_csv['my_borough'] = borough
        final_csv['my_neighborhood'] = neighborhood
        
        final_csv
    
#2 - clean the dataset
        final_csv['my_code'] = final_csv['my_code'].replace('\n',' ', regex=True)
        final_csv['my_borough'] = final_csv['my_borough'].replace('\n',' ', regex=True)
        final_csv['my_neighborhood'] = final_csv['my_neighborhood'].replace('\n',' ', regex=True)
    
        #rename the columns
        final_csv.rename(columns={'my_code':'PostalCode','my_borough':'Borough','my_neighborhood':'Neighborhood'}, inplace=True)
    
        return final_csv

scrap_to_data(my_code, my_borough, my_neighborhood)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,
176,M6Z,Not assigned,
177,M7Z,Not assigned,
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


## Data wrangling

To follow assignment instructions, I decided to replace 'Not assigned" values in the Borough column in dataframe by Nan. Eventually I could remove the rows with Nan values and got a nice and clean dataframe. 

N.B: there was no need to remove duplicates since Wikipedia table removed them by themselves

In [6]:
df = scrap_to_data(my_code, my_borough, my_neighborhood)
df.head(20)  

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"


In [7]:
#remove rows with not assigned borough
df = df.replace('Not assigned',np.nan, regex=True)
df = df.dropna()
df

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,Business reply mail Processing Centre
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [8]:
df['Neighborhood'].isnull().sum()

0

In [9]:
df.shape

(103, 3)

## Geocoding data


1- To geocode the data, I decided to create two lists in which I will store the coordinates of every postal code in Toronto

2- To get the coordinates, I decided to use arcgis geocoder and not the google geocoder which was not working properly. I looped over every postal code and appended the results into the lists

3- I added 3 columns to the dataframe corresponding to the Latitude and Longitude of every postal code

4- I eventually visualized the data using Folium

N.B : to better understand the different steps in the code, I show the dataframe at every transformation step. 

In [10]:
df['PostalCode']

2      M3A 
3      M4A 
4      M5A 
5      M6A 
6      M7A 
       ... 
160    M8X 
165    M4Y 
168    M7Y 
169    M8Y 
178    M8Z 
Name: PostalCode, Length: 103, dtype: object

In [11]:
#1
Latitude = []
Longitude = []

In [12]:
#2 
lat_lng_coords = None

# loop until you get the coordinates
for i in df['PostalCode']:
    
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(i))
        lat_lng_coords = g.latlng
       
        latitude = lat_lng_coords[0]
        longitude = lat_lng_coords[1]
        
        Latitude.append(latitude)
        Longitude.append(longitude)
        
        
       

In [13]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,Business reply mail Processing Centre
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [14]:
#3 
df['Latitude'] = Latitude
df['Longitude'] = Longitude

In [15]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.752935,-79.335641
3,M4A,North York,Victoria Village,43.728102,-79.311890
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.661790,-79.389390
...,...,...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653340,-79.509766
165,M4Y,Downtown Toronto,Church and Wellesley,43.666659,-79.381472
168,M7Y,East Toronto,Business reply mail Processing Centre,43.648700,-79.385450
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.632798,-79.493017


In [71]:
#4
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [75]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [89]:
toronto_data = df[df['Borough'].astype(str).str.contains('Toronto')]
toronto_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939
13,M5B,Downtown Toronto,"Garden District, Ryerson",43.657491,-79.377529
22,M5C,Downtown Toronto,St. James Town,43.651734,-79.375554
30,M4E,East Toronto,The Beaches,43.678148,-79.295349
31,M5E,Downtown Toronto,Berczy Park,43.645196,-79.373855
40,M5G,Downtown Toronto,Central Bay Street,43.656072,-79.385653
41,M6G,Downtown Toronto,Christie,43.668602,-79.420387
49,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650542,-79.384116
50,M6H,West Toronto,"Dufferin, Dovercourt Village",43.66491,-79.438664


## Clustering data



In [16]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.752935,-79.335641
3,M4A,North York,Victoria Village,43.728102,-79.311890
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.661790,-79.389390
...,...,...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653340,-79.509766
165,M4Y,Downtown Toronto,Church and Wellesley,43.666659,-79.381472
168,M7Y,East Toronto,Business reply mail Processing Centre,43.648700,-79.385450
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.632798,-79.493017


In [17]:
CLIENT_ID = 'ELFZ42THMXXN1LLKT5YPUU05Q412OGFCBBW4IEMPGPCGFWMO'
CLIENT_SECRET = 'JXTMRDTUGS2HL4XNBWDGIIRX004OR5SXBRMZSCHXQUID2ZVM'
VERSION = '20180605'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ELFZ42THMXXN1LLKT5YPUU05Q412OGFCBBW4IEMPGPCGFWMO
CLIENT_SECRET:JXTMRDTUGS2HL4XNBWDGIIRX004OR5SXBRMZSCHXQUID2ZVM


### Identify venues in Parkwood venues

1- I used the Foursquare API to get the venues data. Especially, I requested every venues within 500 meters of the coordinates of every neighborhood

2- I created a function to retrieve the categories of the venues

3- I created a dataframe with the retrieved data

In [19]:
#1
neighborhood_latitude = df.loc[2,'Latitude']
neighborhood_longitude = df.loc[2,'Longitude']

In [20]:
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, radius, LIMIT)


In [21]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ec40771b57e88001c15498d'},
 'response': {'headerLocation': 'Sunnybrook - York Mills',
  'headerFullLocation': 'Sunnybrook - York Mills, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.75743455950008,
    'lng': -79.32942319651914},
   'sw': {'lat': 43.74843455050008, 'lng': -79.3418596494808}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e8d9dcdd5fbbbb6b3003c7b',
       'name': 'Brookbanks Park',
       'location': {'address': 'Toronto',
        'lat': 43.751976046055574,
        'lng': -79.33214044722958,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.751976046055574,
          'lng': -79.33214044722958}],
        'distance': 301,
        

In [22]:
#2
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
    
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

get_category_type

<function __main__.get_category_type(row)>

In [23]:
#3
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,Variety Store,Food & Drink Shop,43.751974,-79.333114
2,Sun Life,Construction & Landscaping,43.75476,-79.332783
3,MacLeod Exteriors Inc.,Construction & Landscaping,43.755014,-79.338688


### Get every venues in every neighborhood of Toronto

I followed the same logic as above and created a function to get every venues in Toronto

In [94]:
#1
def getNearbyVenus(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)
        
           
        
        

In [25]:
toronto_venues = getNearbyVenus(names=df['Neighborhood'],
                                latitudes=df['Latitude'], 
                                longitudes=df['Longitude'])






Parkwoods 
Victoria Village 
Regent Park, Harbourfront 
Lawrence Manor, Lawrence Heights 
Queen's Park, Ontario Provincial Government 
Islington Avenue 
Malvern, Rouge 
Don Mills 
Parkview Hill, Woodbine Gardens 
Garden District, Ryerson 
Glencairn 
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale 
Rouge Hill, Port Union, Highland Creek 
Don Mills 
Woodbine Heights 
St. James Town 
Humewood-Cedarvale 
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood 
Guildwood, Morningside, West Hill 
The Beaches 
Berczy Park 
Caledonia-Fairbanks 
Woburn 
Leaside 
Central Bay Street 
Christie 
Cedarbrae 
Hillcrest Village 
Bathurst Manor, Wilson Heights, Downsview North 
Thorncliffe Park 
Richmond, Adelaide, King 
Dufferin, Dovercourt Village 
Scarborough Village 
Fairview, Henry Farm, Oriole 
Northwood Park, York University 
East Toronto 
Harbourfront East, Union Station, Toronto Islands 
Little Portugal, Trinity 
Kennedy Park, Ionview, East Birchmount Park 
Bayview 

In [26]:
toronto_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.752935,-79.335641,Brookbanks Park,43.751976,-79.332140,Park
1,Parkwoods,43.752935,-79.335641,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Parkwoods,43.752935,-79.335641,Sun Life,43.754760,-79.332783,Construction & Landscaping
3,Parkwoods,43.752935,-79.335641,MacLeod Exteriors Inc.,43.755014,-79.338688,Construction & Landscaping
4,Victoria Village,43.728102,-79.311890,Tim Hortons,43.725517,-79.313103,Coffee Shop
...,...,...,...,...,...,...,...
2248,"Mimico NW, The Queensway West, South of Bloor,...",43.625490,-79.526000,Tactical Products Canada,43.626801,-79.529388,Miscellaneous Shop
2249,"Mimico NW, The Queensway West, South of Bloor,...",43.625490,-79.526000,Queensway Fish & Chips,43.621720,-79.524588,Fish & Chips Shop
2250,"Mimico NW, The Queensway West, South of Bloor,...",43.625490,-79.526000,Sleep Country,43.621340,-79.526708,Mattress Store
2251,"Mimico NW, The Queensway West, South of Bloor,...",43.625490,-79.526000,Global Pet Foods,43.621304,-79.526146,Pet Store


In [27]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,5,5,5,5,5,5
"Alderwood, Long Branch",8,8,8,8,8,8
"Bathurst Manor, Wilson Heights, Downsview North",20,20,20,20,20,20
Bayview Village,2,2,2,2,2,2
"Bedford Park, Lawrence Manor East",20,20,20,20,20,20
...,...,...,...,...,...,...
"Willowdale, Newtonbrook",3,3,3,3,3,3
Woburn,3,3,3,3,3,3
Woodbine Heights,11,11,11,11,11,11
York Mills West,3,3,3,3,3,3


## Analyze Each Neighborhood

1- I started by created dummy variables to tell me if there is or not this category of venues in the neighborhood

2- I replace the neighborhood column back the first column of the dataframe

3- I grouped the dataframe by Neighborhood and computed the mean of appearance of every venue category. This gives me an idea of the density of every venue category in the neighborhood. 

4- I decided to compute the top 5 and top 10 of venue category, that is to say the most frequent venue category in every neighborhood. The top 10 most frequent venue category was chosen as the data to be k-clustered

In [95]:
#1
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

#2
# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [56]:
toronto_onehot.set_index('Neighborhood', inplace=True)

In [57]:
toronto_onehot.shape

(2253, 259)

In [58]:
toronto_onehot.head(30)

Unnamed: 0_level_0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Parkwoods,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Parkwoods,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Parkwoods,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Parkwoods,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Victoria Village,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Victoria Village,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Victoria Village,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Victoria Village,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Victoria Village,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Victoria Village,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [50]:
#3
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Art Gallery,Art Museum,Arts & Crafts Store,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [59]:
toronto_grouped.shape

(97, 260)

In [60]:
#4
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt ----
              venue  freq
0    Breakfast Spot   0.2
1   Badminton Court   0.2
2       Supermarket   0.2
3  Sushi Restaurant   0.2
4      Skating Rink   0.2


----Alderwood, Long Branch ----
               venue  freq
0  Convenience Store  0.12
1                Pub  0.12
2        Gas Station  0.12
3                Gym  0.12
4           Pharmacy  0.12


----Bathurst Manor, Wilson Heights, Downsview North ----
                 venue  freq
0                 Bank  0.10
1          Coffee Shop  0.10
2  Fried Chicken Joint  0.05
3             Pharmacy  0.05
4        Deli / Bodega  0.05


----Bayview Village ----
                        venue  freq
0  Construction & Landscaping   0.5
1                       Trail   0.5
2         Moroccan Restaurant   0.0
3                Night Market   0.0
4     New American Restaurant   0.0


----Bedford Park, Lawrence Manor East ----
                venue  freq
0         Coffee Shop  0.10
1  Italian Restaurant  0.10
2      Sandwich Place  0

In [61]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [62]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Breakfast Spot,Skating Rink,Badminton Court,Sushi Restaurant,Supermarket,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant,Doctor's Office
1,"Alderwood, Long Branch",Pizza Place,Gym,Sandwich Place,Pub,Coffee Shop,Convenience Store,Gas Station,Pharmacy,Ethiopian Restaurant,Falafel Restaurant
2,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Sandwich Place,Sushi Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Restaurant,Diner,Bridal Shop,Gas Station
3,Bayview Village,Trail,Construction & Landscaping,Women's Store,Donut Shop,Eastern European Restaurant,Electronics Store,Elementary School,Ethiopian Restaurant,Falafel Restaurant,Farm
4,"Bedford Park, Lawrence Manor East",Italian Restaurant,Sandwich Place,Coffee Shop,Sushi Restaurant,Restaurant,Café,Butcher,Sports Club,Pub,Thai Restaurant


# Cluster Neighborhood

1- I decide to create five clusters based on the most 10 frequent venue category in every neighborhood

2- I added the coordinates to the clustered neighborhoods

3- I mapped the data using Folium

In [67]:
#1
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 4, 4, 0, 4, 4, 4, 4, 4, 4], dtype=int32)

In [90]:
#2
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041,4,Pub,Café,Athletics & Sports,Coffee Shop,Performing Arts Venue,Tech Startup,Seafood Restaurant,Mediterranean Restaurant,Boutique,Mexican Restaurant
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939,4,Coffee Shop,Café,Sushi Restaurant,Hobby Shop,Fried Chicken Joint,Sandwich Place,Pharmacy,Park,Middle Eastern Restaurant,Italian Restaurant
13,M5B,Downtown Toronto,"Garden District, Ryerson",43.657491,-79.377529,4,Coffee Shop,Clothing Store,Sandwich Place,Middle Eastern Restaurant,Hotel,Café,Italian Restaurant,Bar,Cosmetics Shop,Restaurant
22,M5C,Downtown Toronto,St. James Town,43.651734,-79.375554,4,Café,Coffee Shop,Cocktail Bar,Seafood Restaurant,Gastropub,American Restaurant,Cosmetics Shop,Italian Restaurant,Creperie,Lingerie Store
30,M4E,East Toronto,The Beaches,43.678148,-79.295349,4,Health Food Store,Pub,Church,Trail,Farmers Market,Farm,Falafel Restaurant,Fast Food Restaurant,Ethiopian Restaurant,Discount Store


In [93]:
#3
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters