# Neighborhood Toronto 

This file is used to analyze the Toronto neighborhood area as part of the Applied Capstone Data Science Project from Coursera. 

In the first file, I am going to transform the postal code information of Toronto into a DataFrame, in which I am going to proceed for the subsequent exercise. 

This markdown file serves as documentation tool in which I am going to describe my steps and ideas leading to the respective steps. 

I will constantly update the markdown file according to the individual weeks in which each assignment process takes place. 

## Week 3: Segmenting and Clustering Neighborhoods in the city of Toronto, Canada

#### First, we are going to install the most common packages in Python: 

In [2]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import requests
import requests # library to handle requests
import random # library for random number generation
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
import folium # plotting library

#### In a next step, we are going to look for a package or library that can transform the respective Postal Code Wikipedia page into a different data format. Apparently, Beautiful soup is a Python library for pulling data out of HTML and XML files.

Let's install the package: 

In [3]:
from bs4 import BeautifulSoup

### Getting the dataset

#### To use th document, we need to retrieve the URL of our postal code file and assign it to a name, request it as an HTML document and parse it into a text format, which we can access like a tuple:

In [4]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

html_doc = requests.get(url)

soup = BeautifulSoup(html_doc.text, "html.parser")
print(soup.title.text)

List of postal codes of Canada: M - Wikipedia


#### We can extract the html table by going on the webpage and open the right-click menu bar. 
Once we click on "Inspect", it opens up the a console in which the elements of the page are visible. This 
document lets us analyze the structure of the file. 

Once there, we can just look for the respective line(s) of code that represent the table we want to retrieve. 
In our case, it "table class = wikitable...". As a consequence, we can use the command below. 

This will give us the html documentation of the table (instead of the entire page)

In [5]:
toronto_table = soup.find("table", attrs={"class": "wikitable sortable"})

#### In the first few rows it is applicable that the headers use the prefix "th" as identification. Consequently, we want to find all "th" values. Doing so, we use a for loop that loops over all three header names. Also, we replace the "th" values with blanks, to get the list we want

In [6]:
headers = toronto_table.find_all("th")
for i, head in enumerate(headers): headers[i]=str(headers[i]).replace("<th>","").replace("</th>","").replace("\n","")
headers

['Postal Code', 'Borough', 'Neighbourhood']

#### We do the same for our body values. In order to get rid of the row tabs, we can use "\n". 

Importantly, however, we still want to keep one indent, mainly because we require a similar structure for each row
such that we can convert the created one-column dataframe into a multiple columns dataframe later. 

In [8]:
body = toronto_table.find_all("tr")
body = body[1:len(body)]
body
for i, bod in enumerate(body): body[i] = str(body[i]).replace("<tr>\n<td>", "").replace("\n</td></tr>", "")

#### Now, we create the dataframe: 

In this step, we: 

    - Divide the dataframe into its respective columns 
    - Drop Not assigned boroughs 
    - Fill the not assigned neighborhood with the name of its respective borough

In [9]:
toronto_df = pd.DataFrame(body)

toronto_df.columns = ["AB"]
toronto_df[headers] = toronto_df['AB'].str.split('\n</td>\n<td>', n = 2, expand = True)
toronto_df.drop("AB", axis = 1, inplace = True)
toronto_df = toronto_df[toronto_df.Borough != "Not assigned"]
toronto_df.Neighbourhood.fillna(toronto_df.Borough, inplace=True)

toronto_df.update(
    toronto_df.Neighbourhood.loc[
        lambda x: x.str.contains('title')
    ].str.extract('title=\"([^\"]*)',expand=False))

toronto_df.update(
    toronto_df.Borough.loc[
        lambda x: x.str.contains('title')
    ].str.extract('title=\"([^\"]*)',expand=False))

toronto_df.update(
    toronto_df.Neighbourhood.loc[
        lambda x: x.str.contains('Toronto')
    ].str.replace(", Toronto",""))
toronto_df.update(
    toronto_df.Neighbourhood.loc[
        lambda x: x.str.contains('Toronto')
    ].str.replace("\(Toronto\)",""))

toronto_df.head()


Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [10]:
toronto_df.to_csv('out.zip', index=False)

Which leaves us with a dataframe consisting of 103 individual rows and 3 columns.

In [11]:
toronto_df.shape

(103, 3)

#### Next, we will merge the location dataset with the one we just created to retrieve the latitude and longitude 

In [12]:
postal = pd.read_csv("http://cocl.us/Geospatial_data")

In [13]:
toronto_df["Postal Code"].nunique()

103

In [14]:
toronto_postal = pd.merge(toronto_df, postal, on='Postal Code')

#### Now, we can play around and analyze some neighborhoods. 

To do so, we again require the geolocation package and access to FourSquare. In the first case, we only analyze one borough. 

In [15]:
CLIENT_ID = 'JBREGZ4UNA53HX43WMAD4TQ2X2XJWMX5DPHEZEIZHQA0ACNP' # your Foursquare ID
CLIENT_SECRET = 'VNS40KF3V4MGSWWAV0IGQINZIGIT1EQKNCWBFPOS3QF1JMOJ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 10
radius = 500
 
neighborhood_latitude = toronto_postal.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_postal.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = toronto_postal.loc[0, 'Neighbourhood'] # neighborhood name

In [16]:
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, radius, LIMIT)

#### We can now call the url and get the respective JSON document and clean the dataframe as discussed in the sessions

In [17]:
results = requests.get(url).json()
venues = results["response"]["venues"]
dataframe = pd.json_normalize(venues)
filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
dataframe_filtered = dataframe.loc[:, filtered_columns]

def get_category_type(row):
    try:
        categories = row['categories']
    except:
        categories = row['venue.categories']
        
    if len(categories) == 0:
        return None
    else:
        return categories[0]['name']
    
dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

dataframe_filtered.columns = [col.split('.')[-1] for col in dataframe_filtered.columns]

dataframe_filtered["neighborhood_name"] = neighborhood_name

dataframe_filtered


Unnamed: 0,name,categories,address,crossStreet,lat,lng,labeledLatLngs,distance,cc,city,state,country,formattedAddress,postalCode,neighborhood,id,neighborhood_name
0,TTC stop #8380,Bus Stop,Underhill Dr,At Cassandra N,43.752672,-79.326351,"[{'label': 'display', 'lat': 43.752672, 'lng':...",273,CA,Toronto,ON,Canada,"[Underhill Dr (At Cassandra N), Toronto ON, Ca...",,,4e42684718a8627fce453c01,Parkwoods
1,Brookbanks Park,Park,Toronto,,43.751976,-79.33214,"[{'label': 'display', 'lat': 43.75197604605557...",245,CA,Toronto,ON,Canada,"[Toronto, Toronto ON, Canada]",,,4e8d9dcdd5fbbbb6b3003c7b,Parkwoods
2,Dollarama,Discount Store,"1277 York Mills Rd, Parkwood Village",,43.760341,-79.325519,"[{'label': 'display', 'lat': 43.760341, 'lng':...",855,CA,North York,ON,Canada,"[1277 York Mills Rd, Parkwood Village, North Y...",M3A 1Z5,Parkwoods - Donalda,55bbdfb9498e5996dd9d4038,Parkwoods
3,GTA Restoration | Emergency Water Damage Plumb...,Construction & Landscaping,250 Yonge St,401 & DVP,43.753567,-79.351308,"[{'label': 'display', 'lat': 43.7535666482373,...",1741,CA,Toronto,ON,Canada,"[250 Yonge St (401 & DVP), Toronto ON M5B 2L7,...",M5B 2L7,,535fddb1498e03814e03968f,Parkwoods
4,Yorkmills Wellness & Spa,Spa,25 Lesmill Road Suite 200,,43.7568,-79.325346,"[{'label': 'display', 'lat': 43.75680029671985...",524,CA,North York,ON,Canada,"[25 Lesmill Road Suite 200, North York ON, Can...",,,54ee51de498e7a6fbe4f00a7,Parkwoods
5,Allwyn's Bakery,Caribbean Restaurant,81 Underhill drive,,43.75984,-79.324719,"[{'label': 'display', 'lat': 43.75984035203157...",833,CA,Toronto,ON,Canada,"[81 Underhill drive, Toronto ON M3A 1Z5, Canada]",M3A 1Z5,Parkwoods - Donalda,4b8991cbf964a520814232e3,Parkwoods
6,Subway,Sandwich Place,"1277 York Mills Road, Unit F1-2, Bldg F",,43.760334,-79.326906,"[{'label': 'display', 'lat': 43.76033437476135...",818,CA,Toronto,ON,Canada,"[1277 York Mills Road, Unit F1-2, Bldg F, Toro...",M3A 1Z5,,5e111e7e9316a70007fb9653,Parkwoods
7,Toronto Police Service - 33 Division,Police Station,50 Upjohn Rd,,43.748067,-79.336699,"[{'label': 'display', 'lat': 43.7480666251842,...",809,CA,Toronto,ON,Canada,"[50 Upjohn Rd, Toronto ON, Canada]",,,4ccb2b542dc43704b8d0bd08,Parkwoods
8,5 Brookbanks Drive,Residential Building (Apartment / Condo),5 Brookbanks Dr,,43.748044,-79.336736,"[{'label': 'display', 'lat': 43.74804359191481...",813,CA,Toronto,ON,Canada,"[5 Brookbanks Dr, Toronto ON M3A 2S8, Canada]",M3A 2S8,,4f470ba5e4b01863518856c7,Parkwoods
9,Bruno's valu-mart,Grocery Store,83 Underhill,at Donwood Plaza,43.746143,-79.32463,"[{'label': 'display', 'lat': 43.746143, 'lng':...",889,CA,Don Mills,ON,Canada,"[83 Underhill (at Donwood Plaza), Don Mills ON...",M3A 2P5,,4bafa285f964a5203a123ce3,Parkwoods


#### We can also visualize the venues for the respective borough:

In [19]:
toronto_borough_map = folium.Map(location = [neighborhood_latitude, neighborhood_longitude], zoom_start = 13)

folium.features.CircleMarker(
    [neighborhood_latitude, neighborhood_longitude],
    radius=10,
    color='red',
    popup='District Center',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(toronto_borough_map)

for lat, lng, label in zip(dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.categories):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(toronto_borough_map)

toronto_borough_map

#### However, we want to analyze more than just one borough. It would be convenient to retrieve information of this kind for each neighborhood in each borough of Toronto.

Doing so will require us to formulate a function:

In [561]:
def VenuesToronto(latitudes, longitudes, names, postal):  

    CLIENT_ID = 'JBREGZ4UNA53HX43WMAD4TQ2X2XJWMX5DPHEZEIZHQA0ACNP' # your Foursquare ID
    CLIENT_SECRET = 'VNS40KF3V4MGSWWAV0IGQINZIGIT1EQKNCWBFPOS3QF1JMOJ' # your Foursquare Secret
    VERSION = '20180605'
    LIMIT = 40
    radius = 500

    venues_list =  [] 

    for latitude, longitude, name, post in zip(latitudes, longitudes, names, postal):

        # In order to loook for specific venues nearby, we can use the FourSquare API call: 

        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            latitude, 
            longitude, 
            radius, 
            LIMIT)
        
        venue = requests.get(url).json()["response"]['groups'][0]['items']

        venues_list.append([(name, 
                          latitude, 
                          longitude,
                          post,
                          v["venue"]["name"], 
                          v["venue"]["categories"][0]["name"],
                          v["venue"]["location"]["lat"],
                          v["venue"]["location"]["lng"]) for v in venue])

        pd_v = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        pd_v.columns = ['Neighborhood', 
                        'Neighborhood-Latitude', 
                        'Neighborhood-Longitude', 
                        'Postal Code',
                        'Venue', 
                        'Venue_Category',
                        'Venue_Latitude', 
                        'Venue_Longitude', 
                        ]
    return(pd_v)

In [562]:
toronto_venues = VenuesToronto(latitudes = toronto_postal["Latitude"],
                longitudes = toronto_postal["Longitude"],
                names = toronto_postal["Neighbourhood"], 
                postal = toronto_postal['Postal Code'])

#### As the function is created, we have now many venues for each Neighborhood. 
#### Also, we created a count value, which we will use later.

In [563]:
tor_venu = toronto_venues
tor_venu["Count"] = 1
tor_venu

Unnamed: 0,Neighborhood,Neighborhood-Latitude,Neighborhood-Longitude,Postal Code,Venue,Venue_Category,Venue_Latitude,Venue_Longitude,Count
0,Parkwoods,43.753259,-79.329656,M3A,Brookbanks Park,Park,43.751976,-79.332140,1
1,Parkwoods,43.753259,-79.329656,M3A,Variety Store,Food & Drink Shop,43.751974,-79.333114,1
2,Victoria Village,43.725882,-79.315572,M4A,Victoria Village Arena,Hockey Arena,43.723481,-79.315635,1
3,Victoria Village,43.725882,-79.315572,M4A,Portugril,Portuguese Restaurant,43.725819,-79.312785,1
4,Victoria Village,43.725882,-79.315572,M4A,Tim Hortons,Coffee Shop,43.725517,-79.313103,1
...,...,...,...,...,...,...,...,...,...
1538,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,M8Z,RONA,Hardware Store,43.629393,-79.518320,1
1539,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,M8Z,Royal Canadian Legion #210,Social Club,43.628855,-79.518903,1
1540,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,M8Z,Koala Tan Tanning Salon & Sunless Spa,Tanning Salon,43.631370,-79.519006,1
1541,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,M8Z,Kingsway Boxing Club,Gym,43.627254,-79.526684,1


#### Finally, we can visualize the venues with folium: 

In [564]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="foursquare_agent") # call the geolocator 

location = geolocator.geocode(address)
latitude_tor = location.latitude
longitude_tor = location.longitude

toronto_map = folium.Map(location = [latitude_tor, longitude_tor], zoom_start = 10)

folium.features.CircleMarker(
    [latitude_tor, longitude_tor],
    radius=10,
    color='red',
    popup='District Center',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6,
    
).add_to(toronto_map)

for lat, lng, label in zip(toronto_venues.Venue_Latitude, toronto_venues.Venue_Longitude, toronto_venues.Venue_Category):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        fill = True,
        fill_color='blue',
        fill_opacity=0.6,
        popup=folium.Popup(label, parse_html=True)
    ).add_to(toronto_map)


toronto_map

### Configuring the dataset

#### We can play around with this dataset now. If we want to compare neighborhoods with each other, it would be handy to understand which venue categories are represented to what amount in each neighborhood. Also, assumming that neighborhoods within the same borough are more similar to each other based on demo-, socio- as well as geographic characteristics, we might be interested to know to which borough the respective neighborhood belongs.  

Let's start by making an index.

In [565]:
toronto_venues.set_index("Neighborhood", inplace = True)

In [566]:
tor_venu.to_csv("tor.csv", index = False)

#### To get the Boroughs for each neighborhood, we can just merge the dataframe with the one we used earlier

In [568]:
tor_venu = pd.merge(toronto_df, tor_venu, on='Postal Code')
tor_venu

Unnamed: 0,Postal Code,Borough,Neighbourhood,Neighborhood-Latitude,Neighborhood-Longitude,Venue,Venue_Category,Venue_Latitude,Venue_Longitude,Count
0,M3A,North York,Parkwoods,43.753259,-79.329656,Brookbanks Park,Park,43.751976,-79.332140,1
1,M3A,North York,Parkwoods,43.753259,-79.329656,Variety Store,Food & Drink Shop,43.751974,-79.333114,1
2,M4A,North York,Victoria Village,43.725882,-79.315572,Victoria Village Arena,Hockey Arena,43.723481,-79.315635,1
3,M4A,North York,Victoria Village,43.725882,-79.315572,Portugril,Portuguese Restaurant,43.725819,-79.312785,1
4,M4A,North York,Victoria Village,43.725882,-79.315572,Tim Hortons,Coffee Shop,43.725517,-79.313103,1
...,...,...,...,...,...,...,...,...,...,...
1538,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,RONA,Hardware Store,43.629393,-79.518320,1
1539,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Royal Canadian Legion #210,Social Club,43.628855,-79.518903,1
1540,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Koala Tan Tanning Salon & Sunless Spa,Tanning Salon,43.631370,-79.519006,1
1541,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Kingsway Boxing Club,Gym,43.627254,-79.526684,1


#### Now, we can have a look at the Borough-Neighborhood-Venue Category-Count combinations off our dataset with the following command. 

This one will show us which Venue Category how often shows up in the respective neighborhood and enables us to gain a better overview

In [569]:
venu_tor = tor_venu.groupby(['Borough','Neighbourhood','Venue_Category']).sum()
venu_tor

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Neighborhood-Latitude,Neighborhood-Longitude,Venue_Latitude,Venue_Longitude,Count
Borough,Neighbourhood,Venue_Category,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Central Toronto,Davisville,Brewery,43.704324,-79.388790,43.707991,-79.389943,1
Central Toronto,Davisville,Café,87.408649,-158.777580,87.407441,-158.776806,2
Central Toronto,Davisville,Coffee Shop,87.408649,-158.777580,87.410267,-158.778207,2
Central Toronto,Davisville,Dessert Shop,131.112973,-238.166370,131.115853,-238.166817,3
Central Toronto,Davisville,Diner,43.704324,-79.388790,43.702103,-79.387618,1
...,...,...,...,...,...,...,...
York,"Runnymede, The Junction North",Brewery,43.673185,-79.487262,43.669903,-79.483430,1
York,"Runnymede, The Junction North",Caribbean Restaurant,43.673185,-79.487262,43.672725,-79.482508,1
York,"Runnymede, The Junction North",Convenience Store,43.673185,-79.487262,43.672352,-79.492571,1
York,Weston,Convenience Store,43.706876,-79.518188,43.704486,-79.515789,1


#### Now, we can create a dummy to account for unobserved differences between Toronto's boroughs. Although this does solely control for a small portion of the existing heterogeneity, it still can give us a better predictability. 

In [570]:
venu_tor.reset_index(inplace = True)
dummy_bor = pd.get_dummies(venu_tor["Borough"])
venu_tor = pd.concat([venu_tor, dummy_bor], axis=1)
venu_tor.drop("Borough", axis=1, inplace=True)

In [571]:
venu_tor.tail(50)

Unnamed: 0,Neighbourhood,Venue_Category,Neighborhood-Latitude,Neighborhood-Longitude,Venue_Latitude,Venue_Longitude,Count,Central Toronto,Downtown Toronto,East Toronto,East York,Etobicoke,Mississauga,North York,Scarborough,West Toronto,York
1194,"Parkdale, Roncesvalles",Eastern European Restaurant,43.64896,-79.456325,43.649796,-79.45031,1,0,0,0,0,0,0,0,0,1,0
1195,"Parkdale, Roncesvalles",Gift Shop,87.297919,-158.91265,87.301592,-158.901501,2,0,0,0,0,0,0,0,0,1,0
1196,"Parkdale, Roncesvalles",Italian Restaurant,43.64896,-79.456325,43.649235,-79.450229,1,0,0,0,0,0,0,0,0,1,0
1197,"Parkdale, Roncesvalles",Movie Theater,43.64896,-79.456325,43.651112,-79.450961,1,0,0,0,0,0,0,0,0,1,0
1198,"Parkdale, Roncesvalles",Restaurant,43.64896,-79.456325,43.650688,-79.450685,1,0,0,0,0,0,0,0,0,1,0
1199,"Runnymede, Swansea",Bank,43.651571,-79.48445,43.650142,-79.480274,1,0,0,0,0,0,0,0,0,1,0
1200,"Runnymede, Swansea",Bar,43.651571,-79.48445,43.649533,-79.483056,1,0,0,0,0,0,0,0,0,1,0
1201,"Runnymede, Swansea",Bookstore,43.651571,-79.48445,43.650211,-79.48122,1,0,0,0,0,0,0,0,0,1,0
1202,"Runnymede, Swansea",Boutique,43.651571,-79.48445,43.650398,-79.479931,1,0,0,0,0,0,0,0,0,1,0
1203,"Runnymede, Swansea",Burrito Place,43.651571,-79.48445,43.649779,-79.482894,1,0,0,0,0,0,0,0,0,1,0


Now that we have our borough indicators, it's time to see which neighborhoods offer what kind of venue. This requires some formal steps, which are administered below. 

#### What we want is a column for each neighborhood that indicates all available venues! 

Consequently, we are required to create a column for each venue category and then sum each row per respective neighborhood up to retrieve the indicator number for each neighborhood that shows us which categories are offered and which are not, according to our search request we made earlier. 

#### We first need to define a group index number. This is because Python is sometimes not eager to make a groupby-sum combination where we group by a string. 

In [572]:
venu_tor['GrpIdx'] = venu_tor['Neighbourhood'].rank(method='dense').astype(int)
venu_tor.sort_values("Neighbourhood", inplace = True)

#### Next, we create a dummy variable for each venue category and combine both dataframes

In [573]:
dummy_cat = pd.get_dummies(venu_tor["Venue_Category"])
v_t = pd.concat([venu_tor, dummy_cat], axis=1)

#### As we are interested in the indication for each category per neighborhood, we cannot include factors such as longitude / latitude etc. since they cannot be displayed in a summarized dataframe. As a consequence, we slice the dataframe to obtain the columns we want. Further, we define the total_venues column which displays the total number of indicated venues per neighborhood. This has two purposes. Initially, we can check if the number of neighborhoods matches. Secondly, we can see how many venues the neighborhood indicates, which was previously not possible without further calculation. 

In [574]:
t = v_t.iloc[:,0]
u = v_t.iloc[:,17:]
v = pd.concat([t, u], axis=1)
w = v.groupby(["GrpIdx"]).sum()
w["total_venues"] = w.sum(axis = 1)
w = w.groupby('GrpIdx').mean().reset_index()
w.tail(50)




Unnamed: 0,GrpIdx,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio,total_venues
44,45,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
45,46,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,8
46,47,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
47,48,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,27
48,49,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,1,0,0,1,31
49,50,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
50,51,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
51,52,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,14
52,53,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
53,54,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,14


#### These lines of code serve such that we can add the neighborhood names back to the dataframe

In [575]:
t.drop_duplicates(inplace=True)
t = pd.DataFrame(t)
t['C'] = t.reset_index().index
t["GrpIdx"] = t["C"] + 1
t.drop("C", inplace = True, axis = 1)
venue_count_tor = pd.merge(w,t, on = "GrpIdx")
venue_count_tor.drop(["total_venues", "GrpIdx"], axis = 1, inplace = True)

In [576]:
venue_count_tor
fixed_columns = [venue_count_tor.columns[-1]] + list(venue_count_tor.columns[:-1])
venue_count_tor = venue_count_tor[fixed_columns]
venue_count_tor

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Alderwood, Long Branch",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Bathurst Manor, Wilson Heights, Downsview North",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Bayview Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Bedford Park, Lawrence Manor East",0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89,"Willowdale, Willowdale East",0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
90,"Willowdale, Willowdale West",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
91,Woburn,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
92,Woodbine Heights,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Next, it would be interesting to understand which venues are most commonly represented in which neighborhood: 

To do so, we define a function that extracts the 5 most common categories and display it:

In [577]:
num_top_venues = 5

for hood in venue_count_tor['Neighbourhood']:
    print("----"+hood+"----")
    temp = venue_count_tor[venue_count_tor['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0               Skating Rink   1.0
1  Latin American Restaurant   1.0
2             Clothing Store   1.0
3             Breakfast Spot   1.0
4                     Lounge   1.0


----Alderwood, Long Branch----
          venue  freq
0      Pharmacy   1.0
1           Gym   1.0
2   Pizza Place   1.0
3   Coffee Shop   1.0
4  Skating Rink   1.0


----Bathurst Manor, Wilson Heights, Downsview North----
                 venue  freq
0          Pizza Place   1.0
1  Fried Chicken Joint   1.0
2             Pharmacy   1.0
3          Coffee Shop   1.0
4        Deli / Bodega   1.0


----Bayview Village----
                 venue  freq
0   Chinese Restaurant   1.0
1                 Café   1.0
2                 Bank   1.0
3  Japanese Restaurant   1.0
4    Accessories Store   0.0


----Bedford Park, Lawrence Manor East----
               venue  freq
0        Pizza Place   1.0
1  Indian Restaurant   1.0
2           Pharmacy   1.0
3     Cosmetics Shop   

Then, we define a function that puts these venue categories back into a new dataframe. The steps are the following:


- First, we define a function that takes the first row and sorts the rows according to their neighborhood name. 
- Then, we define again the five most common venues and create an empty dataframe with the columns described and 
  the pre-defined structure of all inidividual neighborhoods. 
- Third, we take a for loop and iterate through each row, giving us the five most common values until we reached 
  five (applicable in the for function). Once we reached five, we tell the program to jump to the next line, untill 
  we reached all 93 neighborhoods

In [578]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [579]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = venue_count_tor['Neighbourhood']

for ind in np.arange(venue_count_tor.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(venue_count_tor.iloc[ind, 1:], num_top_venues)

neighborhoods_venues_sorted.head()

toronto_venues = neighborhoods_venues_sorted


### k-means clustering

#### Now, with the dataset in the shape we want it to be, we can finally start clustering 

In [580]:
from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

#### We first need to normalize the dataset for identical portability and comparability:

In [581]:
x = venue_count_tor.values[:,3:]
x = np.nan_to_num(x)
clustered = StandardScaler().fit_transform(x)
clustered

array([[-0.10369517, -0.10369517, -0.10369517, ..., -0.10369517,
        -0.10369517, -0.40061681],
       [-0.10369517, -0.10369517, -0.10369517, ..., -0.10369517,
        -0.10369517, -0.40061681],
       [-0.10369517, -0.10369517, -0.10369517, ..., -0.10369517,
        -0.10369517, -0.40061681],
       ...,
       [-0.10369517, -0.10369517, -0.10369517, ..., -0.10369517,
        -0.10369517, -0.40061681],
       [-0.10369517, -0.10369517, -0.10369517, ..., -0.10369517,
        -0.10369517, -0.40061681],
       [-0.10369517, -0.10369517, -0.10369517, ..., -0.10369517,
        -0.10369517, -0.40061681]])

Next, we start modeling: 

Here we define: 

- 4 clusters
- 12 iterations (12 times the mean value is researched)
    

In [582]:
num_of_clstr = 4

Then, we assign the mean values of each cluster according to the distance of each data point to its anchor point and iterate this process 12 times for each data point combination of the customers. 

This will return us the labels, which indicate to which cluster each customer belongs, minimising the SSE. 

In [583]:
k_means = KMeans(init = "k-means++", n_clusters = num_of_clstr, n_init = 12)
k_means.fit(clustered)
labels = k_means.labels_

toronto_venues["cluster"] = labels
labels

array([1, 1, 1, 1, 0, 3, 1, 1, 1, 2, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1,
       1, 1, 0, 1, 1, 1, 3, 0, 1, 3, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 1], dtype=int32)

In [584]:
toronto_venues.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,cluster
0,Agincourt,Latin American Restaurant,Skating Rink,Clothing Store,Lounge,Breakfast Spot,1
1,"Alderwood, Long Branch",Coffee Shop,Pharmacy,Sandwich Place,Pub,Pizza Place,1
2,"Bathurst Manor, Wilson Heights, Downsview North",Gas Station,Sushi Restaurant,Park,Pharmacy,Pizza Place,1
3,Bayview Village,Chinese Restaurant,Japanese Restaurant,Café,Bank,Deli / Bodega,1
4,"Bedford Park, Lawrence Manor East",Thai Restaurant,Italian Restaurant,Pub,Butcher,Fast Food Restaurant,0


#### Now, we finally have the dataset clustered and portraying the five most common values

What we can still do is to graphically visualize the individual clusters. 

To do so, I decided to take publicly available data from the Canadian public web portal. It allows one to download the area codes as well as geographic indicators (lat, long) for each neighborhood in Canada. 

However, one is required to construct a geo-json file from the output. The exact steps for the usage can be found here: 

    https://medium.com/dataexplorations/generating-geojson-file-for-toronto-fsas-9b478a059f04

Basically, one has to download the QGIS application (a geographic tool which allows you to transform geographic indicators into figures) and then, by following the steps above, transform it into a GEOJSON file and upload said file into the notebook. 

Then, we are required to merge the above dataframe with the postal code dataframe, in order to receive a column in which all Postal Codes of Toronto's neighborhoods are indicated. With the respective column we can then link the dataframe with our GEOJSON file and, finally, draw a map which shows the different layers for our clusters. 

However, using the QGIS application requires some workaround, since we need to select each particular Postal Code in Toronto either individually in the program (which amounts for nearly 100 searches as not only the Postal Codes of Toronto, but entire Ontario are given) or defining a workaround. 

My workaround consisted of exporting the postal code csv file constructed earlier and then modifying the Postal Codes such that the program is able to read it accordingly. 

Although in the link above each step is visualized, I am providing the exact Postal Code assembly below: 

"CFSAUID" IN ('M3A',	'M4A',	'M5A',	'M6A',	'M7A',	'M1B',	'M3B',	'M4B',	'M5B',	'M6B',	'M1C',	'M3C',	'M4C',	'M5C',	'M6C',	'M9C',	'M1E',	'M4E',	'M5E',	'M6E',	'M1G',	'M4G',	'M5G',	'M6G',	'M1H',	'M2H',	'M3H',	'M4H',	'M5H',	'M6H',	'M1J',	'M2J',	'M3J',	'M4J',	'M5J',	'M6J',	'M1K',	'M2K',	'M3K',	'M4K',	'M5K',	'M6K',	'M1L',	'M3L',	'M4L',	'M5L',	'M6L',	'M9L',	'M1M',	'M3M',	'M4M',	'M5M',	'M6M',	'M9M',	'M1N',	'M2N',	'M3N',	'M4N',	'M5N',	'M6N',	'M9N',	'M1P',	'M2P',	'M4P',	'M5P',	'M6P',	'M9P',	'M1R',	'M2R',	'M4R',	'M5R',	'M6R',	'M7R',	'M9R',	'M1S',	'M4S',	'M5S',	'M6S',	'M1T',	'M4T',	'M5T',	'M1V',	'M4V',	'M5V',	'M8V',	'M9V',	'M1W',	'M4W',	'M5W',	'M8W',	'M9W',	'M4X',	'M5X',	'M8X',	'M4Y',	'M7Y',	'M8Y',	'M8Z')

Basically, just copy-paste it into the respective tab in the program and it will automatically deliver the appropriate map of Toronto with each neighborhood indicated. 

In [585]:
ontario_geo = "/Users/nikolas.anic/Desktop/ML/json_neighbor.geojson"

toronto_venues = pd.merge(toronto_venues, toronto_df,  on = "Neighbourhood")

In [586]:
toronto_venues

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,cluster,Postal Code,Borough
0,Agincourt,Latin American Restaurant,Skating Rink,Clothing Store,Lounge,Breakfast Spot,1,M1S,Scarborough
1,"Alderwood, Long Branch",Coffee Shop,Pharmacy,Sandwich Place,Pub,Pizza Place,1,M8W,Etobicoke
2,"Bathurst Manor, Wilson Heights, Downsview North",Gas Station,Sushi Restaurant,Park,Pharmacy,Pizza Place,1,M3H,North York
3,Bayview Village,Chinese Restaurant,Japanese Restaurant,Café,Bank,Deli / Bodega,1,M2K,North York
4,"Bedford Park, Lawrence Manor East",Thai Restaurant,Italian Restaurant,Pub,Butcher,Fast Food Restaurant,0,M5M,North York
...,...,...,...,...,...,...,...,...,...
93,"Willowdale, Willowdale East",Japanese Restaurant,Ramen Restaurant,Plaza,Steakhouse,Electronics Store,0,M2N,North York
94,"Willowdale, Willowdale West",Coffee Shop,Pharmacy,Bank,Pizza Place,Butcher,1,M2R,North York
95,Woburn,Coffee Shop,Korean Restaurant,Mexican Restaurant,Yoga Studio,Curling Ice,1,M1G,Scarborough
96,Woodbine Heights,Skating Rink,Pharmacy,Curling Ice,Athletics & Sports,Beer Store,1,M4C,East York


In [587]:
toronto_map = folium.Map(location = [latitude_tor, longitude_tor], zoom_start = 11)

toronto_map.choropleth(
    geo_data=ontario_geo,
    data=toronto_venues,
    columns=['Postal Code', 'cluster'],
    key_on='feature.properties.CFSAUID',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Clusters',
    reset=True
)

toronto_map

In [590]:
! pwd

/Users/nikolas.anic


#### Lastly, we can analyze the individual clusters to get an overview which cluster owns which neighborhood. Let's do this just for cluster 3: 

In [588]:
toronto_venues.loc[toronto_venues["cluster"] == 0,:]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,cluster,Postal Code,Borough
4,"Bedford Park, Lawrence Manor East",Thai Restaurant,Italian Restaurant,Pub,Butcher,Fast Food Restaurant,0,M5M,North York
13,Central Bay Street,Yoga Studio,Bar,Spa,Discount Store,Japanese Restaurant,0,M5G,Downtown Toronto
15,Church and Wellesley,Yoga Studio,Burrito Place,Smoke Shop,Ice Cream Shop,Indian Restaurant,0,M4Y,Downtown Toronto
18,"Commerce Court, Victoria Hotel",Bakery,Tea Room,Sandwich Place,Japanese Restaurant,Seafood Restaurant,0,M5L,Downtown Toronto
19,Davisville,Gas Station,Brewery,Indian Restaurant,Italian Restaurant,Seafood Restaurant,0,M4S,Central Toronto
22,Don Mills,Sandwich Place,Baseball Field,Supermarket,Café,Sporting Goods Shop,0,M3B,North York
23,Don Mills,Sandwich Place,Baseball Field,Supermarket,Café,Sporting Goods Shop,0,M3C,North York
32,"Fairview, Henry Farm, Oriole",Movie Theater,Liquor Store,Fast Food Restaurant,Bakery,Bank,0,M2J,North York
33,"First Canadian Place, Underground city",Steakhouse,Tea Room,Deli / Bodega,Pizza Place,Speakeasy,0,M5X,Downtown Toronto
35,"Garden District, Ryerson",Diner,Burrito Place,Spa,Hotel,Shopping Mall,0,M5B,Downtown Toronto
