<h1>Segmenting and clustering neighborhoods in Toronto </h1>

<h2>Part 1 - Scraping a Wikipedia page</h2>

<p>In Canada, postal codes beginning with "M" are located within the city of Toronto in the province of Ontario. We will scrape the Wikipedia page <a href= "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">List of postal codes of Canada: M</a> for data in the table of postal codes. We will create a <i>pandas</i> dataframe of this data.</p>
<p>
<ul>
    <li>This dataframe will consist of three columns, namely <b>PostalCode</b>, <b>Borough</b>, and <b>Neighborhood</b>.</li>
    <li>The rows containing "Not assigned" in the borough field will be ignored.</li>
    <li>When more than one neighborhood exists for a postal code area, the neighborhoods should be listed together in the <b>Neighborhood</b> column.</li>
    <li>If a row contains an assigned borough, but no assigned neighborhood, the neighborhood should be the same as the borough.</li>
</ul>
</p>

<h3>Step 1: Downloading required dependencies</h3>
<p>This will install all libraries we will use for Part 1.</p>

In [1]:
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes
!pip install beautifulsoup4 #Beautiful Soup 4
!pip install lxml #lmxl parser
!pip install request #request library
from bs4 import BeautifulSoup
import requests
import csv
print('Libraries imported.')

Libraries imported.


<h3>Step 2: Creating the raw dataframe</h3>
<p>Although Beautiful Soup was attempted to solve this problem, it did not work to iterate through the table. The pandas <i>read_html</i> method was used instead.</p>
<p>
<ul>
    <li>
    Using the pandas <i>read_html</i> method and the url of the Wikipedia page, a list of dataframes is created. These dataframes represent the tables in the html file. As the first table is the one of interest, list[0] will return the dataframe we would like to work with.</li>
    <li>The column heading <b>Postal code</b> is renamed to <b>PostalCode</b>.</li>
</ul>
</p>

In [2]:
#creating a new dataframe from a list
list_ = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',header=0)

#first value of the list
df1 = list_[0]

#Postal code is renamed to PostalCode
df1.rename(columns={"Postal code": "PostalCode"},inplace = True)
df1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


<h3>Step 3: Assigning blank neighborhood cells to borough cell values</h3>
<p>It is necessary to assign any blank <b>Neighborhood </b>cell values to the value of the <b>Borough</b> cell in the same row.</p>
<ul>
    <li>Firstly, the NaN values in the table are changed to a blank string value in a new dataframe, df2.</li>
    <li>Secondly, the <b>Neighborhood</b> cells containing a blank string are replaced with the <b>Borough</b> cell value in the same row using a function. This function iterates along the length of the index of the df2.</li>
    </ul>
    </p>

In [3]:
#Changing NaN cells to blank cells
df2 = df1.fillna(" ")

#Assigning blank neighbourhoods to borough names
for i in range(len(df2)):
    if df2.loc[i,'Neighborhood'] == " ":
        df2.loc[i,'Neighborhood'] = df2.loc[i,'Borough']
    else:
        pass
df2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


<h3>Step 4: Removing rows with unassigned boroughs</h3>
<p>
    Now that there are no blank <b>Neighbourhood</b> cells, the entries in the table with unassigned <b>Borough</b> cells must be removed. An unintended consequence of step 3 was that rows with blank <b>Neighborhood</b> values and unassigned <b>Borough</b> values now have "Not assigned" as both the <b>Borough</b> and <b>Neighborhood</b> value. Luckily, this can be fixed in the next step of the data clean up.
</p>
<p>
    <ul>
        <li>First we create a new dataframe, df3, and we populate it with only the rows in df2 that <u>do not</u> contain "Not assigned" in the <b>Borough</b> field.</li>
        <li>The index values from df2 remain and a new index for df3 must be created. A new dataframe, df4, is created and assigned as df3 with the index reset and the old index dropped.</li>
        </ul>
        </p>

In [4]:
#Removing rows where borough = "Not assigned"
df3 = df2[df2['Borough']!="Not assigned"]

#reset index
df4 = df3.reset_index(drop = True)
df4.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


<h3>Step 5: Changing forward slashes to commas in the neighborhoods field</h3>
<p>Now that this is an almost completely cleaned dataframe, some finishing touches must be added. Some postal codes contain multiple neighborhoods, which are listed in the dataframe and separated with forward slashes. The forward slashes in the <b>Neighborhood</b> field must now be changed to commas.</p>
<p>
    <ul>
        <li>First we create a new dataframe, df5, a copy of df4.</li>
        <li>Next, using a function iterating along the length of the index of df4, the <b>Neighborhood</b> field cells are reassigned to be the <b>Neighborhood</b> cells in df4 with the forward slash replaced with a comma. This is acheived using the .replace() method.</li>
        </ul>
        </p>

In [5]:
df5 = df4


#Replacing / with ,
for i in range(len(df4)):
    df5.loc[i,'Neighborhood'] = df4.loc[i,'Neighborhood'].replace(" /",",")
    
df5.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


<h3>Step 6: Counting the number of rows</h3>
<p>The number of rows in the dataframe, df5, can be counted using the .shape method and requesting the value of the first index, 0.</p>

In [6]:
shape = df5.shape
print("Number of rows =",shape[0])

Number of rows = 103


<h2>Part 2: Obtaining geographical coordinates for postal codes</h2>
<p>Now that the dataframe contains the postal code of each neighborhood in Toronto, as well as the borough and neighborhood name, we would like the geographical coordinates for each postal code. This will allow us to make use of Foursquare location data to cluster the neighborhoods based on venues in the area.</p>

<h3>Step 1: Downloading required CSV file</h3>
<p>This will import the CSV file of latitudes and longitudes for each postal code.</p>

In [7]:
df_latlong = pd.read_csv("https://cocl.us/Geospatial_data")
df_latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


<h3>Step 2: Sorting the main dataframe by PostalCode</h3>
<p>
    <ul>
        <li>The main dataframe, df5, is sorted by <b>PostalCode</b> with inplace = True to retain the values in this order.</li>
        <li>A new dataframe, df6, is copied from df5. The index of df6 is the reset index of df5.</li>
    </ul>
</p>

In [8]:
#Sort df5 dataframe
df5.sort_values(by = "PostalCode", inplace = True)

#reset index
df6 = df5.reset_index(drop = True)
df6.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


<h3>Step 3: Sorting the coordinated dataframe by Postal Code</h3>
<p>
    <ul>
        <li>The df_latlong dataframe is sorted by <b>Postal Code</b> with inplace = True to retain the values in this order.</li>
        <li>A new dataframe, df_latlong2, is copied from df_latlong. The index of df_latlong2 is the reset index of df_latlong.</li>
    </ul>
</p>

In [9]:
#Sort df_latlong dataframe
df_latlong.sort_values(by = "Postal Code",inplace = True)

#reset index
df_latlong2 = df_latlong.reset_index(drop = True)
df_latlong2.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


<h3>Step 4: Merging the two dataframes</h3>
<p>
    <ul>
        <li>First, <b>Postal Code</b> in the df_latlong2 dataframe is renamed to <b>PostalCode</b>. This is done because, to use the pandas.merge() function, the column names are used as keys and must be identical.</li>
        <li>Next, the pandas.merge() function is used to merge the two dataframes based on <b>PostalCode</b> as a key.</li>
    </ul>
</p>

In [10]:
#First rename Postal Code to PostalCode to use as same key
df_latlong2.rename(columns={"Postal Code": "PostalCode"},inplace = True)
#Merging df6 and df_latlong3 on PostalCode as key
df_data = pd.merge(df6,df_latlong2,on="PostalCode")
df_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


<h2>Part 3: Exploring and clustering neighbourhoods in Toronto</h2>
<p>We would like to explore and cluster the neighborhoods of Toronto. This will be done by using the Foursquare API to find the most popular venues in each neighborhood. These venues will be categorised on type and ranked by the ten most common venue types in a neighborhood. This data will then be used with <i>k-means</i> analysis to cluster the neighborhoods into 5 groups.</p>

<h3>Step 1: Downloading required dependencies</h3>
<p>This will install all libraries we will use for Part 3.</p>

In [11]:
import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


<h3>Step 2: Checking the number of boroughs and neighborhoods</h3>
<p>We would like to find out the number of boroughs and neighborhoods in Toronto in our dataset.
    <ul>
        <li>The .unique() function is passsed over the <b>Borough</b> column. The length of this list is calculated to find the number of boroughs in our dataset. </li>
        <li>The same set of functions is passed over the <b>Neighborhood</b> column to find the number of neighborhoods in the dataset. A caveat of this is that the number of unique lists of neighborhoods in the dataset is calculated, as each postal code can be associated with multiple neighborhoods.</li>
        </ul>
        </p>

In [12]:
#How many boroughs and neighbourhoods are there?
print("The dataframe has {} boroughs and {} lists of neighbourhoods.".format(
    len(df_data["Borough"].unique()),
    len(df_data["Neighborhood"].unique())
    )
)

The dataframe has 10 boroughs and 98 lists of neighbourhoods.


<h3>Step 3: Finding the latitude and longitude of Toronto for a map</h3>
    <p>To generate a map of Toronto using the Folium package, we need the geographical co-ordinates for the centre of the city. We use the GeoPy package to request the latitude and longitude of Toronto.</p>

In [13]:
#Use geopy library to get coordinates of Toronto for a map
address = "Toronto, Ontario"
geolocator = Nominatim(user_agent = "toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print("The geographical coordinates of Toronto are {}, {}.".format(latitude,longitude))

The geographical coordinates of Toronto are 43.6534817, -79.3839347.


<h3>Step 4: Create a map of Toronto and the locations of postal codes</h3>
    <p>We would like to generate a map centred on Toronto that shows the location of the postal codes of the city.
    <ul>
        <li>Using the Folium package, we centre a map on Toronto using the latitude and longitude we previously generated in step 3. We set the zoom_start level to 10.</li>
        <li>Using the latitude and longitude of the postal codes, fetched from a CSV file in part 2 of the notebook, we generate markers on the Folium map of the locations of Toronto postal codes. By clicking on the markers, you can see the name of the neighborhood(s) associated with the postal code.</li>
        </ul>
    </p>

In [14]:
#Create a map of Toronto with postal codes
map_toronto = folium.Map(location = [latitude,longitude], zoom_start = 10)

for lat, lng, borough, neighborhood in zip(df_data["Latitude"],df_data["Longitude"],df_data["Borough"],df_data["Neighborhood"]):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 3,
        popup = label,
        color = "#8B0000",
        fill = True,
        fill_color = "#DC143C",
        foll_opacity = 0.7,
        parse_html = False).add_to(map_toronto)
map_toronto

<h3>Step 5: Defining Foursquare credentials and version</h3>
    <p>As we will be using the Foursquare Places API to generate a list of the most popular venues in a neighborhood and the type of the venue, we will need to input our Foursquare credentials for the GET explore request.</p>

In [15]:
CLIENT_ID = '0OYWUO1KIOBY4TMTOIAHFQMQPXKWH05S52HMPMZ5E2YB2FV4'
CLIENT_SECRET = 'YVZW53FC1D2AKGOBVXBMKKSXKDH4NEUOAI0JBAUIVTSTNEJY'
VERSION = '20180605'

<h3>Step 6: Testing Foursquare GET request for popular venues</h3>
<p>
    We will test requesting popular venues in a neighborhood using the Foursquare Places API. We will use the first entry in our dataset, Malvern and Rouge in M1B, to do this.
    <ul>
        <li>First we define the latitude and longitude of the neighborhood from our dataframe. This is then printed out.</li>
        <li>We define the radius for our Foursqare GET explore request as 500m and the limit of venues to fetch as 100 venues.</li>
        <li>We define the url of the explore GET request as the variable url.</li>
        <li>We send the GET request and print out the json file results.</li>
        </ul>
        </p>

In [16]:
#Test latitude and longitude of postal code - Step 6, part 1
neighborhood_latitude = df_data.loc[0,"Latitude"]
neighborhood_longitude = df_data.loc[0,"Longitude"]
print("For ({}), Latitude = {}, Longitude = {}".format(df_data.loc[0,"Neighborhood"],neighborhood_latitude,neighborhood_longitude))

#url to get top 10 venues in a postal code within a radius of 500m
radius = 500
LIMIT = 100

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)




For (Malvern, Rouge), Latitude = 43.806686299999996, Longitude = -79.19435340000001


In [17]:
#Test url on test postal code - Step 6, part 2
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5eae9c3b216785001becde23'},
  'headerLocation': 'Malvern',
  'headerFullLocation': 'Malvern, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 2,
  'suggestedBounds': {'ne': {'lat': 43.8111863045, 'lng': -79.18812958073042},
   'sw': {'lat': 43.80218629549999, 'lng': -79.2005772192696}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bb6b9446edc76b0d771311c',
       'name': 'Wendy’s',
       'location': {'crossStreet': 'Morningside & Sheppard',
        'lat': 43.80744841934756,
        'lng': -79.19905558052072,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.80744841934756,
          'lng': -79.19905558052072}],
        'distance': 387,
        'cc': 'CA',
        'city': 'Toronto',
    

<h3>Step 7: Extracting information from the json file and structuring a dataframe</h3>
<p><ul>
    <li>We know that the information we are looking for is in the <i>items</i> key of the json file. We will use a function <b>get_category_type</b>, to extract the category of the venue. </li>
    <li>We clean the json file and structure it into a pandas dataframe.</li>
    </ul></p>

In [18]:
#function to extract the category of a venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    
#Clean json and structure it into a dataframe
venues = results["response"]["groups"][0]["items"]

#flatten json
nearby_venues = pd.json_normalize(venues)
nearby_venues
#filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

#filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

#clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Wendy’s,Fast Food Restaurant,43.807448,-79.199056
1,Interprovincial Group,Print Shop,43.80563,-79.200378


<h3>Step 8: Repeat the get_category_type function for all neighborhoods in the dataset</h3>
<p>First we define the function, <b>getNearbyVenues</b> that uses the latitude, longitude and a radius to find the top 100 venues for a given neighborhood.
    <ul>
        <li>We make use of the same Foursquare GET request we used previously. We also extract the venues information from the resulting json file and create a dataframe containing this information.</li>
        <li>We then run the function on the neighborhoods in our dataset. This returns the top 100 venues for each neighbourhood as well as their venue category.</li>
        </ul>
        </p>
    

In [19]:
#Function to repeat process for all postal codes in Toronto - Step 8, part 1
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [20]:
#Run the above function on each postal code and create a new dataframe called toronto_venues - Step 8, part 2
toronto_venues = getNearbyVenues(names = df_data["Neighborhood"], latitudes = df_data["Latitude"], longitudes = df_data["Longitude"])

Malvern, Rouge
Rouge Hill, Port Union, Highland Creek
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
Kennedy Park, Ionview, East Birchmount Park
Golden Mile, Clairlea, Oakridge
Cliffside, Cliffcrest, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Wexford Heights, Scarborough Town Centre
Wexford, Maryvale
Agincourt
Clarks Corners, Tam O'Shanter, Sullivan
Milliken, Agincourt North, Steeles East, L'Amoreaux East
Steeles West, L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
York Mills, Silver Hills
Willowdale, Newtonbrook
Willowdale
York Mills West
Willowdale
Parkwoods
Don Mills
Don Mills
Bathurst Manor, Wilson Heights, Downsview North
Northwood Park, York University
Downsview
Downsview
Downsview
Downsview
Victoria Village
Parkview Hill, Woodbine Gardens
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Lawrence P

<h3>Step 9: How large is the resulting dataframe?</h3>
<p>We use the .shape function to see that the resulting dataframe size is 2151 rows and 7 columns. We display the first 5 rows of the dataframe, toronto_venues, here.

In [21]:
print(toronto_venues.shape)
toronto_venues.head()

(2151, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,"Malvern, Rouge",43.806686,-79.194353,Interprovincial Group,43.80563,-79.200378,Print Shop
2,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,SEBS Engineering Inc. (Sustainable Energy and ...,43.782371,-79.15682,Construction & Landscaping
4,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,Scarborough Historical Society,43.788755,-79.162438,History Museum


<h3>Step 10: One hot encoding of venues types for each neighborhood</h3>
<p>In order to analyse and cluster the neighborhoods by most popular venue types, we one hot encode the data. This assigns a binary variable (0 or 1) for each unique venue category.
    <ul>
        <li>We apply one hot encoding using the pandas .get_dummies() function on the toronto_venues dataframe. This creates the toronto_onehot dataframe.</li>
        <li>We then add the <b>Neighborhood</b> column back to the dataframe using the .insert() function.</li>
        <li>Unfortunately, one of the venue types is "neighborhood", which will wreak havoc on the rest of our analysis. We will therefore remove the "neighborhood" venues type from the toronto_onehot dataframe.</li>
        </ul>
        </p>
    

In [22]:
#one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

#add neighborhood column back to dataframe
toronto_onehot.insert(0, "Neighborhood", toronto_venues['Neighborhood'] , True) 

#Remove "Neighborhood" venue type
toronto_onehot = toronto_onehot.loc[:,~toronto_onehot.columns.duplicated()]
toronto_onehot.head(5)

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<h3>Step 11: Group rows by neighborhood and calculate frequency of venue category occurance</h3>
<p>We group the toronto_onehot dataset by <b>Neighborhood</b> and calculate the mean of the frequency of each venue category's occurence. We create a new dataframe, toronto_grouped, from this.</p>

In [23]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<h3> Step 12: Sort venues in descending order</h3>
<p>We sort the venues categories in each neighborhood to top 10 venue types. We then create a new dataframe, neighborhoods_venues_sorted, from this. This dataframe contains the 10 most common venue types in each neighborhood, listed by name in each column. This is printed out at the end.</p>

In [24]:
#Function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#Create new dataframe to display top ten venues for each neighborhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe neighborhoods_venues_sorted
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] =toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Lounge,Skating Rink,Latin American Restaurant,Breakfast Spot,Clothing Store,Yoga Studio,Drugstore,Distribution Center,Dog Run,Doner Restaurant
1,"Alderwood, Long Branch",Pizza Place,Sandwich Place,Skating Rink,Pharmacy,Gym,Dance Studio,Coffee Shop,Athletics & Sports,Pub,Dim Sum Restaurant
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Fried Chicken Joint,Bridal Shop,Sandwich Place,Diner,Restaurant,Middle Eastern Restaurant,Supermarket,Sushi Restaurant
3,Bayview Village,Café,Bank,Chinese Restaurant,Japanese Restaurant,Yoga Studio,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Italian Restaurant,Restaurant,Coffee Shop,Sushi Restaurant,Greek Restaurant,Thai Restaurant,Liquor Store,Comfort Food Restaurant,Juice Bar


<h3>Step 13: Cluster the neighborhoods into 5 clusters using k-means analysis</h3>
<p>
    <ul>
        <li>First we define 5 clusters to begin with.</li>
        <li>Next, we drop the <b>Neighborhood</b> column from the toronto_grouped dataframe. We create a new dataframe, toronto_grouped_clustering, from this.</li>
        <li>We run a Kmeans() clustering function on the toronto_grouped_clustering dataframe.</li>
        <li>We create a new dataframe, neighborhoods_venues_sorted, that inserts the kmeans labels as <b>Cluster Labels</b>. This dataframe contains the clusters as well as the top 10 venue categories for each neighborhood.</li>
        <li>We display this new dataframe.</li>
        </ul>
        </p>

In [25]:
#Run k-means to cluster the neighborhood into 5 clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

#Show neighborhoods_venues_sorted
neighborhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,Agincourt,Lounge,Skating Rink,Latin American Restaurant,Breakfast Spot,Clothing Store,Yoga Studio,Drugstore,Distribution Center,Dog Run,Doner Restaurant
1,0,"Alderwood, Long Branch",Pizza Place,Sandwich Place,Skating Rink,Pharmacy,Gym,Dance Studio,Coffee Shop,Athletics & Sports,Pub,Dim Sum Restaurant
2,0,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Fried Chicken Joint,Bridal Shop,Sandwich Place,Diner,Restaurant,Middle Eastern Restaurant,Supermarket,Sushi Restaurant
3,0,Bayview Village,Café,Bank,Chinese Restaurant,Japanese Restaurant,Yoga Studio,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
4,0,"Bedford Park, Lawrence Manor East",Sandwich Place,Italian Restaurant,Restaurant,Coffee Shop,Sushi Restaurant,Greek Restaurant,Thai Restaurant,Liquor Store,Comfort Food Restaurant,Juice Bar


<h3>Step 14: Merge the df_data and neighborhoods_venues_sorted dataframes</h3>
<p>The df_data and neighborhoods_venues_sorted dataframes are merged into the toronto_merged dataframe. This dataframe will be used to show the clustering on a map of Toronto. We remove all the rows that contain NaN as the cluster label and venue categories. These are as a result of some neighborhoods in the original dataframe not returning venues in the GET request. A sample of the resulting dataframe is printed.</p>

In [26]:
toronto_merged = df_data

#merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

#Drop NaN cluster labels - some neighborhoods didn't have venues
toronto_merged = toronto_merged.dropna(subset = ["Cluster Labels"], axis = 0,inplace = False)
toronto_merged.head()


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,1.0,Fast Food Restaurant,Print Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Yoga Studio,Dessert Shop
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,1.0,Construction & Landscaping,History Museum,Bar,Yoga Studio,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,1.0,Mexican Restaurant,Rental Car Location,Breakfast Spot,Medical Center,Electronics Store,Intersection,Bank,Doner Restaurant,Distribution Center,Dog Run
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.0,Coffee Shop,Korean Restaurant,Convenience Store,Yoga Studio,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1.0,Caribbean Restaurant,Fried Chicken Joint,Bank,Thai Restaurant,Athletics & Sports,Gas Station,Bakery,Hakka Restaurant,Eastern European Restaurant,Drugstore


<h3>Step 15: Visualing the neighborhood clusters</h3>
<p>We visualise the clusters as different coloured markers on a map of Toronto.
    <ul>
        <li>We create a Folium map centred on Toronto with a zoom_start of 10.</li>
        <li>We set the colour scheme for the clusters as a range of 5 colours in colour_array. This uses the rainbow() function of the cm library.</li>
        <li>We cast the <b>Cluster Labels</b> column type to integer so that the color assignment for our markers will work. This requires an integer or slice input.</li>
        <li>We add markers onto the map defined from the cluster label, the name of the neighborhood and its associated latitude and longitude.</li>
        <li>The clusters are grouped according to 5 colours. The clusters were determined based on the most common venue type in the neighborhoods.</li>
        </ul>
        </p>

In [27]:
#Visualise the cluster
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)
toronto_merged = toronto_merged.astype({"Cluster Labels": int})
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h3>(Optional) Step 16: Examining the clusters</h3>
<p>Optionally, we can examine each cluster and determine the discriminating venue categories that lead to the resulting clustering. Change the cluster_number variable to examine each cluster.</p>
<p>From cluster 2, we can see the most popular venue types were playgrounds, parks and yoga studios.</p>

In [28]:
cluster_number = 2
toronto_merged.loc[toronto_merged['Cluster Labels'] == cluster_number, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Scarborough,2,Playground,Yoga Studio,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
14,Scarborough,2,Playground,Park,Coffee Shop,Yoga Studio,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant
23,North York,2,Park,Convenience Store,Bank,Bar,Yoga Studio,Drugstore,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
25,North York,2,Food & Drink Shop,Park,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio
40,East York,2,Park,Coffee Shop,Convenience Store,Yoga Studio,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant
48,Central Toronto,2,Park,Yoga Studio,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Eastern European Restaurant
50,Downtown Toronto,2,Park,Playground,Trail,Yoga Studio,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run
74,York,2,Park,Women's Store,Spa,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Electronics Store
98,York,2,Park,Yoga Studio,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Eastern European Restaurant
