<h1 align=center><font color=royalblue> Segmenting and Clustering Neighborhoods in Toronto </font></h1>

## Intro

In this notebook, we will explore, segment, and cluster the neighborhoods in the city of Toronto. The Toronto neighborhood data is available on [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). To convert the postal code table available on above mentioned Wikipedia page to a pandas DataFrame we will do the web scrapping using **BeautifulSoup Package**. Then to obtain the latitude and longitude values of all the neighborhoods in Toronto we will use **Geocoder Python package**. After that using **Foursquare API** we will explore the neighborhoods and get the most common venue categories for each neighborhood. We will use the **k-means** clustering algorithm to complete this task. Finally using **folium** we will visualize these clusters on the Toronto map.

## Direct Link to Solutions -

<div class="alert alert-block alert-success" style="margin-top: 20px"> 

<font size = 4>
    
1. <a href="#item1">Part 1</a><br><br>
    
2. <a href="#item2">Part 2</a><br><br>
    
3. <a href="#item3">Part 3</a><br><br>

</font>   

</div>

<a id='item1'></a>

## 1. Part 1

In this part we will extract the table from the Wikipedia page of **List of postal codes of Canada: M** and covert it to pandas Dataframe using the _beautifulsoul4_ package. Then we will clean the data- look for NaN values, Look for Duplicate postal code entries, remove rows with NaN values in Borough Column, merge rows with duplicate postal codes and reset the index after data wrangling is completed. 

### Install all the important library packages

<div class="alert alert-block alert-info" style="margin-top: 20px">When doing interactive computing it is common to need to access the underlying shell. This is doable through the use of the exclamation mark ! (or bang). Be sure to provide "-y" to specify yes to the install prompt as you can not submit input to the commands when running.
</div>

In [1]:
# uncomment if installation required

#!conda install -c anaconda beautifulsoup4 -y
#!conda install -c anaconda lxml -y
#!conda install -c conda-forge geopy -y
#!pip install folium
#!pip install pgeocode
print('All libraries installed successfully.')

All libraries installed successfully.


### Now we will import required library packages

In [2]:
import pandas as pd # library for data manipulation and analysis
import numpy as np # library for multi-dimensional arrays and matrices
import urllib.request # the library we use to open URLs
from bs4 import BeautifulSoup # the BeautifulSoup library so we can parse HTML and XML documents
import requests # library to handle requests
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import pgeocode # library for high performance off-line querying of GPS coordinates
import folium #Map rendering library
from sklearn.cluster import KMeans #import k-means from clustering stage
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

Now we will specify the URL of the Wikipedia page we are looking to scrape and then using the urllib.request library, we want to query the page and put the HTML data into a variable (which we have called ‘url’). Next we want to import the functions from Beautiful Soup which will let us parse and work with the HTML we fetched from our Wiki page. Then we use Beautiful Soup to parse the HTML data we stored in our ‘url’ variable and store it in a new variable called ‘soup’ in the Beautiful Soup format.

In [3]:
# specify which URL/web page we are going to be scraping
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# open the url using urllib.request and put the HTML into the page variable
page = urllib.request.urlopen(url)
# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "lxml")
#print(soup.prettify()) #uncomment this line to view the webpage in HTML

We know the required data resides within an HTML table tag with class = 'wikitable sortable' after inspecting the elements of Wikipedia webpage using F12 key.

In [4]:
table=soup.find('table', class_='wikitable sortable')
#table #uncomment this to see the table data

We know that the table is set up in rows (starting with <*tr*> tags) with the data sitting within <*td*> tags in each row. We aren’t too worried about the header row with the <*th*> elements as we know what each of the columns represent by looking at the table. There are three columns in our table that we want to scrape the data from so we will set up three empty lists (postal_code, borough & neighborhood) to store our data in.

To start with, we want to use the Beautiful Soup ‘find_all’ function again and set it to look for the string ‘tr’. We will then set up a FOR loop for each row within that array and set Python to loop through the rows, one by one.

Within the loop we are going to use find_all again to search each row for <*td*> tags with the ‘td’ string. We will add all of these to a variable called ‘cells’ and then check to make sure that there are 3 items in our ‘cells’ array (i.e. one for each column).

If there are then we use the find(text=True)) option to extract the content string from within each <*td*> element in that row and add them to the lists we created at the start of this step. 

In [5]:
postal_code=[]
borough=[]
neighborhood=[]

for row in table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        postal_code.append(cells[0].find(text=True))
        borough.append(cells[1].find(text=True))
        neighborhood.append(cells[2].find(text=True))

We will create a dataframe with pandas, assigning each of the lists into a column with the name of our source table columns i.e. PostalCode, Borough, and Neighborhood.

In [6]:
headers = [postal_code, borough,neighborhood]
columns=['PostalCode','Borough','Neighborhood']
df_postal_m=pd.DataFrame(headers).transpose() 
df_postal_m.columns = columns
df_postal_m.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now all the necessary steps for Data Wrangling will be performed. Like removing regex, whitespace, NaN values, duplicates, etc.

In [7]:
df_postal_m=df_postal_m.replace('\n',' ', regex=True) #remove regular expression '\n' from the dataframe cells
df_postal_m=df_postal_m.apply(lambda x: x.str.strip()) #remove the whitespace character from the dataframe cells
df_postal_m.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [8]:
df_postal_m=df_postal_m.replace('Not assigned',np.NaN) # Replace 'Not assigned' with NaN
df_postal_m.isna().sum() #Total Nan values for each column in DataFrame

PostalCode       0
Borough         77
Neighborhood    77
dtype: int64

In [9]:
df_postal_m.dropna(subset=["Borough"],inplace=True) #Removing all rows having 'Borough' value as NaN.

In [10]:
df_postal_m.reset_index(inplace=True,drop=True)
df_postal_m.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [11]:
df_postal_m.duplicated(['PostalCode']).sum() #Checking for total duplicate values in column PostalCode

0

In [12]:
df_postal_m['Neighborhood'].str.match('Not assigned' or np.NaN).sum() #checking for total NaN values in Neighborhood Column

0

In [13]:
df_postal_m.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [14]:
print("There are {} rows in this dataframe".format(df_postal_m.shape[0]))

There are 103 rows in this dataframe


<a id='item2'></a>

## 2. Part 2

Now we will get the latitude and the longitude coordinates of each neighborhood which is required for FourSquare API calls. We will assign the Foursquare ID and Secret to variable CLIENT_ID and CLIENT_SECRET. The version we will use is 20200501.

In [15]:
# The code was removed by Watson Studio for sharing.

In [16]:
postal_list = df_postal_m['PostalCode'].tolist() #convert the PostalCode Column to list and save it in postal_list

In [17]:
nomi = pgeocode.Nominatim('ca') #ca is country code for canada
df_postal_ll= nomi.query_postal_code(postal_list) #querying all the postal codes in postal_list and saving the result DataFrame in df_postal_ll.
df_postal_ll.head()

Unnamed: 0,postal_code,country code,place_name,state_name,state_code,county_name,county_code,community_name,community_code,latitude,longitude,accuracy
0,M3A,CA,North York (York Heights / Victoria Village / ...,Ontario,ON,North York,,,,43.7545,-79.33,1.0
1,M4A,CA,North York (Sweeney Park / Wigmore Park),Ontario,ON,,,,,43.7276,-79.3148,6.0
2,M5A,CA,Downtown Toronto (Regent Park / Port of Toronto),Ontario,ON,Toronto,8133394.0,,,43.6555,-79.3626,6.0
3,M6A,CA,North York (Lawrence Manor / Lawrence Heights),Ontario,ON,North York,,,,43.7223,-79.4504,6.0
4,M7A,CA,Queen's Park Ontario Provincial Government,Ontario,ON,,,,,43.6641,-79.3889,


Now we will concatenate the two dataframes with the required columns.

In [18]:
df1 = df_postal_m
df2 = df_postal_ll[['latitude','longitude']]
df_list = [df1,df2]
df_postal_join = pd.concat(df_list,axis=1)

In [19]:
df_postal_join.rename(columns={"latitude": "Latitude", "longitude": "Longitude"}, inplace=True)
df_postal_join.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.6662,-79.5282
6,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.193
7,M3B,North York,Don Mills,43.745,-79.359
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.7063,-79.3094
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783


<a id='item3'></a>

## 3. Part 3

In this part, we will explore and cluster the neighborhoods in Toronto with only boroughs that contain the word Toronto ie East Toronto, Downtown Toronto, Central Toronto, and West Toronto.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>foursquare_agent</em>, as shown below.

In [20]:
address = 'Toronto,ON'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are- Latitude {}, Longitude {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are- Latitude 43.6534817, Longitude -79.3839347.


We will make a dataframe with the boroughs having toronto in their name.

In [21]:
df_borough_toronto = df_postal_join[df_postal_join['Borough'].str.contains('Toronto')]
df_borough_toronto.reset_index(drop=True,inplace=True)
df_borough_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783
3,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756
4,M4E,East Toronto,The Beaches,43.6784,-79.2941


In [22]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(len(df_borough_toronto['Borough'].unique()),df_borough_toronto.shape[0]))

The dataframe has 4 boroughs and 39 neighborhoods.


#### Create a map of Toronto with neighborhoods of selective boroughs superimposed on top.

In [23]:
# create map of Toronto using latitude and longitude values and showing all 39 neighborhoods of df_borough_toronto DataFrame.
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for borough, neighborhood,lat, lng, in zip(df_borough_toronto['Borough'], df_borough_toronto['Neighborhood'],
                                           df_borough_toronto['Latitude'], df_borough_toronto['Longitude']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#FFA500',
        fill_opacity=0.8,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Now we will define some functions to make our work easy.

In [24]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [25]:
#function to get the list of top 100 venues for each neighborhood within the radius of 300 meters and saving the result in a DataFrame.
def getNearbyVenues(names, latitudes, longitudes, radius=300):
    venues_list=[]
    limit=100
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now we will run the above function on each neighborhood and create a new dataframe called ewcd_toronto_venues where ewcd is east, west, centeral, and downtown. 

In [26]:
ewcd_toronto_venues = getNearbyVenues(names=df_borough_toronto['Neighborhood'],
                                   latitudes=df_borough_toronto['Latitude'],
                                   longitudes=df_borough_toronto['Longitude']
                                  )

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
North Toronto West,  Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport


In [27]:
ewcd_toronto_venues.shape # To get the number of rows and columns of ewcd_toronto_venues DataFrame.

(713, 7)

In [28]:
ewcd_toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.6555,-79.3626,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.6555,-79.3626,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.6555,-79.3626,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
3,"Regent Park, Harbourfront",43.6555,-79.3626,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
4,"Regent Park, Harbourfront",43.6555,-79.3626,The Yoga Lounge,43.655515,-79.364955,Yoga Studio


We will now check how many venues were returned for each neighborhood.

In [29]:
ewcd_toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,16,16,16,16,16,16
"Brockton, Parkdale Village, Exhibition Place",5,5,5,5,5,5
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",3,3,3,3,3,3
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",15,15,15,15,15,15
Central Bay Street,16,16,16,16,16,16
Christie,2,2,2,2,2,2
Church and Wellesley,55,55,55,55,55,55
"Commerce Court, Victoria Hotel",70,70,70,70,70,70
Davisville,10,10,10,10,10,10
Davisville North,4,4,4,4,4,4


Now we will find out how many unique categories can be curated from all the returned venues

In [30]:
print('There are {} uniques categories.'.format(len(ewcd_toronto_venues['Venue Category'].unique())))

There are 172 uniques categories.


#### Analyze each neighborhood

In [31]:
# Analyze each neighborhood
# one hot encoding
ewcd_toronto_onehot = pd.get_dummies(ewcd_toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
ewcd_toronto_onehot['Neighborhood'] = ewcd_toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [ewcd_toronto_onehot.columns[-1]] + list(ewcd_toronto_onehot.columns[:-1])
ewcd_toronto_onehot = ewcd_toronto_onehot[fixed_columns]

ewcd_toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,American Restaurant,Art Gallery,Art Museum,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,...,Theme Restaurant,Trail,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint
0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [32]:
ewcd_toronto_grouped = ewcd_toronto_onehot.groupby('Neighborhood').mean().reset_index()
ewcd_toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,American Restaurant,Art Gallery,Art Museum,Asian Restaurant,Athletics & Sports,BBQ Joint,...,Theme Restaurant,Trail,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,...,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018182,0.018182
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.028571,0.014286,0.0,0.042857,0.0,0.0,...,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.014286,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
ewcd_toronto_grouped.shape

(36, 172)

Now let's print each neighborhood along with the top 5 most common venues

In [34]:
num_top_venues = 5

for hood in ewcd_toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = ewcd_toronto_grouped[ewcd_toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
                 venue  freq
0  Sporting Goods Shop  0.06
1             Beer Bar  0.06
2                 Park  0.06
3           Restaurant  0.06
4         Concert Hall  0.06


----Brockton, Parkdale Village, Exhibition Place----
         venue  freq
0          Bar   0.2
1    Pet Store   0.2
2         Café   0.2
3  Coffee Shop   0.2
4       Museum   0.0


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                  venue  freq
0  Gym / Fitness Center  0.33
1                   Gym  0.33
2      Sushi Restaurant  0.33
3   Monument / Landmark  0.00
4                Market  0.00


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
               venue  freq
0               Park  0.13
1       Intersection  0.13
2  French Restaurant  0.07
3              Diner  0.07
4         Donut Shop  0.07


----Central Bay Street----
                 venue  freq
0          Co

A function to sort the venues in descending order.

In [35]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [36]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = ewcd_toronto_grouped['Neighborhood']

for ind in np.arange(ewcd_toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ewcd_toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Berczy Park,Concert Hall,French Restaurant,Restaurant,Japanese Restaurant,Sporting Goods Shop
1,"Brockton, Parkdale Village, Exhibition Place",Pet Store,Bar,Café,Coffee Shop,Farmers Market
2,"Business reply mail Processing Centre, South C...",Gym / Fitness Center,Sushi Restaurant,Gym,Gourmet Shop,Fast Food Restaurant
3,"CN Tower, King and Spadina, Railway Lands, Har...",Park,Intersection,Coffee Shop,Diner,Caribbean Restaurant
4,Central Bay Street,Coffee Shop,Hotel,Spa,Japanese Restaurant,Chinese Restaurant


#### Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 4 clusters.

In [37]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 4

ewcd_toronto_grouped_clustering = ewcd_toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=7).fit(ewcd_toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 1, 3, 1, 1, 1, 1], dtype=int32)

Now let's create a new dataframe that includes the cluster as well as the top 5venues for each neighborhood.

In [38]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'clusterlabel', kmeans.labels_)

toronto_merged = df_borough_toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged# check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,clusterlabel,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626,1.0,Breakfast Spot,Furniture / Home Store,Sandwich Place,Yoga Studio,Light Rail Station
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889,1.0,Sushi Restaurant,Park,Beer Bar,Dog Run,Farmers Market
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783,1.0,Coffee Shop,Middle Eastern Restaurant,Café,Bar,Bookstore
3,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756,1.0,Gastropub,Coffee Shop,Cosmetics Shop,Middle Eastern Restaurant,Camera Store
4,M4E,East Toronto,The Beaches,43.6784,-79.2941,1.0,Health Food Store,Pizza Place,Trail,Pub,Coffee Shop
5,M5E,Downtown Toronto,Berczy Park,43.6456,-79.3754,1.0,Concert Hall,French Restaurant,Restaurant,Japanese Restaurant,Sporting Goods Shop
6,M5G,Downtown Toronto,Central Bay Street,43.6564,-79.386,1.0,Coffee Shop,Hotel,Spa,Japanese Restaurant,Chinese Restaurant
7,M6G,Downtown Toronto,Christie,43.6683,-79.4205,3.0,Grocery Store,Café,Dog Run,Farmers Market,Falafel Restaurant
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.6496,-79.3833,1.0,Coffee Shop,Asian Restaurant,Salad Place,Hotel,Restaurant
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.6655,-79.4378,1.0,Bus Line,Skating Rink,Music Venue,Park,Dog Run


In [39]:
toronto_merged['clusterlabel'].isna().sum() #check for number of NaN values in the clusterlabel column

3

Now we will drop the rows having NaN values in clusterlabel column.

In [40]:
toronto_merged.dropna(subset=["clusterlabel"],inplace=True)

We will now change the data type of column 'clusterlabel' from float to int.

In [41]:
toronto_merged.clusterlabel=toronto_merged.clusterlabel.astype(int)

Finally, let's visualize the resulting clusters

In [42]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'],
                                  toronto_merged['clusterlabel']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Now Lets examine each cluster.

In [43]:
#cluster0 list
toronto_merged.loc[toronto_merged['clusterlabel'] == 0, toronto_merged.columns[[2] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
22,"High Park, The Junction South",Park,Wings Joint,Discount Store,Farmers Market,Falafel Restaurant
29,"Moore Park, Summerhill East",Park,Wings Joint,Discount Store,Farmers Market,Falafel Restaurant
33,Rosedale,Park,Wings Joint,Discount Store,Farmers Market,Falafel Restaurant


In [44]:
#cluster1 list
toronto_merged.loc[toronto_merged['clusterlabel'] == 1, toronto_merged.columns[[2] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Regent Park, Harbourfront",Breakfast Spot,Furniture / Home Store,Sandwich Place,Yoga Studio,Light Rail Station
1,"Queen's Park, Ontario Provincial Government",Sushi Restaurant,Park,Beer Bar,Dog Run,Farmers Market
2,"Garden District, Ryerson",Coffee Shop,Middle Eastern Restaurant,Café,Bar,Bookstore
3,St. James Town,Gastropub,Coffee Shop,Cosmetics Shop,Middle Eastern Restaurant,Camera Store
4,The Beaches,Health Food Store,Pizza Place,Trail,Pub,Coffee Shop
5,Berczy Park,Concert Hall,French Restaurant,Restaurant,Japanese Restaurant,Sporting Goods Shop
6,Central Bay Street,Coffee Shop,Hotel,Spa,Japanese Restaurant,Chinese Restaurant
8,"Richmond, Adelaide, King",Coffee Shop,Asian Restaurant,Salad Place,Hotel,Restaurant
9,"Dufferin, Dovercourt Village",Bus Line,Skating Rink,Music Venue,Park,Dog Run
10,"Harbourfront East, Union Station, Toronto Islands",Park,Music Venue,Athletics & Sports,Discount Store,Farmers Market


In [45]:
#cluster2 list
toronto_merged.loc[toronto_merged['clusterlabel'] == 2, toronto_merged.columns[[2] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
21,"Forest Hill North & West, Forest Hill Road Park",Accessories Store,Wings Joint,Dog Run,Fast Food Restaurant,Farmers Market


In [46]:
#cluster3 list
toronto_merged.loc[toronto_merged['clusterlabel'] == 3, toronto_merged.columns[[2] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
7,Christie,Grocery Store,Café,Dog Run,Farmers Market,Falafel Restaurant
27,"University of Toronto, Harbord",Café,College Gym,College Arts Building,Wings Joint,Dog Run


#### Thank you for reviewing this assignment.

<div class="alert alert-block alert-success"style="margin-top: 20px"> 
This assignment is done by Mayank Panwar.

Have a Good Day! :) </div>