# IBM Data Science Specialization Capstone
This notebook is part of the IBM datascience capstone projects.

## Segmenting and Clustering Neighborhoods in Toronto - Part 3

In Part 1 of this project, we collected the Toronto neighborhoods data from a web page, converted it to a pandas dataframe, and cleaned and pre-processed it.

In Part 2, we collected the latitude and logitude of each neighborhood and added it to our dataframe so that we can utilize the Foursquare location data.

In this final part, we will explore and cluster the neighborhoods in Toronto.

### Install required packages

In [1]:
import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install beautifulsoup4
!{sys.executable} -m pip install requests
!{sys.executable} -m pip install geopy
!{sys.executable} -m pip install sklearn
!{sys.executable} -m pip install folium
!{sys.executable} -m pip install matplotlib



### Import dependencies

In [2]:
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans
import requests
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

## Part 1 Revisited - Collect neighborhood data for Toronto and convert it into a pandas dataframe

The data will be collected from Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

This page contains a list of postal codes in Canada where the first letter is M.

In [3]:
# Get the HTML text from the wiki page
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_text = requests.get(wiki_url).text

# Extract the table data using BeautifulSoup
soup = BeautifulSoup(html_text)
table = soup.find('table', attrs={'class':'wikitable sortable'})
trs = table.find_all('tr')

# Extract the text from all the table cells and add all rows
# to a list of rows.
rows = list()
for tr in trs:
    td = tr.find_all('td')
    row = [ele.text.strip() for ele in td]
    if row:
        # Ignore empty rows with no 'td',
        # applicable for the column headers row.
        rows.append(row)
        
# Convert the data to a pandas dataframe with 
# columns 'PostalCode', 'Borough', and 'Neighborhood'.
df = pd.DataFrame(rows,
                  columns=['PostalCode', 'Borough', 'Neighborhood'])

# Remove rows with column 'Not assigned' for column 'Borough'
df = df[df.Borough != 'Not assigned']
df.reset_index(inplace=True, drop=True)

# If a cell has a borough but a Not assigned neighborhood, 
# then the neighborhood will be the same as the borough.
# So, replace 'Neighborhood' columns with value as
# 'Not assigned' with the value of its 'Borough'
df['Neighborhood'] = df.apply(
    lambda row: 
    row['Borough'] if row['Neighborhood'] == 'Not assigned' 
    else row['Neighborhood'],
    axis=1)

# More than one neighborhood can exist in one postal code area.
# Combine rows with the same postal code into a single row.
# For example, in the table on the Wikipedia page, M5A is 
# listed twice and has two neighborhoods: Harbourfront and
# Regent Park. These two rows will be combined into one row
# with the neighborhoods separated with a comma.
df = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].\
    apply(', '.join).to_frame()
df.reset_index(inplace=True)


In [4]:
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [5]:
print("Dataframe shape: ", df.shape)

Dataframe shape:  (103, 3)


## Part 2 Revisited - Add latitude and longitude data to the dataframe

We will be using the CSV data at https://cocl.us/Geospatial_data to get the latitude and logitude information for all the postal codes.

Once we have the data, it will be concatenated with our original dataframe.

In [6]:
# Read the CSV data to a pandas dataframe.
geospatial_data = pd.read_csv('https://cocl.us/Geospatial_data')

## Combine the two dataframes.

# Concatenate the two dataframes using column "PostalCode" from 'df' and
# "Postal Code" from 'geospatial_data'.
df = pd.concat(
    [df.set_index('PostalCode'), geospatial_data.set_index('Postal Code')],
    axis=1, join='inner')

# Postal code will now be the index, change it to a non-index 
# column and reset index.
df.reset_index(inplace=True)

# Rename the 'index' column to 'PostalCode'
df.rename(columns={'index': 'PostalCode'}, inplace=True)


In [7]:
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [8]:
print("Dataframe shape: ", df.shape)

Dataframe shape:  (103, 5)


# Part 3 - Segment and Cluster Neighborhoods in Toronto

## 1. Explore the data

### a. Get Toronto's latitude and longitude using geopy

In [9]:
address = 'Toronto, Canada'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
print('The geograpical coordinate of Toronto are {}, {}.'.format(
    location.latitude, location.longitude))


The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


### b. Create a map of Toronto using folium

In [10]:
map_toronto = folium.Map(
    location=[location.latitude, location.longitude],
    zoom_start=10)
map_toronto

### c. Superimpose the neighborhoods on Toronto's map

Markers are added to the map for each of the neighborhoods and each marker is labeled in the format: "PostalCode (Neighborhood), Borough"

In [11]:
for _, row in df.iterrows():
    label = '{} ({}), {}'.format(
        row.PostalCode, row.Neighborhood, row.Borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [row.Latitude, row.Longitude],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) 
    
map_toronto   

### d. Limit data set to boroughs with term 'Toronto'

**For this lab, we will just pick neighborhoods in boroughs that contain the word "Toronto" in them.**

In [12]:
# Create dataframe with boroughs containing the term 'Toronto'
toronto_data = df[df.Borough.str.contains('Toronto')]
toronto_data.reset_index(inplace=True, drop=True)
toronto_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


In [13]:
toronto_data.shape

(39, 5)

So, there are 38 neighborhoods with boroughs containing the term 'Toronto'.

Let us visualize these on Toronto's map.

In [14]:
# Create a map instance
map_toronto = folium.Map(
    location=[location.latitude, location.longitude],
    zoom_start=11)

# Add markers for all 38 neighborhoods
for _, row in toronto_data.iterrows():
    label = '{} ({}), {}'.format(
        row.PostalCode, row.Neighborhood, row.Borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [row.Latitude, row.Longitude],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) 
    
map_toronto

### e. Define Foursquare crendentials

To avoid exposing the credentials, I have saved the client ID and secret in file "config.py" and placed it alongside this notebook.

In [15]:
from config import config

CLIENT_ID = config['CLIENT_ID']
CLIENT_SECRET = config['CLIENT_SECRET']
VERSION = '20190601'

ModuleNotFoundError: No module named 'config'

### f. Explore a neighborhood in Toronto

Now let us explore the first neighborhood in our dataframe.

Get the first neighborhood's name, latitude and longitude.

In [None]:
neighborhood_name = toronto_data.loc[0, 'Neighborhood']
neighborhood_latitude = toronto_data.loc[0, 'Latitude']
neighborhood_longitude = toronto_data.loc[0, 'Longitude']

print("Latitude and longitude of neighborhood '{}' are [{}, {}]".format(
    neighborhood_name, neighborhood_latitude, neighborhood_longitude))

**Get the top 100 venues in 'The Beaches' within a radius of 500 meters.**

In [None]:
# Construct the URL
limit = 100
radius = 500
explore_url_prefix = 'https://api.foursquare.com/v2/venues/explore'
url = '{}?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    explore_url_prefix, CLIENT_ID, CLIENT_SECRET, VERSION, 
    neighborhood_latitude, neighborhood_longitude, radius, limit)

In [None]:
# Get the venues.
results = requests.get(url).json()
results

In [None]:
# Explore the venues
venues = results['response']['groups'][0]['items']
venues

In [None]:
# Normalize the JSON response
neighborhood_venues = json_normalize(venues)
neighborhood_venues

This neighborhood only seems to have 4 venues. Let us go ahead and explore it further.

In [None]:
# Filter out the venue name, category, latitude and logitude.
venue_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
neighborhood_venues = neighborhood_venues.loc[:, venue_columns]
neighborhood_venues


In [None]:
# Change the column names to just the last part after the '.'.
neighborhood_venues.columns = [column.split(".")[-1] for column in neighborhood_venues.columns]
neighborhood_venues


In [None]:
# Extract the category names from a row.
# We will get the first item in the categories list and then get its name.

def get_category(row):
    categories_list = row['categories']
    if categories_list:
        return categories_list[0]['name']
    return None

In [None]:
# Replace the values in categories column with the first catogory name.
neighborhood_venues['categories'] = neighborhood_venues.apply(get_category, axis=1)
neighborhood_venues


### g. Explore neighborhoods in Toronto

Now we will explore all the neighborhoods in our dataframe. We will repeat the process that we did for 'The Beaches' neighborhood in the previous section.

In [None]:
venues_list = list()

for name, lat, lng in zip(toronto_data['Neighborhood'], toronto_data['Latitude'], toronto_data['Longitude']):
    print("Collecting venues for neighborhood:", name)
    
    # Create API request URL
    limit = 100
    radius = 500
    explore_url_prefix = 'https://api.foursquare.com/v2/venues/explore'
    url = '{}?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        explore_url_prefix, CLIENT_ID, CLIENT_SECRET, VERSION, 
        lat, lng, radius, limit)
    
    # Make the request
    neighborhood_venues = requests.get(url).json()["response"]['groups'][0]['items']
    
    # Add relevant info to venues_list
    venues_list.extend([(
        name, lat, lng,
        v['venue']['name'], 
        v['venue']['location']['lat'], 
        v['venue']['location']['lng'],  
        v['venue']['categories'][0]['name']) for v in neighborhood_venues])

print("Done")   

In [None]:
# Let us see how many venues we have got.
len(venues_list)

In [None]:
# Convert the venues_list to a dataframe.
toronto_venues = pd.DataFrame(venues_list)
toronto_venues.columns = [
    'Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 
    'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']

In [None]:
print(toronto_venues.shape)
toronto_venues.head()

Let us check the number of venues in each neighborhood.

In [None]:
toronto_venues.groupby('Neighborhood').count()

Let us find how many unique venue categories we have.

In [None]:
print('There are {} uniques categories'.format(
    len(toronto_venues['Venue Category'].unique())))


## 2. Analyze each neighborhood

In [None]:
# One hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_onehot.head()


In [None]:
# Add neighborhood column back to dataframe as NeighborhoodName
toronto_onehot.insert(0, 'NeighborhoodName', toronto_venues['Neighborhood']) 
toronto_onehot.head()


In [None]:
toronto_onehot.shape

In [None]:
# Check the size of this dataframe.
venues_count, categories_count = toronto_onehot.shape
print("So we have {} different venues and {} venue categories.".format(
    venues_count, categories_count))

#### Now we will group the rows by neighborhood name and by calculating the mean of the frequency of occurence of each category.

In [None]:
toronto_grouped = toronto_onehot.groupby('NeighborhoodName').mean().reset_index()
toronto_grouped


The dataframe has 38 rows and 238 columns, which is as expected,i.e., 38 neighborhoods and 238 categories.

Now we will print each of the neighborhoods with the top 5 most common venues in them

In [None]:
num_top_venues = 5
for neighborhood in toronto_grouped['NeighborhoodName']:
    print("------{}------".format(neighborhood))
    temp = toronto_grouped[toronto_grouped['NeighborhoodName'] == neighborhood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

#### Next we will create a pandas dataframe with the top 10 most common venues in each neighborhood

Function to sort venues in descending order of their frequency.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now we will create the dataframe.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['NeighborhoodName']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(
        toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

## 3. Cluster the Toronto Neighborhoods

We will run k-means and cluster the neighborhoods into 5 clusters.

In [None]:
# Set number of clusters
k = 5

# Drop the neighborhood name column so that each column contains only the feature set.
toronto_grouped_clustering = toronto_grouped.drop('NeighborhoodName', 1)

# Run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(toronto_grouped_clustering)

# Check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 


Now we will create a dataframe with the neighborhood information, top 10 common venues as well as the cluster labels.

In [None]:
# Add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# Merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

print(toronto_merged.shape)
toronto_merged.head()

#### Visualize the clusters

In [None]:
# Create a map instance
map_toronto = folium.Map(
    location=[location.latitude, location.longitude],
    zoom_start=11)

# Set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(
        toronto_merged['Latitude'], toronto_merged['Longitude'],
        toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_toronto)
       
map_toronto

## 4. Examine the clusters

Now let us examine the clusters and see how they differ from each other in terms of popular venues.

#### Cluster 0

In [None]:
cluster_0 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 0,
                               toronto_merged.columns[
                                   [2] + list(range(
                                       5, toronto_merged.shape[1]))]]
cluster_0

In [None]:
print(cluster_0.shape)

#### Cluster 1

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1,
                   toronto_merged.columns[
                       [2] + list(range(5, toronto_merged.shape[1]))]]


#### Cluster 2

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2,
                   toronto_merged.columns[
                       [2] + list(range(5, toronto_merged.shape[1]))]]


#### Cluster 3

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3,
                   toronto_merged.columns[
                       [2] + list(range(5, toronto_merged.shape[1]))]]


#### Cluster 4

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4,
                   toronto_merged.columns[
                       [2] + list(range(5, toronto_merged.shape[1]))]]


**Insights**: 
- There are 32 neighborhoods in the first cluster! It appears that coffee shops and cafes are very popular in the neighborhoods in this cluster and in general in Toronto!
- The rest of the clusters are much smaller and coffee shops are not that common in those.
- Out of the 5 clusters, 3 have only 1 neighborhood in them. These are the most distinct neighborhoods of all.
- Cluster 4, which is the last cluster seems to be more suitable for outdoor activities with parks, trails, playgrounds, etc. amongst the top 10.

# Thank you!