<h1> Coursera Capstone Assignment: <br>Segmenting and Clustering Neighborhoods in Toronto </h1>

<h2> 1) Create dataframe with Toronto neighborhoods </h2>

In [2]:
import requests # library to handle requests to websites

from bs4 import BeautifulSoup #library to get wikipedia content

import numpy as np

In [3]:
#define variable with the url of the postal codes in Toronto
website_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
#create requets object
result = requests.get(website_url).text

In [4]:
#create Beautifulsoup object
soup = BeautifulSoup(result, 'lxml')
#print(soup.prettify()) #prettify enables to view how the tags are nested in the document
#I do not print the result to reduce space

The Wikipedia page is downloaded in html format. Now I need to extract the table with the post codes. <br>
The tables in HTML are initiated with 'table class'

In [5]:
#create a variable which contains only the table with postal codes
toronto_table = soup.find('table',{'class':'wikitable sortable'})

#print(toronto_table)
#I do not print the result to reduce space

Just the table is extracted from html. Now I will get the data from the table. <br> The <i>postcode, borough, neiborhood</i> are always in format: <br>
<td>Postcode</td> <br>
<td>Borough</td> <br>
<td>Neighbourhood</td> <br>

In [6]:
#extract the postcode, borough, neighborhood
toronto_table_reduced = toronto_table.findAll('td')

#view sample
toronto_table_reduced[0:10]

[<td>M1A</td>, <td>Not assigned</td>, <td>Not assigned
 </td>, <td>M2A</td>, <td>Not assigned</td>, <td>Not assigned
 </td>, <td>M3A</td>, <td><a href="/wiki/North_York" title="North York">North York</a></td>, <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
 </td>, <td>M4A</td>]

toronto_table_reduced contains postcode, borough, neighborhood in form off a list. Now I will create pandas dataframe out of it.

In [7]:
import pandas as pd # import pandas library

In [8]:
# define the dataframe columns
column_names = ['Postcode', 'Borough', 'Neighborhood'] 

# instantiate the dataframe
toronto_postcodes = pd.DataFrame(columns=column_names)

In [9]:
#iterate through toronto_table_postcode and append data to dataframe
for item in range(0,len(toronto_table_reduced),3): 
    postcode = toronto_table_reduced[item].decode_contents() #get just the text and ignore html
    borough = toronto_table_reduced[item+1].get_text() #get just the text and ignore html
    neighborhood = toronto_table_reduced[item+2].get_text(strip = True) #get just the text and ignore html
    
    if neighborhood == 'Not assigned': #if the neighborhood does not have a name than it is same as borough
        neighborhood = borough
        
    if borough != 'Not assigned': #only consider the borough that have a name
        toronto_postcodes = toronto_postcodes.append({'Postcode': postcode,
                                                      'Borough': borough,
                                                      'Neighborhood': neighborhood}, ignore_index=True)
    else:
        pass

#print dataframe to check
toronto_postcodes.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


The dataframe still contains duplicate rows: same postcode -> different Neigborhood. I will sort it now. <br>
I will iterate through the table and if for the Neighborhoods with the same postcode place them in same row.

In [10]:
#iterate through dataframe
for i in range(toronto_postcodes.shape[0]-1, 0 , -1): #iterate from the bottom of the table
    if toronto_postcodes.iloc[i][0] == toronto_postcodes.iloc[i-1][0]: #if the postcode is same as in the next row
        toronto_postcodes.iloc[i-1][2] += ', ' + toronto_postcodes.iloc[i][2] #combine the neighborhoods
    else:
        pass
    
#now I remove the rows with duplicate postcodes
toronto_postcodes.drop_duplicates(subset= 'Postcode', keep= 'first', inplace = True)
toronto_postcodes.reset_index(inplace=True)

#print the dataframe sample
toronto_postcodes.head(10)

Unnamed: 0,index,Postcode,Borough,Neighborhood
0,0,M3A,North York,Parkwoods
1,1,M4A,North York,Victoria Village
2,2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,4,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,6,M7A,Queen's Park,Queen's Park
5,7,M9A,Etobicoke,Islington Avenue
6,8,M1B,Scarborough,"Rouge, Malvern"
7,10,M3B,North York,Don Mills North
8,11,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,13,M5B,Downtown Toronto,"Ryerson, Garden District"


The pandas dataframe with Toronto postodes is now ready. I check its shape

In [11]:
toronto_postcodes.shape

(103, 4)

<h2> 2) Get geo coordinates for each postcode </h2>

I tried to use Geocoder as indicated in the assignement description, but I could not get any data.

In [12]:
#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

In [13]:
'''latitude = []
longitude = []
geolocator = Nominatim(user_agent="foursquare_agent")
for index, row in toronto_postcodes.iterrows():
    address = '{}, Toronto, Ontario'.format(row[0])
    location = geolocator.geocode(address)
    latitude.append(location.latitude)
    longitude.append(location.longitude)
'''

'latitude = []\nlongitude = []\ngeolocator = Nominatim(user_agent="foursquare_agent")\nfor index, row in toronto_postcodes.iterrows():\n    address = \'{}, Toronto, Ontario\'.format(row[0])\n    location = geolocator.geocode(address)\n    latitude.append(location.latitude)\n    longitude.append(location.longitude)\n'

I also tried to used Geopy, but It cannot recognize the postal code. <br>
Finally I use the csv file provided.

In [14]:
#read the csv file with coordinates
df_coordinates = pd.read_csv('Geospatial_Coordinates.csv')
df_coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [15]:
#rename "Postal Code" to 'Postcode' to match with my dataframe
df_coordinates.rename(columns={'Postal Code':'Postcode'}, inplace=True)

In [16]:
#merge the dataframes
toronto_postcodes = toronto_postcodes.merge(df_coordinates,on='Postcode')
toronto_postcodes.head()

Unnamed: 0,index,Postcode,Borough,Neighborhood,Latitude,Longitude
0,0,M3A,North York,Parkwoods,43.753259,-79.329656
1,1,M4A,North York,Victoria Village,43.725882,-79.315572
2,2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,4,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,6,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


Now the table with Toronto Postcodes and its GPS coordinates is ready.

<h2> 3) Explore and cluster the neighborhoods in Toronto </h2>

<h3> Explore the neighborhoods with Foursquare API </h3>

I will analyze the Toronto based on Boroughs. <br>
Because there are several Postal codes with same borough name, I will use combination of borough name and postal code as reference.

In [17]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

In [18]:
CLIENT_ID = 'ID' # your Foursquare ID
CLIENT_SECRET = 'SECRET' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
radius = 500
LIMIT = 100

A function that checks nearby venues for a borough in Toronto. Actually this is borrowed from the class.

In [19]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough code', #Create combination of borough name and postal code
                  'Borough Latitude', 
                  'Borough Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Run the function for each borough.

In [20]:
toronto_venues = getNearbyVenues(names=toronto_postcodes['Borough'] + ' ' + toronto_postcodes['Postcode'], #Create combination of borough name and postal code
                                   latitudes=toronto_postcodes['Latitude'],
                                   longitudes=toronto_postcodes['Longitude']
                                  )

North York M3A
North York M4A
Downtown Toronto M5A
North York M6A
Queen's Park M7A
Etobicoke M9A
Scarborough M1B
North York M3B
East York M4B
Downtown Toronto M5B
North York M6B
Etobicoke M9B
Scarborough M1C
North York M3C
East York M4C
Downtown Toronto M5C
York M6C
Etobicoke M9C
Scarborough M1E
East Toronto M4E
Downtown Toronto M5E
York M6E
Scarborough M1G
East York M4G
Downtown Toronto M5G
Downtown Toronto M6G
Scarborough M1H
North York M2H
North York M3H
East York M4H
Downtown Toronto M5H
West Toronto M6H
Scarborough M1J
North York M2J
North York M3J
East York M4J
Downtown Toronto M5J
West Toronto M6J
Scarborough M1K
North York M2K
North York M3K
East Toronto M4K
Downtown Toronto M5K
West Toronto M6K
Scarborough M1L
North York M2L
North York M3L
East Toronto M4L
Downtown Toronto M5L
North York M6L
North York M9L
Scarborough M1M
North York M2M
North York M3M
East Toronto M4M
North York M5M
York M6M
North York M9M
Scarborough M1N
North York M2N
North York M3N
Central Toronto M4N
Centr

The result is a dataframe:

In [21]:
print(toronto_venues.shape)

(2258, 7)


Let's check how many venues were returned for each borough

In [23]:
toronto_venues.groupby('Borough code').Venue.count().head(10)

Borough code
Central Toronto M4N      3
Central Toronto M4P      7
Central Toronto M4R     24
Central Toronto M4S     34
Central Toronto M4T      1
Central Toronto M4V     15
Central Toronto M5N      1
Central Toronto M5P      5
Central Toronto M5R     24
Downtown Toronto M4W     5
Name: Venue, dtype: int64

Total number of Boroughs is 103, while here there are only 99. Some boroughs have no venues.

<h3> Analyze Each Neighborhood </h3>

In [26]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Borough code'] = toronto_venues['Borough code'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Borough code,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,North York M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,North York M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,North York M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,North York M4A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,North York M4A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [27]:
toronto_grouped = toronto_onehot.groupby('Borough code').mean().reset_index()
toronto_grouped

Unnamed: 0,Borough code,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Central Toronto M4N,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.00,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000
1,Central Toronto M4P,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.00,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000
2,Central Toronto M4R,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.00,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.041667
3,Central Toronto M4S,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.00,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000
4,Central Toronto M4T,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.00,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000
5,Central Toronto M4V,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.066667,...,0.00,0.000000,0.000000,0.0,0.066667,0.0,0.000000,0.000000,0.0,0.000000
6,Central Toronto M5N,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.00,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000
7,Central Toronto M5P,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.00,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000
8,Central Toronto M5R,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.041667,...,0.00,0.041667,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000
9,Downtown Toronto M4W,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.00,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000


First, let's write a function to sort the venues in descending order.

In [28]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [29]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Borough code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
boroughs_venues_sorted = pd.DataFrame(columns=columns)
boroughs_venues_sorted['Borough code'] = toronto_grouped['Borough code']

for ind in np.arange(toronto_grouped.shape[0]):
    boroughs_venues_sorted['Borough code'] = toronto_grouped['Borough code']
    boroughs_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

boroughs_venues_sorted.head()

Unnamed: 0,Borough code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Central Toronto M4N,Bus Line,Park,Swim School,Yoga Studio,Dumpling Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant
1,Central Toronto M4P,Hotel,Sandwich Place,Park,Gym,Breakfast Spot,Clothing Store,Food & Drink Shop,Eastern European Restaurant,Doner Restaurant,Donut Shop
2,Central Toronto M4R,Sporting Goods Shop,Clothing Store,Coffee Shop,Mexican Restaurant,Diner,Dessert Shop,Cosmetics Shop,Park,Pet Store,Bagel Shop
3,Central Toronto M4S,Sandwich Place,Dessert Shop,Coffee Shop,Pharmacy,Sushi Restaurant,Italian Restaurant,Café,Pizza Place,Gym,Chinese Restaurant
4,Central Toronto M4T,Restaurant,Yoga Studio,Dim Sum Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant


The result shows each borough top 10 venues

<h3> Cluster boroughs </h3>

In [30]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [31]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Borough code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 2, 2, 2, 2, 1, 0, 2, 0])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [32]:
# add clustering labels
boroughs_venues_sorted.reset_index()
boroughs_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_postcodes
toronto_merged['Borough code'] = toronto_merged['Borough'] + ' ' + toronto_merged['Postcode']

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(boroughs_venues_sorted.set_index('Borough code'), on='Borough code')

toronto_merged.head(6) # check the last columns!

Unnamed: 0,index,Postcode,Borough,Neighborhood,Latitude,Longitude,Borough code,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,M3A,North York,Parkwoods,43.753259,-79.329656,North York M3A,0.0,Food & Drink Shop,Park,Pool,Yoga Studio,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
1,1,M4A,North York,Victoria Village,43.725882,-79.315572,North York M4A,2.0,Intersection,Hockey Arena,French Restaurant,Coffee Shop,Portuguese Restaurant,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop
2,2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636,Downtown Toronto M5A,2.0,Coffee Shop,Pub,Bakery,Park,Café,Breakfast Spot,Restaurant,Gym / Fitness Center,Mexican Restaurant,Health Food Store
3,4,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,North York M6A,2.0,Clothing Store,Accessories Store,Coffee Shop,Shoe Store,Miscellaneous Shop,Furniture / Home Store,Boutique,Vietnamese Restaurant,Women's Store,Cosmetics Shop
4,6,M7A,Queen's Park,Queen's Park,43.662301,-79.389494,Queen's Park M7A,2.0,Coffee Shop,Gym,Diner,Park,Japanese Restaurant,Chinese Restaurant,Smoothie Shop,Seafood Restaurant,Sandwich Place,Burger Joint
5,7,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242,Etobicoke M9A,,,,,,,,,,,


Not all boroughs have venues, therefore some rows contains NaNs. As a result the Cluster Labels type changed to float. <br>
I remove NaN rows and change type to int.

In [33]:
toronto_merged.dropna(axis=0, how='any', inplace = True)
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype(int)

This is the final table. Now I will vizualize the results on the map.

In [42]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import folium # map rendering library

In [46]:
# create map
latitude = toronto_merged['Latitude'][0]
longitude = toronto_merged['Longitude'][0]
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Borough code'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

We clearly see a big disproportion. Most of the boroughs are light blue. I will investigate the numbers.

In [47]:
toronto_merged.groupby(['Cluster Labels']).Postcode.count()

Cluster Labels
0    15
1     1
2    81
3     1
4     1
Name: Postcode, dtype: int64

There 81 boroughs in cluster 2. On the other hand clusters 1, 3, 4 contain only one borough each.