# IBM Certificate: Capstone Project

_This notebook contains the code for the Capstone project (IBM Certificate)_

## 1.Segmenting and Clustering Neighborhoods in Toronto

### 1.1 Importing the data

We import the below libraries:
- Pandas
- Numpy
- Requests


In [1]:
# Import libraries
import pandas as pd
import numpy as np
import requests

The data we want are the **postal codes** in Canada and in particular the ones corresponding to **Toronto** in the province of Ontario. We can find these data on Wikipedia. The aim is then to scrap the data on the webpage. 

This will be done using the library **_requests_**.

In [2]:
# URL of the wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
# Extract the content
r = requests.get(url)

Let's check what r contains so we know how to retrieve, clean, organise the data.

In [3]:
# Check the status code
print('Status code: ', r.status_code)
# Check the encoding
print('Encoding: ', r.encoding)
# Check the data type
print('Data type: ', type(r))
# Check the header
print('Headers: ', r.headers['content-type'])

Status code:  200
Encoding:  UTF-8
Data type:  <class 'requests.models.Response'>
Headers:  text/html; charset=UTF-8


We need to read an **html** content. Let's use _'read_html'_ from **Pandas**' library to retrieve the data.

In [4]:
raw_data = pd.read_html(r.text)

In [5]:
postal_codes = raw_data[0]

print('\n\n Dimension of the dataframe: ', postal_codes.shape)
postal_codes.head()



 Dimension of the dataframe:  (180, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### 1.2 Filtering the data

We need to filter the data a bit:
- remove the rows with a borough _'Not assigned'_
- a borough that has no corresponding neighborhood will have as neighborhood the same name as the borough

In [6]:
filter_1 = (postal_codes['Borough'] != 'Not assigned')
print('{} rows have a borough not assigned and need to be removed.'.format(len(postal_codes)-filter_1[filter_1 == True].count()))
postal_codes = postal_codes[filter_1]

77 rows have a borough not assigned and need to be removed.


In [7]:
filter_2 = (postal_codes['Neighborhood'] != 'Not assigned' )
print('After the first filter is applied {} boroughs have no corresponding neighborhood.'.format(len(postal_codes)-filter_2[filter_2 == True].count()))
postal_codes = postal_codes[filter_2]

After the first filter is applied 0 boroughs have no corresponding neighborhood.


In [8]:
print(' The dataframe is now of dimension {}.'.format(postal_codes.shape))
postal_codes.head()

 The dataframe is now of dimension (103, 3).


Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### 1.3 Adding Latitude, Longitude for each borough

The list of latitude and longitude for each postal code is given and is available at the following address https://cocl.us/Geospatial_data

In [9]:
url_lat_lng = 'https://cocl.us/Geospatial_data'
lat_lng = pd.read_csv(url_lat_lng)

Let's preview the imported data.

In [10]:
lat_lng.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now let's **merge** the dataframe **postal_codes** with the table **lat_lng** containing the latitude and longitude for each borough

In [11]:
df = pd.merge(postal_codes, lat_lng, how='left', on='Postal Code', sort=True,validate='1:1')

Let's verify that we added the latitude and longitude for each borough and that the resulting dataframe is of dimension 103x5 as we expect.

In [12]:
print('After merging the data the new dataframe is of dimension {} .'.format(df.shape))
print('\nFirst 10 rows sorted by \'Postal Code\' in descending order ')
df.sort_values(['Postal Code'], ascending = [False]).head(10)

After merging the data the new dataframe is of dimension (103, 5) .

First 10 rows sorted by 'Postal Code' in descending order 


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
102,M9W,Etobicoke,"Northwest, West Humber - Clairville",43.706748,-79.594054
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
98,M9N,York,Weston,43.706876,-79.518188
97,M9M,North York,"Humberlea, Emery",43.724766,-79.532242
96,M9L,North York,Humber Summit,43.756303,-79.565963
95,M9C,Etobicoke,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201
94,M9B,Etobicoke,"West Deane Park, Princess Gardens, Martin Grov...",43.650943,-79.554724
93,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242


### 1.4 Segmenting the neighborhood

Let's import the libraries **folium** that will allow us to show data on a map.

In [13]:
!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

Solving environment: done

# All requested packages already installed.



Let's get the **geographical coordinates of Toronto** and let's represent on a map all the neiborhoods.

In [36]:
toronto_latitude = 43.7
toronto_longitude = -79.45

Let's show a map of Toronto with all the corresponding neighborhoods.

In [62]:
# create map of Toronto using latitude and longitude values for each neighborhoods
map_toronto = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=10.5)

# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Declare **Foursquare** credentials and version _(https://foursquare.com is a website that offers APIs with which we can retrieve based on coordinates a list of venues and their features)_

In [43]:
CLIENT_ID = 'QFVW1KZRJ2BOM5VMEIRWDGLAN4M5YTVXFK5XRS4NQJ5VKPUY' # your Foursquare ID
CLIENT_SECRET = 'NAI1CPV0MEJWVQDNOXC0PJ5HR0WARV5C3LD3LI3KKVNIFPGH' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [41]:
LIMIT = 100
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
toronto_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

In [51]:
print('The dataframe with all venues for each neighborhood has {} rows and {} columns'.format(toronto_venues.shape[0],toronto_venues.shape[1]))
toronto_venues.head(5)

The dataframe with all venues for each neighborhood has 2132 rows and 7 columns


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Guildwood, Morningside, West Hill",43.763573,-79.188711,RBC Royal Bank,43.76679,-79.191151,Bank
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Big Bite Burrito,43.766299,-79.19072,Mexican Restaurant


Let's get a matrix of data showing for all neighborhoods the list of venues existing

In [54]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's group the data by neighborhood

In [66]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can check by neighborhood the type of venues that are the most frequently visited.

In [61]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0               Skating Rink  0.25
1                     Lounge  0.25
2             Breakfast Spot  0.25
3  Latin American Restaurant  0.25
4                Men's Store  0.00


----Alderwood, Long Branch----
            venue  freq
0     Pizza Place   0.2
1    Skating Rink   0.1
2  Sandwich Place   0.1
3    Dance Studio   0.1
4             Pub   0.1


----Bathurst Manor, Wilson Heights, Downsview North----
                       venue  freq
0                       Bank  0.09
1                Coffee Shop  0.09
2                Bridal Shop  0.04
3              Shopping Mall  0.04
4  Middle Eastern Restaurant  0.04


----Bayview Village----
                 venue  freq
0                 Café  0.25
1  Japanese Restaurant  0.25
2                 Bank  0.25
3   Chinese Restaurant  0.25
4  Moroccan Restaurant  0.00


----Bedford Park, Lawrence Manor East----
                venue  freq
0          Restaurant  0.09
1  Italian Restaurant  0.09

In [63]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [65]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Lounge,Latin American Restaurant,Skating Rink,Breakfast Spot,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run
1,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Skating Rink,Dance Studio,Pharmacy,Pool,Pub,Sandwich Place,Gym,American Restaurant
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Park,Fried Chicken Joint,Bridal Shop,Sandwich Place,Diner,Deli / Bodega,Restaurant,Ice Cream Shop
3,Bayview Village,Café,Bank,Chinese Restaurant,Japanese Restaurant,Women's Store,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run
4,"Bedford Park, Lawrence Manor East",Coffee Shop,Sandwich Place,Italian Restaurant,Restaurant,Thai Restaurant,Juice Bar,Indian Restaurant,Butcher,Café,Pub


### 1.5 Clustering the neighborhoods

In [71]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 0, 4, 4, 4, 4, 4, 4, 4, 4], dtype=int32)

In [148]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

#neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_).astype(int)

toronto_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged = toronto_merged.dropna()
toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,1.0,Fast Food Restaurant,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Women's Store,Department Store
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,3.0,Bar,Women's Store,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop,Fast Food Restaurant
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,4.0,Medical Center,Rental Car Location,Breakfast Spot,Electronics Store,Mexican Restaurant,Bank,Intersection,Dog Run,Discount Store,Distribution Center
3,M1G,Scarborough,Woburn,43.770992,-79.216917,4.0,Coffee Shop,Indian Restaurant,Korean Restaurant,Soccer Field,Comic Shop,Comfort Food Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,4.0,Hakka Restaurant,Thai Restaurant,Bank,Fried Chicken Joint,Athletics & Sports,Caribbean Restaurant,Gas Station,Bakery,Doner Restaurant,Dog Run


In [78]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [162]:
toronto_merged.dtypes

Postal Code                object
Borough                    object
Neighborhood               object
Latitude                  float64
Longitude                 float64
Cluster Labels              int32
1st Most Common Venue      object
2nd Most Common Venue      object
3rd Most Common Venue      object
4th Most Common Venue      object
5th Most Common Venue      object
6th Most Common Venue      object
7th Most Common Venue      object
8th Most Common Venue      object
9th Most Common Venue      object
10th Most Common Venue     object
dtype: object

Before we can proceed to the next step we need to change the data type of the coluymn **'Cluster Labels'** to _int32_ instead of _float64_

In [163]:
toronto_merged = toronto_merged.astype({'Cluster Labels':'int32'}, inplace = True ) 
toronto_merged['Cluster Labels'].dtype

dtype('int32')

In [165]:
# create map
map_clusters = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters