# Coursera Capstone Project Week 3 Assignment

## Alejandro González Casal 29/02/2021

## 1 - Scraping the web and preparing the dataframe

### 1.1 - The first step is to install the BeautifulSoup library

In [1]:
!pip install bs4



### 1.2 - Import libraries

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np

### 1.3 - Download the web, saving text response and parsing with BeautifulSoup

In [3]:
html_data = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
html_data_text = html_data.text
html_data.headers

{'Date': 'Mon, 01 Mar 2021 15:43:36 GMT', 'Vary': 'Accept-Encoding,Cookie,Authorization', 'Server': 'ATS/8.0.8', 'X-Content-Type-Options': 'nosniff', 'P3p': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'Content-Language': 'en', 'X-Request-Id': 'YDeAj8sayba4-hwRD6ixBwAAAIc', 'Last-Modified': 'Wed, 24 Feb 2021 11:44:32 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Encoding': 'gzip', 'Age': '71586', 'X-Cache': 'cp1081 miss, cp1079 hit/52', 'X-Cache-Status': 'hit-front', 'Server-Timing': 'cache;desc="hit-front"', 'Strict-Transport-Security': 'max-age=106384710; includeSubDomains; preload', 'Report-To': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'NEL': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}', 'Set-Cookie': 'WMF-Last-Access

In [4]:
soup = BeautifulSoup(html_data_text, 'html.parser')

In [5]:
soup.title

<title>List of postal codes of Canada: M - Wikipedia</title>

### 1.4 - Extracting the table row by row using BeautifulSoup

* Firstly, the canada_data dataframe is created with the required columns.
* Then, a loop iterates through the table's rows, the first row with the headers is avoided because it has a different tag (th) and doesn't return anything in: col = row.find_all("td")
* The data of each column is saved in different variables and then appended into the dataframe.
* After appending the line breaking token are deleted.

In [6]:
#soup.find("tbody").find_all("tr")

In [7]:
canada_data = pd.DataFrame(columns=["PostalCode", "Borough", "Neighbourhood"])

for row in soup.find("tbody").find_all("tr"):
    col = row.find_all("td")
    if len(col) != 0:
        postalCode = col[0].text.replace("\n","")
        borough = col[1].text.replace("\n","")
        neighbourhood = col[2].text.replace("\n","")
        canada_data = canada_data.append({"PostalCode":postalCode, "Borough":borough, "Neighbourhood":neighbourhood}, ignore_index=True)

canada_data

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


### 1.5 - Turn the 'Not Assigned' values into NaN and drop them.

In [8]:
canada_data['Borough'].replace("Not assigned", np.nan, inplace = True)
canada_data= canada_data.dropna().reset_index(drop= True)
canada_data

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


### 1.6 - Display the result dataframe shape

#### IMPORTANT NOTE: It isn't necessary to merge the rows with the same postal code since this change it is already done in the Wikipedia current version.

In [9]:
canada_data.shape

(103, 3)

## 2 - Getting geographical coordinates using Pgeocode Python package

### 2.1 - Install and import Pgeocode library

In [10]:
!pip install pgeocode



In [11]:
import pgeocode

### 2.2 - Using Pgeocode library to extract the geogrphical coordinates given the postal code.

In [12]:
#create geolocator to canada:
pgeocode.Nominatim('ca')
geolocator = pgeocode.Nominatim('ca')
#iterate the rows of the dataframe:
for i in canada_data.index:
    g = None
    #get postal code:
    postal_code = canada_data.at[i,'PostalCode']
    vuelta = 0
    # loop until you get the coordinates for the postal code:
    while(g is None):
        g = geolocator.query_postal_code(postal_code)
        vuelta += 1
    #add coordinates to the dataframe (2 new columns)
    canada_data.at[i,'Latitude'] = g.latitude
    canada_data.at[i,'Longitude']  = g.longitude
    #print(f'Fila {i} finalizada, coordenadas {g.latitude},{g.longitude}')
canada_data

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.3300
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.6518,-79.5076
99,M4Y,Downtown Toronto,Church and Wellesley,43.6656,-79.3830
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.7804,-79.2505
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.6325,-79.4939


## 3 - Explore and cluster the neighborhoods in Toronto

### 3.1 - Filter the canada dataframe to the boroughs that contain Toronto in their names.

In [13]:
toronto_data = canada_data[canada_data['Borough'].str.contains("Toronto", case=False)].reset_index(drop=True)

In [14]:
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783
3,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756
4,M4E,East Toronto,The Beaches,43.6784,-79.2941


### 3.2 - Define Foursquare Credentials and Version

In [39]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 
CLIENT_SECRET:


### 3.2 - Define getNearbyVenues function

In [16]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        print(name,len(results))
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    print(venues_list)
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [17]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighbourhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude'])

Regent Park, Harbourfront 100
Queen's Park, Ontario Provincial Government 100
Garden District, Ryerson 100
St. James Town 100
The Beaches 67
Berczy Park 100
Central Bay Street 100
Christie 100
Richmond, Adelaide, King 100
Dufferin, Dovercourt Village 100
Harbourfront East, Union Station, Toronto Islands 24
Little Portugal, Trinity 100
The Danforth West, Riverdale 100
Toronto Dominion Centre, Design Exchange 100
Brockton, Parkdale Village, Exhibition Place 100
India Bazaar, The Beaches West 76
Commerce Court, Victoria Hotel 100
Studio District 100
Lawrence Park 59
Roselawn 17
Davisville North 91
Forest Hill North & West, Forest Hill Road Park 48
High Park, The Junction South 100
North Toronto West,  Lawrence Park 66
The Annex, North Midtown, Yorkville 100
Parkdale, Roncesvalles 90
Davisville 100
University of Toronto, Harbord 100
Runnymede, Swansea 77
Moore Park, Summerhill East 78
Kensington Market, Chinatown, Grange Park 100
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer

In [18]:
print(toronto_venues.shape)
toronto_venues.head()

(3315, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.6555,-79.3626,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.6555,-79.3626,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.6555,-79.3626,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
3,"Regent Park, Harbourfront",43.6555,-79.3626,Sumach Espresso,43.658135,-79.359515,Coffee Shop
4,"Regent Park, Harbourfront",43.6555,-79.3626,Impact Kitchen,43.656369,-79.35698,Restaurant


The cell below is only necessary if we use a radius of 500  or less, in that case Roselawn doesn't have any venues and cause error further on.

In [19]:
#Drop Roselawn from toronto_data to avoid future problems:
print(toronto_data.shape)
#toronto_data = toronto_data.drop((toronto_data[toronto_data['Neighbourhood'] == 'Roselawn']).index)
toronto_data.shape

(39, 5)


(39, 5)

In [20]:
#Count the number of venues per neighbourhood
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,100,100,100,100,100,100
"Brockton, Parkdale Village, Exhibition Place",100,100,100,100,100,100
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",83,83,83,83,83,83
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",100,100,100,100,100,100
Central Bay Street,100,100,100,100,100,100
Christie,100,100,100,100,100,100
Church and Wellesley,100,100,100,100,100,100
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,100,100,100,100,100,100
Davisville North,91,91,91,91,91,91


In [21]:
#Count the unique types of venues:
print('There are {} uniques categories in Toronto Venues dataframe.'.format(len(toronto_venues['Venue Category'].unique())))

There are 290 uniques categories in Toronto Venues dataframe.


### 3.3 - Create a one-hot encoding dataframe from the toronto_venues dataframe

In [22]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

### 3.4 - Grouping by neighbourhood and calculate the frequency of each type of venue

In [23]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
print(toronto_grouped.shape)
toronto_grouped

(39, 291)


Unnamed: 0,Neighbourhood,Accessories Store,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012048,...,0.0,0.012048,0.0,0.0,0.0,0.0,0.012048,0.024096,0.012048,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.02,0.0,0.02,0.01,0.0,0.01,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.0
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.010989,0.0,0.0,0.010989,0.0,0.010989,0.0,0.0,0.021978,0.0


### 3.5 - Creating a dataframe with the top 10 venues of the neighbourhood

In [24]:
#defining function to extract the top n venues of each neighbourhood:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [25]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted

(39, 11)


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Café,Hotel,Japanese Restaurant,Seafood Restaurant,Restaurant,Bakery,Park,Art Gallery,Sporting Goods Shop
1,"Brockton, Parkdale Village, Exhibition Place",Restaurant,Coffee Shop,Bar,Café,Furniture / Home Store,Bakery,Gift Shop,Tibetan Restaurant,Park,Arts & Crafts Store
2,"Business reply mail Processing Centre, South C...",Clothing Store,Restaurant,Bakery,Coffee Shop,Gym / Fitness Center,Bank,Sporting Goods Shop,Toy / Game Store,Pharmacy,Department Store
3,"CN Tower, King and Spadina, Railway Lands, Har...",Coffee Shop,Yoga Studio,Park,Café,Gym,Dessert Shop,Bakery,Spa,French Restaurant,Italian Restaurant
4,Central Bay Street,Coffee Shop,Café,Hotel,Seafood Restaurant,Juice Bar,Yoga Studio,Electronics Store,Steakhouse,Asian Restaurant,Ramen Restaurant
5,Christie,Korean Restaurant,Café,Coffee Shop,Grocery Store,Mexican Restaurant,Ice Cream Shop,Cocktail Bar,Caribbean Restaurant,Eastern European Restaurant,Pizza Place
6,Church and Wellesley,Coffee Shop,Sushi Restaurant,Italian Restaurant,Diner,Park,Japanese Restaurant,Men's Store,Café,Dance Studio,Pizza Place
7,"Commerce Court, Victoria Hotel",Coffee Shop,Café,Hotel,Theater,Concert Hall,Restaurant,Japanese Restaurant,Seafood Restaurant,Sushi Restaurant,Cosmetics Shop
8,Davisville,Italian Restaurant,Pizza Place,Indian Restaurant,Sushi Restaurant,Coffee Shop,Café,Bakery,Sandwich Place,Restaurant,Dessert Shop
9,Davisville North,Coffee Shop,Italian Restaurant,Pizza Place,Café,Park,Sushi Restaurant,Restaurant,Pharmacy,Pub,Supermarket


### 3.6 - Using k-means method to cluster the different neighbourhood based on the frequency dataframe (3.4).

In [26]:
# import k-means
from sklearn.cluster import KMeans

In [27]:
# set number of clusters
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 1, 1, 1, 0, 1, 0, 0, 4, 4, 1, 0, 4, 0, 3, 1, 1, 1, 4, 1, 4, 4,
       1, 0, 0, 0, 2, 4, 1, 0, 0, 0, 1, 4, 1, 1, 1, 0, 1], dtype=int32)

### 3.7 - Insert the corresponding cluster in the most common venues dataframe (3.5) then merge it with the initial dataframe.

In [28]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood') #LA primera hace referencia al añadido, la segunda al base.

#Transform the clusters to int:
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype(int)

toronto_merged # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626,0,Coffee Shop,Café,Theater,Park,Restaurant,Italian Restaurant,Breakfast Spot,Bakery,Gastropub,Diner
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889,0,Coffee Shop,Park,Sushi Restaurant,Café,Boutique,Hotel,Italian Restaurant,Restaurant,Pizza Place,Bookstore
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783,0,Coffee Shop,Japanese Restaurant,Gastropub,Café,Italian Restaurant,Theater,Hotel,Seafood Restaurant,Ramen Restaurant,Cosmetics Shop
3,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756,0,Coffee Shop,Café,Restaurant,Seafood Restaurant,Theater,Italian Restaurant,Gastropub,Art Gallery,Cosmetics Shop,Concert Hall
4,M4E,East Toronto,The Beaches,43.6784,-79.2941,1,Pub,Coffee Shop,Pizza Place,Breakfast Spot,Bar,Health Food Store,Caribbean Restaurant,Nail Salon,Burger Joint,Sandwich Place
5,M5E,Downtown Toronto,Berczy Park,43.6456,-79.3754,0,Coffee Shop,Café,Hotel,Japanese Restaurant,Seafood Restaurant,Restaurant,Bakery,Park,Art Gallery,Sporting Goods Shop
6,M5G,Downtown Toronto,Central Bay Street,43.6564,-79.386,0,Coffee Shop,Café,Hotel,Seafood Restaurant,Juice Bar,Yoga Studio,Electronics Store,Steakhouse,Asian Restaurant,Ramen Restaurant
7,M6G,Downtown Toronto,Christie,43.6683,-79.4205,1,Korean Restaurant,Café,Coffee Shop,Grocery Store,Mexican Restaurant,Ice Cream Shop,Cocktail Bar,Caribbean Restaurant,Eastern European Restaurant,Pizza Place
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.6496,-79.3833,0,Coffee Shop,Café,Hotel,Theater,Plaza,Restaurant,Sushi Restaurant,Gym,Japanese Restaurant,Italian Restaurant
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.6655,-79.4378,1,Bar,Café,Bakery,Coffee Shop,Italian Restaurant,Pizza Place,Caribbean Restaurant,Park,Beer Store,Bank


### 3.8 - Display the different clusters in a folium map

In [29]:
!pip install geopy



In [30]:
#Import some useful libraries
import folium
from geopy.geocoders import Nominatim
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [31]:
#get the latitude and longitude coordinates of Toronto to center the view in it
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [32]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
colors_array = cm.rainbow(np.linspace(0, 1, len(x)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 3.9 - Create a datarame with the top 5 frequencies of venues in each dataframe

In [33]:
toronto_grouped.insert(0, 'Cluster Labels', kmeans.labels_)

In [34]:
toronto_venues_cluster = toronto_grouped.groupby('Cluster Labels').sum().reset_index().set_index('Cluster Labels')
toronto_venues_cluster

Unnamed: 0_level_0,Accessories Store,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wings Joint,Women's Store,Yoga Studio,Zoo
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.18,0.0,...,0.15,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.11,0.0
1,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.147257,0.011111,...,0.247912,0.012048,0.041111,0.111111,0.01,0.051111,0.012048,0.054096,0.218193,0.011111
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.041667,0.041667,0.041667,0.083333,0.083333,0.083333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.044585,0.0,...,0.068172,0.0,0.0,0.050726,0.0,0.026141,0.02,0.0,0.078664,0.0


In [35]:
toronto_venues_cluster=toronto_venues_cluster.T

In [36]:
toronto_venues_cluster

Cluster Labels,0,1,2,3,4
Accessories Store,0.00,0.010000,0.0,0.000000,0.000000
African Restaurant,0.00,0.020000,0.0,0.000000,0.000000
Airport,0.00,0.000000,0.0,0.041667,0.000000
Airport Food Court,0.00,0.000000,0.0,0.041667,0.000000
Airport Gate,0.00,0.000000,0.0,0.041667,0.000000
...,...,...,...,...,...
Wine Bar,0.00,0.051111,0.0,0.000000,0.026141
Wings Joint,0.00,0.012048,0.0,0.000000,0.020000
Women's Store,0.01,0.054096,0.0,0.000000,0.000000
Yoga Studio,0.11,0.218193,0.0,0.000000,0.078664


In [37]:
venues_cluster_df = pd.DataFrame()
for c in range(kclusters):
    venues_cluster = toronto_venues_cluster[c].sort_values(ascending = False)
    """print(f'EL top 5 venues del cluster {c} son:')
    print(venues_cluster [0:5])"""
    for v in range(5):
        venues_cluster_df.at[v,f'Cluster {c} venue'] = venues_cluster [v]
        venues_cluster_df.at[v,f'Cluster {c} value'] = venues_cluster.index [v]

In [38]:
venues_cluster_df

Unnamed: 0,Cluster 0 venue,Cluster 0 value,Cluster 1 venue,Cluster 1 value,Cluster 2 venue,Cluster 2 value,Cluster 3 venue,Cluster 3 value,Cluster 4 venue,Cluster 4 value
0,1.208947,Coffee Shop,0.961975,Café,0.3125,Park,0.166667,Harbor / Marina,0.556106,Coffee Shop
1,0.702632,Café,0.959172,Coffee Shop,0.125,Trail,0.083333,Burger Joint,0.539438,Italian Restaurant
2,0.47,Hotel,0.585826,Bakery,0.0625,Historic Site,0.083333,Bar,0.411004,Café
3,0.442632,Restaurant,0.469155,Bar,0.0625,Other Great Outdoors,0.083333,Airport Lounge,0.393939,Sushi Restaurant
4,0.362632,Japanese Restaurant,0.462181,Restaurant,0.0625,Candy Store,0.083333,Airport Service,0.342886,Bank


### 3.10 - Conclusions of the clustering

#### After the several analysis done above the next conclusion can be extracted:
* There are 3 main clusters that have the most of the neighbourhoods.
* Gepgraphically, one hold the northern side (4), other the southern one (0) and the last one is splitted between the east and the west (1).
* The to remaining clusters (2 and 3) only have one neighbourhood so they doesn't allow further analysis. The unique question about them is that they keep appearing even if we reduce the cluster number to 4 or even to 3.
* As for the venues that they have, cluster 0 most common one are cafés, hotels and restaurants with an higher frequency of the cafés. Cluster 1 is similar but it also includes venues like bakeries or bars. Finally, cluster 3 present very low frequencies in the most common venues, so we can conclude that it has a more evenly distributed range of venues.