<a href="https://github.com/PhinanceScientist"><img src = "https://i.ibb.co/NLfc0SV/Deveaner.png" width = 100> </a>
<h1 align=center><font size = 5>Merida's Downtown Neighborhoods Clustered by Vulnerability due to the COVID-19 Outbreak</font></h1>

## Introduction

For this project I will be using some prepared data from a postal public web page due to the lack of postal and geodata from Mérida, Yucatán in México. The goal is to obtain some relevant information from the vulnerability of neighborhoods from Merida's Downtown based on the information retrieved by the Foursquare API. k-means will be used to group the neighborhoods and finally I will use the Folium library to visualize the results.
This approach is an attempt for visualizing the main neighborhoods inside Merida's DownTown in order to cluster the most vulnerable places as the COVID-19 is more likely to spread on crowded places such as fitness centers, cinemas and malls.

Please do notice that if you want to render this Jupyter notebook (show the folium maps) you can use this link https://nbviewer.jupyter.org/

## Data

The data needed for this project  can be found on this local postal services web page called <a href="https://www.heraldo.com.mx/"> Heraldo.com.mx </a> where we can find several postal codes from México. In this case we will be foccused on <a href="https://www.heraldo.com.mx/yucatan/merida/merida/">Mérida's postal codes</a>.<br>
As for the CSV file used it is based on the first 100 postal codes from Mérida (ascending order starting from the downtown area as common knowledge) and then linked to its own Latitude and Longitude as a result of a Google Maps Search for each one.<br>
The Foursquare's API will be used to retrieve information of the venue on each neighborhood, type of each venue will be our goal to determine how crowded they are and therefore the whole vulenarability of the surrounding area.  

## Bibliografy
https://molekule.science/places-to-avoid-flu-virus/ <br>
https://www.babymed.com/health-news/8-public-places-avoid-during-cold-and-flu-season <br>
https://www.nhs.uk/conditions/coronavirus-covid-19/ <br>
https://www.health.gov.au/news/health-alerts/novel-coronavirus-2019-ncov-health-alert/what-you-need-to-know-about-coronavirus-covid-19 <br>
https://www.who.int/emergencies/diseases/novel-coronavirus-2019/advice-for-public <br>
https://www.healthline.com/health-news/public-places-and-the-coronavirus-what-to-know#Coronavirus-can-spread-through-contact-with-contaminated-surfaces,-too <br>
https://www.cdc.gov/coronavirus/2019-ncov/prepare/transmission.html <br>
https://www.bbc.com/future/article/20200317-covid-19-how-long-does-the-coronavirus-last-on-surfaces <br>


# <p style =" text-align: center">PART 1<p> 


## Scraping data from Wikipedia using BeautifulSoup

In [1]:
#Import requests for web scraping
import pandas as pd
import requests as rq
import numpy as np
import io

import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries installed')

Libraries installed


In [8]:
website_url= rq.get('https://micodigopostal.org/yucatan/merida/').text #Bring the data from the target URL

### Now we shall use BeautifulSoup library

In [9]:
#Import BeautifulSoup for html structure information from our request
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')

table = soup.find_all('table')[0] #Find the table

df = pd.read_html(str(table)) #Read the table in HTML

neighborhood=pd.DataFrame(df[0]) #Turn the table to a DataFrame
neighborhood

Unnamed: 0,Asentamiento▼,Tipo de Asentamiento,Código Postal,Municipio,Ciudad,Zona,Mapa
0,15 de Mayo,Fraccionamiento,97229,Mérida,Mérida,Urbana,Mapa
1,5 Colonias,Colonia,97280,Mérida,Mérida,Urbana,Mapa
2,Águilas Chuburna,Fraccionamiento,97215,Mérida,Mérida,Urbana,Mapa
3,Álamos del Sur,Fraccionamiento,97285,Mérida,Mérida,Urbana,Mapa
4,(adsbygoogle = window.adsbygoogle || []).push(...,(adsbygoogle = window.adsbygoogle || []).push(...,(adsbygoogle = window.adsbygoogle || []).push(...,(adsbygoogle = window.adsbygoogle || []).push(...,(adsbygoogle = window.adsbygoogle || []).push(...,(adsbygoogle = window.adsbygoogle || []).push(...,(adsbygoogle = window.adsbygoogle || []).push(...
5,Alcalá Martín,Colonia,97050,Mérida,Mérida,Urbana,Mapa
6,Algarrobos Desarrollo Residencial,Fraccionamiento,97305,Mérida,-,Semiurbana,Mapa
7,Almendros,Colonia,97203,Mérida,Mérida,Urbana,Mapa
8,Altabrisa,Fraccionamiento,97130,Mérida,Mérida,Urbana,Mapa
9,Altavista,Fraccionamiento,97305,Mérida,-,Urbana,Mapa


### Drop Not asiggned Neighbourhoods as there are no Boroughs assigned to them neither

In [10]:
noNeighborhood = neighborhood[neighborhood['Tipo de Asentamiento'] == '(adsbygoogle = window.adsbygoogle || []).push({});'].index
neighborhood.drop(noNeighborhood, inplace = True)
neighborhood.drop(['Municipio','Ciudad','Zona','Mapa','Tipo de Asentamiento'], axis=1, inplace=True)

neighborhood

Unnamed: 0,Asentamiento▼,Código Postal
0,15 de Mayo,97229
1,5 Colonias,97280
2,Águilas Chuburna,97215
3,Álamos del Sur,97285
5,Alcalá Martín,97050
6,Algarrobos Desarrollo Residencial,97305
7,Almendros,97203
8,Altabrisa,97130
9,Altavista,97305
10,Alura,97305


### Now, we proceed to group our Dataframe by Postcode with a concatenation of a ","

In [11]:
grpdf =neighborhood.groupby(['Código Postal'], as_index=False, sort=False).agg(','.join)
grpdf #Dataframe Grouped by Postcode and joined with ","

Unnamed: 0,Código Postal,Asentamiento▼
0,97229,"15 de Mayo,Ampliación Roma (Luis Echeverría),H..."
1,97280,"5 Colonias,Santa Rita"
2,97215,"Águilas Chuburna,Colonial Buenavista,Colonial ..."
3,97285,"Álamos del Sur,Ampliación Plan de Ayala,Bellav..."
4,97050,"Alcalá Martín,Yucatán"
5,97305,"Algarrobos Desarrollo Residencial,Altavista,Al..."
6,97203,"Almendros,Ampliación Francisco de Montejo,Arbo..."
7,97130,"Altabrisa,Diaz Ordaz,Missan II,Montecarlo,Resi..."
8,97256,"Álvaro Torres,Graciano Ricalde,Industrial Brid..."
9,97175,Amalia Solorzano


In [22]:
grpdf = grpdf.rename(columns ={'Código Postal':'Postcode','Asentamiento▼':'Neighborhood'})
grpdf

Unnamed: 0,Postcode,Neighborhood
0,97229,"15 de Mayo,Ampliación Roma (Luis Echeverría),H..."
1,97280,"5 Colonias,Santa Rita"
2,97215,"Águilas Chuburna,Colonial Buenavista,Colonial ..."
3,97285,"Álamos del Sur,Ampliación Plan de Ayala,Bellav..."
4,97050,"Alcalá Martín,Yucatán"
5,97305,"Algarrobos Desarrollo Residencial,Altavista,Al..."
6,97203,"Almendros,Ampliación Francisco de Montejo,Arbo..."
7,97130,"Altabrisa,Diaz Ordaz,Missan II,Montecarlo,Resi..."
8,97256,"Álvaro Torres,Graciano Ricalde,Industrial Brid..."
9,97175,Amalia Solorzano


### Our last requirement is to verify our Dataframe shape

In [12]:
grpdf.shape

(141, 2)

## Final Thoughts <br>

<li>There were not Borough names for the Not assigned Neighbourhoods, so, we skipped the instruction of using the same name as de Borough for the Neighbourhood with a value of "Not assigned" (March 2020).</li>
<li>The Original table from the wikipedia (March 2020) has fewer rows than the Example's image provide for the instructions. </li>
<li>The example's image showed a duplicate Neighbourhood value for the M5A Postal Code but It was not found in the Wikipedia Table (March 2020).</li>

### References <br>
Medium post: <br>
https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722 (How BeautifulSoup Works)<br><br>

Coursera threads:<br>
https://www.coursera.org/learn/applied-data-science-capstone/discussions/all/threads/WwZwTZcmQJuGcE2XJuCb4g  (Scrap and turn to dataframe) <br><br>
https://www.coursera.org/learn/applied-data-science-capstone/discussions/all/threads/czrpnE_gEemX6BLS8CLb5g (Group by, merge Poste Code) <br><br>

thispointer.com:<br>
https://thispointer.com/python-pandas-how-to-drop-rows-in-dataframe-by-conditions-on-column-values/ (How to drop rows)



This notebook was <b>Part 1</b> of the Final assignment from the week 3 of the Applied Data Science Capstone from IBM Professional Certificate made by <a href='https://www.linkedin.com/in/novelo-luis/'> Luis Novelo </a>

***

***

# <p style =" text-align: center">PART 2<p> 

## First we need to retrieve our data, in this case we will use the file given by the instructions <br>
CSV URL File: https://raw.githubusercontent.com/PhinanceScientist/Coursera_Capstone/master/merida_cp_prepared.csv

In [2]:
urlCSV = 'https://raw.githubusercontent.com/PhinanceScientist/Coursera_Capstone/master/merida_downtowncp.csv' #Retreive the data
geoSpatial = pd.read_csv(urlCSV) #Turned to dataFrame
newdf = geoSpatial.rename(columns ={'cp':'Postcode'}) #Rename our column in order to have the same Column title as our previous DataFrame
newdf




Unnamed: 0,Postcode,colonia,lat,lon
0,97000,La Quinta,20.976392,-89.636806
1,97000,Jardines de San Sebastian,20.95905,-89.634774
2,97000,Privada Del Maestro,20.982308,-89.626156
3,97000,Merida Centro,20.968927,-89.645942
4,97000,Los Cocos,20.948595,-89.630134
5,97000,Privada Garcia Gineres C - 29,20.989226,-89.638116


In [19]:
newdf.drop_duplicates(subset='Postcode', keep="first")#Dropping duplicate postcodes and keeping only the first value
#newdf.drop(['colonia','estado','municipio'], axis=1, inplace=True)
newdf.head()

Unnamed: 0,Postcode,colonia,lat,lon
0,97000,La Quinta,20.976392,-89.636806
1,97000,Jardines de San Sebastian,20.95905,-89.634774
2,97000,Privada Del Maestro,20.982308,-89.626156
3,97000,Merida Centro,20.968927,-89.645942
4,97000,Los Cocos,20.948595,-89.630134


In [21]:
#newdf.dtypes
#grpdf["Postcode"]= grpdf["Postcode"].astype(int) 

#grpdf.astype({'Postcode': 'int64'}).dtypes
grpdf.dtypes

Código Postal    object
Asentamiento▼    object
dtype: object

In [22]:
mergedf=pd.merge(grpdf, newdf, on='Postcode') #Merge by column name and build new dataframe

mergedf.drop_duplicates(subset='Postcode', keep="first")

KeyError: 'Postcode'

## Final Thoughts <br>

<li>The data set provided by the instructions was used in order to simplify the excercise (March 2020).</li>


### References <br>
note.nkmk.me: <br>
https://note.nkmk.me/en/python-pandas-dataframe-rename/ (How to rename dataframe's columns )<br><br>

Stack overflow:<br>
https://stackoverflow.com/questions/43297589/merge-two-data-frames-based-on-common-column-values-in-pandas  (How to merge columns by value in pandas) <br><br>
https://stackoverflow.com/questions/32400867/pandas-read-csv-from-url (How to read CSV from URL) <br><br>

pandas.org:<br>
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html (How to read CSV with pandas)

This notebook was <b>Part 2</b> of the Final assignment from the week 3 of the Applied Data Science Capstone from IBM Professional Certificate made by <a href='https://www.linkedin.com/in/novelo-luis/'> Luis Novelo </a>

***

***

# <p style =" text-align: center">PART 3<p> 

### First we need to import our libraries

In [14]:
!pip -q install folium
import folium
print('Folium imported')

Folium imported


## 1. Exploring the dataset


In [25]:
map_merida = folium.Map(location=[20.97537, -89.61696], zoom_start=11) # Create Map

# add markers to map
for lat, lng, borough, neighborhood in zip(newdf['lat'], newdf['lon'], newdf['Postcode'], newdf['colonia']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_merida) 
    
map_merida


### Adding the Foursquare credentials

In [16]:
#@hidden_cell
CLIENT_ID = 'DGBSOBI1JYHOTEEC5WQBC41VJNTTUGDB0IJH4U4GI5HITY4D' # your Foursquare ID
CLIENT_SECRET = 'NDTXJZISJVIJX0J5V5RSDHXJULPWBBI2ND2EN3JH11ULSJQO' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: DGBSOBI1JYHOTEEC5WQBC41VJNTTUGDB0IJH4U4GI5HITY4D
CLIENT_SECRET:NDTXJZISJVIJX0J5V5RSDHXJULPWBBI2ND2EN3JH11ULSJQO


### For this excercise we will use only the Neighbourhoods from the 'Downtown Toronto' Borough as is quite an important venue 

In [17]:
#dt_Toronto_data = mergedf[mergedf['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
#dt_Toronto_data.head()
newdf.head()

Unnamed: 0,Postcode,colonia,lat,lon
0,97000,La Quinta,20.976392,-89.636806
1,97000,Jardines de San Sebastian,20.95905,-89.634774
2,97000,Privada Del Maestro,20.982308,-89.626156
3,97000,Merida Centro,20.968927,-89.645942
4,97000,Los Cocos,20.948595,-89.630134


In [19]:
neighborhood_latitude = newdf.loc[0, 'lat'] # neighborhood latitude value
neighborhood_longitude = newdf.loc[0, 'lon'] # neighborhood longitude value

neighborhood_name = newdf.loc[0, 'colonia'] # neighbourhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of La Quinta are 20.976392, -89.6368064.


### Let's create the GET request URL. 

In [20]:


LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=DGBSOBI1JYHOTEEC5WQBC41VJNTTUGDB0IJH4U4GI5HITY4D&client_secret=NDTXJZISJVIJX0J5V5RSDHXJULPWBBI2ND2EN3JH11ULSJQO&v=20180605&ll=20.976392,-89.6368064&radius=500&limit=100'

### Send the GET request and examine the resutls

In [21]:
results = rq.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e6ac27a6001fe001b8bb1ca'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Mérida',
  'headerFullLocation': 'Mérida',
  'headerLocationGranularity': 'city',
  'totalResults': 10,
  'suggestedBounds': {'ne': {'lat': 20.980892004500006,
    'lng': -89.63199600292809},
   'sw': {'lat': 20.971891995499995, 'lng': -89.6416167970719}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '56b5ef8c498e2be2af8c0a3d',
       'name': "Padrino's Fitness & Muscle",
       'location': {'address': 'Calle 50 #332-D',
        'crossStreet': 'Calle 11',
        'lat': 20.977171654356308,
        'lng': -89.63987790833097,
        'labeledLatLngs': [{'label': 'display

### Function that extracts the category of the venue

In [22]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

### Now we clean the json and structure it into a pandas dataframe.

In [23]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Padrino's Fitness & Muscle,Gymnastics Gym,20.977172,-89.639878
1,Hojaldras Rossana,Bakery,20.977154,-89.64032
2,Cantina El Poniente,Bar,20.979084,-89.640137
3,Oxxo,Convenience Store,20.977383,-89.633286
4,VillaLobos Madereria,Arts & Crafts Store,20.975803,-89.63617


In [24]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

10 venues were returned by Foursquare.


## 2. Exploring Neighbourhoods in Merida Downtown 

### Function to repeat the same process to all the neighborhoods in Downtown Toronto

In [61]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = rq.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### The code to run the above function on each neighborhood and create a new dataframe called *dt_Toronto__venues*.

In [62]:

dt_Merida_venues = getNearbyVenues(names=newdf['colonia'],
                                   latitudes=newdf['lat'],
                                   longitudes=newdf['lon']
                                  )

La Quinta
Jardines de San Sebastian
Privada Del Maestro
Merida Centro
Los Cocos
Privada Garcia Gineres C - 29


### Check the size of the new dataFrame


In [64]:
print(dt_Merida_venues.shape)
dt_Merida_venues.head()

(79, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,La Quinta,20.976392,-89.636806,Padrino's Fitness & Muscle,20.977172,-89.639878,Gymnastics Gym
1,La Quinta,20.976392,-89.636806,Hojaldras Rossana,20.977154,-89.64032,Bakery
2,La Quinta,20.976392,-89.636806,Cantina El Poniente,20.979084,-89.640137,Bar
3,La Quinta,20.976392,-89.636806,Oxxo,20.977383,-89.633286,Convenience Store
4,La Quinta,20.976392,-89.636806,VillaLobos Madereria,20.975803,-89.63617,Arts & Crafts Store


### Let's group our dataframe by Neighbourhood adn count how many venues they have

In [65]:
dt_Merida_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jardines de San Sebastian,10,10,10,10,10,10
La Quinta,10,10,10,10,10,10
Los Cocos,11,11,11,11,11,11
Merida Centro,4,4,4,4,4,4
Privada Del Maestro,22,22,22,22,22,22
Privada Garcia Gineres C - 29,22,22,22,22,22,22


In [66]:
# Unique venues categories
print('There are {} uniques categories.'.format(len(dt_Merida_venues['Venue Category'].unique())))

There are 45 uniques categories.


## 3. Analyze Each Neighbourhood

In [67]:
# one hot encoding
dt_Merida_onehot = pd.get_dummies(dt_Merida_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dt_Merida_onehot['Neighbourhood'] = dt_Merida_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [dt_Merida_onehot.columns[-1]] + list(dt_Merida_onehot.columns[:-1])
dt_Merida_onehot = dt_Merida_onehot[fixed_columns]

dt_Merida_onehot.head()

Unnamed: 0,Neighbourhood,Arts & Crafts Store,Athletics & Sports,Auditorium,Bakery,Bar,Beach,Bed & Breakfast,Breakfast Spot,Café,...,Public Art,Restaurant,Sandwich Place,Seafood Restaurant,Snack Place,Steakhouse,Taco Place,Theater,Vegetarian / Vegan Restaurant,Yoga Studio
0,La Quinta,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,La Quinta,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,La Quinta,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,La Quinta,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,La Quinta,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [68]:
dt_Merida_onehot.shape

(79, 46)

### Next, let's group rows by neighbourhood and by taking the mean of the frequency of occurrence of each category

In [69]:
dt_Merida_grouped = dt_Merida_onehot.groupby('Neighbourhood').mean().reset_index()
dt_Merida_grouped

Unnamed: 0,Neighbourhood,Arts & Crafts Store,Athletics & Sports,Auditorium,Bakery,Bar,Beach,Bed & Breakfast,Breakfast Spot,Café,...,Public Art,Restaurant,Sandwich Place,Seafood Restaurant,Snack Place,Steakhouse,Taco Place,Theater,Vegetarian / Vegan Restaurant,Yoga Studio
0,Jardines de San Sebastian,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,...,0.0,0.1,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,La Quinta,0.1,0.0,0.0,0.1,0.1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0
2,Los Cocos,0.0,0.090909,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.090909,0.090909,0.0,0.0,0.0
3,Merida Centro,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Privada Del Maestro,0.0,0.0,0.045455,0.0,0.045455,0.0,0.045455,0.045455,0.045455,...,0.045455,0.045455,0.0,0.0,0.0,0.0,0.045455,0.045455,0.090909,0.0
5,Privada Garcia Gineres C - 29,0.0,0.0,0.0,0.045455,0.045455,0.0,0.0,0.0,0.0,...,0.0,0.090909,0.045455,0.045455,0.045455,0.0,0.0,0.0,0.0,0.045455


In [70]:
dt_Merida_grouped.shape

(6, 46)

### Let's print each neighbourhood along with the top 5 most common venues

In [71]:
num_top_venues = 5

for hood in dt_Merida_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = dt_Merida_grouped[dt_Merida_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Jardines de San Sebastian----
               venue  freq
0               Park   0.2
1     Sandwich Place   0.2
2  Convenience Store   0.2
3        Pizza Place   0.1
4    Bed & Breakfast   0.1


----La Quinta----
                 venue  freq
0  Arts & Crafts Store   0.1
1    Convenience Store   0.1
2          Snack Place   0.1
3             Pharmacy   0.1
4      Motorcycle Shop   0.1


----Los Cocos----
                venue  freq
0   Convenience Store  0.09
1         Pizza Place  0.09
2  Athletics & Sports  0.09
3                 Bar  0.09
4          Taco Place  0.09


----Merida Centro----
                   venue  freq
0                 Lounge  0.25
1                  Beach  0.25
2                    Gym  0.25
3  Performing Arts Venue  0.25
4            Pizza Place  0.00


----Privada Del Maestro----
                           venue  freq
0  Paper / Office Supplies Store  0.09
1                          Hotel  0.09
2  Vegetarian / Vegan Restaurant  0.09
3             Mexican Rest

### Let's put that into a *pandas* dataframe

Sorting venues in descending order

In [72]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [77]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = dt_Merida_grouped['Neighbourhood']

for ind in np.arange(dt_Toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dt_Toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head(10)

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Jardines de San Sebastian,Sandwich Place,Convenience Store,Park,Market,Restaurant,Bed & Breakfast,Pizza Place,Yoga Studio,Comedy Club,Fried Chicken Joint
1,La Quinta,Lawyer,Pharmacy,Bakery,Bar,Convenience Store,Diner,Gymnastics Gym,Motorcycle Shop,Arts & Crafts Store,Snack Place
2,Los Cocos,Athletics & Sports,Taco Place,Steakhouse,Laundromat,Food Court,Bar,Dessert Shop,Pizza Place,Mexican Restaurant,Convenience Store
3,Merida Centro,Lounge,Gym,Beach,Performing Arts Venue,Yoga Studio,Concert Hall,Fried Chicken Joint,Food Court,Fast Food Restaurant,Diner
4,Privada Del Maestro,Hotel,Vegetarian / Vegan Restaurant,Paper / Office Supplies Store,Breakfast Spot,Pizza Place,Convenience Store,Concert Hall,Men's Store,Mexican Restaurant,Café
5,Privada Garcia Gineres C - 29,Convenience Store,Mexican Restaurant,Restaurant,Pharmacy,Yoga Studio,Pizza Place,Hotel,Comedy Club,Coffee Shop,Motorcycle Shop


## 4. Cluster Neighbourhoods

Run *k*-means to cluster the neighborhood into 5 clusters. we wil be using k=5 as this is only for demostration on the Foursquare API and clustering, we are not analyzing the optimal k

In [81]:
# set number of clusters
kclusters = 5

dt_Merida_grouped_clustering = dt_Merida_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dt_Merida_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 3, 4, 2, 1, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [86]:
# add clustering labels
#neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dt_Merida_merged = newdf

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
dt_Merida_merged = dt_Merida_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

dt_Merida_merged.head() # check the last columns!

KeyError: 'Neighbourhood'

Finally, let's visualize the resulting clusters

In [84]:
# create map
map_clusters = folium.Map(location=[43.651070, -79.347015], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dt_Toronto_merged['Latitude'], dt_Toronto_merged['Longitude'], dt_Toronto_merged['Neighbourhood'], dt_Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

KeyError: 'Latitude'

## 5. Examining the Clusters

#### Cluster 1

In [76]:
dt_Toronto_merged.loc[dt_Toronto_merged['Cluster Labels'] == 0, dt_Toronto_merged.columns[[1] + list(range(5, dt_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,0,Coffee Shop,Pub,Park,Café,Bakery,Restaurant,Theater,Breakfast Spot,Mexican Restaurant,Dessert Shop
2,Downtown Toronto,0,Coffee Shop,Clothing Store,Café,Middle Eastern Restaurant,Japanese Restaurant,Cosmetics Shop,Plaza,Restaurant,Pizza Place,Bookstore
3,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Italian Restaurant,Hotel,Bakery,Cosmetics Shop,Clothing Store,Beer Bar,Breakfast Spot
4,Downtown Toronto,0,Coffee Shop,Cocktail Bar,Farmers Market,Restaurant,Cheese Shop,Beer Bar,Seafood Restaurant,Bakery,Café,Fountain
7,Downtown Toronto,0,Coffee Shop,Restaurant,Café,Thai Restaurant,Bar,Steakhouse,Sushi Restaurant,Gym,Asian Restaurant,Pizza Place
9,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Hotel,Italian Restaurant,Japanese Restaurant,Gastropub,Bar,Seafood Restaurant,American Restaurant
10,Downtown Toronto,0,Coffee Shop,Restaurant,Café,Hotel,Gym,American Restaurant,Gastropub,Deli / Bodega,Japanese Restaurant,Italian Restaurant
11,Downtown Toronto,0,Café,Restaurant,Bakery,Bar,Bookstore,Japanese Restaurant,Italian Restaurant,Dessert Shop,Pub,Noodle House
12,Downtown Toronto,0,Bar,Café,Vietnamese Restaurant,Chinese Restaurant,Coffee Shop,Bakery,Dumpling Restaurant,Vegetarian / Vegan Restaurant,Mexican Restaurant,Pizza Place
15,Downtown Toronto,0,Coffee Shop,Restaurant,Café,Cocktail Bar,Beer Bar,Seafood Restaurant,Japanese Restaurant,Hotel,Creperie,Lounge


In [81]:
dt_Toronto_merged.loc[dt_Toronto_merged['Cluster Labels'] == 1, dt_Toronto_merged.columns[[1] + list(range(5, dt_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Downtown Toronto,1,Park,Playground,Trail,Department Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center


In [78]:
dt_Toronto_merged.loc[dt_Toronto_merged['Cluster Labels'] == 2, dt_Toronto_merged.columns[[1] + list(range(5, dt_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Downtown Toronto,2,Grocery Store,Café,Park,Gas Station,Diner,Candy Store,Baby Store,Coffee Shop,Nightclub,Italian Restaurant


In [79]:
dt_Toronto_merged.loc[dt_Toronto_merged['Cluster Labels'] == 3, dt_Toronto_merged.columns[[1] + list(range(5, dt_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
13,Downtown Toronto,3,Airport Service,Airport Lounge,Airport Terminal,Coffee Shop,Sculpture Garden,Rental Car Location,Boat or Ferry,Boutique,Harbor / Marina,Airport Gate


In [82]:
dt_Toronto_merged.loc[dt_Toronto_merged['Cluster Labels'] == 4, dt_Toronto_merged.columns[[1] + list(range(5, dt_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Downtown Toronto,4,Coffee Shop,Park,Yoga Studio,Discount Store,Portuguese Restaurant,Nightclub,Mexican Restaurant,Juice Bar,Japanese Restaurant,Italian Restaurant
5,Downtown Toronto,4,Coffee Shop,Italian Restaurant,Burger Joint,Chinese Restaurant,Juice Bar,Japanese Restaurant,Café,Ice Cream Shop,Sandwich Place,Bubble Tea Shop
8,Downtown Toronto,4,Coffee Shop,Aquarium,Hotel,Café,Italian Restaurant,Scenic Lookout,Brewery,Sporting Goods Shop,Restaurant,Fried Chicken Joint


# Report <br>

### I decided to use the Downtown Toronto Borough for this excercise due to its great economic impact and because it has most of the well know neighbourhoods including some of the "Top Ten Best Toronto Neighbourhoods To Live In 2019" according to TorontoRentals.com.

### The clusters were defined by the most common venues: <br>
   <li> Cluster 0: A lot of coffe shops and restaurants<br></li>
   <li> Cluster 1: Public and recreational places like parks and playgrounds<br></li>
   <li> Cluster 2: Self service stores, Grocery Stores and Café<br></li>
    <li>Cluster 3: Airport services<br></li>
    <li>Cluster 4: A lot of coffe shop and recreational places like parks and aquariums, <b>excelent for touristic purposes!</b> <br></li>



This notebook was <b>Part 3</b> of the Final assignment from the week 3 of the Applied Data Science Capstone from IBM Professional Certificate made by <a href='https://www.linkedin.com/in/novelo-luis/'> Luis Novelo </a>

***


***