<a href="https://github.com/PhinanceScientist"><img src = "https://i.ibb.co/NLfc0SV/Deveaner.png" width = 100> </a>
<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

## Introduction

For this project I will be testing some web scraping methods with python in order to obtain data from a public web page. After cleaning and exploring the data, the goal is to obtain some relevant information from the neighbourhoods from Toronto, Canada based on the information retrieved by the Foursquare API. k-means will be used to group the neighbourhoods and finally I will use the Folium library to visualize the results.

Please do notice that if you want to render this Jupyter notebook (show the folium maps) you can use this link https://nbviewer.jupyter.org/

# <p style =" text-align: center">PART 1<p> 


## Scraping data from Wikipedia using BeautifulSoup

In [69]:
#Import requests for web scraping
import pandas as pd
import requests as rq
import numpy as np

import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries installed')

Libraries installed


In [14]:
website_url= rq.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text #Bring the data from the target URL

### Now we shall use BeautifulSoup library

In [18]:
#Import BeautifulSoup for html structure information from our request
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')

table = soup.find_all('table')[0] #Find the table

df = pd.read_html(str(table)) #Read the table in HTML

neighbourhood=pd.DataFrame(df[0]) #Turn the table to a DataFrame
neighbourhood

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Etobicoke,Islington Avenue


### Drop Not asiggned Neighbourhoods as there are no Boroughs assigned to them neither

In [19]:
noNeighbourhood = neighbourhood[neighbourhood['Neighbourhood'] == 'Not assigned'].index
neighbourhood.drop(noNeighbourhood, inplace = True)
neighbourhood

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Etobicoke,Islington Avenue
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


### Now, we proceed to group our Dataframe by Postcode with a concatenation of a ","

In [20]:
grpdf =neighbourhood.groupby(['Postcode','Borough'], as_index=False, sort=False).agg(','.join)
grpdf #Dataframe Grouped by Postcode and joined with ","

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


### Our last requirement is to verify our Dataframe shape

In [21]:
grpdf.shape

(103, 3)

## Final Thoughts <br>

<li>There were not Borough names for the Not assigned Neighbourhoods, so, we skipped the instruction of using the same name as de Borough for the Neighbourhood with a value of "Not assigned" (March 2020).</li>
<li>The Original table from the wikipedia (March 2020) has fewer rows than the Example's image provide for the instructions. </li>
<li>The example's image showed a duplicate Neighbourhood value for the M5A Postal Code but It was not found in the Wikipedia Table (March 2020).</li>

### References <br>
Medium post: <br>
https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722 (How BeautifulSoup Works)<br><br>

Coursera threads:<br>
https://www.coursera.org/learn/applied-data-science-capstone/discussions/all/threads/WwZwTZcmQJuGcE2XJuCb4g  (Scrap and turn to dataframe) <br><br>
https://www.coursera.org/learn/applied-data-science-capstone/discussions/all/threads/czrpnE_gEemX6BLS8CLb5g (Group by, merge Poste Code) <br><br>

thispointer.com:<br>
https://thispointer.com/python-pandas-how-to-drop-rows-in-dataframe-by-conditions-on-column-values/ (How to drop rows)



This notebook was <b>Part 1</b> of the Final assignment from the week 3 of the Applied Data Science Capstone from IBM Professional Certificate made by <a href='https://www.linkedin.com/in/novelo-luis/'> Luis Novelo </a>

***

***

# <p style =" text-align: center">PART 2<p> 

## First we need to retrieve our data, in this case we will use the file given by the instructions <br>
CSV URL File: https://cocl.us/Geospatial_data

In [22]:
urlCSV = 'https://cocl.us/Geospatial_data' #Retreive the data
geoSpatial = pd.read_csv(urlCSV) #Turned to dataFrame
newdf = geoSpatial.rename(columns ={'Postal Code':'Postcode'}) #Rename our column in order to have the same Column title as our previous DataFrame
newdf.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [23]:
mergedf=pd.merge(grpdf, newdf, on='Postcode') #Merge by column name and build new dataframe
mergedf.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


## Final Thoughts <br>

<li>The data set provided by the instructions was used in order to simplify the excercise (March 2020).</li>


### References <br>
note.nkmk.me: <br>
https://note.nkmk.me/en/python-pandas-dataframe-rename/ (How to rename dataframe's columns )<br><br>

Stack overflow:<br>
https://stackoverflow.com/questions/43297589/merge-two-data-frames-based-on-common-column-values-in-pandas  (How to merge columns by value in pandas) <br><br>
https://stackoverflow.com/questions/32400867/pandas-read-csv-from-url (How to read CSV from URL) <br><br>

pandas.org:<br>
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html (How to read CSV with pandas)

This notebook was <b>Part 2</b> of the Final assignment from the week 3 of the Applied Data Science Capstone from IBM Professional Certificate made by <a href='https://www.linkedin.com/in/novelo-luis/'> Luis Novelo </a>

***

***

# <p style =" text-align: center">PART 3<p> 

### First we need to import our libraries

In [24]:
!pip -q install folium
import folium
print('Folium imported')

Folium imported


## 1. Exploring the dataset


In [49]:
map_toronto = folium.Map(location=[43.651070, -79.347015], zoom_start=11) # Create Map

# add markers to map
for lat, lng, borough, neighborhood in zip(mergedf['Latitude'], mergedf['Longitude'], mergedf['Borough'], mergedf['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) 
    
map_toronto


### Adding the Foursquare credentials

In [83]:
#@hidden_cell
CLIENT_ID = ' ' # your Foursquare ID
CLIENT_SECRET = ' ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID:  
CLIENT_SECRET: 


### For this excercise we will use only the Neighbourhoods from the 'Downtown Toronto' Borough as is quite an important venue 

In [35]:
dt_Toronto_data = mergedf[mergedf['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
dt_Toronto_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
1,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
2,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


In [50]:
neighborhood_latitude = dt_Toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = dt_Toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = dt_Toronto_data.loc[0, 'Neighbourhood'] # neighbourhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Harbourfront are 43.6542599, -79.3606359.


### Let's create the GET request URL. 

In [84]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id= &client_secret= &v=20180605&ll=43.6542599,-79.3606359&radius=500&limit=100'

### Send the GET request and examine the resutls

In [41]:
results = rq.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e69b5a20f5968002887ba39'},
 'response': {'headerLocation': 'Corktown',
  'headerFullLocation': 'Corktown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 47,
  'suggestedBounds': {'ne': {'lat': 43.6587599045, 'lng': -79.3544279001486},
   'sw': {'lat': 43.6497598955, 'lng': -79.36684389985142}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '54ea41ad498e9a11e9e13308',
       'name': 'Roselle Desserts',
       'location': {'address': '362 King St E',
        'crossStreet': 'Trinity St',
        'lat': 43.653446723052674,
        'lng': -79.3620167174383,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.653446723052674,
          'lng': -79.3620167174383}],
        'distance': 143,
       

### Function that extracts the category of the venue

In [43]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

### Now we clean the json and structure it into a pandas dataframe.

In [47]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Roselle Desserts,Bakery,43.653447,-79.362017
1,Tandem Coffee,Coffee Shop,43.653559,-79.361809
2,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008
3,Body Blitz Spa East,Spa,43.654735,-79.359874
4,Morning Glory Cafe,Breakfast Spot,43.653947,-79.361149


In [48]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

47 venues were returned by Foursquare.


## 2. Exploring Neighbourhoods in Downtown Toronto

### Function to repeat the same process to all the neighborhoods in Downtown Toronto

In [54]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = rq.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### The code to run the above function on each neighborhood and create a new dataframe called *dt_Toronto__venues*.

In [55]:

dt_Toronto_venues = getNearbyVenues(names=dt_Toronto_data['Neighbourhood'],
                                   latitudes=dt_Toronto_data['Latitude'],
                                   longitudes=dt_Toronto_data['Longitude']
                                  )

Harbourfront
Queen's Park
Ryerson,Garden District
St. James Town
Berczy Park
Central Bay Street
Christie
Adelaide,King,Richmond
Harbourfront East,Toronto Islands,Union Station
Design Exchange,Toronto Dominion Centre
Commerce Court,Victoria Hotel
Harbord,University of Toronto
Chinatown,Grange Park,Kensington Market
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown,St. James Town
First Canadian Place,Underground city
Church and Wellesley


### Check the size of the new dataFrame


In [56]:
print(dt_Toronto_venues.shape)
dt_Toronto_venues.head()

(1304, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Harbourfront,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,Harbourfront,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot


### Let's group our dataframe by Neighbourhood adn count how many venues they have

In [58]:
dt_Toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",100,100,100,100,100,100
Berczy Park,55,55,55,55,55,55
"CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara",17,17,17,17,17,17
"Cabbagetown,St. James Town",42,42,42,42,42,42
Central Bay Street,79,79,79,79,79,79
"Chinatown,Grange Park,Kensington Market",88,88,88,88,88,88
Christie,18,18,18,18,18,18
Church and Wellesley,86,86,86,86,86,86
"Commerce Court,Victoria Hotel",100,100,100,100,100,100
"Design Exchange,Toronto Dominion Centre",100,100,100,100,100,100


In [59]:
# Unique venues categories
print('There are {} uniques categories.'.format(len(dt_Toronto_venues['Venue Category'].unique())))

There are 205 uniques categories.


## 3. Analyze Each Neighbourhood

In [61]:
# one hot encoding
dt_Toronto_onehot = pd.get_dummies(dt_Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dt_Toronto_onehot['Neighbourhood'] = dt_Toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [dt_Toronto_onehot.columns[-1]] + list(dt_Toronto_onehot.columns[:-1])
dt_Toronto_onehot = dt_Toronto_onehot[fixed_columns]

dt_Toronto_onehot.head()

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [62]:
dt_Toronto_onehot.shape

(1304, 206)

### Next, let's group rows by neighbourhood and by taking the mean of the frequency of occurrence of each category

In [63]:
dt_Toronto_grouped = dt_Toronto_onehot.groupby('Neighbourhood').mean().reset_index()
dt_Toronto_grouped

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"CN Tower,Bathurst Quay,Island airport,Harbourf...",0.0,0.058824,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Cabbagetown,St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012658,0.0,...,0.0,0.0,0.012658,0.0,0.0,0.012658,0.0,0.0,0.0,0.012658
5,"Chinatown,Grange Park,Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.034091,0.0,0.056818,0.011364,0.0,0.0,0.0,0.0
6,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Church and Wellesley,0.011628,0.0,0.0,0.0,0.0,0.0,0.0,0.011628,0.0,...,0.0,0.0,0.0,0.0,0.011628,0.0,0.011628,0.011628,0.0,0.011628
8,"Commerce Court,Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0
9,"Design Exchange,Toronto Dominion Centre",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,...,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0


In [65]:
dt_Toronto_grouped.shape

(19, 206)

### Let's print each neighbourhood along with the top 5 most common venues

In [66]:
num_top_venues = 5

for hood in dt_Toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = dt_Toronto_grouped[dt_Toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide,King,Richmond----
             venue  freq
0      Coffee Shop  0.07
1       Restaurant  0.05
2             Café  0.04
3  Thai Restaurant  0.04
4       Steakhouse  0.03


----Berczy Park----
                venue  freq
0         Coffee Shop  0.09
1        Cocktail Bar  0.05
2            Beer Bar  0.04
3  Seafood Restaurant  0.04
4         Cheese Shop  0.04


----CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara----
              venue  freq
0   Airport Service  0.18
1    Airport Lounge  0.12
2  Airport Terminal  0.12
3       Coffee Shop  0.06
4           Airport  0.06


----Cabbagetown,St. James Town----
                venue  freq
0         Coffee Shop  0.07
1              Bakery  0.05
2                Café  0.05
3          Restaurant  0.05
4  Italian Restaurant  0.05


----Central Bay Street----
                 venue  freq
0          Coffee Shop  0.16
1   Italian Restaurant  0.05
2         Burger Joint  0.04
3  Japanese 

### Let's put that into a *pandas* dataframe

Sorting venues in descending order

In [67]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [70]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = dt_Toronto_grouped['Neighbourhood']

for ind in np.arange(dt_Toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dt_Toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head(10)

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Restaurant,Café,Thai Restaurant,Bar,Steakhouse,Sushi Restaurant,Gym,Asian Restaurant,Pizza Place
1,Berczy Park,Coffee Shop,Cocktail Bar,Farmers Market,Restaurant,Cheese Shop,Beer Bar,Seafood Restaurant,Bakery,Café,Fountain
2,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Airport Service,Airport Lounge,Airport Terminal,Coffee Shop,Sculpture Garden,Rental Car Location,Boat or Ferry,Boutique,Harbor / Marina,Airport Gate
3,"Cabbagetown,St. James Town",Coffee Shop,Italian Restaurant,Café,Bakery,Restaurant,Pizza Place,Pub,Indian Restaurant,Sandwich Place,Japanese Restaurant
4,Central Bay Street,Coffee Shop,Italian Restaurant,Burger Joint,Chinese Restaurant,Juice Bar,Japanese Restaurant,Café,Ice Cream Shop,Sandwich Place,Bubble Tea Shop
5,"Chinatown,Grange Park,Kensington Market",Bar,Café,Vietnamese Restaurant,Chinese Restaurant,Coffee Shop,Bakery,Dumpling Restaurant,Vegetarian / Vegan Restaurant,Mexican Restaurant,Pizza Place
6,Christie,Grocery Store,Café,Park,Gas Station,Diner,Candy Store,Baby Store,Coffee Shop,Nightclub,Italian Restaurant
7,Church and Wellesley,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Men's Store,Bubble Tea Shop,Burger Joint,Mediterranean Restaurant,Café
8,"Commerce Court,Victoria Hotel",Coffee Shop,Restaurant,Café,Hotel,Gym,American Restaurant,Gastropub,Deli / Bodega,Japanese Restaurant,Italian Restaurant
9,"Design Exchange,Toronto Dominion Centre",Coffee Shop,Café,Restaurant,Hotel,Italian Restaurant,Japanese Restaurant,Gastropub,Bar,Seafood Restaurant,American Restaurant


## 4. Cluster Neighbourhoods

Run *k*-means to cluster the neighborhood into 5 clusters. we wil be using k=5 as this is only for demostration on the Foursquare API and clustering, we are not analyzing the optimal k

In [72]:
# set number of clusters
kclusters = 5

dt_Toronto_grouped_clustering = dt_Toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dt_Toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 3, 0, 4, 0, 2, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [73]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dt_Toronto_merged = dt_Toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
dt_Toronto_merged = dt_Toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

dt_Toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,0,Coffee Shop,Pub,Park,Café,Bakery,Restaurant,Theater,Breakfast Spot,Mexican Restaurant,Dessert Shop
1,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494,4,Coffee Shop,Park,Yoga Studio,Discount Store,Portuguese Restaurant,Nightclub,Mexican Restaurant,Juice Bar,Japanese Restaurant,Italian Restaurant
2,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937,0,Coffee Shop,Clothing Store,Café,Middle Eastern Restaurant,Japanese Restaurant,Cosmetics Shop,Plaza,Restaurant,Pizza Place,Bookstore
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Coffee Shop,Café,Restaurant,Italian Restaurant,Hotel,Bakery,Cosmetics Shop,Clothing Store,Beer Bar,Breakfast Spot
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Coffee Shop,Cocktail Bar,Farmers Market,Restaurant,Cheese Shop,Beer Bar,Seafood Restaurant,Bakery,Café,Fountain


Finally, let's visualize the resulting clusters

In [75]:
# create map
map_clusters = folium.Map(location=[43.651070, -79.347015], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dt_Toronto_merged['Latitude'], dt_Toronto_merged['Longitude'], dt_Toronto_merged['Neighbourhood'], dt_Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examining the Clusters

#### Cluster 1

In [76]:
dt_Toronto_merged.loc[dt_Toronto_merged['Cluster Labels'] == 0, dt_Toronto_merged.columns[[1] + list(range(5, dt_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,0,Coffee Shop,Pub,Park,Café,Bakery,Restaurant,Theater,Breakfast Spot,Mexican Restaurant,Dessert Shop
2,Downtown Toronto,0,Coffee Shop,Clothing Store,Café,Middle Eastern Restaurant,Japanese Restaurant,Cosmetics Shop,Plaza,Restaurant,Pizza Place,Bookstore
3,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Italian Restaurant,Hotel,Bakery,Cosmetics Shop,Clothing Store,Beer Bar,Breakfast Spot
4,Downtown Toronto,0,Coffee Shop,Cocktail Bar,Farmers Market,Restaurant,Cheese Shop,Beer Bar,Seafood Restaurant,Bakery,Café,Fountain
7,Downtown Toronto,0,Coffee Shop,Restaurant,Café,Thai Restaurant,Bar,Steakhouse,Sushi Restaurant,Gym,Asian Restaurant,Pizza Place
9,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Hotel,Italian Restaurant,Japanese Restaurant,Gastropub,Bar,Seafood Restaurant,American Restaurant
10,Downtown Toronto,0,Coffee Shop,Restaurant,Café,Hotel,Gym,American Restaurant,Gastropub,Deli / Bodega,Japanese Restaurant,Italian Restaurant
11,Downtown Toronto,0,Café,Restaurant,Bakery,Bar,Bookstore,Japanese Restaurant,Italian Restaurant,Dessert Shop,Pub,Noodle House
12,Downtown Toronto,0,Bar,Café,Vietnamese Restaurant,Chinese Restaurant,Coffee Shop,Bakery,Dumpling Restaurant,Vegetarian / Vegan Restaurant,Mexican Restaurant,Pizza Place
15,Downtown Toronto,0,Coffee Shop,Restaurant,Café,Cocktail Bar,Beer Bar,Seafood Restaurant,Japanese Restaurant,Hotel,Creperie,Lounge


In [81]:
dt_Toronto_merged.loc[dt_Toronto_merged['Cluster Labels'] == 1, dt_Toronto_merged.columns[[1] + list(range(5, dt_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Downtown Toronto,1,Park,Playground,Trail,Department Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center


In [78]:
dt_Toronto_merged.loc[dt_Toronto_merged['Cluster Labels'] == 2, dt_Toronto_merged.columns[[1] + list(range(5, dt_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Downtown Toronto,2,Grocery Store,Café,Park,Gas Station,Diner,Candy Store,Baby Store,Coffee Shop,Nightclub,Italian Restaurant


In [79]:
dt_Toronto_merged.loc[dt_Toronto_merged['Cluster Labels'] == 3, dt_Toronto_merged.columns[[1] + list(range(5, dt_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
13,Downtown Toronto,3,Airport Service,Airport Lounge,Airport Terminal,Coffee Shop,Sculpture Garden,Rental Car Location,Boat or Ferry,Boutique,Harbor / Marina,Airport Gate


In [82]:
dt_Toronto_merged.loc[dt_Toronto_merged['Cluster Labels'] == 4, dt_Toronto_merged.columns[[1] + list(range(5, dt_Toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Downtown Toronto,4,Coffee Shop,Park,Yoga Studio,Discount Store,Portuguese Restaurant,Nightclub,Mexican Restaurant,Juice Bar,Japanese Restaurant,Italian Restaurant
5,Downtown Toronto,4,Coffee Shop,Italian Restaurant,Burger Joint,Chinese Restaurant,Juice Bar,Japanese Restaurant,Café,Ice Cream Shop,Sandwich Place,Bubble Tea Shop
8,Downtown Toronto,4,Coffee Shop,Aquarium,Hotel,Café,Italian Restaurant,Scenic Lookout,Brewery,Sporting Goods Shop,Restaurant,Fried Chicken Joint


# Report <br>

### I decided to use the Downtown Toronto Borough for this excercise due to its great economic impact and because it has most of the well know neighbourhoods including some of the "Top Ten Best Toronto Neighbourhoods To Live In 2019" according to TorontoRentals.com.

### The clusters were defined by the most common venues: <br>
   <li> Cluster 0: A lot of coffe shops and restaurants<br></li>
   <li> Cluster 1: Public and recreational places like parks and playgrounds<br></li>
   <li> Cluster 2: Self service stores, Grocery Stores and Café<br></li>
    <li>Cluster 3: Airport services<br></li>
    <li>Cluster 4: A lot of coffe shop and recreational places like parks and aquariums, <b>excelent for touristic purposes!</b> <br></li>



This notebook was <b>Part 3</b> of the Final assignment from the week 3 of the Applied Data Science Capstone from IBM Professional Certificate made by <a href='https://www.linkedin.com/in/novelo-luis/'> Luis Novelo </a>

***


***