# Capstone Project
## Created by Furqan Tariq

This is the notebook that contains all the code and outputs related to the IBM Data Science Specialization and particularly for the course of Capstone Project.

In [1]:
#importing pandas and numpy library
import pandas as pd
import numpy as np

In [2]:
print("Hello Capstone Project Course")

Hello Capstone Project Course


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Part 1: Data Scraping</a>

2. <a href="#item2">Part 2: Adding Latitudes and Longitudes</a>

3. <a href="#item3">Part 3: Clustering</a>
  
</font>
</div>

## Note: The assignment has asked for 3 links. I have used the same notebook for all parts. Hence, the same link.

## Part 1: Data Scraping

In this part, I will attempt to find the data about the burrows and neighborhoods of Toronto. The particular steps that I take will be accompanied by comments in the code.

In [3]:
! pip install beautifulsoup4 #installed the beautifulsoup library

ERROR: Invalid requirement: '#installed'


In [4]:
! python -m pip install --upgrade pip #upgraded my pip version because it's always good to keep this updated :)

ERROR: Invalid requirement: '#upgraded'


In [6]:
#Setting up the url that will be used for data scraping. Then a request is sent and the response's text is stored

import requests
website_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(website_url).text

In [7]:
#using BeautifulSoup to get the xml of the website page
from bs4 import BeautifulSoup
soup = BeautifulSoup(response,'lxml')

In [8]:
#finding the table that has the list of the boroughs and neighborhoods
html_table = soup.find('table',{'class':'wikitable sortable'})

In [9]:
#finding the header columns so that I can set up the dataframe's headers

table_header = html_table.find_all('th')
df_columns = []

#getting the column names
for th in table_header:
    column_name = th.text.rstrip()
    df_columns.append(column_name)

df_columns

['Postcode', 'Borough', 'Neighbourhood']

In [10]:
#finding the rows of the table so that I can fill the dataframe rows/body

table_body = html_table.find('tbody')
table_rows = table_body.find_all('tr')[1:]

df_body = []

i = 0
for tr in table_rows:
    row = []
    table_cells = tr.find_all('td')
    for td in table_cells:
        value = td.text.rstrip()
        row.append(value)
    df_body.append(row)        

In [11]:
#this is the raw version of the data scraped from Wikipedia
toronto_df = pd.DataFrame(df_body, columns = df_columns)
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [12]:
#Dropping the rows where the Borough is not assigned

indexes_to_drop = toronto_df[toronto_df['Borough'] == 'Not assigned'].index
indexes_to_drop
toronto_df.drop(indexes_to_drop, inplace=True)
toronto_df.reset_index(inplace=True, drop=True)
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


In [13]:
#Since a borough can have more than 1 neighborhood, this step groups those neighborhoods into single rows separated by commas

toronto_df = toronto_df.groupby('Postcode').agg(lambda x : ','.join(set(x)))
toronto_df.reset_index(inplace=True)
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Rouge Hill,Port Union,Highland Creek"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [14]:
#For the case where there is a borough but no neighborhood, the borough is assigned to the neighborhood field

indexes_to_duplicate_borough = toronto_df[toronto_df['Neighbourhood'] == 'Not assigned'].index
toronto_df.loc[indexes_to_duplicate_borough, 'Neighbourhood'] = toronto_df.loc[indexes_to_duplicate_borough, 'Borough']
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Rouge Hill,Port Union,Highland Creek"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [15]:
toronto_df.shape

(103, 3)

### Note: The example data shown in the assignment prompt is different to the data I saw on Wikipedia. Therefore, my answers may be different in some rows. Please note this before marking my assignment wrong :)

## Part 2: Adding Latitudes and Longitudes

In [16]:
! pip install geocoder

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
Collecting click
  Downloading Click-7.0-py2.py3-none-any.whl (81 kB)
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Collecting future
  Downloading future-0.18.2.tar.gz (829 kB)
Installing collected packages: click, ratelim, future, geocoder
    Running setup.py install for future: started
    Running setup.py install for future: finished with status 'done'
Successfully installed click-7.0 future-0.18.2 geocoder-1.38.1 ratelim-0.1.6


In [17]:
import geocoder

The following code doesn't work because the Geocoder package returns none for all the postcodes

The code is stuck in a while loop because as it returns nothing.

### Therefore, I am resorting *to using the CSV file* that has been shared in the assignment prompt. In theory, the code below should work.

In [23]:
for index, row in toronto_df.iterrows():
    postcode = row['Postcode']
    lat_lng_coords = None
    
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postcode))
        lat_lng_coords = g.latlng
        
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    
    row['latitude'] = latitude
    row['longitude'] = longitude
    
    break
    
toronto_df.head()

KeyboardInterrupt: 

In [31]:
toronto_coordinates_df = pd.read_csv('Geospatial_Coordinates.csv')
toronto_coordinates_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [36]:
toronto_merged_df = pd.merge(left=toronto_df, right=toronto_coordinates_df, left_on='Postcode', right_on='Postal Code')
toronto_merged_df.drop(columns=['Postal Code'], inplace= True)
toronto_merged_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill,Port Union,Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## Part 3: Clustering and Visualization

In [38]:
! pip install geopy
! pip install folium
import json
import requests
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

Collecting folium
  Downloading folium-0.10.1-py2.py3-none-any.whl (91 kB)
Collecting branca>=0.3.0
  Downloading branca-0.3.1-py3-none-any.whl (25 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.10.1


In [40]:
toronto_merged_df['Borough'].value_counts()

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           11
Central Toronto      9
West Toronto         6
East Toronto         5
East York            5
York                 5
Queen's Park         1
Mississauga          1
Name: Borough, dtype: int64

### I want to first visualize the neighborhoods in Toronoto, Ontario

In [42]:
from geopy.geocoders import Nominatim
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Ontario are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, Ontario are 43.653963, -79.387207.


In [57]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_merged_df['Latitude'], toronto_merged_df['Longitude'], toronto_merged_df['Borough'], toronto_merged_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Since there is a lot of data here, I will only get the data for boroughs that have 'Toronto' in them

i.e. Downtown Toronto, Central Toronto, West Toronto and East Toronto

In [56]:
toronto_boroughs_df = toronto_merged_df[toronto_merged_df['Borough'].str.contains('Toronto')].reset_index(drop=True)
toronto_boroughs_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar,The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


### Let's Visualize this data i.e. only the boroughs that have 'Toronto' in their name

In [59]:
# create map of New York using latitude and longitude values
map_toronto_only = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_boroughs_df['Latitude'], toronto_boroughs_df['Longitude'], toronto_boroughs_df['Borough'], toronto_boroughs_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_only)  
    
map_toronto_only

### Now it's time to start analyzing the venues for this subset dataframe. Again, note that I am using only data for the boroughs that had 'Toronto' in them

In [None]:
CLIENT_ID = 'HIDDEN' # your Foursquare ID
CLIENT_SECRET = 'HIDDEN' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [63]:
LIMIT = 100
RADIUS = 500

In [64]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [65]:
toronto_venues = getNearbyVenues(names=toronto_boroughs_df['Neighbourhood'],
                                   latitudes=toronto_boroughs_df['Latitude'],
                                   longitudes=toronto_boroughs_df['Longitude']
                                  )

The Beaches
The Danforth West,Riverdale
India Bazaar,The Beaches West
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Summerhill East,Moore Park
Rathnelly,Forest Hill SE,South Hill,Summerhill West,Deer Park
Rosedale
Cabbagetown,St. James Town
Church and Wellesley
Harbourfront
Ryerson,Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide,King,Richmond
Toronto Islands,Harbourfront East,Union Station
Design Exchange,Toronto Dominion Centre
Commerce Court,Victoria Hotel
Roselawn
Forest Hill North,Forest Hill West
The Annex,North Midtown,Yorkville
Harbord,University of Toronto
Grange Park,Chinatown,Kensington Market
Bathurst Quay,King and Spadina,Railway Lands,Island airport,South Niagara,Harbourfront West,CN Tower
Stn A PO Boxes 25 The Esplanade
First Canadian Place,Underground city
Christie
Dovercourt Village,Dufferin
Little Portugal,Trinity
Exhibition Place,Brockton,Parkdale Village
High Park,The Junction South
Parkdale,Roncesvalles
Swansea,R

In [66]:
toronto_venues.shape

(1702, 7)

In [78]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Glen Stewart Park,43.675278,-79.294647,Park
3,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
4,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood


### Now, I want to analyze all of these neighborhoods

In [83]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
cols = list(toronto_onehot.columns.values) #Make a list of all of the columns in the df
temp1 = cols.pop(cols.index('Neighborhood')) #Remove b from list
toronto_onehot =  toronto_onehot[[temp1] + cols] #Create new dataframe with columns in the order you want

toronto_onehot.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Now we will group all the neighborhoods and find the frequency of the different categories

In [85]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.01,0.0
1,"Bathurst Quay,King and Spadina,Railway Lands,I...",0.0,0.0625,0.0625,0.125,0.125,0.125,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824
4,"Cabbagetown,St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Now we will see the top 5 categories per neighborhood

In [87]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide,King,Richmond----
         venue  freq
0  Coffee Shop  0.07
1         Café  0.04
2   Steakhouse  0.04
3          Bar  0.04
4   Restaurant  0.03


----Bathurst Quay,King and Spadina,Railway Lands,Island airport,South Niagara,Harbourfront West,CN Tower----
              venue  freq
0    Airport Lounge  0.12
1   Airport Service  0.12
2  Airport Terminal  0.12
3   Harbor / Marina  0.06
4             Plane  0.06


----Berczy Park----
                venue  freq
0         Coffee Shop  0.09
1        Cocktail Bar  0.05
2            Beer Bar  0.04
3  Seafood Restaurant  0.04
4      Farmers Market  0.04


----Business Reply Mail Processing Centre 969 Eastern----
              venue  freq
0       Yoga Studio  0.06
1     Garden Center  0.06
2       Pizza Place  0.06
3        Comic Shop  0.06
4  Recording Studio  0.06


----Cabbagetown,St. James Town----
         venue  freq
0  Coffee Shop  0.07
1  Pizza Place  0.04
2         Park  0.04
3         Café  0.04
4          Pub  0.04


----C

In [88]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [119]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Bar,Café,Steakhouse,Restaurant,Asian Restaurant,Breakfast Spot,Hotel,Thai Restaurant,Seafood Restaurant
1,"Bathurst Quay,King and Spadina,Railway Lands,I...",Airport Lounge,Airport Service,Airport Terminal,Boutique,Boat or Ferry,Bar,Rental Car Location,Plane,Coffee Shop,Harbor / Marina
2,Berczy Park,Coffee Shop,Cocktail Bar,Cheese Shop,Bakery,Beer Bar,Farmers Market,Seafood Restaurant,Steakhouse,Café,Butcher
3,Business Reply Mail Processing Centre 969 Eastern,Yoga Studio,Auto Workshop,Park,Comic Shop,Pizza Place,Recording Studio,Restaurant,Burrito Place,Brewery,Light Rail Station
4,"Cabbagetown,St. James Town",Coffee Shop,Italian Restaurant,Park,Pizza Place,Bakery,Restaurant,Café,Pub,Grocery Store,Diner


### Now its time to cluster these neighborhoods

In [120]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

unique, counts = np.unique(kmeans.labels_, return_counts=True)
dict(zip(unique, counts))

{0: 34, 1: 1, 2: 2, 3: 1, 4: 1}

In [121]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_boroughs_df_with_labels = toronto_boroughs_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_boroughs_df_with_labels = toronto_boroughs_df_with_labels.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

toronto_boroughs_df_with_labels.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Park,Trail,Health Food Store,Pub,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Yoga Studio,Fruit & Vegetable Store,Restaurant,Pub,Pizza Place
2,M4L,East Toronto,"India Bazaar,The Beaches West",43.668999,-79.315572,0,Park,Board Shop,Steakhouse,Sushi Restaurant,Ice Cream Shop,Brewery,Pub,Liquor Store,Fast Food Restaurant,Italian Restaurant
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Café,Coffee Shop,Gastropub,Brewery,Bakery,Italian Restaurant,American Restaurant,Park,Seafood Restaurant,Sandwich Place
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,3,Park,Swim School,Bus Line,Yoga Studio,Dim Sum Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


### Time to visualize the clusters

In [122]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_boroughs_df_with_labels['Latitude'], toronto_boroughs_df_with_labels['Longitude'], toronto_boroughs_df_with_labels['Neighbourhood'], toronto_boroughs_df_with_labels['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### So, we have 5 clusters. Now we will try to name these clusters

#### Cluster 1 has tonnes of coffee shops and restuarants etc. I feel this cluster can be named the Foodie Cluster. Which most of Toronto :D

In [105]:
toronto_boroughs_df_with_labels.loc[toronto_boroughs_df_with_labels['Cluster Labels'] == 0, toronto_boroughs_df_with_labels.columns[[2] + list(range(6, toronto_boroughs_df_with_labels.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,"The Danforth West,Riverdale",Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Yoga Studio,Fruit & Vegetable Store,Restaurant,Pub,Pizza Place
2,"India Bazaar,The Beaches West",Park,Board Shop,Steakhouse,Sushi Restaurant,Ice Cream Shop,Brewery,Pub,Liquor Store,Fast Food Restaurant,Italian Restaurant
3,Studio District,Café,Coffee Shop,Gastropub,Brewery,Bakery,Italian Restaurant,American Restaurant,Park,Seafood Restaurant,Sandwich Place
5,Davisville North,Park,Sandwich Place,Department Store,Food & Drink Shop,Hotel,Breakfast Spot,Gym,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
6,North Toronto West,Coffee Shop,Sporting Goods Shop,Fast Food Restaurant,Diner,Dessert Shop,Mexican Restaurant,Park,Clothing Store,Chinese Restaurant,Café
7,Davisville,Sandwich Place,Dessert Shop,Pizza Place,Italian Restaurant,Coffee Shop,Gym,Café,Sushi Restaurant,Park,Farmers Market
9,"Rathnelly,Forest Hill SE,South Hill,Summerhill...",Pub,Coffee Shop,American Restaurant,Restaurant,Fried Chicken Joint,Sports Bar,Sushi Restaurant,Pizza Place,Liquor Store,Light Rail Station
11,"Cabbagetown,St. James Town",Coffee Shop,Italian Restaurant,Park,Pizza Place,Bakery,Restaurant,Café,Pub,Grocery Store,Diner
12,Church and Wellesley,Coffee Shop,Japanese Restaurant,Restaurant,Sushi Restaurant,Gay Bar,Café,Gym,Hotel,Mediterranean Restaurant,Fast Food Restaurant
13,Harbourfront,Coffee Shop,Park,Bakery,Pub,Breakfast Spot,Restaurant,Café,Mexican Restaurant,Farmers Market,Event Space


#### Cluster 2 only has 1 neighborhood which is a bit strange; this is more like a Fitness Cluster

In [106]:
toronto_boroughs_df_with_labels.loc[toronto_boroughs_df_with_labels['Cluster Labels'] == 1, toronto_boroughs_df_with_labels.columns[[2] + list(range(6, toronto_boroughs_df_with_labels.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,"Summerhill East,Moore Park",Gym,Playground,Tennis Court,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


#### Cluster 3 only has 2 neighborhoods which is more like a Picnic/Beach Cluster

In [123]:
toronto_boroughs_df_with_labels.loc[toronto_boroughs_df_with_labels['Cluster Labels'] == 2, toronto_boroughs_df_with_labels.columns[[2] + list(range(6, toronto_boroughs_df_with_labels.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,The Beaches,Park,Trail,Health Food Store,Pub,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
10,Rosedale,Park,Playground,Trail,Department Store,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


#### Cluster 4 only has 1 neighborhood which is a bit strange; It is kind of like the hybrid of Cluster 2 and 3.

In [124]:
toronto_boroughs_df_with_labels.loc[toronto_boroughs_df_with_labels['Cluster Labels'] == 3, toronto_boroughs_df_with_labels.columns[[2] + list(range(6, toronto_boroughs_df_with_labels.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Lawrence Park,Park,Swim School,Bus Line,Yoga Studio,Dim Sum Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


#### Cluster 5 only has 1 neighborhood which is a bit strange; It is kind of the hybrid of Cluster 1, 2 and 3.

In [126]:
toronto_boroughs_df_with_labels.loc[toronto_boroughs_df_with_labels['Cluster Labels'] == 4, toronto_boroughs_df_with_labels.columns[[2] + list(range(6, toronto_boroughs_df_with_labels.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Roselawn,Garden,Yoga Studio,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


#### The clustering kind of makes sense. The majority of the neighborhoods are in the 'city' areas have similar vibes i.e. coffee shops, food places etc. The other clusters are a bit off the main 'city' and therefore offer different activities such as beaches, yoga, swimming etc.

# That's it folks!