# Part 1 of the Assignment 

Scraping data from a Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe 

### Importing all necessary libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import requests # library to handle requests
print('Libraries imported.')

Libraries imported.


In [2]:
!pip install beautifulsoup4 



In [3]:
!pip install lxml



In [4]:
#!pip install html5lib

In [5]:
from bs4 import BeautifulSoup
import requests

### Import the wikipedia file

In [6]:
!wget -O postal_codes_canada.html https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

--2020-05-10 00:56:29--  https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Resolving en.wikipedia.org (en.wikipedia.org)... 208.80.154.224, 2620:0:861:ed1a::1
Connecting to en.wikipedia.org (en.wikipedia.org)|208.80.154.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52153 (51K) [text/html]
Saving to: ‘postal_codes_canada.html’


2020-05-10 00:56:29 (839 KB/s) - ‘postal_codes_canada.html’ saved [52153/52153]



### Open the file as an html file 

In [7]:
with open("postal_codes_canada.html") as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
#print(soup.prettify())

In [8]:
match = soup.title.text
print(match)

List of postal codes of Canada: M - Wikipedia


### Convert the table in the file into a pandas dataframe and name the columns

In [9]:
all_tables=soup.find_all("table")
#all_tables

In [10]:
right_table=soup.find('table', class_='wikitable sortable')
#right_table

In [11]:
A=[]
B=[]
C=[]

for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

In [12]:
df=pd.DataFrame(A,columns=['PostalCode'])
df['Borough']=B
df['Neighborhood']=C
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,\n
1,M2A\n,Not assigned\n,\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


Showing the first five rows of the table, we see that the table is not cleaned, so we will remove all unwanted elements:

1. We remove first all '\n' 
2. Apply the strip function to get rid of any ending whitespace characters that may have appeared
3. Eliminate all rows that have **Not Assigned** values in the column Borough
4. Replace the '/' in the column *Neighborhood* with a comma ','
5. Reset index
6. Print the first 20 rows of our dataframe

In [13]:
df["PostalCode"] = df.PostalCode.str.replace('\n', '')
df["Borough"] = df.Borough.str.replace('\n', '')
df["Neighborhood"] = df.Neighborhood.str.replace('\n', '')

df["PostalCode"] = df["PostalCode"].apply(lambda x: x.strip())
df["Borough"] = df["Borough"].apply(lambda x: x.strip())
df["Neighborhood"] = df["Neighborhood"].apply(lambda x: x.strip())
#df.head()

In [14]:
rows_notassigned = df[ df['Borough'] == 'Not assigned' ].index
df.drop(rows_notassigned , inplace=True)
df.reset_index(inplace = True)
#df.head()

In [15]:
df["Neighborhood"] = df.Neighborhood.str.replace('/', ',')
df.drop("index", axis=1, inplace=True)
df.reset_index()
df.head(20)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Looking through the rows in the column Neighborhood, we see that we do not have any row with the **Not Assigned** values
so we do not have to do any further cleaning

In [16]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing Centre
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [17]:
print("The dataframe has", df.shape[0], "rows and", df.shape[1], "columns")

The dataframe has 103 rows and 3 columns


# Part 2: Inserting the Latitudes and Longitudes for each Postal Code

Import libraries

In [18]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


In [19]:
!wget -O Geospatial_data.csv https://cocl.us/Geospatial_data

--2020-05-10 00:56:30--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 158.85.108.83, 158.85.108.86, 169.48.113.194
Connecting to cocl.us (cocl.us)|158.85.108.83|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-05-10 00:56:31--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.27.197
Connecting to ibm.box.com (ibm.box.com)|107.152.27.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-05-10 00:56:31--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.

In [20]:
df_geo = pd.read_csv("Geospatial_data.csv")
df_geo.sort_values(by='Postal Code')
df_geo.set_index("Postal Code", inplace=True)
df_geo.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [21]:
df.sort_values(by='PostalCode')
df.set_index("PostalCode", inplace=True)
df.head()

Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [22]:
result = pd.concat([df, df_geo], axis=1, join='inner')
result.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
M3A,North York,Parkwoods,43.753259,-79.329656
M4A,North York,Victoria Village,43.725882,-79.315572
M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [23]:
result.shape

(103, 4)

In [24]:
result.reset_index(inplace=True)
result.rename(columns={"index":"PostalCode"}, inplace=True)
result.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


# Part 3: Clustering and Visualization of Toronto Neighborhoods

Import necessary libraries

In [25]:
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from pandas.io.json import json_normalize 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

import json

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    brotlipy-0.7.0             |py36h8c4c3a4_1000         346 KB  conda-forge
    chardet-3.0.4              |py36h9f0ad1d_1006         188 KB  conda-forge
    cryptography-2.9.2         |   py36h45558ae_0         613 KB  co

Since we are working with Toronto Neighborhoods, I want to check the number of boroughs that are in Toronto, that is, 
boroughs with the names Toronto. They are Central, Downtown, East and West Toronto. I will create a new dataframe with all
the boroughs of Toronto and ignore the other boroughs.

In [26]:
df.groupby("Borough").count()

Unnamed: 0_level_0,Neighborhood
Borough,Unnamed: 1_level_1
Central Toronto,9
Downtown Toronto,19
East Toronto,5
East York,5
Etobicoke,12
Mississauga,1
North York,24
Scarborough,17
West Toronto,6
York,5


There are 4 boroughs of Toronto, namely: Downtown, Central, East and West Toronto. I will be copying the data from df into a new dataframe, then deleting everything that is not Toronto related:

In [27]:
df_toronto = pd.DataFrame()
df_toronto = result[["PostalCode", "Borough", "Neighborhood", "Latitude", "Longitude"]]
df_toronto.head()

rows_not_toronto = df_toronto[ (df_toronto['Borough']!= 'Downtown Toronto')
                               & (df_toronto['Borough']!='Central Toronto')
                               & (df_toronto['Borough']!='East Toronto')
                               & (df_toronto['Borough']!='West Toronto') ].index
df_toronto.drop(rows_not_toronto , inplace=True)
df_toronto.reset_index(inplace=True)
df_toronto.drop("index", inplace=True, axis=1)
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


Check the number of boroughs in Toronto:

In [28]:
df_toronto.shape

(39, 5)

Get the coordinates of Toronto:

In [29]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
T_latitude = location.latitude
T_longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(T_latitude, T_longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Visualization of Toronto and the boroughs in it.

In [30]:
# create map of Toronto using latitude and longitude values
from IPython.core.display import HTML

map_toronto = folium.Map(location=[T_latitude, T_longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) 
    
map_toronto
HTML(map_toronto._repr_html_())

#### Define Foursquare Credentials and Version

In [31]:
CLIENT_ID = 'W3C5SHO5BTWD4P2LH1OIAOIQI450LITWR0PYLNP3MOEYGTTD' # my Foursquare ID
CLIENT_SECRET = '0Q5BV1A4JR1T5IG1YP0XTBQYBB23KPLVKDCSLPXN1IW25WVM' # my Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('My credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

My credentials:
CLIENT_ID: W3C5SHO5BTWD4P2LH1OIAOIQI450LITWR0PYLNP3MOEYGTTD
CLIENT_SECRET:0Q5BV1A4JR1T5IG1YP0XTBQYBB23KPLVKDCSLPXN1IW25WVM


Getting the url of Toronto from Foursquare:

In [32]:
limit = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, T_latitude, T_longitude, VERSION, radius, limit)
url

'https://api.foursquare.com/v2/venues/explore?client_id=W3C5SHO5BTWD4P2LH1OIAOIQI450LITWR0PYLNP3MOEYGTTD&client_secret=0Q5BV1A4JR1T5IG1YP0XTBQYBB23KPLVKDCSLPXN1IW25WVM&ll=43.6534817,-79.3839347&v=20180605&radius=500&limit=100'

The next set of codes is to separate the neighborhoods, given that some postal codes had several neighborhoods. 

In [33]:
neigh_toronto = pd.DataFrame()
neigh_toronto["Borough"]= df_toronto["Borough"]
neigh_toronto["Neighborhood"] = df_toronto["Neighborhood"]
neigh_toronto.head()

Unnamed: 0,Borough,Neighborhood
0,Downtown Toronto,"Regent Park, Harbourfront"
1,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
2,Downtown Toronto,"Garden District, Ryerson"
3,Downtown Toronto,St. James Town
4,East Toronto,The Beaches


In [34]:
neighborhood_list = []
neighborhood_list.append(neigh_toronto['Neighborhood'].str.split(',').tolist())

In [35]:
neigh_toronto['Neighborhood'].str.split(',', expand=True).rename(columns = lambda x: "neighborhood"+str(x+1))

tryneigh_toronto =pd.DataFrame()
tryneighnew_toronto =pd.DataFrame()

tryneigh_toronto = neigh_toronto['Neighborhood'].str.split(',', expand=True).rename(columns = lambda x: "neighborhood"+str(x+1))

In [36]:
column_names = ['Latitude', 'Longitude'] 

# instantiate the dataframe
toronto_data = pd.DataFrame()
coordinates = pd.DataFrame(columns = column_names)

for column in tryneigh_toronto:
    liste=[]
    liste= list(filter(None, tryneigh_toronto[column].tolist()))
    pc = pd.DataFrame(liste)
    toronto_data = toronto_data.append(pc, ignore_index=True)

toronto_data.rename(columns={0:'Neighborhood'}, inplace=True)

toronto_data.reset_index(inplace=True)
toronto_data.drop("index", inplace=True, axis=1)


In [37]:
print(toronto_data.shape)
print("We have a total of ",toronto_data.shape[0], "neighborhoods\n" )
print("Retrieving coordinates...")

for value in range(len(toronto_data["Neighborhood"])):
    try:
        address = toronto_data.iloc[value, 0] + ', Toronto'
        geolocator = Nominatim(user_agent="ny_explorer")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        coordinates = coordinates.append({'Latitude': latitude,'Longitude': longitude}, ignore_index=True)
    except AttributeError:
        print ("Cannot get coordinates of ", address, '\n')
        latitude = None
        Longitude = None
        coordinates = coordinates.append({'Latitude': latitude,'Longitude': longitude}, ignore_index=True)

toronto_data["Latitude" ] = coordinates["Latitude"]
toronto_data["Longitude"] = coordinates["Longitude"]

toronto_data.head()


(75, 1)
We have a total of  75 neighborhoods

Retrieving coordinates...
Cannot get coordinates of  Stn A PO Boxes, Toronto 

Cannot get coordinates of  Business reply mail Processing Centre, Toronto 

Cannot get coordinates of   Ontario Provincial Government, Toronto 



Unnamed: 0,Neighborhood,Latitude,Longitude
0,Regent Park,43.660706,-79.360457
1,Queen's Park,43.659659,-79.39034
2,Garden District,43.6565,-79.377114
3,St. James Town,43.669403,-79.372704
4,The Beaches,43.671024,-79.296712


Since we cannot obtain the coordinates of 3 of the Neighborhoods, I have chosen to delete them 

In [38]:
toronto_data = toronto_data.dropna(how='any',axis=0)
toronto_data.reset_index(inplace=True)
toronto_data.drop("index", inplace=True, axis=1)

print("We have a total of ", toronto_data.shape[0], 'neighborhoods')

We have a total of  72 neighborhoods


In [39]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Listing the neighborhoods

In [40]:
LIMIT = 75
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

Regent Park
Queen's Park
Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond
Dufferin
Harbourfront East
Little Portugal
The Danforth West
Toronto Dominion Centre
Brockton
India Bazaar
Commerce Court
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park
North Toronto West
The Annex
Parkdale
Davisville
University of Toronto
Runnymede
Moore Park
Kensington Market
Summerhill West
CN Tower
Rosedale
St. James Town
First Canadian Place
Church and Wellesley
 Harbourfront
 Ryerson
 Adelaide
 Dovercourt Village
 Union Station
 Trinity
 Riverdale
 Design Exchange
 Parkdale Village
 The Beaches West
 Victoria Hotel
 The Junction South
 North Midtown
 Roncesvalles
 Harbord
 Swansea
 Summerhill East
 Chinatown
 Rathnelly
 King and Spadina
 Cabbagetown
 Underground city
 King
 Toronto Islands
 Exhibition Place
 Yorkville
 Grange Park
 South Hill
 Railway Lands
 Forest Hill SE
 Harbourfront West
 Deer Park
 Bathurst Quay
 Sou

First five Venues displayed in a dataframe, their coordinates and their categories:

In [41]:
print("There are :", toronto_venues.shape[0], "venues retrieved\n\n")
toronto_venues.head()

There are : 3275 venues retrieved




Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Regent Park,43.660706,-79.360457,Regent Park Aquatic Centre,43.6606,-79.361392,Pool
1,Regent Park,43.660706,-79.360457,Daniels Spectrum,43.660137,-79.361808,Performing Arts Venue
2,Regent Park,43.660706,-79.360457,Thai To Go,43.663418,-79.36071,Thai Restaurant
3,Regent Park,43.660706,-79.360457,Sumach Espresso,43.658135,-79.359515,Coffee Shop
4,Regent Park,43.660706,-79.360457,Paintbox Bistro,43.66005,-79.362855,Restaurant


#### Number of venues retrieved for each neighborhood

In [42]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelaide,75,75,75,75,75,75
Bathurst Quay,26,26,26,26,26,26
Cabbagetown,51,51,51,51,51,51
Chinatown,58,58,58,58,58,58
Deer Park,59,59,59,59,59,59
Design Exchange,75,75,75,75,75,75
Dovercourt Village,9,9,9,9,9,9
Exhibition Place,34,34,34,34,34,34
Forest Hill SE,3,3,3,3,3,3
Grange Park,75,75,75,75,75,75


#### Let's find out how many unique categories can be curated from all the returned venues

In [43]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 297 uniques categories.


### Analyze each neighborhood

In [44]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Service,Airport Terminal,American Restaurant,Animal Shelter,Antique Shop,Aquarium,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,Athletics & Sports,Auto Dealership,Auto Workshop,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Stadium,Beach,Beach Bar,Bed & Breakfast,Beer Bar,Beer Store,Belgian Restaurant,Bike Rental / Bike Share,Bike Shop,Bike Trail,Bistro,Boat or Ferry,Bookstore,Botanical Garden,Boutique,Bowling Alley,Brazilian Restaurant,Breakfast Spot,Brewery,Bridge,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Bus Line,Bus Station,Bus Stop,Butcher,Café,Cantonese Restaurant,Caribbean Restaurant,Castle,Cheese Shop,Chinese Restaurant,Chiropractor,Chocolate Shop,Church,Clothing Store,Cocktail Bar,Coffee Shop,College Arts Building,College Auditorium,College Cafeteria,College Gym,College Rec Center,College Theater,Colombian Restaurant,Comedy Club,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Costume Shop,Creperie,Cuban Restaurant,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Design Studio,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Gaming Cafe,Garden,Garden Center,Gas Station,Gastropub,Gay Bar,General Entertainment,General Travel,German Restaurant,Gift Shop,Gluten-free Restaurant,Golf Course,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Halal Restaurant,Harbor / Marina,Hardware Store,Hawaiian Restaurant,Health & Beauty Service,Health Food Store,Historic Site,History Museum,Hobby Shop,Hockey Arena,Hospital,Hostel,Hotel,Hotel Bar,Hotpot Restaurant,IT Services,Ice Cream Shop,Indian Chinese Restaurant,Indian Restaurant,Indie Theater,Indonesian Restaurant,Irish Pub,Israeli Restaurant,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Karaoke Bar,Korean Restaurant,Lake,Latin American Restaurant,Lawyer,Library,Light Rail Station,Lingerie Store,Liquor Store,Lounge,Malay Restaurant,Market,Martial Arts Dojo,Massage Studio,Mattress Store,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Movie Theater,Moving Target,Museum,Music School,Music Store,Music Venue,Nail Salon,Neighborhood,New American Restaurant,Night Market,Nightclub,Nightlife Spot,Non-Profit,Noodle House,North Indian Restaurant,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Outdoor Sculpture,Outdoor Supply Store,Paintball Field,Pakistani Restaurant,Paper / Office Supplies Store,Park,Pastry Shop,Performing Arts Venue,Persian Restaurant,Peruvian Restaurant,Pet Store,Pharmacy,Pie Shop,Pilates Studio,Pizza Place,Platform,Playground,Plaza,Poke Place,Pool,Portuguese Restaurant,Poutine Place,Print Shop,Pub,Racetrack,Ramen Restaurant,Record Shop,Rental Car Location,Restaurant,Rock Climbing Spot,Roof Deck,Sake Bar,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,School,Sculpture Garden,Seafood Restaurant,Shoe Repair,Shoe Store,Shopping Mall,Shopping Plaza,Skating Rink,Smoke Shop,Snack Place,Soccer Field,Soccer Stadium,Soup Place,Souvenir Shop,Souvlaki Shop,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Sri Lankan Restaurant,Steakhouse,Storage Facility,Street Art,Strip Club,Supermarket,Sushi Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tea Room,Tennis Court,Thai Restaurant,Theater,Theme Park,Theme Park Ride / Attraction,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Tree,Tunnel,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Regent Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Regent Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Regent Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Regent Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Regent Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [45]:
toronto_onehot.shape

(3275, 297)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

#### Let's confirm the new size

In [None]:
toronto_grouped.shape

#### Let's print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

## Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [None]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()
#toronto_merged.tail()

In [None]:
# create map
map_clusters = folium.Map(location=[T_latitude, T_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
map_clusters
HTML(map_clusters._repr_html_())

## Cluster 0

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[0] + list(range(4, toronto_merged.shape[1]))]]

## Cluster 1

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[0] + list(range(4, toronto_merged.shape[1]))]]

## Cluster 2

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[0] + list(range(4, toronto_merged.shape[1]))]]

## Cluster 3

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[0] + list(range(4, toronto_merged.shape[1]))]]

## Cluster 4

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[0] + list(range(4, toronto_merged.shape[1]))]]

### Conclusion and naming of clusters

According to the analysis, this is what I came up with in relation to every cluster:

**Cluster 0**: These are neighborhoods popular for the best bars, eateries and restaurants.

**Cluster 1**: Neighborhoods popular for the best coffee shops, eateries and hotels.

**Cluster 2**: Neighborhoods popular for their parks.

**Cluster 3**: Neighborhood popular for its airport.

**Cluster 4**: Neighborhood probably popular for its playground, but probably more for its sporting areas like the gym, the trail and the dog run.