 To create the above dataframe:

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

- **Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.**

- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. 

- **These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.**

- **If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.**

- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.

- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

4. Submit a link to your Notebook on your Github repository. (10 marks)

Note: There are different website scraping libraries and packages in Python. For scraping the above table, you can simply use pandas to read the table into a pandas dataframe.

Another way, which would help to learn for more complicated cases of web scraping is using the BeautifulSoup package. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/

The package is so popular that there is a plethora of tutorials and examples on how to use it. Here is a very good Youtube video on how to use the BeautifulSoup package: https://www.youtube.com/watch?v=ng2o98k983k

Use pandas, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe.

In [1]:
import numpy as np 
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium # map rendering library
from bs4 import BeautifulSoup
# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
from IPython.display import display_html

# 1. Scrapping and creating DataFrame

In [2]:
# Download dand parse data using BeautifulSoup :
# There were some problems with updated wikipedia page, so I turned to older version(Feb 2020) and copied new link
url='https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&diff=942851379&oldid=942655599'

# take text
source = requests.get(url).text

soup = BeautifulSoup(source,'lxml')
#print(soup.prettify())


In [3]:
# Let's try to have a look on a row of table
soup.tbody.tr.text

'\nPostcode\nBorough\nNeighbourhood\n'

**It shows us that we need to split rows with \n, and we will have to remove first and last space[1:-1]**

In [4]:
table = soup.find('tbody')
rows = table.find_all('tr')

#Extracting details
data = []
for row in rows:
    details = row.text.split('\n')[1:-1] 
    data.append(details)

In [5]:
# Convert list to dataFrame and take into consideration that first row is the name of columns
df = pd.DataFrame(data[1:], columns=data[0])
df = df.rename(columns={'Postcode': 'PostalCode', 'Neighbourhood': 'Neighborhood'})# renaming columns

df.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## 2. Only processing the cells that have an assigned borough. Ignoring the cells with a borough that is Not assigned. Droping row where borough is "Not assigned"

In [6]:
df=df.drop(df[df.Borough == 'Not assigned'].index)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


## 3.More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park.These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [7]:
# let's see any postal code with 2 or more neighborhoods
df[df['PostalCode']=='M6A']

Unnamed: 0,PostalCode,Borough,Neighborhood
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [8]:
# Combine them in one row seperated by ','
df=df.groupby(['PostalCode','Borough']).agg(','.join).reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## 4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [9]:
df['Neighborhood'].replace('Not assigned',df['Borough'],inplace=True)

## In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe

In [10]:
df.shape

(103, 3)

# Part 2

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking postal code M5G as an example, your code would look something like this:


Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

Use the Geocoder package or the csv file to create the following dataframe:

In [11]:
long_lat=pd.read_csv("http://cocl.us/Geospatial_data")
# rename the first column to allow merging dataframes on PostalCode
long_lat = long_lat.rename(columns={'Postal Code': 'PostalCode'})
df = pd.merge(long_lat, df, on='PostalCode')

df.head()

Unnamed: 0,PostalCode,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,Scarborough,"Rouge,Malvern"
1,M1C,43.784535,-79.160497,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae


In [12]:
#Change the order of columns as on the example
df = df[['PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude']]
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# Part 3

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

-to add enough Markdown cells to explain what you decided to do and to report any observations you make.

-to generate maps to visualize your neighborhoods and how they cluster together.

#### Define Foursquare Credentials and Version

In [13]:
CLIENT_ID = 'T3F04SZ3O2YYB1HWDYLXOQHLMJLIBOI0RUL4ZOEQJV2PQSER'
CLIENT_SECRET = 'OWVWJ1XIP1OSKNPAT3DUYF4GI202NOLGSOIWOEIAJGU013IX'
VERSION = '20180605' # Foursquare API version
LIMIT=5

In [14]:
# Boroughs that contain the word Toronto
df_final = df[df['Borough'].str.contains('Toronto',regex=False)]
df_final.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


## Explore neighborhoods

In [15]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [16]:
toronto_venues = getNearbyVenues(names=df_final['Neighborhood'],
                                   latitudes=df_final['Latitude'],
                                   longitudes=df_final['Longitude']
                                  )

The Beaches
The Danforth West,Riverdale
The Beaches West,India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park,Summerhill East
Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West
Rosedale
Cabbagetown,St. James Town
Church and Wellesley
Harbourfront
Ryerson,Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide,King,Richmond
Harbourfront East,Toronto Islands,Union Station
Design Exchange,Toronto Dominion Centre
Commerce Court,Victoria Hotel
Roselawn
Forest Hill North,Forest Hill West
The Annex,North Midtown,Yorkville
Harbord,University of Toronto
Chinatown,Grange Park,Kensington Market
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place,Underground city
Christie
Dovercourt Village,Dufferin
Little Portugal,Trinity
Brockton,Exhibition Place,Parkdale Village
High Park,The Junction South
Parkdale,Roncesvalles
Runnymede

In [17]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West,Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


In [18]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 80 uniques categories.


## Analyze neighborhoods

In [19]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Airport,Airport Food Court,Airport Lounge,Airport Terminal,American Restaurant,Arts & Crafts Store,Asian Restaurant,Bakery,Bar,Bookstore,Breakfast Spot,Brewery,Bubble Tea Shop,Burrito Place,Bus Line,Café,Chinese Restaurant,Clothing Store,Coffee Shop,Comic Shop,Concert Hall,Cosmetics Shop,Cuban Restaurant,Dance Studio,Department Store,Dessert Shop,Diner,Distribution Center,Dog Run,Eastern European Restaurant,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Food,Food & Drink Shop,Garden,Gastropub,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Harbor / Marina,Health Food Store,Hotel,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Jewelry Store,Korean Restaurant,Liquor Store,Mexican Restaurant,Middle Eastern Restaurant,Movie Theater,Museum,Neighborhood,Organic Grocery,Park,Pet Store,Pizza Place,Playground,Plaza,Pub,Ramen Restaurant,Restaurant,Salad Place,Salon / Barbershop,Sandwich Place,Spa,Speakeasy,Sporting Goods Shop,Supermarket,Sushi Restaurant,Swim School,Tea Room,Thai Restaurant,Theme Restaurant,Trail,Vegetarian / Vegan Restaurant
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,The Beaches,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"The Danforth West,Riverdale",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [20]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Airport,Airport Food Court,Airport Lounge,Airport Terminal,American Restaurant,Arts & Crafts Store,Asian Restaurant,Bakery,Bar,Bookstore,Breakfast Spot,Brewery,Bubble Tea Shop,Burrito Place,Bus Line,Café,Chinese Restaurant,Clothing Store,Coffee Shop,Comic Shop,Concert Hall,Cosmetics Shop,Cuban Restaurant,Dance Studio,Department Store,Dessert Shop,Diner,Distribution Center,Dog Run,Eastern European Restaurant,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Food,Food & Drink Shop,Garden,Gastropub,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Harbor / Marina,Health Food Store,Hotel,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Jewelry Store,Korean Restaurant,Liquor Store,Mexican Restaurant,Middle Eastern Restaurant,Movie Theater,Museum,Organic Grocery,Park,Pet Store,Pizza Place,Playground,Plaza,Pub,Ramen Restaurant,Restaurant,Salad Place,Salon / Barbershop,Sandwich Place,Spa,Speakeasy,Sporting Goods Shop,Supermarket,Sushi Restaurant,Swim School,Tea Room,Thai Restaurant,Theme Restaurant,Trail,Vegetarian / Vegan Restaurant
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2
2,"Brockton,Exhibition Place,Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",0.0,0.2,0.2,0.2,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Most common venues

In [21]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide,King,Richmond----
                           venue  freq
0  Vegetarian / Vegan Restaurant   0.2
1                          Plaza   0.2
2                          Hotel   0.2
3                     Restaurant   0.2
4                   Concert Hall   0.2


----Berczy Park----
                           venue  freq
0  Vegetarian / Vegan Restaurant   0.2
1                   Concert Hall   0.2
2                         Museum   0.2
3                     Restaurant   0.2
4                   Liquor Store   0.2


----Brockton,Exhibition Place,Parkdale Village----
                venue  freq
0         Coffee Shop   0.4
1                 Gym   0.2
2  Italian Restaurant   0.2
3                 Bar   0.2
4        Liquor Store   0.0


----Business Reply Mail Processing Centre 969 Eastern----
            venue  freq
0      Comic Shop   0.2
1         Brewery   0.2
2   Burrito Place   0.2
3  Farmers Market   0.2
4     Pizza Place   0.2


----CN Tower,Bathurst Quay,Island airport,Harbourfro

In [22]:
#a function to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [23]:
# Now let's create the new dataframe and display the top 10 venues for each neighborhood.
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Adelaide,King,Richmond",Vegetarian / Vegan Restaurant,Concert Hall,Plaza,Hotel,Restaurant
1,Berczy Park,Vegetarian / Vegan Restaurant,Museum,Concert Hall,Liquor Store,Restaurant
2,"Brockton,Exhibition Place,Parkdale Village",Coffee Shop,Italian Restaurant,Bar,Gym,Food & Drink Shop
3,Business Reply Mail Processing Centre 969 Eastern,Pizza Place,Burrito Place,Farmers Market,Brewery,Comic Shop
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Airport,Airport Food Court,Airport Lounge,Airport Terminal,Harbor / Marina


## Clustering

In [24]:
# k-means to cluster the neighborhood into 3 clusters.
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=5, random_state=5).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:5] 

array([0, 0, 1, 4, 0], dtype=int32)

##### Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [25]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = df_final.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Trail,Health Food Store,Pub,Vegetarian / Vegan Restaurant,Diner
41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,1,Greek Restaurant,Ice Cream Shop,Cosmetics Shop,Italian Restaurant,Food
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,4,Gym,Fish & Chips Shop,Fast Food Restaurant,Ice Cream Shop,Brewery
43,M4M,East Toronto,Studio District,43.659526,-79.340923,1,Pet Store,Bookstore,Ice Cream Shop,Coffee Shop,Sandwich Place
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,3,Bus Line,Swim School,Park,Vegetarian / Vegan Restaurant,Dessert Shop


## Cluster 0

In [26]:
Cluster_0=toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
Cluster_0

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
37,East Toronto,0,Trail,Health Food Store,Pub,Vegetarian / Vegan Restaurant,Diner
46,Central Toronto,0,Yoga Studio,Spa,Diner,Salon / Barbershop,Chinese Restaurant
49,Central Toronto,0,Liquor Store,Sushi Restaurant,Supermarket,American Restaurant,Restaurant
51,Downtown Toronto,0,Italian Restaurant,Japanese Restaurant,Café,Diner,Indian Restaurant
52,Downtown Toronto,0,Theme Restaurant,Dance Studio,Bubble Tea Shop,Ramen Restaurant,Breakfast Spot
55,Downtown Toronto,0,Italian Restaurant,Gym,Restaurant,Coffee Shop,Japanese Restaurant
56,Downtown Toronto,0,Vegetarian / Vegan Restaurant,Museum,Concert Hall,Liquor Store,Restaurant
58,Downtown Toronto,0,Vegetarian / Vegan Restaurant,Concert Hall,Plaza,Hotel,Restaurant
60,Downtown Toronto,0,Gym,Pub,Hotel,Restaurant,Coffee Shop
61,Downtown Toronto,0,Gym,Café,Pub,Restaurant,Coffee Shop


In [27]:
print(Cluster_0.shape)

(17, 7)


## Cluster 1

In [28]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1,
                   toronto_merged.columns[
                       [2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
41,"The Danforth West,Riverdale",1,Greek Restaurant,Ice Cream Shop,Cosmetics Shop,Italian Restaurant,Food
43,Studio District,1,Pet Store,Bookstore,Ice Cream Shop,Coffee Shop,Sandwich Place
53,Harbourfront,1,Coffee Shop,Distribution Center,Bakery,Spa,Breakfast Spot
57,Central Bay Street,1,Coffee Shop,Gastropub,Distribution Center,Concert Hall,Cosmetics Shop
75,Christie,1,Café,Grocery Store,Italian Restaurant,Coffee Shop,Airport Lounge
78,"Brockton,Exhibition Place,Parkdale Village",1,Coffee Shop,Italian Restaurant,Bar,Gym,Food & Drink Shop
84,"Runnymede,Swansea",1,Café,Food,Fish & Chips Shop,Pub,Coffee Shop
85,Queen's Park,1,Coffee Shop,Italian Restaurant,Distribution Center,Park,Dessert Shop


## Cluster 2

In [29]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2,
                   toronto_merged.columns[
                       [2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
63,Roselawn,2,Garden,Vegetarian / Vegan Restaurant,Distribution Center,Concert Hall,Cosmetics Shop


## Cluster 3

In [30]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3,
                   toronto_merged.columns[
                       [2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
44,Lawrence Park,3,Bus Line,Swim School,Park,Vegetarian / Vegan Restaurant,Dessert Shop
45,Davisville North,3,Breakfast Spot,Food & Drink Shop,Hotel,Department Store,Park
48,"Moore Park,Summerhill East",3,Playground,Vegetarian / Vegan Restaurant,Coffee Shop,Concert Hall,Cosmetics Shop
50,Rosedale,3,Park,Trail,Playground,Vegetarian / Vegan Restaurant,Dessert Shop
59,"Harbourfront East,Toronto Islands,Union Station",3,Park,Dessert Shop,Sporting Goods Shop,Salad Place,Coffee Shop
64,"Forest Hill North,Forest Hill West",3,Trail,Sushi Restaurant,Park,Jewelry Store,Vegetarian / Vegan Restaurant
82,"High Park,The Junction South",3,Bar,Speakeasy,Gastropub,Italian Restaurant,Park


## Cluster 4

In [31]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4,
                   toronto_merged.columns[
                       [2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
42,"The Beaches West,India Bazaar",4,Gym,Fish & Chips Shop,Fast Food Restaurant,Ice Cream Shop,Brewery
47,Davisville,4,Dessert Shop,Pizza Place,Café,Indian Restaurant,Vegetarian / Vegan Restaurant
54,"Ryerson,Garden District",4,Comic Shop,Pizza Place,Tea Room,Plaza,Clothing Store
76,"Dovercourt Village,Dufferin",4,Grocery Store,Brewery,Middle Eastern Restaurant,Bar,Gym / Fitness Center
77,"Little Portugal,Trinity",4,Pizza Place,Korean Restaurant,Asian Restaurant,Ice Cream Shop,Brewery
87,Business Reply Mail Processing Centre 969 Eastern,4,Pizza Place,Burrito Place,Farmers Market,Brewery,Comic Shop


# VISUALISATION

In [32]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3,
                   toronto_merged.columns[
                       [2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
44,Lawrence Park,3,Bus Line,Swim School,Park,Vegetarian / Vegan Restaurant,Dessert Shop
45,Davisville North,3,Breakfast Spot,Food & Drink Shop,Hotel,Department Store,Park
48,"Moore Park,Summerhill East",3,Playground,Vegetarian / Vegan Restaurant,Coffee Shop,Concert Hall,Cosmetics Shop
50,Rosedale,3,Park,Trail,Playground,Vegetarian / Vegan Restaurant,Dessert Shop
59,"Harbourfront East,Toronto Islands,Union Station",3,Park,Dessert Shop,Sporting Goods Shop,Salad Place,Coffee Shop
64,"Forest Hill North,Forest Hill West",3,Trail,Sushi Restaurant,Park,Jewelry Store,Vegetarian / Vegan Restaurant
82,"High Park,The Junction South",3,Bar,Speakeasy,Gastropub,Italian Restaurant,Park


In [33]:
# create map
latitude=43.651070
longitude= -79.347015
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters