# 1. Create the dataframe

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [1]:
# import needed libraries

from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

In [2]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
index = r.text
soup_index = BeautifulSoup(index, 'html.parser')

In [3]:
# get columns of the target dataframe

list_headers = []
for a in soup_index.select('th'):
    list_headers.append(a.get_text().strip())
list_headers = list_headers[:-1]
list_headers

['Postal Code', 'Borough', 'Neighborhood']

In [4]:
# get all the needed info

list_info = []
for b in soup_index.select('td'):
    list_info.append(b.get_text().strip())
# list_info.index('M9Z')
list_info = list_info[:540]
x = np.array(list_info).reshape(180,3)
Nei_To = pd.DataFrame(x)
Nei_To.columns = list_headers
Nei_To

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [5]:
#  Ignore cells with a borough that is Not assigned

Nei_To = Nei_To[~Nei_To['Borough'].isin(['Not assigned'])]
Nei_To.reset_index(drop=True)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [6]:
Nei_To.shape

(103, 3)

#  2. Get the latitude and the longitude coordinates of each neighborhood

In [11]:
# import the dataset of geographical coordinates

Geo_Coordinates = pd.read_csv('Geospatial_Coordinates.csv')

# merge two dataframes

Toronto_data = pd.merge(Nei_To, Geo_Coordinates, how='left', on='Postal Code')
Toronto_data

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


# 3. Explore and cluster the neighborhoods in Toronto.

In [12]:
# import the needed libraries 

import folium
import geocoder
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import json
import requests

### Define Foursquare Credentials and Version

In [13]:
client_id = 'S4LDZJC1LZPDP4WC4O31JJGGEGWEE43W110DNNGTMZI53MGY'
client_secret = 'I01AHG5RWS1VNCDCVU5FCFNKBKIZ4CCIBSRSN4TH1H5Q3TCL'
version = '20200606'

In [14]:
Toronto_data.loc[0, 'Postal Code']
neighborhood_latitude = Toronto_data.loc[0, 'Latitude']
neighborhood_longitude = Toronto_data.loc[0, 'Longitude']
neighborhood_name = Toronto_data.loc[0, 'Neighborhood']

print('Latitude and Longitude values of ({}) are {},{}.'.format(neighborhood_name,
                                                             neighborhood_latitude,
                                                             neighborhood_longitude))

Latitude and Longitude values of (Parkwoods) are 43.7532586,-79.3296565.


In [21]:
# get the top 50 venues that are in Parkwoods within a radius of 1000 meters

limit = 50
radius = 1000
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    client_id, 
    client_secret, 
    version, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    limit)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=S4LDZJC1LZPDP4WC4O31JJGGEGWEE43W110DNNGTMZI53MGY&client_secret=I01AHG5RWS1VNCDCVU5FCFNKBKIZ4CCIBSRSN4TH1H5Q3TCL&v=20200606&ll=43.7532586,-79.3296565&radius=1000&limit=50'

In [22]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5edbb396211536001bf0f1ec'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Parkwoods - Donalda',
  'headerFullLocation': 'Parkwoods - Donalda, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 29,
  'suggestedBounds': {'ne': {'lat': 43.762258609000014,
    'lng': -79.31721997969855},
   'sw': {'lat': 43.74425859099999, 'lng': -79.34209302030145}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4b8991cbf964a520814232e3',
       'name': "Allwyn's Bakery",
       'location': {'address': '81 Underhill drive',
        'lat': 43.75984035203157,
        'lng': -79.32471879917513,
        'labeledLatLngs': [{'label': 'display'

In [23]:
# function that extracts the category of the venue

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [24]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Allwyn's Bakery,Caribbean Restaurant,43.75984,-79.324719
1,Brookbanks Park,Park,43.751976,-79.33214
2,Tim Hortons,Café,43.760668,-79.326368
3,A&W,Fast Food Restaurant,43.760643,-79.326865
4,Bruno's valu-mart,Grocery Store,43.746143,-79.32463


In [25]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

29 venues were returned by Foursquare.


### Explore Neighborhoods in Toronto

In [26]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            client_id, 
            client_secret, 
            version, 
            neighborhood_latitude, 
            neighborhood_longitude, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Toronto_venues = getNearbyVenues(names=df['Postal Code'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

print(Toronto_venues.shape)
Toronto_venues.head()

M3A
M4A
M5A
M6A
M7A
M9A
M1B
M3B
M4B
M5B
M6B
M9B
M1C
M3C
M4C
M5C
M6C
M9C
M1E
M4E
M5E
M6E
M1G
M4G
M5G
M6G
M1H
M2H
M3H
M4H
M5H
M6H
M1J
M2J
M3J
M4J
M5J
M6J
M1K
M2K
M3K
M4K
M5K
M6K
M1L
M2L
M3L
M4L
M5L
M6L
M9L
M1M
M2M
M3M
M4M
M5M
M6M
M9M
M1N
M2N
M3N
M4N
M5N
M6N
M9N
M1P
M2P
M4P
M5P
M6P
M9P
M1R
M2R
M4R
M5R
M6R
M7R
M9R
M1S
M4S
M5S
M6S
M1T
M4T
M5T
M1V
M4V
M5V
M8V
M9V
M1W
M4W
M5W
M8W
M9W
M1X
M4X
M5X
M8X
M4Y
M7Y
M8Y
M8Z
(2987, 7)


Unnamed: 0,Postal Code,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,43.753259,-79.329656,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,M3A,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
2,M3A,43.753259,-79.329656,Tim Hortons,43.760668,-79.326368,Café
3,M3A,43.753259,-79.329656,A&W,43.760643,-79.326865,Fast Food Restaurant
4,M3A,43.753259,-79.329656,Bruno's valu-mart,43.746143,-79.32463,Grocery Store


In [27]:
Toronto_venues.groupby('Postal Code').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1B,29,29,29,29,29,29
M1C,29,29,29,29,29,29
M1E,29,29,29,29,29,29
M1G,29,29,29,29,29,29
M1H,29,29,29,29,29,29
...,...,...,...,...,...,...
M9N,29,29,29,29,29,29
M9P,29,29,29,29,29,29
M9R,29,29,29,29,29,29
M9V,29,29,29,29,29,29


In [28]:
print('There are {} uniques categories.'.format(len(Toronto_venues['Venue Category'].unique())))

There are 23 uniques categories.


### Analyze Each Neighborhood

In [29]:
# one hot encoding
To_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
To_onehot['Neighborhood'] = Toronto_venues['Postal Code'] 

# move neighborhood column to the first column
fixed_columns = [To_onehot.columns[-1]] + list(To_onehot.columns[:-1])
To_onehot = To_onehot[fixed_columns]

To_onehot.head()

Unnamed: 0,Neighborhood,Bus Stop,Café,Caribbean Restaurant,Chinese Restaurant,Coffee Shop,Convenience Store,Cosmetics Shop,Discount Store,Fast Food Restaurant,...,Park,Pharmacy,Pizza Place,Road,Shop & Service,Shopping Mall,Skating Rink,Supermarket,Tennis Court,Train Station
0,M3A,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M3A,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,M3A,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M3A,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
To_onehot.shape

(2987, 24)

###     group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [32]:
To_grouped = To_onehot.groupby('Neighborhood').mean().reset_index()
To_grouped.head()

Unnamed: 0,Neighborhood,Bus Stop,Café,Caribbean Restaurant,Chinese Restaurant,Coffee Shop,Convenience Store,Cosmetics Shop,Discount Store,Fast Food Restaurant,...,Park,Pharmacy,Pizza Place,Road,Shop & Service,Shopping Mall,Skating Rink,Supermarket,Tennis Court,Train Station
0,M1B,0.068966,0.034483,0.034483,0.034483,0.034483,0.068966,0.034483,0.034483,0.034483,...,0.103448,0.068966,0.034483,0.034483,0.034483,0.068966,0.034483,0.034483,0.034483,0.034483
1,M1C,0.068966,0.034483,0.034483,0.034483,0.034483,0.068966,0.034483,0.034483,0.034483,...,0.103448,0.068966,0.034483,0.034483,0.034483,0.068966,0.034483,0.034483,0.034483,0.034483
2,M1E,0.068966,0.034483,0.034483,0.034483,0.034483,0.068966,0.034483,0.034483,0.034483,...,0.103448,0.068966,0.034483,0.034483,0.034483,0.068966,0.034483,0.034483,0.034483,0.034483
3,M1G,0.068966,0.034483,0.034483,0.034483,0.034483,0.068966,0.034483,0.034483,0.034483,...,0.103448,0.068966,0.034483,0.034483,0.034483,0.068966,0.034483,0.034483,0.034483,0.034483
4,M1H,0.068966,0.034483,0.034483,0.034483,0.034483,0.068966,0.034483,0.034483,0.034483,...,0.103448,0.068966,0.034483,0.034483,0.034483,0.068966,0.034483,0.034483,0.034483,0.034483


In [33]:
To_grouped.shape

(103, 24)

In [36]:
for hood in To_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = To_grouped[To_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    num_top_venues = temp.sort_values('freq', ascending=False).reset_index(drop=True).head()
    print(num_top_venues)
    print('\n')

----M1B----
               venue  freq
0               Park  0.10
1           Bus Stop  0.07
2  Convenience Store  0.07
3      Shopping Mall  0.07
4           Pharmacy  0.07


----M1C----
               venue  freq
0               Park  0.10
1           Bus Stop  0.07
2  Convenience Store  0.07
3      Shopping Mall  0.07
4           Pharmacy  0.07


----M1E----
               venue  freq
0               Park  0.10
1           Bus Stop  0.07
2  Convenience Store  0.07
3      Shopping Mall  0.07
4           Pharmacy  0.07


----M1G----
               venue  freq
0               Park  0.10
1           Bus Stop  0.07
2  Convenience Store  0.07
3      Shopping Mall  0.07
4           Pharmacy  0.07


----M1H----
               venue  freq
0               Park  0.10
1           Bus Stop  0.07
2  Convenience Store  0.07
3      Shopping Mall  0.07
4           Pharmacy  0.07


----M1J----
               venue  freq
0               Park  0.10
1           Bus Stop  0.07
2  Convenience Store  0.07


In [37]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [38]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = To_grouped['Neighborhood']

for ind in np.arange(To_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(To_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Park,Bus Stop,Shopping Mall,Convenience Store,Pharmacy,Fish & Chips Shop,Café,Caribbean Restaurant,Chinese Restaurant,Coffee Shop
1,M1C,Park,Bus Stop,Shopping Mall,Convenience Store,Pharmacy,Fish & Chips Shop,Café,Caribbean Restaurant,Chinese Restaurant,Coffee Shop
2,M1E,Park,Bus Stop,Shopping Mall,Convenience Store,Pharmacy,Fish & Chips Shop,Café,Caribbean Restaurant,Chinese Restaurant,Coffee Shop
3,M1G,Park,Bus Stop,Shopping Mall,Convenience Store,Pharmacy,Fish & Chips Shop,Café,Caribbean Restaurant,Chinese Restaurant,Coffee Shop
4,M1H,Park,Bus Stop,Shopping Mall,Convenience Store,Pharmacy,Fish & Chips Shop,Café,Caribbean Restaurant,Chinese Restaurant,Coffee Shop


### Cluster Neighborhood

In [39]:
# set number of clusters
kclusters = 5

Toronto_grouped_clustering = To_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

  import sys


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## Based on the kmeans.labels showed above, there is only one cluster