Import the standard libraries to work with dataframes

In [1]:
import numpy as np
import pandas as pd

Use the read_html function of from the Pandas library and pass the url of the wikipedia page. This returns a list of dataframes. These dataframes are assigned to the variable 'dfs'

In [2]:
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

Display the first five dfs

In [3]:
dfs[0:5]

[    Postal Code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 ..          ...               ...   
 175         M5Z      Not assigned   
 176         M6Z      Not assigned   
 177         M7Z      Not assigned   
 178         M8Z         Etobicoke   
 179         M9Z      Not assigned   
 
                                          Neighbourhood  
 0                                         Not assigned  
 1                                         Not assigned  
 2                                            Parkwoods  
 3                                     Victoria Village  
 4                            Regent Park, Harbourfront  
 ..                                                 ...  
 175                                       Not assigned  
 176                                       Not assigned  
 177                

Assign the first dataframe from dfs to the dataframe: df_toronto

In [4]:
df_toronto = dfs[0]

Display first 8 rows in this dataframe

In [5]:
df_toronto.head(8)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned


Created a copy of this dataframe for data wrangling

In [6]:
df_toronto_w = df_toronto.copy()

Iterate through each row in dataframe and if the borough for that row is listed as 'Not assigned', removed the row from the dataframe

In [7]:
for index, row in df_toronto_w.iterrows():
    if row['Borough'] == 'Not assigned':
        df_toronto_w.drop(index, inplace=True)

Search for any rows where the Neighbourhood is listed as 'Not Assigned'. Result returns no records.

In [8]:
df_toronto_w.loc[df_toronto_w['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postal Code,Borough,Neighbourhood


Dispaly resulting dataframe. This will show the indices having been removed. Will then reset index in the following cell.

In [9]:
df_toronto_w.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [10]:
df_toronto_w.reset_index(drop=True, inplace=True)
df_toronto_w.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Using the shape function, indicate the number of rows in resulting dataframe.

In [10]:
print('There are ', df_toronto_w.shape[0], ' rows in this dataframe')

There are  103  rows in this dataframe


Reading the postal codes CSV from site below to obtain latitude and longitude coordinates of each postal code abve.

In [11]:
coords = pd.read_csv('https://cocl.us/Geospatial_data')

Place the coordinates found above into the dataframe previously created using merge.

In [12]:
df_toronto_w = df_toronto_w.merge(coords, left_on='Postal Code', right_on='Postal Code')

Use the groupby and count functions to review number of neighbourhoods in each borough.

In [13]:
df_toronto_w.groupby('Borough').count()

Unnamed: 0_level_0,Postal Code,Neighbourhood,Latitude,Longitude
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Central Toronto,9,9,9,9
Downtown Toronto,19,19,19,19
East Toronto,5,5,5,5
East York,5,5,5,5
Etobicoke,12,12,12,12
Mississauga,1,1,1,1
North York,24,24,24,24
Scarborough,17,17,17,17
West Toronto,6,6,6,6
York,5,5,5,5


Based on results above, will be reviewing neighbourhoods within the North York borough. Using geopy, obtain the coordinates of North York.

In [14]:
from geopy.geocoders import Nominatim

In [15]:
address = 'North York Ontario'

geolocator = Nominatim(user_agent="nyork_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North York borough are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of North York borough are 43.7543263, -79.44911696639593.


Installing and importing folium to create maps.

In [18]:
!pip install folium==0.5.0 # uncomment this line if you haven't completed the Foursquare API lab

Collecting folium==0.5.0
  Downloading folium-0.5.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 8.9 MB/s  eta 0:00:01
[?25hCollecting branca
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Building wheels for collected packages: folium
  Building wheel for folium (setup.py) ... [?25ldone
[?25h  Created wheel for folium: filename=folium-0.5.0-py3-none-any.whl size=76240 sha256=3d6e3326c5510983141091d81b1070cb7d724c95c305993d920667b8d7b0f794
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/b2/2f/2c/109e446b990d663ea5ce9b078b5e7c1a9c45cca91f377080f8
Successfully built folium
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.5.0


In [19]:
import folium

In [20]:
# create map of New York using latitude and longitude values
map_nyork = folium.Map(location=[latitude, longitude], zoom_start=12)

In [21]:
map_nyork

Create dataframe to contain only North York data.

In [22]:
nyork_data = df_toronto_w[df_toronto_w['Borough'] == 'North York'].reset_index(drop=True)

Using the new dataframe, add markers to map focused on North York

In [23]:
for lat, lng, label in zip(nyork_data['Latitude'], nyork_data['Longitude'], nyork_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_nyork)  
    
map_nyork

Enter Foursquare API credentials

In [24]:
CLIENT_ID = 'JMH2AKCWYAKJBR3OH0RBY1OQ0CQNS5RU4LZWYWSBOPY214DH' # your Foursquare ID
CLIENT_SECRET = '30MTTRE5Q3XMSKHUL55JZIYAWXKHJE0UFLHPN3YQTMNH5Y1H' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

In [24]:
nyork_data.loc[0, 'Neighbourhood']

'Parkwoods'

In [25]:
neighbourhood_latitude = nyork_data.loc[0, 'Latitude'] # neighborhood latitude value
neighbourhood_longitude = nyork_data.loc[0, 'Longitude'] # neighborhood longitude value

neighbourhood_name = nyork_data.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighbourhood_name, 
                                                               neighbourhood_latitude, 
                                                               neighbourhood_longitude))

Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


In [26]:
# type your answer here
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighbourhood_latitude, 
    neighbourhood_longitude, 
    radius, 
    LIMIT)

'https://api.foursquare.com/v2/venues/explore?&client_id=JMH2AKCWYAKJBR3OH0RBY1OQ0CQNS5RU4LZWYWSBOPY214DH&client_secret=30MTTRE5Q3XMSKHUL55JZIYAWXKHJE0UFLHPN3YQTMNH5Y1H&v=20180605&ll=43.7532586,-79.3296565&radius=500&limit=100'

Import libraries below to work with json files from Foursquare

In [27]:
import json # library to handle JSON files


import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

Function getNearbyVenues as shown in lab.

In [29]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Using the above function, get venues from the North York specific dataframe.

In [30]:
# type your answer here
nyork_venues = getNearbyVenues(names=nyork_data['Neighbourhood'],
                                   latitudes=nyork_data['Latitude'],
                                   longitudes=nyork_data['Longitude']
                                  )

Parkwoods
Victoria Village
Lawrence Manor, Lawrence Heights
Don Mills
Glencairn
Don Mills
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Fairview, Henry Farm, Oriole
Northwood Park, York University
Bayview Village
Downsview
York Mills, Silver Hills
Downsview
North Park, Maple Leaf Park, Upwood Park
Humber Summit
Willowdale, Newtonbrook
Downsview
Bedford Park, Lawrence Manor East
Humberlea, Emery
Willowdale, Willowdale East
Downsview
York Mills West
Willowdale, Willowdale West


Group the venues by Neighbourhood

In [32]:
nyork_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor, Wilson Heights, Downsview North",24,24,24,24,24,24
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
Don Mills,26,26,26,26,26,26
Downsview,16,16,16,16,16,16
"Fairview, Henry Farm, Oriole",66,66,66,66,66,66
Glencairn,5,5,5,5,5,5
Hillcrest Village,5,5,5,5,5,5
Humber Summit,3,3,3,3,3,3
"Humberlea, Emery",2,2,2,2,2,2


Analyze each neighbourhood

In [34]:
# one hot encoding
nyork_onehot = pd.get_dummies(nyork_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
nyork_onehot['Neighbourhood'] = nyork_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = [nyork_onehot.columns[-1]] + list(nyork_onehot.columns[:-1])
nyork_onehot = nyork_onehot[fixed_columns]

nyork_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,...,Steakhouse,Supermarket,Sushi Restaurant,Thai Restaurant,Theater,Toy / Game Store,Trail,Video Game Store,Vietnamese Restaurant,Women's Store
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


####  Group rows by neighbourhood by taking the mean of the frequency of occurrence of each category of venues

In [35]:
nyork_grouped = nyork_onehot.groupby('Neighbourhood').mean().reset_index()
nyork_grouped

Unnamed: 0,Neighbourhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,...,Steakhouse,Supermarket,Sushi Restaurant,Thai Restaurant,Theater,Toy / Game Store,Trail,Video Game Store,Vietnamese Restaurant,Women's Store
0,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,...,0.0,0.041667,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.043478,0.043478,0.0,0.0,0.0,0.0,0.0,0.043478
3,Don Mills,0.0,0.0,0.0,0.038462,0.0,0.038462,0.038462,0.0,0.0,...,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Downsview,0.0,0.0625,0.0,0.0,0.0,0.0,0.0625,0.0,0.0625,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Fairview, Henry Farm, Oriole",0.0,0.0,0.015152,0.0,0.0,0.015152,0.0,0.030303,0.030303,...,0.0,0.0,0.0,0.0,0.015152,0.015152,0.0,0.015152,0.0,0.015152
6,Glencairn,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.2,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Hillcrest Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Humber Summit,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Humberlea, Emery",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Find the top venue categories in each neighbourhood

In [37]:
num_top_venues = 5

for hood in nyork_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = nyork_grouped[nyork_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor, Wilson Heights, Downsview North----
           venue  freq
0    Coffee Shop  0.08
1           Bank  0.08
2  Grocery Store  0.04
3   Intersection  0.04
4    Gas Station  0.04


----Bayview Village----
                 venue  freq
0   Chinese Restaurant  0.25
1                 Café  0.25
2                 Bank  0.25
3  Japanese Restaurant  0.25
4    Accessories Store  0.00


----Bedford Park, Lawrence Manor East----
                venue  freq
0  Italian Restaurant  0.13
1         Coffee Shop  0.09
2      Sandwich Place  0.09
3    Greek Restaurant  0.04
4        Liquor Store  0.04


----Don Mills----
                 venue  freq
0                  Gym  0.12
1           Restaurant  0.08
2  Japanese Restaurant  0.08
3           Beer Store  0.08
4          Coffee Shop  0.08


----Downsview----
            venue  freq
0   Grocery Store  0.19
1            Park  0.12
2  Baseball Field  0.06
3           Hotel  0.06
4  Discount Store  0.06


----Fairview, Henry Farm, Oriole--

Function from lab to find most common venue types

In [38]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [39]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = nyork_grouped['Neighbourhood']

for ind in np.arange(nyork_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(nyork_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Middle Eastern Restaurant,Sandwich Place,Pet Store,Pharmacy,Pizza Place,Bridal Shop,Mobile Phone Shop,Deli / Bodega
1,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Discount Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop
2,"Bedford Park, Lawrence Manor East",Italian Restaurant,Coffee Shop,Sandwich Place,Women's Store,Restaurant,Greek Restaurant,Grocery Store,Indian Restaurant,Comfort Food Restaurant,Juice Bar
3,Don Mills,Gym,Restaurant,Japanese Restaurant,Beer Store,Coffee Shop,Café,Baseball Field,Dim Sum Restaurant,Discount Store,Asian Restaurant
4,Downsview,Grocery Store,Park,Hotel,Bank,Business Service,Baseball Field,Shopping Mall,Snack Place,Gym / Fitness Center,Athletics & Sports


Import KMeans to preform clustering algorithms

In [40]:
from sklearn.cluster import KMeans

In [41]:
# set number of clusters
kclusters = 5

nyork_grouped_clustering = nyork_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(nyork_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 4, 4, 4, 4, 4, 4, 4, 3, 1], dtype=int32)

In [42]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

nyork_merged = nyork_data

# merge nyork_grouped with nyork_data to add latitude/longitude for each neighborhood
nyork_merged = nyork_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

nyork_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,0.0,Park,Food & Drink Shop,Diner,Chocolate Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop
1,M4A,North York,Victoria Village,43.725882,-79.315572,4.0,Coffee Shop,Pizza Place,Hockey Arena,Portuguese Restaurant,Intersection,Women's Store,Dim Sum Restaurant,Clothing Store,Comfort Food Restaurant,Construction & Landscaping
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,4.0,Clothing Store,Women's Store,Sporting Goods Shop,Boutique,Carpet Store,Coffee Shop,Event Space,Furniture / Home Store,Vietnamese Restaurant,Miscellaneous Shop
3,M3B,North York,Don Mills,43.745906,-79.352188,4.0,Gym,Restaurant,Japanese Restaurant,Beer Store,Coffee Shop,Café,Baseball Field,Dim Sum Restaurant,Discount Store,Asian Restaurant
4,M6B,North York,Glencairn,43.709577,-79.445073,4.0,Pizza Place,Asian Restaurant,Bakery,Pub,Japanese Restaurant,Women's Store,Diner,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping


A number of fields in above dataframe contain Nan values so used the function 'dropna' to remove these to ensure the map markers can be placed in following cells. Also imported necessary classes from matplotlip to assist with map created. Finally, the map of North York with the computed clusters is displayed.

In [48]:
nyork_merged = nyork_merged.dropna()

In [49]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [50]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(nyork_merged['Latitude'], nyork_merged['Longitude'], nyork_merged['Neighbourhood'], nyork_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters