# Applied Data Science Capstone

## Part 1: Create Dataframe of Toronto Neighborhoods

In [1]:
import pandas as pd

We need to start by scraping the tables from the Wikipedia page showing all postal codes of Toronto.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
dfs = pd.read_html(url)

Let's extract the table we need and give the columns the names indicated in the assignment instructions.

In [3]:
df = dfs[0]
df.columns = ["PostalCode", "Borough", "Neighborhood"]
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


We drop all rows that have no borough assigned to them.

In [4]:
df = df[df.Borough != "Not assigned"]
df.reset_index(drop = True, inplace = True)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Let's check if their are any duplicates in the PostalCode column.

In [5]:
len(df.PostalCode.unique())

103

We can see that the number of unique postal codes equals the number of rows in the df. Therefore, the df only contains unique postal codes, i.e. already groups neighborhoods by postal code. There is no need to adjust the df as requested in the assignment instructions. 

Let's also check whether there are any rows in the df that have no neighborhood assigned to them.

In [6]:
"Not assigned" in list(df.Neighborhood)

False

So the neighborhood column does not contain any more "Not assigned" values. As above, there is thus no need to adjust the df as requested in the assignment instructions.

In [7]:
# Following code is not needed, as shown above.

#neighborhoods_temp = []

#for borough, neighborhood in zip(df.Borough, df.Neighborhood):
#    if neighborhood == "Not assigned":
#        neighborhoods_temp.append(borough)
#    else:
#        neighborhoods_temp.append(neighborhood)

#df.Neighborhood = neighborhoods_temp

In [8]:
df.shape

(103, 3)

Our final df shows the three columns we want and contains 103 rows (i.e. unique postal codes).

## Part 2: Add Geographic Coordinates of Neighborhoods

We will first try the solution suggested in the assignment instructions.

In [9]:
#import geocoder

In [10]:
# Initialize your variable to None
#lat_lng_coords = None

# Loop until you get the coordinates
#while(lat_lng_coords is None):
#  g = geocoder.google('{}, Toronto, Ontario'.format("M5G"))
#  lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]

This didn't work... So we have to use the CSV file provided.

In [11]:
latlng = pd.read_csv("Geo_Coordinates.csv")
latlng

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [12]:
df = df.join(latlng.set_index("Postal Code"), on = "PostalCode")
df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


## Part 3: Exploring and Clustering Neighborhoods in Toronto

We only want to work with boroughs that contain the word Toronto.

In [13]:
Toronto = df[df["Borough"].str.contains("Toronto")]
Toronto.reset_index(drop = True, inplace = True)
Toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [14]:
Toronto.shape

(39, 5)

This cuts our dataframe down to 39 rows (postal codes).

Now we can start exploring the neighborhoods. Let's first create a map of all the neighborhoods in Toronto with their postal codes.

In [15]:
from geopy.geocoders import Nominatim

In [16]:
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode("Toronto, Ontario")
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [17]:
import folium

In [18]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, borough, postal in zip(Toronto['Latitude'], Toronto['Longitude'], Toronto['Borough'], Toronto['PostalCode']):
    label = '{}, {}'.format(postal, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Now we need to start using Foursquare. So let's define our credentials. Personal account details were deleted from the following cell. So if you need to run the analysis yourself, just enter your own access information.

In [19]:
CLIENT_ID = 'xxx'
CLIENT_SECRET = 'xxx'
ACCESS_TOKEN = 'xxx'
VERSION = '20180605'

We want to find the top 80 venues for each postal code within a radius of 300 meters. Let's test this for the first postal code in our dataframe: M5A.

In [20]:
latitude_M5A = Toronto.loc[0, "Latitude"]
longitude_M5A = Toronto.loc[0, "Longitude"]

print('Latitude and longitude values of M5A are {}, {}.'.format(latitude_M5A, longitude_M5A))

Latitude and longitude values of M5A are 43.6542599, -79.3606359.


In [21]:
import requests

limit = 100
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude_M5A, longitude_M5A, ACCESS_TOKEN, VERSION, radius, limit)

results = requests.get(url).json()

Let's clean the JSON file and convert it to a dataframe.

In [22]:
import json
from pandas.io.json import json_normalize

venues = results["response"]["groups"][0]["items"]

recommended_venues = json_normalize(venues)
recommended_venues = recommended_venues.loc[:,["venue.name", "venue.categories", "venue.location.lat", "venue.location.lng"]]

recommended_venues.columns = ["Name", "Category", "Latitude", "Longitude"]

recommended_venues.head()

  recommended_venues = json_normalize(venues)


Unnamed: 0,Name,Category,Latitude,Longitude
0,Tandem Coffee,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",43.653559,-79.361809
1,Roselle Desserts,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",43.653447,-79.362017
2,Cooper Koo Family YMCA,"[{'id': '52e81612bcbc57f1066b7a37', 'name': 'D...",43.653249,-79.358008
3,Body Blitz Spa East,"[{'id': '4bf58dd8d48988d1ed941735', 'name': 'S...",43.654735,-79.359874
4,Impact Kitchen,"[{'id': '4bf58dd8d48988d1c4941735', 'name': 'R...",43.656369,-79.35698


We need to create a function that extracts the category name of each venue.

In [23]:
def get_category_type(row):
    categories_list = row["Category"]   
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]["name"]

recommended_venues["Category"] = recommended_venues.apply(get_category_type, axis = 1)

recommended_venues.head()

Unnamed: 0,Name,Category,Latitude,Longitude
0,Tandem Coffee,Coffee Shop,43.653559,-79.361809
1,Roselle Desserts,Bakery,43.653447,-79.362017
2,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008
3,Body Blitz Spa East,Spa,43.654735,-79.359874
4,Impact Kitchen,Restaurant,43.656369,-79.35698


In [24]:
recommended_venues.shape

(75, 4)

Foursquare returned 28 venues for this postal code.

Now that we see that our test has worked, we can run the same analysis for all postal codes in Toronto.

In [25]:
venues_list=[]

for postal, lat, lng in zip(Toronto["PostalCode"], Toronto["Latitude"], Toronto["Longitude"]):
    print(postal)
    
    # create the API request URL
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&radius={}&limit={}'.format(
        CLIENT_ID,
        CLIENT_SECRET,
        lat,
        lng,
        ACCESS_TOKEN,
        VERSION,
        radius,
        limit)
            
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
        
    # return only relevant information for each nearby venue
    venues_list.append([(
        postal, 
        lat, 
        lng, 
        v['venue']['name'], 
        v['venue']['location']['lat'], 
        v['venue']['location']['lng'],  
        v['venue']['categories'][0]['name']) for v in results])

Toronto_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
Toronto_venues.columns = ['Postal Code', 
                  'Postal Code Latitude', 
                  'Postal Code Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

M5A
M7A
M5B
M5C
M4E
M5E
M5G
M6G
M5H
M6H
M5J
M6J
M4K
M5K
M6K
M4L
M5L
M4M
M4N
M5N
M4P
M5P
M6P
M4R
M5R
M6R
M4S
M5S
M6S
M4T
M5T
M4V
M5V
M4W
M5W
M4X
M5X
M4Y
M7Y


In [26]:
print(Toronto_venues.shape)
Toronto_venues.head()

(2165, 7)


Unnamed: 0,Postal Code,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M5A,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
1,M5A,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
2,M5A,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,M5A,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,M5A,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


Great! We now have a list of all the recommended venues for all postal codes in Toronto.

Let's check how many venues were returned for each neighborhood.

In [27]:
Toronto_venues.groupby("Postal Code").count()

Unnamed: 0_level_0,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M4E,9,9,9,9,9,9
M4K,84,84,84,84,84,84
M4L,28,28,28,28,28,28
M4M,54,54,54,54,54,54
M4N,8,8,8,8,8,8
M4P,13,13,13,13,13,13
M4R,38,38,38,38,38,38
M4S,44,44,44,44,44,44
M4T,5,5,5,5,5,5
M4V,23,23,23,23,23,23


We can see that we did not reach the limit of 100 for many postal codes.

Let's see how many unique venue categories there are.

In [28]:
print('There are {} uniques categories.'.format(len(Toronto_venues['Venue Category'].unique())))

There are 271 uniques categories.


Now we can use one-hot encoding to measure the frequency of each venue category for each postal code.

In [29]:
Toronto_onehot = pd.get_dummies(Toronto_venues[["Venue Category"]], prefix="", prefix_sep="")
Toronto_onehot["Postal Code"] = Toronto_venues["Postal Code"]
Toronto_onehot = Toronto_onehot[[Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])]
Toronto_onehot.head()

Unnamed: 0,Postal Code,ATM,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We need to group the dataframe by Postal Code to calculate the frequency of each venue category per postal code.

In [30]:
Toronto_grouped = Toronto_onehot.groupby("Postal Code").mean().reset_index()
Toronto_grouped

Unnamed: 0,Postal Code,ATM,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,M4E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M4K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,...,0.011905,0.011905,0.0,0.0,0.0,0.0,0.011905,0.0,0.0,0.011905
2,M4L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M4M,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,...,0.018519,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037
4,M4N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,M4P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,M4R,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316
7,M4S,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,...,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,M4T,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,M4V,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,...,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0


We can also print each postal code with its top 5 venue categories.

In [31]:
num_top_venues = 5

for postal in Toronto_grouped["Postal Code"]:
    print("----"+postal+"----")
    temp = Toronto_grouped[Toronto_grouped["Postal Code"] == postal].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M4E----
               venue  freq
0  Health Food Store  0.11
1              Trail  0.11
2       Neighborhood  0.11
3   Asian Restaurant  0.11
4               Park  0.11


----M4K----
              venue  freq
0  Greek Restaurant  0.15
1       Coffee Shop  0.06
2               Spa  0.05
3  Sushi Restaurant  0.05
4    Ice Cream Shop  0.04


----M4L----
                  venue  freq
0  Fast Food Restaurant  0.07
1           Pizza Place  0.07
2                  Park  0.07
3        Sandwich Place  0.07
4            Restaurant  0.04


----M4M----
                 venue  freq
0          Coffee Shop  0.06
1               Bakery  0.06
2          Yoga Studio  0.04
3  American Restaurant  0.04
4            Gastropub  0.04


----M4N----
                        venue  freq
0                    Bus Line  0.12
1  Construction & Landscaping  0.12
2                      Lawyer  0.12
3          Photography Studio  0.12
4        Gym / Fitness Center  0.12


----M4P----
                        venue 

Now we can cluster the postal codes according to the venue categories' frequency by using k-means.

In [32]:
from sklearn.cluster import KMeans

Toronto_grouped_clustering = Toronto_grouped.drop("Postal Code", axis = 1)

kclusters = 5
kmeans = KMeans(n_clusters = kclusters, random_state=0).fit(Toronto_grouped_clustering)

We can check the cluster labels generated for each postal code.

In [33]:
kmeans.labels_[0:20]

array([0, 0, 0, 0, 2, 0, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Now we can add the cluster label to our dataframe of Toronto postal codes.

In [34]:
Toronto.insert(5, "Cluster Label", kmeans.labels_)
Toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Label
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,2


And we can visualize the clustering on a map.

In [35]:
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, postal, cluster in zip(Toronto['Latitude'], Toronto['Longitude'], Toronto['PostalCode'], Toronto['Cluster Label']):
    label = folium.Popup(str(postal) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

We can see that the clustering is not perfect, as the algorithm returns one major cluster and then a few outlier clusters with only one or two members. We could analyze this further...