# IBM Data Science on Coursera -  Applied Data Science Capstone
## Week 3 - Part 1: Scrape neighborhood information from Wikipedia

Build a pandas dataframe with the postal codes of each Toronto Neighborhood / Borough. Postal code information scraped from Wikipedia, and then cleaned up.

### Let's start by installing and importing all the libraries we need.

In [1]:
!pip install pandas -U
import pandas as pd
from pandas.io.json import json_normalize
print("\n*** Pandas Installed, Updated, & Imported\n")
print("\n*** JSON_normalize Imported\n")

!pip install numpy -U
import numpy as np
print("\n*** NumPy Installed, Updated, & Imported\n")

import requests
import urllib.request
print("\n*** Requests Imported\n")

import random
print("\n*** Random Imported\n")

!pip install geopy -U
from geopy.geocoders import Nominatim
print("\n*** Geopy Installed, Updated, & Imported\n")
print("\n*** Nominatim Imported\n")

!pip install ipython -U
from IPython.display import Image
from IPython.core.display import HTML
print("\n*** IPython Installed, Updated, & Imported\n")
print("\n*** Image & HTML Imported\n")

!pip install folium -U
import folium
print("\n*** Folium Installed, Updated, & Imported\n")

!pip install BeautifulSoup4 -U
from bs4 import BeautifulSoup
print("\n*** BeautifulSoup Installed, Updated, & Imported\n")

Requirement already up-to-date: pandas in c:\users\alexi\anaconda3\lib\site-packages (1.0.3)

*** Pandas Installed, Updated, & Imported


*** JSON_normalize Imported

Requirement already up-to-date: numpy in c:\users\alexi\anaconda3\lib\site-packages (1.18.3)

*** NumPy Installed, Updated, & Imported


*** Requests Imported


*** Random Imported

Requirement already up-to-date: geopy in c:\users\alexi\anaconda3\lib\site-packages (1.21.0)

*** Geopy Installed, Updated, & Imported


*** Nominatim Imported

Requirement already up-to-date: ipython in c:\users\alexi\anaconda3\lib\site-packages (7.13.0)

*** IPython Installed, Updated, & Imported


*** Image & HTML Imported

Requirement already up-to-date: folium in c:\users\alexi\anaconda3\lib\site-packages (0.10.1)

*** Folium Installed, Updated, & Imported

Requirement already up-to-date: BeautifulSoup4 in c:\users\alexi\anaconda3\lib\site-packages (4.9.0)

*** BeautifulSoup Installed, Updated, & Imported



### Scrape data from Wikipedia w/ BeautifulSoup

Using BeautifulSoup, we scrape the wikipedia page, looking through the code for the table containing our data. There are several tables available, however the one containing our Postal Code / Borough / Neighborhood information is the "sortable wikitable". Let's isolate it.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")

In [3]:
table = soup.find('table', class_='wikitable sortable')

### Parse table, & append to Pandas DF

We're parsing the "sortable wikitable" and extracting each cell on each row. The cells are appended to their own lists. Then, a pandas dataframe is initialized using the data in the lists to populate each column.

In [4]:
PostalCode = []
Borough = []
Neighborhood = []

for row in table.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 3:
        PostalCode.append(cells[0].find(text = True))
        Borough.append(cells[1].find(text = True))
        Neighborhood.append(cells[2].find(text = True))

In [5]:
df = pd.DataFrame(PostalCode, columns=["PostalCode"])
df["Borough"] = Borough
df["Neighborhood"] = Neighborhood

### Cleanup & Formatting

Let's see what we've scraped up. We notice there are a lot of **\n (newline)** characters, **empty cells** and **Not assigned** cells. These all need to be cleaned up. Otherwise it looks like it picked up all of the postal codes (from M1A to M9Z) and the corresponding Boroughs, and Neighborhoods.

In [6]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,\n
1,M2A\n,Not assigned\n,\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,Regent Park / Harbourfront\n
...,...,...,...
175,M5Z\n,Not assigned\n,\n
176,M6Z\n,Not assigned\n,\n
177,M7Z\n,Not assigned\n,\n
178,M8Z\n,Etobicoke\n,Mimico NW / The Queensway West / South of Bloo...


In [7]:
df.shape

(180, 3)

In [8]:
df.dtypes

PostalCode      object
Borough         object
Neighborhood    object
dtype: object

In [9]:
# Remove \n (newline) characters, & replace Not Assigned with NAN
df = df.replace(r'\n',  ' ', regex=True)
df = df.replace(r'Not assigned', np.nan, regex=True)

In [10]:
# Drop NAN rows
df.dropna(axis = 0, how = "any", inplace = True)

In [11]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...
160,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,Business reply mail Processing CentrE
169,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


### Grouping the dataframe

Now, let's group the dataframe by Postal Code, & Borough. At the same time, we will be replacing the backslashes with comas.

In [12]:
df_grouped = df.groupby(["PostalCode", "Borough"], as_index=False).agg(lambda x: ", ".join(x))
df_grouped = df_grouped.replace(r' / ',  ', ', regex=True)

In [13]:
df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [14]:
df_grouped.shape

(103, 3)

## Week 3 - Part 2: Adding Geospatial Data to our dataframe

Let's start by getting the lat, lon for these postal codes. We're going to be using the .csv file instead of fiddling about with geocoder.

In [15]:
coords = pd.read_csv("https://cocl.us/Geospatial_data")
coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Concatenate the Coordinates, and the Postal Codes dataframe

In [16]:
df_grp_coords = pd.concat([df_grouped, coords], axis=1, sort=False)
df_grp_coords = df_grp_coords.drop(columns = ["Postal Code"])
df_grp_coords.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## Week 3 - Part 3: Map & Explore with FourSquare

Let's start by visualizing our data and working from there. Eventually we are going to use FourSquare to get more information on these neighborhoods, so let's see what we have to work with.

In [17]:
tdot = folium.Map(location=[43.7, -79.4], zoom_start=11)


for lat, lng, borough, neighborhood in zip(df_grp_coords['Latitude'], df_grp_coords['Longitude'], df_grp_coords['Borough'], df_grp_coords['Neighborhood']):
    label = '{}- {}'.format(neighborhood, borough)
    folium.CircleMarker([lat, lng], radius=5, popup=label, color='blue', fill=True, fill_color='red').add_to(tdot)

tdot

Let's drop some of the periphery Boroughs / Neighborhoods, and focus only on the more central areas. We're going to drop the data pertaining to Etopicoke, Mississauga, York, North York, and Scarborough as these are not really part of Toronto proper.

In [18]:
out_there = ['Etobicoke', 'Mississauga', 'Scarborough', 'York', 'North York']
df_central = df_grp_coords[~df_grp_coords.Borough.str.contains('|'.join(out_there))]
df_central.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [19]:
tdot_center = folium.Map(location=[43.7, -79.4], zoom_start=12)


for lat, lng, borough, neighborhood in zip(df_central['Latitude'], df_central['Longitude'], df_central['Borough'], df_central['Neighborhood']):
    label = '{}- {}'.format(neighborhood, borough)
    folium.CircleMarker([lat, lng], radius=5, popup=label, color='blue', fill=True, fill_color='red').add_to(tdot_center)

tdot_center

### Using the FourSquare API

Credentials are hidden in a secondary file that's not being pushed to Github.

In [20]:
import config as cfg

CLIENT_ID = cfg.client_id
CLIENT_SECRET = cfg.client_secret
VERSION = "20200426"

Now, let's make an API request for the venues, and save the results in a list. We've gone with all venues within 500 meters of the geospatial coordinates.

In [21]:
radius = 500
LIMIT = 100

venues = []

for lat, long, post, borough, neighborhood in zip(df_central['Latitude'], df_central['Longitude'], df_central['PostalCode'], df_central['Borough'], df_central['Neighborhood']):
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    for venue in results:
        venues.append((
            post, 
            borough,
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

Create a pandas DataFrame from the list of venues, & add the column headers.

In [22]:
venues_df = pd.DataFrame(venues)
venues_df.columns = ['PostalCode', 'Borough', 'Neighborhood', 'BoroughLatitude', 'BoroughLongitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

venues_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,M4E,East Toronto,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,M4E,East Toronto,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,Glen Stewart Park,43.675278,-79.294647,Park
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood


Now let's one-hot encode the venues to a new dataframe.

In [23]:
toh = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

toh.head()

Unnamed: 0,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And add back the neighborhood information from the venues dataframe.

In [24]:
toh.insert(0, 'PostalCode', venues_df['PostalCode'])
toh.insert(1, 'Borough', venues_df['Borough'])
toh.insert(2, 'Neighborhoods', venues_df['Neighborhood'])

toh.head()

Unnamed: 0,PostalCode,Borough,Neighborhoods,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Group the neighborhoods together & take the means of each vanue categeory

In [25]:
central_venues = toh.groupby(["PostalCode", "Borough", "Neighborhoods"]).mean().reset_index()
central_venues.head()

Unnamed: 0,PostalCode,Borough,Neighborhoods,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,M4E,East Toronto,The Beaches,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M4K,East Toronto,"The Danforth West, Riverdale",0.0,0.0,0.0,0.0,0.0,0.0,0.023256,...,0.0,0.0,0.023256,0.0,0.0,0.0,0.0,0.0,0.0,0.023256
2,M4L,East Toronto,"India Bazaar, The Beaches West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M4M,East Toronto,Studio District,0.0,0.0,0.0,0.0,0.0,0.0,0.05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.025
4,M4N,Central Toronto,Lawrence Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Find the top 5 venues in each Neighborhood

In [26]:
areaColumns = ['PostalCode', 'Borough', 'Neighborhoods']
freqColumns = []
for ind in np.arange(5):
    freqColumns.append('Top {}'.format(ind+1))

columns = areaColumns+freqColumns

top5_venues = pd.DataFrame(columns=columns)
top5_venues['PostalCode'] = central_venues['PostalCode']
top5_venues['Borough'] = central_venues['Borough']
top5_venues['Neighborhoods'] = central_venues['Neighborhoods']

In [27]:
for ind in np.arange(central_venues.shape[0]):
    row_categories = central_venues.iloc[ind, :].iloc[3:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    top5_venues.iloc[ind, 3:] = row_categories_sorted.index.values[0:5]

In [28]:
top5_venues.head()

Unnamed: 0,PostalCode,Borough,Neighborhoods,Top 1,Top 2,Top 3,Top 4,Top 5
0,M4E,East Toronto,The Beaches,Pub,Trail,Health Food Store,Neighborhood,Park
1,M4K,East Toronto,"The Danforth West, Riverdale",Greek Restaurant,Italian Restaurant,Coffee Shop,Furniture / Home Store,Bookstore
2,M4L,East Toronto,"India Bazaar, The Beaches West",Fast Food Restaurant,Sandwich Place,Pizza Place,Brewery,Ice Cream Shop
3,M4M,East Toronto,Studio District,Café,Coffee Shop,Gastropub,Brewery,Bakery
4,M4N,Central Toronto,Lawrence Park,Park,Bus Line,Swim School,Yoga Studio,Deli / Bodega


Set up for K-Means clustering

In [29]:
from sklearn.cluster import KMeans 

In [30]:
cluster_df = central_venues.drop(["PostalCode", "Borough", "Neighborhoods"], axis = 1)

kmeans = KMeans(n_clusters=5).fit(cluster_df)

In [31]:
cluster_df = df_central.copy()
cluster_df = pd.merge(cluster_df, central_venues, on=['PostalCode'], how='inner')

Append the label information to the dataframe we're working with, and then do some dataframe cleanup. Remove duplicated columns, move the cluster label to the front of the dataframe.

In [32]:
cluster_df["Cluster_labels"] = kmeans.labels_

cluster_df = cluster_df.join(top5_venues.drop(["Borough", "Neighborhoods"], 1).set_index("PostalCode"), on="PostalCode")

cluster_df.head()

Unnamed: 0,PostalCode,Borough_x,Neighborhood_x,Latitude,Longitude,Borough_y,Neighborhoods,Airport,Airport Food Court,Airport Gate,...,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio,Cluster_labels,Top 1,Top 2,Top 3,Top 4,Top 5
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,East Toronto,The Beaches,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,Pub,Trail,Health Food Store,Neighborhood,Park
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,East Toronto,"The Danforth West, Riverdale",0.0,0.0,0.0,...,0.0,0.0,0.0,0.023256,0,Greek Restaurant,Italian Restaurant,Coffee Shop,Furniture / Home Store,Bookstore
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,East Toronto,"India Bazaar, The Beaches West",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,Fast Food Restaurant,Sandwich Place,Pizza Place,Brewery,Ice Cream Shop
3,M4M,East Toronto,Studio District,43.659526,-79.340923,East Toronto,Studio District,0.0,0.0,0.0,...,0.0,0.025,0.0,0.025,0,Café,Coffee Shop,Gastropub,Brewery,Bakery
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,Central Toronto,Lawrence Park,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3,Park,Bus Line,Swim School,Yoga Studio,Deli / Bodega


In [33]:
cluster_df.sort_values(["Cluster_labels"], inplace=True)

# cluster_df.drop(['Borough_y', 'Neighborhoods'], axis=1, inplace = True)
cluster_df.rename(columns = {'Borough_x':'Borough', 'Neighborhood_x':'Neighborhood'}, inplace = True)

mid = cluster_df['Cluster_labels']
cluster_df.drop(labels=['Cluster_labels'], axis=1,inplace = True)
cluster_df.insert(0, 'Cluster_labels', mid)
cluster_df = cluster_df[['Cluster_labels', 'PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude', 'Top 1', 'Top 2', 'Top 3', 'Top 4', 'Top 5']]

cluster_df = cluster_df.reset_index(drop=True)
print(cluster_df.shape)
cluster_df.head()

(39, 11)


Unnamed: 0,Cluster_labels,PostalCode,Borough,Neighborhood,Latitude,Longitude,Top 1,Top 2,Top 3,Top 4,Top 5
0,0,M4E,East Toronto,The Beaches,43.676357,-79.293031,Pub,Trail,Health Food Store,Neighborhood,Park
1,0,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817,Coffee Shop,Restaurant,Café,Hotel,Gym
2,0,M5P,Central Toronto,Forest Hill North & West,43.696948,-79.411307,Jewelry Store,Trail,Mexican Restaurant,Sushi Restaurant,Department Store
3,0,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,Café,Sandwich Place,Coffee Shop,Liquor Store,Flower Shop
4,0,M5S,Downtown Toronto,"University of Toronto, Harbord",43.662696,-79.400049,Café,Bookstore,Japanese Restaurant,Bar,Italian Restaurant


Add a new column to the dataframe that's essentially coding the labels to a color for the folium map at the end.

In [34]:
color_list = cluster_df["Cluster_labels"]
color_df = pd.DataFrame(color_list)
color_df.rename(columns = {'Cluster_labels':'colors'}, inplace = True)

In [35]:
color_df["colors"] = color_df["colors"].replace(0, 'yellow')
color_df["colors"] = color_df["colors"].replace(1, 'red')
color_df["colors"] = color_df["colors"].replace(2, 'blue')
color_df["colors"] = color_df["colors"].replace(3, 'green')
color_df["colors"] = color_df["colors"].replace(4, 'purple')
cluster_df.insert(0, 'colors', color_df)

In [36]:
cluster_df.head()

Unnamed: 0,colors,Cluster_labels,PostalCode,Borough,Neighborhood,Latitude,Longitude,Top 1,Top 2,Top 3,Top 4,Top 5
0,yellow,0,M4E,East Toronto,The Beaches,43.676357,-79.293031,Pub,Trail,Health Food Store,Neighborhood,Park
1,yellow,0,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817,Coffee Shop,Restaurant,Café,Hotel,Gym
2,yellow,0,M5P,Central Toronto,Forest Hill North & West,43.696948,-79.411307,Jewelry Store,Trail,Mexican Restaurant,Sushi Restaurant,Department Store
3,yellow,0,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,Café,Sandwich Place,Coffee Shop,Liquor Store,Flower Shop
4,yellow,0,M5S,Downtown Toronto,"University of Toronto, Harbord",43.662696,-79.400049,Café,Bookstore,Japanese Restaurant,Bar,Italian Restaurant


In [37]:
t_map = folium.Map(location = [43.7, -79.4], zoom_start = 12)

for row in cluster_df.itertuples():
    t_map.add_child(folium.CircleMarker(location = [row.Latitude, row.Longitude],
                                  color = row.colors,
                                  fill = True,
                                  fill_color = row.colors,
                                  fill_opacity = 0.5,
                                  popup = [("Borough:", row.Borough), ("Cluster:", row.Cluster_labels)]))

t_map