# Data Science Capstone Assignment Week 3

### Student name: Laura Bongaardt

In this assignment we will scrape data from Wikipedia: PostalCodes, Boroughs and Neighborhoods of Toronto, and convert this into a dataframe. We will then add geographical coordinates for each PostalCode. In the third part we will use Foursquare to collect the most important venues in each PostalCode. Finally we will use k-means to cluster the Neighborhoods into categories.

# Part 1: Scrape data from Wikipedia and load into Pandas dataframe

In [1]:
# Start by importing required libraries
import pandas as pd  #Pandas for handling dataframes
import numpy as np   #Numpy for arithmetic with matrices

print('done')

done


In [2]:
#Install and import Beautifulsoup
from bs4 import BeautifulSoup as bs4
import requests
print('done')

done


In [3]:
#Set the URL of the Wikipedia page
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
URL

'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [4]:
#Use request to get HTML data and print using BeautifulSoup
source = requests.get(URL).text
soup = bs4(source, 'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"947c872a-f436-48fb-a08b-6a7b5c283ea3","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":1032600019,"wgRevisionId":1032600019,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Communica

Going through the text, we can see that the relevant section, the main table, is called *tbody* , with multiple entries denominated by *td*. 
We can select this table using the find method. After that, we can loop through the table and filter the data from all the *td* sections.

In [5]:
#Select table from soup object and load into new object.
WikiTable = soup.find('tbody')
print(WikiTable.prettify())

<tbody>
 <tr>
  <td style="width:11%;">
   <p>
    M1A
    <br/>
    <span style="font-size:85%;">
     Not assigned
    </span>
   </p>
  </td>
  <td style="width:11%;">
   <p>
    M2A
    <br/>
    <span style="font-size:85%;">
     Not assigned
    </span>
   </p>
  </td>
  <td style="width:11%;">
   <p>
    M3A
    <br/>
    <span style="font-size:85%;">
     <a href="/wiki/North_York" title="North York">
      North York
     </a>
     <br/>
     (
     <a href="/wiki/Parkwoods" title="Parkwoods">
      Parkwoods
     </a>
     )
    </span>
   </p>
  </td>
  <td style="width:11%;">
   <p>
    M4A
    <br/>
    <span style="font-size:85%;">
     <a href="/wiki/North_York" title="North York">
      North York
     </a>
     <br/>
     (
     <a href="/wiki/Victoria_Village" title="Victoria Village">
      Victoria Village
     </a>
     )
    </span>
   </p>
  </td>
  <td style="width:11%;">
   <p>
    M5A
    <br/>
    <span style="font-size:85%;">
     <a href="/wiki/Downtown_Tor

In [6]:
#Filter information, load into a list of dictionaries
Toronto_list=[]
for row in WikiTable.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':  #This line will filter out cells without neighborhood/borough information
        pass
    else:
        cell['PostalCode'] = row.p.text[:3] #Separate PostalCode in each cell
        cell['Borough'] = (row.span.text).split('(')[0] #Get Borough names in each cell
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ') #Get Neighborhoods in each cell
        Toronto_list.append(cell) #add dictionary to list

#Convert list to DataFrame
Toronto_df = pd.DataFrame(Toronto_list)
#Clean Dataframe
Toronto_df['Borough'] = Toronto_df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})


In [7]:
#Inspect dataframe and check the number of rows
print('Toronto_df has {} rows'.format(Toronto_df.shape[0]))
Toronto_df.head()

Toronto_df has 103 rows


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


The dataframe has **103** rows, corresponding with the number of Forward Sortation Areas (FSAs) listed on the Wikipedia page

# Part 2: Add Geographical coordinates using *Geospatial_Coordinates.csv*

In [8]:
#Read csv file into new DataFrame Geospatial_df
Geospatial_df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv')
Geospatial_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
#Create new list for LatLong coordinates
LatLong = []

#Find Latitude and Longitude for each row in Toronto_df.
for postal_code in Toronto_df['PostalCode']:
    latitude = Geospatial_df.loc[Geospatial_df['Postal Code'] == postal_code]['Latitude'].values[0]
    longitude = Geospatial_df.loc[Geospatial_df['Postal Code'] == postal_code]['Longitude'].values[0]
    LatLong.append({latitude,longitude})

#Convert list to new dataframe, then add the columns to Toronto_df
LatLong_df = pd.DataFrame(LatLong)
Toronto_df['Latitude']=LatLong_df[1]
Toronto_df['Longitude']=LatLong_df[0]
Toronto_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


# Part 3: Explore Toronto venues using Foursquare. Collect the top10 venues for each PostalCode into dataframe.

In [10]:
#Start by entering Foursquare credentials
CLIENT_ID = 'CMAYJG13B0SA33ZFXBYP0OJMVRXILDLZDU0ZF40HV40NSAPX' # your Foursquare ID
CLIENT_SECRET = '0IIVELU1DFXAOQYHDK5QLQH4XX0VHNHJEQG33CLOAYTQ1XPW' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: CMAYJG13B0SA33ZFXBYP0OJMVRXILDLZDU0ZF40HV40NSAPX
CLIENT_SECRET:0IIVELU1DFXAOQYHDK5QLQH4XX0VHNHJEQG33CLOAYTQ1XPW


In [11]:
#get the venues for each PostalCode
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode', 
                  'PostalCode Latitude', 
                  'PostalCode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Toronto_venues = getNearbyVenues(Toronto_df['PostalCode'], Toronto_df['Latitude'], Toronto_df['Longitude'])
Toronto_venues.head()

Unnamed: 0,PostalCode,PostalCode Latitude,PostalCode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,M3A,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,M3A,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,M4A,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,M4A,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


In [12]:
print(Toronto_venues.shape)
print('There are {} uniques categories.'.format(len(Toronto_venues['Venue Category'].unique())))
Toronto_venues.groupby('PostalCode').count()

(2141, 7)
There are 271 uniques categories.


Unnamed: 0_level_0,PostalCode Latitude,PostalCode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1B,1,1,1,1,1,1
M1C,1,1,1,1,1,1
M1E,8,8,8,8,8,8
M1G,5,5,5,5,5,5
M1H,8,8,8,8,8,8
...,...,...,...,...,...,...
M9N,1,1,1,1,1,1
M9P,8,8,8,8,8,8
M9R,4,4,4,4,4,4
M9V,8,8,8,8,8,8


In [13]:
# one hot encoding
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot['PostalCode'] = Toronto_venues['PostalCode'] 

# move neighborhood column to the first column
fixed_columns = [Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])
Toronto_onehot = Toronto_onehot[fixed_columns]


In [14]:
print(Toronto_onehot.shape)
Toronto_onehot.head()

(2141, 272)


Unnamed: 0,PostalCode,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M4A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
Toronto_grouped = Toronto_onehot.groupby('PostalCode').mean().reset_index()
print(Toronto_grouped.shape)
Toronto_grouped.head()

(101, 272)


Unnamed: 0,PostalCode,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['PostalCode'] = Toronto_grouped['PostalCode']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted.head()

(101, 11)


Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Fast Food Restaurant,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio,Dim Sum Restaurant
1,M1C,Bar,Yoga Studio,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Fast Food Restaurant
2,M1E,Donut Shop,Mexican Restaurant,Restaurant,Electronics Store,Rental Car Location,Intersection,Medical Center,Bank,Distribution Center,Dog Run
3,M1G,Coffee Shop,Insurance Office,Pharmacy,Korean BBQ Restaurant,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
4,M1H,Bakery,Bank,Hakka Restaurant,Caribbean Restaurant,Thai Restaurant,Athletics & Sports,Gas Station,Fried Chicken Joint,Doner Restaurant,Distribution Center


# Part 4: Cluster Postal Codes into Categories using k-means. Plot using Folium

In [17]:
#Import libraries
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')

Libraries imported.


In [18]:
# set number of clusters
kclusters = 5

Toronto_grouped_clustering = Toronto_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 2, 1, 1, 1, 4, 1, 1, 1, 1], dtype=int32)

In [19]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
neighborhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,M1B,Fast Food Restaurant,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio,Dim Sum Restaurant
1,2,M1C,Bar,Yoga Studio,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Fast Food Restaurant
2,1,M1E,Donut Shop,Mexican Restaurant,Restaurant,Electronics Store,Rental Car Location,Intersection,Medical Center,Bank,Distribution Center,Dog Run
3,1,M1G,Coffee Shop,Insurance Office,Pharmacy,Korean BBQ Restaurant,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
4,1,M1H,Bakery,Bank,Hakka Restaurant,Caribbean Restaurant,Thai Restaurant,Athletics & Sports,Gas Station,Fried Chicken Joint,Doner Restaurant,Distribution Center


In [20]:
Toronto_merged = Toronto_df

# merge Toronto_grouped with Toronto_df to add latitude/longitude for each neighborhood
Toronto_merged = Toronto_merged.join(neighborhoods_venues_sorted.set_index('PostalCode'), on='PostalCode')
Toronto_merged.dropna(axis=0, inplace=True) #drop rows that didn't get a cluster label (outliers!)

Toronto_merged = Toronto_merged.astype({"Cluster Labels": int})
print(Toronto_merged.shape) 
Toronto_merged.head() # check the last columns!

(101, 16)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,4,Fast Food Restaurant,Food & Drink Shop,Park,Gym,Falafel Restaurant,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,1,Hockey Arena,Portuguese Restaurant,Intersection,Coffee Shop,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1,Coffee Shop,Bakery,Pub,Park,Theater,Café,Breakfast Spot,Restaurant,Distribution Center,Beer Store
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1,Clothing Store,Accessories Store,Boutique,Furniture / Home Store,Event Space,Coffee Shop,Women's Store,Vietnamese Restaurant,Filipino Restaurant,Fast Food Restaurant
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494,1,Coffee Shop,Sushi Restaurant,Yoga Studio,Nightclub,Bar,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Café


In [21]:
#Install and import Folium
!pip install folium
import folium # map rendering library

print('Folium imported')

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Folium imported


In [23]:
# create map
latitude = Toronto_merged.iloc[2]['Latitude']
longitude = Toronto_merged.iloc[2]['Longitude']
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['Latitude'], Toronto_merged['Longitude'], Toronto_merged['Neighborhood'], Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

We can see there are 2 dominant categories, 1 in purple and 4 in yellow, and three categories with only 1 PostalCode. So it appears there are 3 PostalCodes that are very distinct. k-means allocates three categories to these and divides the rest into (too) generic categories.