# Web Scraping and Clustering Toronto Area Data from Wikipedia

In this notebook we will scrape the Toronto area data from a wikipedia table using BeautifulSoup and then cluster it using K-means. Combining this clustered neighbourhood data with data available to us from the FourSquare API will allow us to run data analysis on neighbourhoods in Toronto.

In [1]:
#import required packages
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from bs4 import BeautifulSoup
import geocoder
import requests
import csv

Now we will take the wikipedia page and pass it to Beautiful Soup. From there we will process and clean the data into  a pandas dataframe. This will allow us to access the table content and pass that through SKLEARN.

In [2]:
#data acquisition:

#store the target page in a variable as text
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

#use the 'lxml' parser to organise the data correctly
soup = BeautifulSoup(source, 'lxml')

#there is only one table in the data so a simple first find will work
table = soup.find('table')

In [3]:
#data processing:
#read the table into lists
df = pd.read_html(str(table))

#convert to data frame and drop rows with missing data
df = df[0].dropna()

#drop NA also changes the index, so reset the index
df.reset_index(drop = True, inplace = True)

#replace fwd slashes with commas in the Neighborhood col
df['Neighborhood'] = df['Neighborhood'].str.replace(" /",",")

In [4]:
#check the data
print("The dataframe shape is: {} \n".format(df.shape))
print(df.head())

The dataframe shape is: (103, 3) 

  Postal Code           Borough                                 Neighborhood
0         M3A        North York                                    Parkwoods
1         M4A        North York                             Victoria Village
2         M5A  Downtown Toronto                    Regent Park, Harbourfront
3         M6A        North York             Lawrence Manor, Lawrence Heights
4         M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government


In [5]:
MY_API = 'AkcWSjN_B3AnUY3PZxR1JT1-j7ixZuC-B2cxPSpR4oujDN2LJ-FUgRiLOhUOpf6Z'

In [6]:
# initialize your variable to None
lat = []
lng = []

#run a for loop through Geocoder to the Bing API to get lat long positions
for i in range(0,len(df.index)):
    g = geocoder.bing('{}, Toronto, Ontario'.format(df.iloc[i]['Postal Code']), key=MY_API)
    lat.append(g.lat)
    lng.append(g.lng)

In [9]:
#add the lat/long data acquired in the cell above to the main data frame
df['Latitude'] = lat
df['Longitude'] = lng

In [10]:
#check the data
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.751881,-79.33036
1,M4A,North York,Victoria Village,43.730419,-79.31282
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65514,-79.362648
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723209,-79.451408
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66449,-79.393021
