# Segmenting and Clustering Neighborhoods in Toronto: Data Scraping

A notebook in which I will scrape data on postal codes in Toronto from Wikipedia, clean the data appropriately and then load into a dataframe for analysis. 

Installing/importing relevent packages: 

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd 

Next, we will scrape the relevent data from the table on Wikipedia

In [2]:
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" # Wikipedia link 

response = requests.get(wiki_url) 
soup = BeautifulSoup(response.text, 'html.parser') # Converting HTML to BeautifulSoup object

table = soup.find('table')

In [3]:
table_headers = table.find_all('th') # Isolataing table headers
table_headers = [i.get_text().rstrip("\n") for i in table_headers] # Formatting table headers to a list
table_headers

['Postal Code', 'Borough', 'Neighbourhood']

In [4]:
table_data = table.find_all('td') # Isolataing table data
table_data = [i.get_text().rstrip("\n") for i in table_data] # Formatting data to a list 

In [5]:
table_data[:20]

['M1A',
 'Not assigned',
 'Not assigned',
 'M2A',
 'Not assigned',
 'Not assigned',
 'M3A',
 'North York',
 'Parkwoods',
 'M4A',
 'North York',
 'Victoria Village',
 'M5A',
 'Downtown Toronto',
 'Regent Park, Harbourfront',
 'M6A',
 'North York',
 'Lawrence Manor, Lawrence Heights',
 'M7A',
 'Downtown Toronto']

Now that we have the data loaded into our notebook, lets perform some prelimenary cleaning and load it into a pandas datafrme

In [6]:
postal_code = [table_data[i] for i in range(0,len(table_data),3)] # Seperating data into columns 
borough = [table_data[i+1] for i in range(0,len(table_data),3)] 
neighbourhood = [table_data[i+2] for i in range(0,len(table_data),3)] 
table_values2 = [postal_code, borough, neighbourhood] 

In [7]:
headers_data = dict(zip(table_headers, table_values2)) # Creating a dictionary object with headers and data together
df = pd.DataFrame(headers_data) # Creating dataframe 
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Now our data has been loaded to a dataframe, we need to perform some more cleaning

In [8]:
df = df.drop(df[df['Borough'] == 'Not assigned'].index).reset_index(drop = True) 
# Drops columns to which a Borough is not assigned 

In [9]:
df = df.groupby(df['Postal Code']).agg('sum').reset_index() # Combining Neighbourhoods that the same postcode
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [10]:
df.shape

(103, 3)