# Applied Data Science Capstone Project

<h2>Webscraping information about Toronto and its Neighborhoods </h2>

Below, we will be taking (aka scraping) information from the Toronto postal codes Wikipedia webpage, using the BeautifulSoup package, and compiling it into a list of dictionaries which will include the postal codes (aka zipcodes) of the Toronto area, the boroughs within the postal codes of the Toronto area, and the neighborhoods within those boroughs. Then, we will convert the list of dictionaries into a dataframe using pandas, clean up the dataframe a little bit, and then confirming the rows and columns we have in our final dataframe.

Let's begin!

First let's import packages that we will need to complete this process.

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

Now, let's get the link to the Toronto postal codes Wikipedia page and convert it into text.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_data = requests.get(url).text

Here, we will begin using BeautifulSoup to parse the html text we scraped from the web.

In [3]:
soup = BeautifulSoup(html_data, 'html.parser')

In [4]:
#separate table from html data
table = soup.find('table')

#start with an empty table, sort through the table rows, create the dictionaries and save them to the list
zip_table = []

for tr in table.find_all('td'):
    cell = {} #cell refers to the cell in the table that we will be drawing the information from where each cell has a zipcode, borough, and assosciated neighborhoods
    if tr.span.text == 'Not assigned':
        pass
    else:
        cell['Zipcode'] = tr.p.text[:3]
        cell['Borough'] = (tr.span.text).split('(')[0]
        cell['Neighborhood'] = ((((tr.span.text).split('(')[1]).replace(')', ' ')).replace(' /', ',')).strip(' ')
        zip_table.append(cell)

Now that we have completed the acquisition and organization of our data, let's put it into a dataframe for ease-of-use.

In [16]:
df = pd.DataFrame(zip_table)
df.shape #How many rows and columns do we have?

(103, 3)

In [27]:
display(df) #Display the whole dataframe. Note: Don't do this with really large datasets!

Unnamed: 0,Zipcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Oops! Looks like we have at least one issue with text running together in this dataframe. Let's fix that!

In [20]:
df['Borough'] = df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade': 'Downtown Toronto Stn A',
                                       'East TorontoBusiness reply mail Processing Centre969 Eastern': 'East Toronto Business',
                                       'East YorkEast Toronto': 'East York/East Toronto',
                                       'MississaugaCanada Post Gateway Processing Centre': 'Mississauga'})

Let's take a peak at the dataframe to see if our text issues are resolved.

In [28]:
display(df)

Unnamed: 0,Zipcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Now let's confirm the rows and columns once more.

In [9]:
df.shape

(103, 3)