# Week 3 Assignment


***

## Section 1: Scraping and cleaning the data.

In the first section, we will use the requests and BeautifulSoup libraries to scrape the content from the Wikipedia article.

First, the standard imports:

In [2]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

Next, the requests library is used to download the Wikipedia article, and the article is then parsed with the BeautifulSoup library.

In [13]:
r = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
r.raise_for_status()

wiki_soup = BeautifulSoup(r.content)

Now, using the BeautifulSoup library, we search through the contents of the Wikipedia article for the contents of the table containing the list of postal codes and which boroughs/neighborhoods they correspond to.
Note that we are assuming that the aforementioned table is the first table present in the html document.

In [9]:
table = wiki_soup.find("tbody")
rowList = []
columns = ["Postal Code", "Borough", "Neighborhood"]

# We want to skip the first row of the table, as it only contains the column headers.
for row in table.find_all("tr")[1:]:
    rowList.append(dict(zip(columns, row.stripped_strings)))

The table is then converted to a pandas dataframe.

In [10]:
df = pd.DataFrame(rowList, columns=columns)

Now, we need to filter out the rows of the table in which the Borough value is "Not assigned".

In [11]:
boroughFilter = df['Borough'] != "Not assigned"
df = df[boroughFilter]
df.reset_index(drop=True, inplace=True)

We also need to check if there are any rows of the dataframe in which there is an assigned Borough, but no assigned Neighborhoods. If there are any rows with a Not assigned neighborhood, we will change the neighborhood to be the same as the borough.

In [23]:
neighborhoodFilter = df['Neighborhood'] != "Not assigned"
if neighborhoodFilter.sum() == df.shape[0]:
    print("There are NO \"Not assigned\" neighborhoods in the data!")
else:
    print("There are {} \"Not assigned\" neighborhoods in the data".format(neighborhoodFilter.sum()))


There are NO "Not assigned" neighborhoods in the data!


Finally, let's check the shape of our resulting data set:

In [28]:
print("The data set has", df.shape[0], "rows and", df.shape[1], "columns.")

The data set has 103 rows and 3 columns.
