# Web-Scraping a Wikipedia Article

Created by: Sangwook Cheon

I will use Beautiful Soup library to scrape the Wikipedia Article. 

> Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

## Import Libraries
First we import important libraries including Beautiful Soup, and other tools that go hand-in-hand.



In [1]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

## Loading HTML and finding the table section
Now load the HTML file from a Wikipedia page containing information about postal codes of Canada. This will be processed using Beautiful Soup.



In [3]:
#.text to get pure HTML file.
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

# To quickly see the html
# print(soup.prettify()) # This prints the entire HTML file, which takes up space :) 

Now let's find the table section from the HTML file:

In [5]:
# From HTML, find a portion pertaining to the table.
#_class is used to find a unique class name of the table, to make sure we find the correct table if there are multiple tables.
raw_table = soup.find('table', class_='wikitable sortable')
# print(raw_table.prettify())

## Processing
Now let’s process the HTML of the table to convert it into a DataFrame. There are many rows with unassigned borough, and we need to ignore them as they have missing information that cannot be filled. If a row is missing **Neighborhood**, then we should set it to be the same as borough.

In [6]:
#Get all rows from the table. <tr> tag refers to a row.
rows_all = raw_table.find_all('tr')

#Initialize empty list.
row_data = []

#For each row, find table values and append it to the list.
for row in rows_all:
    td = row.find_all('td')

    #A list of table values in one row
    row = [i.text for i in td]

    #Only add cells with borough, and with three values
    if len(row) != 0 and row[1] != 'Not assigned':
        row[2] = row[2].rstrip() #to remove /n
        #If Neighborhood is not assigned, the name is same as borough
        if row[2] == 'Not assigned':
            row[2] = row[1]
          #Append the row to the big list.
        row_data.append(row)

In [7]:
data = pd.DataFrame(row_data, columns=['PostalCode', 'Borough', 'Neighborhood'])

#preview ten rows of DataFrame
data.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


The table is correctly loaded! We need to do one more step, which is to combine neighborhoods with same Postal Code into one row. So some of the postal codes will have multiple neighborhoods listed and separated by a comma.

In [8]:
postal_codes = data.PostalCode.unique()
print(postal_codes)

['M3A' 'M4A' 'M5A' 'M6A' 'M7A' 'M9A' 'M1B' 'M3B' 'M4B' 'M5B' 'M6B' 'M9B'
 'M1C' 'M3C' 'M4C' 'M5C' 'M6C' 'M9C' 'M1E' 'M4E' 'M5E' 'M6E' 'M1G' 'M4G'
 'M5G' 'M6G' 'M1H' 'M2H' 'M3H' 'M4H' 'M5H' 'M6H' 'M1J' 'M2J' 'M3J' 'M4J'
 'M5J' 'M6J' 'M1K' 'M2K' 'M3K' 'M4K' 'M5K' 'M6K' 'M1L' 'M2L' 'M3L' 'M4L'
 'M5L' 'M6L' 'M9L' 'M1M' 'M2M' 'M3M' 'M4M' 'M5M' 'M6M' 'M9M' 'M1N' 'M2N'
 'M3N' 'M4N' 'M5N' 'M6N' 'M9N' 'M1P' 'M2P' 'M4P' 'M5P' 'M6P' 'M9P' 'M1R'
 'M2R' 'M4R' 'M5R' 'M6R' 'M7R' 'M9R' 'M1S' 'M4S' 'M5S' 'M6S' 'M1T' 'M4T'
 'M5T' 'M1V' 'M4V' 'M5V' 'M8V' 'M9V' 'M1W' 'M4W' 'M5W' 'M8W' 'M9W' 'M1X'
 'M4X' 'M5X' 'M8X' 'M4Y' 'M7Y' 'M8Y' 'M8Z']


Above are the unique postal codes.

In [9]:
#Set a new empty table with just column names.
clean_data = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'], index=None)

#Populate the new dataframe with unique postal codes.
for code in postal_codes:
    # Get DataFrame containing rows with same postal code.
    df = data.loc[data['PostalCode'] == code]
    borough = df.iloc[0, 1]
    # Join each column into a string containing all neighborhoods in the same code.
    df = ', '.join(df['Neighborhood'].tolist())

    #Add to the new dataframe
    clean_data = clean_data.append({'PostalCode': code, 'Borough': borough , 'Neighborhood': df}, ignore_index=True)

In [10]:
clean_data.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


Now we can see that some postal codes now have a neighborhoods column with multiple names. The table is fully cleaned now, and ready to go. Sometimes it is useful to save this cleaned table as a csv to the local machine.  

Note that the order of rows is different from the sample table shown in the Submission section in Coursera. I have not checked every single row to see if it matches with the sample, instead only some of the rows, so I am making an assumption that the table is correctly processed.

In [11]:
clean_data.shape

(103, 3)

After the table is cleaned, we now have 103 rows and 3 columns. It seems that many Neighborhoods are from the same Postal Code. Thank you for reading this notebook!