# Wikipedia Webscraping

For this exercise I have chosen to use the BeautifulSoup webscraper. The first step is therefore to install the package. 

In [2]:
! pip install beautifulsoup4



The next step is to pass the html object (the wiki page) through the bs4. Using the inspect funtion on my web browser, I was able to locate the class of the table, and proceeded to store references to the table rows and columns in similarly named variables. 

In [1]:
from bs4 import BeautifulSoup
import requests

source = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text

soup = BeautifulSoup(source, 'html.parser')

table = soup.find('table', {'class':"wikitable sortable"}).tbody
rows = table.find_all ('tr')
columns = [v.text.replace ('\n','') for v in rows[0].find_all('th')]
print(columns)

['Postal Code', 'Borough', 'Neighborhood']


Now we have the data we want we can pass it to a pandas dataframe. We then populate this dataframe with the rows of the wikitable. I also exported a CSV to check the full result list. 

In [17]:
import pandas as pds

df = pds.DataFrame(columns=columns)

for i in range(1,len(rows)):
    tds=rows[i].find_all('td')

    values = [tds[0].text.replace ('\n',''),tds[1].text.replace ('\n',''),tds[2].text.replace ('\n','')]
    
    df = df.append(pds.Series(values,index=columns),ignore_index=True)

    df.to_csv(r'C:\Users\Docherty.ATGCH\Desktop\Coding101' + 'hello.csv', index=False)


I then started to manipulate the dataframe, by removing all instances where the Borough was not assigned. 

In [18]:
df = df[df['Borough']!='Not assigned']
print(df)

    Postal Code           Borough  \
2           M3A        North York   
3           M4A        North York   
4           M5A  Downtown Toronto   
5           M6A        North York   
6           M7A  Downtown Toronto   
..          ...               ...   
160         M8X         Etobicoke   
165         M4Y  Downtown Toronto   
168         M7Y      East Toronto   
169         M8Y         Etobicoke   
178         M8Z         Etobicoke   

                                          Neighborhood  
2                                            Parkwoods  
3                                     Victoria Village  
4                            Regent Park, Harbourfront  
5                     Lawrence Manor, Lawrence Heights  
6          Queen's Park, Ontario Provincial Government  
..                                                 ...  
160      The Kingsway, Montgomery Road, Old Mill North  
165                               Church and Wellesley  
168  Business reply mail Processing Centre

I then also filled in all non assigned neighborhoods with the borough value

In [31]:
df.loc[df['Neighborhood']=='Not assigned','Neighborhood']=df['Borough']

print(df)

df.to_csv(r'C:\Users\Docherty.ATGCH\Desktop\Coding101' + 'hello.csv', index=False)

    Postal Code           Borough  \
2           M3A        North York   
3           M4A        North York   
4           M5A  Downtown Toronto   
5           M6A        North York   
6           M7A  Downtown Toronto   
..          ...               ...   
160         M8X         Etobicoke   
165         M4Y  Downtown Toronto   
168         M7Y      East Toronto   
169         M8Y         Etobicoke   
178         M8Z         Etobicoke   

                                          Neighborhood  
2                                            Parkwoods  
3                                     Victoria Village  
4                            Regent Park, Harbourfront  
5                     Lawrence Manor, Lawrence Heights  
6          Queen's Park, Ontario Provincial Government  
..                                                 ...  
160      The Kingsway, Montgomery Road, Old Mill North  
165                               Church and Wellesley  
168  Business reply mail Processing Centre

#### Here is a snapshot of the table

In [34]:
df

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


## And here is the shape of the dataframe

In [35]:
df.shape

(103, 3)