## Scraping data from Wikipedia and creating dataframe of neighborhoods in Toronto

##### Using the BeautfifulSoup package to scrape data on the Toronto, Canada that is in the table of postal codes, from Wikipedia and further using Pandas to  to read the table into a pandas dataframe (df)

#### Importing libraries

In [2]:
# Importing libraries for webscraping (BeautifulSoup) and dataframe (Pandas)

import requests
from bs4 import BeautifulSoup
import pandas as pd

print('All done! Needed libraries imported!')

All done! Needed libraries imported!


#### Scraping from Wikipedia

In [3]:
page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(page.content, 'html.parser')

In [4]:
table = soup.find('tbody')
rows = table.select('tr')
row = [r.get_text() for r in rows]

### Data preprocessing 

In [6]:
df = pd.DataFrame(row)
df1 = df[0].str.split('\n', expand=True)
df2 = df1.rename(columns=df1.iloc[0])
df3 = df2.drop(df2.index[0])
df3.head()

Unnamed: 0,Unnamed: 1,Postcode,Borough,Neighbourhood,Unnamed: 5
1,,M1A,Not assigned,Not assigned,
2,,M2A,Not assigned,Not assigned,
3,,M3A,North York,Parkwoods,
4,,M4A,North York,Victoria Village,
5,,M5A,Downtown Toronto,Harbourfront,


#### Rename Postcode to Postal Code

In [9]:
df4 = df3.rename(columns={'Postcode': 'PostalCode'})
df4.head()

Unnamed: 0,Unnamed: 1,PostalCode,Borough,Neighbourhood,Unnamed: 5
1,,M1A,Not assigned,Not assigned,
2,,M2A,Not assigned,Not assigned,
3,,M3A,North York,Parkwoods,
4,,M4A,North York,Victoria Village,
5,,M5A,Downtown Toronto,Harbourfront,


#### Processing only the cells that have an assigned borough

In [10]:
df5 = df4[df4.Borough != 'Not assigned']
df5.head()

Unnamed: 0,Unnamed: 1,PostalCode,Borough,Neighbourhood,Unnamed: 5
3,,M3A,North York,Parkwoods,
4,,M4A,North York,Victoria Village,
5,,M5A,Downtown Toronto,Harbourfront,
6,,M6A,North York,Lawrence Heights,
7,,M6A,North York,Lawrence Manor,


#### Combination of neighborhoods that exist in same postal code area

In [11]:
df6 = df5.groupby(['PostalCode', 'Borough'], sort = False).agg(','.join)
df6.reset_index(inplace = True)
df6.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


#### Giving Borough and Neighborhood same value

In [12]:
df7 = df6.replace("Not assigned", "Queen's Park")
df7.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


## 2. Latititude and Longitude of Neighborhoods

#### Load csv file from http://cocl.us/Geospatial_data and rename PostalCode to be same as first dataframe

In [18]:
data = "http://cocl.us/Geospatial_data"
df8 = pd.read_csv(data)
df8.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
df8.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### Merge dataframes into one (1st and 2nd)

In [20]:
df9 = pd.merge(df7, df8, on='PostalCode')
df9.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
