# Toronto Neighborhood Data Scarping

### Imporiting required libraries for web scarping and cleaning

In [10]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

## Scraping with Pandas

Reading the html text from the wikipedia url and storing it in variable *html*

In [46]:
html= pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

Using for loops to append the required information into a list of dictionaries under the variable *table* for converting into a dataframe with the name *toronto_data*

In [166]:
table=[]

for i in html[0].values:
    for j in i:
        cell={}
        if 'Not' in j: pass
        else:
            cell['Postal Code'] = j[:3]
            cell['Borough'] = j[3:j.index('(')]
            cell['Neighborhood'] = j[j.index('(')+1:j.index(')')]
            table.append(cell)

A view of the dataframe *toronto_data*

In [167]:
toronto_data= pd.DataFrame(table)
toronto_data

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
101,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


### Cleaning the data 

In [146]:
toronto_data['Neighborhood'] = [str(i.split('/')).strip('[').strip(']') for i in toronto_data['Neighborhood']]

In [148]:
toronto_data['Borough']=toronto_data['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

Final *toronto_data* dataframe

In [149]:
toronto_data

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,'Parkwoods'
1,M4A,North York,'Victoria Village'
2,M5A,Downtown Toronto,"'Regent Park ', ' Harbourfront'"
3,M6A,North York,"'Lawrence Manor ', ' Lawrence Heights'"
4,M7A,Queen's Park,'Ontario Provincial Government'
...,...,...,...
98,M8X,Etobicoke,"'The Kingsway ', ' Montgomery Road ', ' Old Mi..."
99,M4Y,Downtown Toronto,'Church and Wellesley'
100,M7Y,East Toronto Business,'Enclave of M4L'
101,M8Y,Etobicoke,"'Old Mill South ', "" King's Mill Park "", ' Sun..."


### Shape of *toronto_data* dataframe

In [169]:
toronto_data.shape

(103, 3)

## Scraping using Beautiful Soup library

### Reading the html text using *Requests* library into *url* variable and making a BeautifulSoup object named *soup*

In [18]:
url= requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [19]:
soup= BeautifulSoup(url,'lxml')

Feeding the data into a list of dictionaries named, *table_content* and converting it into a dataframe named, *df*

In [24]:
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['Postal Code'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)


df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

A view of the dataframe, *df*

In [25]:
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


### Shape of *df* dataframe

In [22]:
df.shape

(103, 3)

# Location data for Toronto Neighborhoods

## Reading Geospatial data from the provided csv file

In [15]:
geospatial_data= pd.read_csv('..//msiva//Downloads//Geospatial_Coordinates.csv')

In [16]:
geospatial_data

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


## Merging the geospatial data into the neighborhood dataframe

In [26]:
toronto= pd.merge(df,geospatial_data)

In [27]:
toronto

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto Business,Enclave of M4L,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509
