# Python code for extracting data of Toronto neighborhood from wikipedia page

Importing required libraries

In [1]:
import pandas as pd # importing pandas
import requests
from bs4 import BeautifulSoup # importing BeautifulSoup, a web scraping library
import numpy as np # importing numpy

## 1. Scraping and Exploring Dataset
 
 
 extracting tabular data from wikipedia page and storing it into pandas dataframe df
 
 There are different website scraping libraries and packages in Python. One of the most common packages is BeautifulSoup and we'll be using this library for this assignment

In [2]:
res = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood. 
The cells with a borough that is Not assigned will be ignored.
If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

providing the column header to the dataframe

In [115]:
df1=df[0][0:][1:]
df1.columns=['PostalCode','Borough','Neighborhood']# giving the column header to the df1 dataframe
df1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


The above dataframe contains 'Not assigned' Borough and it has to be removed from this data frame

Removing the rows with 'Not assigned' Borough

In [4]:
df2=df1[df1['Borough']!='Not assigned'].reset_index(drop=True) # Removing the rows with 'Not assigned' Borough
df2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


The dataframe df2 has same multiple Postal Codes and associated neighborhoods. So Rows will be grouped for the same postal code and the neighborhood will be separated by comma

In [113]:
df3=df2.groupby(['PostalCode','Borough'], as_index=False).agg(lambda x: ','.join(x)) #grouping the rows for same postal code
df3.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

Assigning the borough to the neighborhood where neighborhood is 'Not assigned'

In [64]:
not_assigned_neighborhood = df3.Neighborhood == 'Not assigned'
df3.loc[not_assigned_neighborhood, 'Neighborhood'] = df3.loc[not_assigned_neighborhood, 'Borough']#Assigning the borough to the neighborhood where neighborhood is 'Not assigned'

In [65]:
df3[not_assigned_neighborhood]

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Queen's Park


In [66]:
df3.shape # checking the dimension of df3 dataframe

(103, 3)

In [67]:
#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.
In this module we will use the Geocoder Python package.The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. To avoid complications we will be moving forward with the given coordinate file

Importing the file containing Latitude and Longotude

In [94]:
df_geo_cord=pd.read_csv('G:/Geospatial_Coordinates.csv')#Importing the coordinate file containing Latitude and Longotude
df_geo_cord.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [114]:
df4=df3.join(df_geo_cord, rsuffix='Latitude, Longitude')# Joining the two dataframe
df5=df4[['PostalCode','Borough','Neighborhood','Latitude','Longitude']]
df5.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848
