<h3> Libraries </h3>

In [2]:
# Data Analysis Libraries
import pandas as pd
import numpy as np 

#API libraries
import requests
from bs4 import BeautifulSoup

# Library for flatenning json files
from pandas.io.json import json_normalize

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#K-means clustering from Sklearn
from sklearn.cluster import KMeans

#Library to construct and visualize maps
import folium

#Time libraries that will be used throughout the code to assess 
#complexity of any block of code
import datetime 
import time

<h3> Importing Data </h3> 

<h5> We first need a table that consists of postal codes and borough names of all neighborhoods in Toronto. We can find such a table <a href= "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"> here </a>  
Moreover, note that we will be using the <a href=" http://beautiful-soup-4.readthedocs.io/en/latest/"> BeautifulSoup </a> library to scrab the web data via the <em> lxml </em> parser. </h5>



In [3]:
start_time = time.time()
################################


URL='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
web=requests.get(URL).text
soup= BeautifulSoup(web, 'lxml')


################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))

Time --- Minutes --- 0.03


<h5> We will now wrangle the lxml document to find the table, and then separate it into into three different lists exactly as in the table on the wikipedia page </h5> 

In [10]:
start_time = time.time()
################################

#Search for the table in the webpage
my_table=soup.find('table')

#Convert the table into a list
entries = list(my_table.find_all('td'))

print("The first five elements of the list are: {}".format(entries[:5]))

################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))

The first five elements of the list are: [<td>M1A
</td>, <td>Not assigned
</td>, <td>Not assigned
</td>, <td>M2A
</td>, <td>Not assigned
</td>]
Time --- Minutes --- 0.0


<h5> Notice how the items in the list are surrounded by tags, indicating that we need further cleaning. </h5> 

In [44]:
start_time = time.time()
################################

#Convert the elements in the list into strings
entries=[str(i) for i in entries]

#removing the tags before and after the texts
proper_entries=[k[4:len(k)-6] for k in entries]

#Separating the entries into 3 lists
PostalCode=[] 
Borough=[]
Neighborhood=[]
i=0
while i<len(proper_entries):
    PostalCode.append(proper_entries[i])
    Borough.append(proper_entries[i+1])
    Neighborhood.append(proper_entries[i+2])
    
    i=i+3
    

toronto_dataset=pd.DataFrame({'PostalCode':PostalCode, 'Borough':Borough, 'Neighborhood': Neighborhood})

################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))

Time --- Minutes --- 0.0


In [45]:
toronto_dataset.head()

Unnamed: 0,Borough,Neighborhood,PostalCode
0,Not assigned,Not assigned,M1A
1,Not assigned,Not assigned,M2A
2,North York,Parkwoods,M3A
3,North York,Victoria Village,M4A
4,Downtown Toronto,"Regent Park, Harbourfront",M5A


<h5> We are only interested in the neighborhoods which are registered in some borough. </h5> 

In [46]:
start_time = time.time()
################################

#Rearranging columns 
toronto_dataset=toronto_dataset[['PostalCode', 'Borough', 'Neighborhood']]

#Removing entries in which Borough is Not assigned
toronto_dataset=toronto_dataset[toronto_dataset['Borough']!='Not assigned']
toronto_dataset.reset_index(inplace=True)
toronto_dataset.drop("index", axis=1, inplace=True)


################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))

Time --- Minutes --- 0.0


In [53]:
toronto_dataset.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [52]:
toronto_dataset.shape

(103, 3)

<h5> We will download a dataset consisting of latitude/longitude coordinates of the neighborhouds in Toronto, given their postal code and then merge it with the original dataset. </h5> 

In [48]:
start_time = time.time()
################################


#Downloading coordinates data
coord=pd.read_csv('http://cocl.us/Geospatial_data')
print(coord.head())




################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))

  Postal Code   Latitude  Longitude
0         M1B  43.806686 -79.194353
1         M1C  43.784535 -79.160497
2         M1E  43.763573 -79.188711
3         M1G  43.770992 -79.216917
4         M1H  43.773136 -79.239476
Time --- Minutes --- 0.11


In [50]:
start_time = time.time()
################################


#changing the name of the Postal Code column in coord to PostalCode
coord.columns=['PostalCode', 'Latitude', 'Longitude']


#merging coordinates data with toronto dataset
tor_df=pd.merge(toronto_dataset,coord,on='PostalCode')


################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))

Time --- Minutes --- 0.0


In [51]:
tor_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
