<h1 align=center><font size = 5>Neighborhoods & Geospatial data in Toronto</font></h1>

## Introduction

In this notebook, we will demonstrate how to get data from a html table and extract the information to see the boroughs and neighborhoods in Toronto, Canada. Then we will combine the latitude and longitude information of the areas.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1.  <a href="#item1">Download and explore the dataset</a>
    
2.  <a href="#item2">Combine the neighbourhoods with geospatial data of each Postal Code area</a>

</font>
</div>


Before we get the data and start exploring it, let's download all the dependencies that we will need.


In [72]:
!pip install lxml
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

print('Necessary Libraries imported...')

Necessary Libraries imported...


<a id='item1'></a>


## 1. Download and explore the dataset

Toronto has a total of 10 Boroughs and 99 Neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the post code,boroughs and the neighborhoods that exist in each borough. Luckily, this dataset exists for free on the web. Feel free to try to find this dataset on your own, but here is the link to the dataset:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [73]:
#reading the table from wikipedia website
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

#removing rows where the Borough is not assigned
na_row = df[df["Borough"]=='Not assigned'].index
df = df.drop(na_row)

#Replace the Not Assigned entries in Neighbourhood with the Borough
df['Neighbourhood']=df['Neighbourhood'].mask(df['Neighbourhood'].eq('Not assigned'),df['Borough'])

#combining Neighbourhood with the same Postal Code
df.groupby(['Postal Code','Borough'], as_index=False).agg({'Neighbourhood': ', '.join})


#Show the dataframe after preprocessing
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [74]:
#Show the size of the dataframe

print('Toronto has '+str(len(df['Borough'].unique()))+' Boroughs and consists of '+str(len(df['Neighbourhood'].unique()))+' Neighbourhoods!')
df.shape

Toronto has 10 Boroughs and consists of 99 Neighbourhoods!


(103, 3)

## 2. Combine the neighbourhoods with geospatial data of each Postal Code area

In [75]:
# define the dataframe columns
column_names = ['Postal Code','Borough', 'Neighbourhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighbourhoods = pd.DataFrame(columns=column_names)
neighbourhoods

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude


In [76]:
neighbourhood_data=pd.read_csv('Geospatial_Coordinates.csv')
neighbourhood_data


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [77]:
#Sort the coordinate values by Postal Code area
sorted_df = neighbourhood_data.sort_values(by='Postal Code')
sorted_df

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [78]:
#Sort the dataframe values by Postal Code area
sorta_df = df.sort_values(by='Postal Code')
sorted_df

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [79]:
#Merge the dataframes on the common sorted column which is the Postal Code
neighbourhoods=pd.merge(sorta_df,sorted_df,on='Postal Code')
neighbourhoods

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


### Thank you for reviewing this assignment!

This notebook was created by [Panagiotis Karfakis](https://www.linkedin.com/panagiotis-karfakis)


This notebook is part of a course on **Coursera** called _Data Science Capstone_. 