# Capstone Project :  Toronto Neighbourhood data Analysis

Goal : Anayse the Toronto neighbourhood data by applying segmentation and clustering and to familiarise with location data provider Foursquare and gain experience using RESTful AIPs to leverage data and use Folium library to generate maps of geospatial data .

In [4]:
import numpy as np
import pandas as pd
import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm

import matplotlib.colors as colors

In [3]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


## Retrive data from Wikipedia

Read data from Wikipedia:

In [41]:
import requests
neighbour_url = requests.get('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=942851379')


Scrape Wikipedia page using Beautifulsoup

In [42]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(neighbour_url.text,'lxml')


In [43]:
neighbour_table = soup.find_all('table')[0]#,{'class':"wikitable sortable"})
#neighbour_table

Create data frame from html

In [50]:
df = pd.read_html(str(neighbour_table))
df=pd.DataFrame(df[0]) 
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


## Data Wrangling

Drop the rows with Borough not assigned

In [51]:
df.replace('Not assigned', np.nan, inplace=True)
df.dropna(subset=["Borough"], axis=0, inplace=True)

# reset index, because we droped rows
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
...,...,...,...
205,M8Z,Etobicoke,Kingsway Park South West
206,M8Z,Etobicoke,Mimico NW
207,M8Z,Etobicoke,The Queensway West
208,M8Z,Etobicoke,Royal York South West


Check for Missing Values:

In [52]:
missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")    

Postcode
False    210
Name: Postcode, dtype: int64

Borough
False    210
Name: Borough, dtype: int64

Neighbourhood
False    210
Name: Neighbourhood, dtype: int64



Size of New Dataset:

In [58]:
df.shape

(210, 3)

### Adding geographical information

Insert columns for Latitude and Longitude

In [65]:
import pgeocode

# retrieve the latitude/longitude from a postal code in Canada 'ca'
nomi_ca = pgeocode.Nominatim('ca')

latitude = []
longitude = []

for index, row in df.iterrows():
    location = nomi_ca.query_postal_code(row[0])  # row[0] represents Postal Code value
    latitude.append(location.latitude)
    longitude.append(location.longitude)
    
# we put the result of the loop in new columns 'latitude' and 'longitude'
df['Latitude'] = latitude
df['Longitude'] = longitude


# pb with Canada Post Gateway Processing Centre > need to do the query manually
df.loc[df['Neighbourhood'] == "Canada Post Gateway Processing Centre", ['Latitude', 'Longitude']] = [43.636966,-79.615819]


In [66]:
df

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.3300
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,Harbourfront,43.6555,-79.3626
3,M6A,North York,Lawrence Heights,43.7223,-79.4504
4,M6A,North York,Lawrence Manor,43.7223,-79.4504
...,...,...,...,...,...
205,M8Z,Etobicoke,Kingsway Park South West,43.6256,-79.5231
206,M8Z,Etobicoke,Mimico NW,43.6256,-79.5231
207,M8Z,Etobicoke,The Queensway West,43.6256,-79.5231
208,M8Z,Etobicoke,Royal York South West,43.6256,-79.5231
