# IBM Data Science Capstone Project
### Toronto Neighbourhood k-mean clustering using Foursquare Venues
#### Subhan Satopay - 27 Dec 2020

This notebook is for the IBM Data Science Capstone Project Week3

---
## Week 3 project tasks
To showcase k-mean clustering using Toronto Neighbourhoods and related Foursquare Venue APIs

##### Part 1 - Scrap the Toronto Neighbourhood data from the wikipedia page
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [20]:
# Import libraries
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

import requests # library to handle requests


**Scrape the wikipedia page and convert to panda dataframe**

In [21]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_page = requests.get(wiki_url) # fetch the wiki page
wiki_html = pd.read_html(wiki_page.content, header = 0)[0] #convert to html
df_canada = wiki_html[wiki_html.Borough != 'Not assigned'] #read the table excluding 'Not assigned'
df_canada.reset_index(drop=True, inplace = True)

**Initial data exploratory**

In [22]:
df_canada.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [23]:
# check if there is any Borough 'Not assigned' and Neighbourhood 'Not assigned'
print('Borough with Not assigned values: {}'.format(len(df_canada[df_canada['Borough'] == 'Not assigned'])))
print('Neighbourhood with Not assigned values: {}'.format(len(df_canada[df_canada['Neighbourhood'] == 'Not assigned'])))

Borough with Not assigned values: 0
Neighbourhood with Not assigned values: 0


**group by Postal Code to check the related Postal Codes**

In [24]:
df_canada.groupby(['Postal Code']).first()

Unnamed: 0_level_0,Borough,Neighbourhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Malvern, Rouge"
M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
M1E,Scarborough,"Guildwood, Morningside, West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
...,...,...
M9N,York,Weston
M9P,Etobicoke,Westmount
M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


**Wiki page already have neighbourhoods merged as per Postal code; There are no pending Not assigned Neighbourhood**

In [25]:
#shape and unique postal codes
print('Dataframe Shape: {} and Total unique Postal codes: {}'.format(df_canada.shape, len(df_canada['Postal Code'].unique())))

Dataframe Shape: (103, 3) and Total unique Postal codes: 103


### End of Part 1

---
### Part 2

#### In this part we will merge the neighbourhood dataframe with Longitude and Latitude for each Neighbourhood
As Geocoder package is not reliable, we will use the CSV data from http://cocl.us/Geospatial_data

In [26]:
# read the csv and prepare dataframe with geo details
geo_url = 'http://cocl.us/Geospatial_data'
df_geo = pd.read_csv(geo_url)
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


**Do inital both dataframe exploration**

In [28]:
print('-----df_canada-----')
print(df_canada.dtypes)
print('-----df_geo-----')
print(df_geo.dtypes)

-----df_canada-----
Postal Code      object
Borough          object
Neighbourhood    object
dtype: object
-----df_geo-----
Postal Code     object
Latitude       float64
Longitude      float64
dtype: object


In [29]:
print('Shape of df_canada: {} '.format(df_canada.shape))
print('Shape of df_geo: {} '.format(df_geo.shape))

Shape of df_canada: (103, 3) 
Shape of df_geo: (103, 3) 


**Next we will merge both the dataframes and do some cleanup**

In [31]:
df = df_canada.join(df_geo.set_index('Postal Code'), on ='Postal Code')
df.reset_index(drop=True, inplace=True)
df.index.name = 'Index'
df

Unnamed: 0_level_0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


### End of Part 2
---