# Segmenting and Clustering Neighborhoods in Toronto
## Week 3 Peer-graded Assignment

## Table of Content
- [Part 1](#part-1)
- [Part 2](#part-2)
- [Part 3](#part-3)

## Part 1
### Load the data from wikipedia webpage

In [18]:
import pandas as pd
import numpy as np

In [19]:
# in case lxml is not installed
# !pip install lxml

In [20]:
# load first table into dataframe
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [21]:
df.shape

(180, 3)

### Preprocess

In [22]:
df['Borough'].value_counts()

Not assigned        77
North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
York                 5
East Toronto         5
Mississauga          1
Name: Borough, dtype: int64

In [23]:
df['Neighbourhood'].value_counts()

Not assigned                                     77
Downsview                                         4
Don Mills                                         2
Humberlea, Emery                                  1
Agincourt                                         1
                                                 ..
Parkview Hill, Woodbine Gardens                   1
Thorncliffe Park                                  1
North Toronto West, Lawrence Park                 1
East Toronto, Broadview North (Old East York)     1
Parkdale, Roncesvalles                            1
Name: Neighbourhood, Length: 100, dtype: int64

There is value with 'Not assigned' 77 each in both 'Borough' and 'Neighbourhood' columns

In [24]:
# replace 'Not assigned' as NaN
df = df.replace(to_replace='Not assigned', value=np.NaN)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [25]:
df.isna().value_counts()

Postal Code  Borough  Neighbourhood
False        False    False            103
             True     True              77
dtype: int64

In [26]:
# drop record with NaN in 'Borough'
df = df.dropna(subset=['Borough'])
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [27]:
df.isna().value_counts()

Postal Code  Borough  Neighbourhood
False        False    False            103
dtype: int64

now, there is no NaN in both 'Borough' and 'Neighbourhood'

In [28]:
# check for duplicated postal codes
df['Postal Code'].value_counts().sort_values(ascending=False)

M4H    1
M1V    1
M4X    1
M6H    1
M9R    1
      ..
M4C    1
M1T    1
M1B    1
M7R    1
M1X    1
Name: Postal Code, Length: 103, dtype: int64

No duplicated postal code found.

In [29]:
df = df.reset_index(drop=True)
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [30]:
df.shape

(103, 3)

got the dataframe of 103 rows with 'Postal Code', 'Borough', 'Neighbourhood'

## Part 2

### Load geo data via geocoder
i tried to run the command, and it take forever.  
so skip this part to merge from .csv

In [31]:
# install geocoder
# !pip install geocoder

In [32]:
# import geocoder # import geocoder

# # initialize your variable to None
# lat_lng_coords = None
# postal_code = 'M3A'
# # loop until you get the coordinates
# while(lat_lng_coords is None):
#   g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#   lat_lng_coords = g.latlng

# latitude = lat_lng_coords[0]
# longitude = lat_lng_coords[1]

### Merge lat, lng data from Geospatial_Coordinates.csv

In [33]:
df_geo = pd.read_csv('Geospatial_Coordinates.csv')
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [38]:
df_full = pd.merge(df, df_geo, how='left')
df_full.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [39]:
df_full.shape

(103, 5)

got the dataframe of 103 rows with additional Latitude and Longitude data

## Part 3