# Segment and Cluster Neighbourhoods in Toronto

### Introduction
#### This notebook is part 1 of 3, which covers phase 1 planned activities for Segment and Cluster Neighbourhoods within Toronto.
#### The overall project consists of assignment deliverables in 3 parts:
1. ***Segment and Cluster Neighbourhoods in Toronto.*** [This notebook]
2. Use Geocoder Python package to derive latitude and longitude coordinates of each neighborhood to expand dataframe detail.
3. Explore and cluster the neighborhoods of Toronto. Decisions and observations are shared and maps to visualize neighbourhoods and their clustering are provided.



### Requirement 1: The dataframe will consist of three columns: Postal Code, Borough, and Neighborhood
#### Step 1: Perform wikipedia scrape from Toronto page using pandas read_html reader to  read url into our dataframe objects.   
Here we build our code to scrape https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 

In [45]:
import pandas as pd
data = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
print("Pandas read_html found [", len(data),"] dataframe object tables within wiki html for us to explore.")

Pandas read_html found [ 3 ] dataframe object tables within wiki html for us to explore.


#### Step 2: Let's inspect the first of 3 dataframe object tables that were created


In [46]:
data[0].head(5)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


####  Observation: 
We can see from cell above, our first dataframe object data[0] has the 3 column names specified, containing 180 rows of data of object data types.

#### Step 3: Let's examine our data types and initial table data types and dimensions

In [47]:
print(data[0].dtypes)
print(data[0].shape)
data[0].describe()

Postal Code      object
Borough          object
Neighbourhood    object
dtype: object
(180, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
count,180,180,180
unique,180,11,100
top,M4S,Not assigned,Not assigned
freq,1,77,77


#### Observations: 
1. We see 180 unique Neighborhoods in Toronto which is 40 additional to the 140 officially recognized by the Toronto wikipedia neighbourhood list found at https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto. Possible explanation may be due to this other wiki site reporting 140 as officially recognized Neighbourhoods and upwards of 240 official and unofficial. So the 180 in our postal code source data from wikipedia, could likely be comprised of 40 neighbourhoods that might be adds from the unofficial counts. This is not an issue and just an observation for awareness and an attempt to understand possible reasons.
2. While the above data[0] dataframe object looks good, we  have data[1] and data[2] pandas created objects to also examine through data analysis.  
3. We have a high frequency of 'Not assigned' cases we will review and process in subsequent Requirement 2 step.

#### Step 4: Let's  examine data[1] dataframe object.

In [48]:
print(data[1].shape, "rows by cols")
data[1].head(5)

(4, 18) rows by cols


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,,Canadian postal codes,,,,,,,,,,,,,,,,
1,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,,,,,,,,,,,,,,,
2,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,ON,ON,MB,SK,AB,BC,NU/NT,YT
3,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


#### Observation: 
The above data object table [1] contains Canada province/state higher level information, not in scope for our Toronto City level model.
#### Step 5: Let's  examine our third and last data[2] dataframe object also returned by html reader.

In [49]:
# Review third data[2] object table
print(data[2].shape, "rows by cols")
data[2].head(5)

(2, 18) rows by cols


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,ON,ON,MB,SK,AB,BC,NU/NT,YT
1,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


#### Observation: 
The above data object table [2] also contains Canada province/state higher level information which is also not in scope for our Toronto City level model.
#### Step 6: Create our dataframe using data[0].

In [83]:
print('\033[1m Requirement 1 completed:\033[0m',u'\N{check mark}')
print("New dataframe, df, created using our transformed data[0] dataframe.")
df=data[0]

[1m Requirement 1 completed:[0m ✓
New dataframe, df, created using our transformed data[0] dataframe.


### Requirement 2:  Ensure Borough cells with an assigned borough are processed. Cells where borough = 'Not assigned' are ignored.
#### Step 7: Review our 'Not assigned" count data on Borough 


In [84]:
print(df.Borough.value_counts()) 

Not assigned        77
North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
York                 5
East York            5
East Toronto         5
Mississauga          1
Name: Borough, dtype: int64


#### Observation: 
77 row counts found (see first line of above printout) where Borough column has 'Not Assigned' row value
#### Step 8: Remove all rows from our dataframe where Borough column has 'Not Assigned' value, as required.

In [85]:
df = df[df['Borough'] != 'Not assigned']
print(df.Borough.value_counts())

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East Toronto         5
York                 5
East York            5
Mississauga          1
Name: Borough, dtype: int64


#### Observation: Recheck confirms 77 'Not assigned' rows for Borough column have been successfully removed from frame

In [86]:
print('\033[1m Requirement 2 completed:\033[0m',u'\N{check mark}')

[1m Requirement 2 completed:[0m ✓


### Requirement 3:  When two or more neighborhoods share the same postal code, combine each neighbourhood to a single neighborhood line comma delimited.

#### Step 9: The case described in Requirement 3 above applies to cases where two seperate rows exist  with same Postal Code. We  perform a simple check for Postal Code duplicates as follows:

In [88]:
dup_chk_req3 = df.apply(lambda x: x.duplicated()).sum()
print(dup_chk_req3)
print(df.head(5))

Postal Code       0
Borough          93
Neighbourhood     4
dtype: int64
  Postal Code           Borough                                Neighbourhood
2         M3A        North York                                    Parkwoods
3         M4A        North York                             Victoria Village
4         M5A  Downtown Toronto                    Regent Park, Harbourfront
5         M6A        North York             Lawrence Manor, Lawrence Heights
6         M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government


#### Observations: 
1. Postal Code value of 0 above means no 'Postal Code' duplicates were found. Therefore, no problem condition of two or more neighborhoods sharing same Postal Code exist within dataframe to process.
2. Additionally, there are a number of cases seen where multiple neighbourhoods are already found to be comma delimited within our original inbound data to our dataframe (shown above), which do share common Postal Code between them as required. Therefore neighbor combination method is already working as desired.
3. Identified duplication behavior across rows for multiple neighbourhoods appears to have been resolved during latest wiki url data construction process or made a default behavior within python read_html function.

In [89]:
print('\033[1m Requirement 3 met:\033[0m',u'\N{check mark}')

[1m Requirement 3 met:[0m ✓


### Requirement 4 - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

#### Step 10: List all rows where Neighbourhood = 'Not assigned'. If we find any cases, we populate Neighbourhood value using Borough value.

In [90]:
Num_Neighbourhood_Not_Assigneds = df[df['Neighbourhood'] == 'Not assigned']
Num_Neighbourhood_Not_Assigneds.head(5)

Unnamed: 0,Postal Code,Borough,Neighbourhood


#### Observation: 
No applicable 'Not assigned" Neighborhood cases were found where neighbourhood needed to be filled with Borough value, due to 'Not assigned" value.  Non issue per requirement as defined.

In [91]:
print('\033[1m Requirement 4 - Met requirement \033[0m',u'\N{check mark}')

[1m Requirement 4 - Met requirement [0m ✓


### Requirement 5 - All required explanations and assumptions made have been noted.
#### Throughout this Notebook, at each step above, the requirement objective, planned activity, thought process, and observations are shared.

In [92]:
print('\033[1m Requirement 5 met:\033[0m',u'\N{check mark}')

[1m Requirement 5 met:[0m ✓


### Requirement 6: Use the .shape method to print the number of rows of your dataframe.

#### Step 11: Run shape method on our final dataframe

In [95]:
print(df.shape, 'rows x cols')

(103, 3) rows x cols


#### Observation: 
Net of 103 rows of data after 77 rows removed based on above Requirement 2, Step #8.