# Segmenting and Clustering Neighborhoods in Toronto
### Week 3. Graded Lab

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured way as required to work with it.

                                                                                         Student: Norma López-Sancho

In [7]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!pip install geopy # using this to install geopy instead of !conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0
import folium # map rendering library

print('You are good to go')

You are good to go


In [9]:
# installing the beautifulsoup funtionality for web scrapping in case is needed
!pip install lxml html5lib beautifulsoup4

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/64/28/0b761b64ecbd63d272ed0e7a6ae6e4402fc37886b59181bfdf274424d693/lxml-4.6.1-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K    100% |████████████████████████████████| 5.5MB 1.8MB/s ta 0:00:011    65% |█████████████████████           | 3.6MB 2.8MB/s eta 0:00:01
[?25hCollecting html5lib
[?25l  Downloading https://files.pythonhosted.org/packages/6c/dd/a834df6482147d48e225a49515aabc28974ad5a4ca3215c18a882565b028/html5lib-1.1-py2.py3-none-any.whl (112kB)
[K    100% |████████████████████████████████| 112kB 3.3MB/s ta 0:00:01
Installing collected packages: lxml, html5lib
Successfully installed html5lib-1.1 lxml-4.6.1


In [97]:
# Reading URL through pandas
tnt = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

In [98]:
# Checking how many tables are within the specified URL
print(len(tnt))

3


In [99]:
# Checking that the table I want is the first contained in the web page
print(tnt[0])

               0                 1  \
0    Postal Code           Borough   
1            M1A      Not assigned   
2            M2A      Not assigned   
3            M3A        North York   
4            M4A        North York   
5            M5A  Downtown Toronto   
6            M6A        North York   
7            M7A  Downtown Toronto   
8            M8A      Not assigned   
9            M9A         Etobicoke   
10           M1B       Scarborough   
11           M2B      Not assigned   
12           M3B        North York   
13           M4B         East York   
14           M5B  Downtown Toronto   
15           M6B        North York   
16           M7B      Not assigned   
17           M8B      Not assigned   
18           M9B         Etobicoke   
19           M1C       Scarborough   
20           M2C      Not assigned   
21           M3C        North York   
22           M4C         East York   
23           M5C  Downtown Toronto   
24           M6C              York   
25          

In [89]:
# Since I have confirmed the first table [0] is the one I want, get it in a new dataframe                                                                                                           
tnt_df = pd.DataFrame(data=tnt[0])
tnt_df.head()

Unnamed: 0,0,1,2
0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [90]:
# Using the first row as column names in my dataframe as is where are contained
tnt_df.columns = tnt_df.iloc[0]
tnt_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [91]:
# And now dropping row 0 as contains the column names
tnt_df.drop([0], inplace = True)
tnt_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [92]:
# Removing rows that have Not assigned a Borough
tnt_df = tnt_df[~tnt_df.Borough.str.contains('Not assigned')]
tnt_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [93]:
# Checking if the code has indeed worked by searching for "Not assigned" string in the column Borough
tnt_df[tnt_df['Borough'].str.match('Not assigned')]

Unnamed: 0,Postal Code,Borough,Neighbourhood


#### It is mentioned in the lab that there are repeated postal codes with different neigbourhoods assigned. As example they use M5A

#### Let´s make a first check to see if the statement it´s true

In [94]:
# Getting the rows that contains M5A
tnt_df[tnt_df['Postal Code'].str.match('M5A')]

Unnamed: 0,Postal Code,Borough,Neighbourhood
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### Seems there are no repeated Postal Codes, at least with the example proposed in the exercise 

#### So let´s make a double check over the whole set

In [95]:
# Counting values in Postal Code column to see if any returns any greater than 1

check = tnt_df['Postal Code'].value_counts()
check[check>1]


Series([], Name: Postal Code, dtype: int64)

#### There´s definitely nothing repeated but if there was, we could use the below code for merging data in Neighbourhood column:

<code> tnt_df = tnt_df.groupby(['Postal Code','Borough'])['Neighbourhood'].apply(', '.join).reset_index()</code>

In [102]:
tnt_df.shape

(103, 3)