# Segmenting and Clustering Neighborhoods in Toronto

## PART 1: Scraping the Wikipedia page in order to obtain the data that is in the table of postal codes with <i>Pandas</i>

### Installing lxml library for html scraping

In [1]:
!pip install lxml

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/dd/ba/a0e6866057fc0bbd17192925c1d63a3b85cf522965de9bc02364d08e5b84/lxml-4.5.0-cp36-cp36m-manylinux1_x86_64.whl (5.8MB)
[K     |████████████████████████████████| 5.8MB 11.8MB/s eta 0:00:01     |█████████████                   | 2.3MB 11.8MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.0


### Importing Pandas library

In [2]:
import pandas as pd

### Reading html table into a dataframe given the url

In [3]:
df_html = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

### Based on html page content, we get the first table of the list

In [4]:
df = df_html[0]
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


In [5]:
# Get dataframe shape before cleaning it
df.shape

(180, 3)

### Only process the cells that have an assigned borough and ignoring cells with a borough that is <i>Not assigned</i>.

In [6]:
# Get names of indexes for which column 'Borough' is 'Not assigned'
indexNames = df[ df['Borough'] == 'Not assigned' ].index

# Drop such rows
df.drop(indexNames, inplace = True)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


### Get the neighborhoods grouped by Postal Code and separated with commas

In [7]:
df['Neighborhood'].replace(r' \/ ', ', ', regex=True, inplace=True)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Reset Dataframe index:

In [8]:
# Reset Dataframe index:
df.reset_index(drop=True, inplace=True)
df.head(10)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


### Using the .shape method to print the number of rows of the dataframe

In [9]:
# Get dataframe shape after the cleaning:
df.shape

(103, 3)