# Segmenting and Clustering Neighborhoods in Toronto

## PART 1: Scraping the Wikipedia page in order to obtain the data that is in the table of postal codes with <i>Pandas</i>

### Installing lxml library for scraping html table cells

In [1]:
!pip install lxml



### Importing Pandas library

In [2]:
import pandas as pd

### Reading html table into a dataframe given the url

In [3]:
df_html = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

### Based on html page content, we get the first table of the list of dataframes

In [4]:
df = df_html[0]
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


In [5]:
# Get dataframe shape before cleaning it
df.shape

(180, 3)

### Only process the cells that have an assigned borough. Ignoring cells with a borough that is <i>Not assigned</i>.

In [6]:
# Get names of indexes for which column 'Borough' is 'Not assigned'
indexNames = df[ df['Borough'] == 'Not assigned' ].index

# Drop such rows
df.drop(indexNames, inplace = True)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


### Combine into one row with the neighborhoods separated with a comma

In [7]:
df['Neighborhood'].replace(r' \/ ', ', ', regex=True, inplace=True)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Reset Dataframe index:

In [None]:
# Reset Dataframe index:
df.reset_index(drop=True, inplace=True)
df.head()

### Using the .shape method to print the number of rows of the dataframe

In [8]:
# Get dataframe shape after the cleaning:
df.shape

(103, 3)