# Segmenting and Clustering Neighborhoods in Toronto - Part 1

## Importing relevant libraries and packages

In [1]:
!pip install lxml
import pandas as pd
print("Import completed")

Import completed


## Exctracting data from the Wikipedia page

 I used read_html() to import the table of postal codes into a <em>pandas</em> dataframe

In [2]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
df=pd.read_html(url)

# Postal code table is the fisrt table on the page
df_table=df[0]

# Columns renaming
df_table.rename(columns={"Postal Code": "PostalCode", "Neighbourhood": "Neighborhood"}, inplace='True')

## Cleaning the dataframe

The cells with <strong>'Not assigned'</strong> values from the Borough column are removed

In [3]:
print("Number of rows in df_table: ", len(df_table.index))
print("Number of rows with a 'Not assigned' value from the 'Borough'column:",df_table.Borough.value_counts()['Not assigned'])

# Dropped cell with Borough that is 'Not assigned' 
df_table.drop(df_table.index[df_table['Borough'] == 'Not assigned'], inplace = True)
df_table.reset_index(drop=True, inplace=True)

print("Number of rows in df_table after remove: ", len(df_table.index))

Number of rows in df_table:  180
Number of rows with a 'Not assigned' value from the 'Borough'column: 77
Number of rows in df_table after remove:  103


77 rows were deleted

In [4]:
df_table.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Replacing <strong>'Not assigned'</strong> values in Neighborhood column by the same as the Borough

In [5]:
print("Number of 'Not assigned values in column Neighborhood:", len(df_table[df_table['Neighborhood'] == 'Not assigned']))

Number of 'Not assigned values in column Neighborhood: 0


<p>There are no <em>'Not assigned'</em> values to replace in column 'Neighborhood'</p>

Grouping Neighborhoods by PostalCode then by Borough and combining Neighborhoods into one raw separated with a comma

In [6]:
df_table = df_table.groupby(['PostalCode','Borough'])['Neighborhood'].apply(','.join).reset_index()
print("Number of rows in df_table: ", len(df_table.index))
print("Number of unique raws in the dataframe: ", len(df_table.groupby(['PostalCode', 'Borough','Neighborhood'])))

Number of rows in df_table:  103
Number of unique raws in the dataframe:  103


There are no duplicate raws in the dataframe

In [7]:
df_table.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [8]:
# Export df_table into csv file
df_table.to_csv('Toronto.csv', index = False)

print("Dataframe shape: ",df_table.shape)

Dataframe shape:  (103, 3)
