## **Segmenting and Clustering Neighborhoods in Toronto**

***Before we get the data and start exploring it, let's download all the dependencies that we will need.***

In [25]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import requests # library to handle requests

# conda install -c anaconda beautiful-soup --yes
from bs4 import BeautifulSoup # package for parsing HTML and XML documents

import csv # implements classes to read and write tabular data in CSV form

print('Libraries imported.')

Libraries imported.


***To build a code to scrape the following Wikipedia Page***

In [26]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_url,'lxml')
table = soup.find('table',{'class':'wikitable sortable'})
#print(soup.prettify())

***To get columns and rows***

In [27]:
headers = [header.text for header in table.find_all('th')]

table_rows = table.find_all('tr')        
rows = []
for row in table_rows:
   td = row.find_all('td')
   row = [row.text for row in td]
   rows.append(row)

In [28]:
with open('Toronto-1.csv', 'w') as f:
   writer = csv.writer(f)
   writer.writerow(headers)
   writer.writerows(row for row in rows if row)

***Load data from Toronto-1.csv file***

In [29]:
# Load data from CSV
df=pd.read_csv('Toronto-1.csv')
print('Data downloaded!')

Data downloaded!


In [30]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


In [31]:
df.shape

(289, 3)

In [32]:
df.columns

Index(['Postcode', 'Borough', 'Neighbourhood\n'], dtype='object')

***Data Cleaning***

In [33]:
#Rename Columns
df.rename(columns={'Neighbourhood\n':'Neighborhood','Postcode':'PostalCode'}, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


In [34]:
df['Neighborhood']=df['Neighborhood'].replace(to_replace='\n', value='', regex=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


***Ignore cells with a borough that is Not assigned***

In [35]:
#df = df[(~df['Borough'].str.contains("Not assigned"))]
df.drop(df[df['Borough'] == 'Not assigned'].index, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [36]:
df.shape

(212, 3)

***More than one neighborhood can exist in one postal code area***  
***These two rows will be combined into one row with the Neighborhoods seperated with a comma.***

For example : ***M5A*** is listed twice and has two neighborhoods : ***Harbourfront and Regent Park***

In [37]:
df.query('PostalCode == "M5A"')

Unnamed: 0,PostalCode,Borough,Neighborhood
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park


In [38]:
# Group the data by Postcode & Borough
df2=df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

In [39]:
df2.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


***A cell has a borough but a Not assigned neighborhood , then the neighborhood will be the same as the borough***

So for the ***9th*** cell in the table on the Wikipedia Page , the value of Borough and the Neighborhood columns will be ***Queen's Park***

In [40]:
df2.query('Neighborhood == "Not assigned"')

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Not assigned


In [41]:
df2.loc[df2.Neighborhood == 'Not assigned', "Neighborhood"] = df2.Borough

In [42]:
df2.query('PostalCode == "M7A"')

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Queen's Park


***Use the .shape method to print the number of rows of dataframe***

In [43]:
# number of rows of dataframe
df2.shape

(103, 3)

In [44]:
df2.to_csv('Toronto-2.csv')