# Segmenting and Clustering Neighborhoods in Toronto


In this notebook, I will be scrapping a wikipedia page about Toronto in order to segment and cluster the neighborhoods. 

### Import Libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import requests
import lxml.html as lh

print('Libraries imported.')

Libraries imported.


### Import Wikipedia Page with List of Postal Codes of Canada

In [2]:
wikipedia_link='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
raw_wikipedia_page=requests.get(wikipedia_link)
page=raw_wikipedia_page.text

#Store the contents of the website under doc using lxml library
doc = lh.fromstring(raw_wikipedia_page.content)



### Parse Table

In [3]:
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

#Check the length of the first 12 rows (information is utilized in populating the data frame -- 3)
[len(T) for T in tr_elements[:12]]

tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print (i,name)
    col.append((name,[]))

#Since the first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 3 (updated from previous length check), the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1
        
[len(C) for (title,C) in col]

1 Postcode
2 Borough
3 Neighbourhood



[289, 289, 289]

### Load, clean and group dataframe

In [4]:
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

#### The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [5]:
#Rename Columns
df.columns = ['Borough','Neighborhood','PostalCode']
df.columns.tolist()

['Borough', 'Neighborhood', 'PostalCode']

In [6]:
#Remove \n From Neighborhood entries
df['Neighborhood'] = df['Neighborhood'].str[:-1]
df.head()

Unnamed: 0,Borough,Neighborhood,PostalCode
0,Not assigned,Not assigned,M1A
1,Not assigned,Not assigned,M2A
2,North York,Parkwoods,M3A
3,North York,Victoria Village,M4A
4,Downtown Toronto,Harbourfront,M5A


In [7]:
#Rearrange the columns moving PostalCode to the front
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
cols
df = df[cols]

#### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [8]:
#Remove observations where Borough = "Not Assigned"
df1 = df[df.Borough != 'Not assigned']

#### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [9]:
#Assign Borough when Neighborhood is Not Assigned
df2 = df1['Neighborhood'].replace('Not assigned',df1['Borough'])

In [10]:
#Concatenate the new assignment to the original trimmed data set and delete the original Neighborhood column
df3 = pd.concat([df1, df2], axis=1)
df3.columns= ['PostalCode', 'Borough','Nold','Neighborhood']
df4 = df3.drop('Nold', axis=1)

#### More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [11]:
#Sort all columns alphabetically, doing this first makes it easier to review results and sorts Neighborhoods into alphanumerical order before grouping
df5 = df4.sort_values(['PostalCode','Borough','Neighborhood'],ascending=[True,True,True])

In [12]:
#Group by PostalCode and Comma Seperate Neighborhood
df_final = (df5.groupby('PostalCode')
   .agg({'Borough' : 'first', 'Neighborhood' : ','.join})
   .reset_index()
   .reindex(columns=df.columns))

In [13]:
df_final.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Highland Creek,Port Union,Rouge Hill"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [14]:
df_final.shape

(103, 3)