# Segmenting and Clustering Neighborhoods in Toronto

In this notebook we will scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe so that it is in a structured format. For this purpose, we will use the **BeautifulSoup** package.

In [1]:
from bs4 import BeautifulSoup

We also need to import **requests** library, which allows to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. And we assign the link of the website through which we are going to scrape the data and assign it to variable named wiki_url.

In [2]:
import requests
wiki_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(wiki_url.content,'lxml')
#print(soup.prettify())

We can uncomment the last line in order to take a look at the whole HTML script.

Now, we need to find the class ‘wikitable sortable’ in the HTML script and we assign it to the variable *my_table*. Let's take a look to the result.

In [85]:
my_table = soup.find('table',{'class':'wikitable sortable'})
#my_table

We can deduce every `<tr>`...`<\tr>` section corresponds to a row. We can use the `find_all()` method in order to find all the `<tr>` tags in the document and get a set. Now let's see what is the length of this set.

In [84]:
len(my_table.find_all("tr"))

290

The lenght of the set is 290, which means our table contains 290 rows. Now let's import the numpy library to create an empty array of 290 rows and 3 columns, as the table shown in the wiki page.

In [87]:
import numpy as np
matrix = np.empty((290, 3), dtype=object)

Now, we will fill our array with the values of the table using a for loop. Note that we use the `stripped_strings` generator.
When there’s more than one thing inside a tag (as it is our case), you can still look at just the strings using the `.strings` generator. Since these strings tend to have a lot of extra whitespace, you can remove it by using the `.stripped_strings` generator.

In [89]:
for i, val in enumerate(my_table.find_all("tr")):
    for j,string in enumerate(val.stripped_strings):
        matrix[i][j]=string
#matrix

But we wish to keep only the rows that have an assigned borough. We will eliminate the rows where the borough is Not assigned.

In [94]:
matrix2=matrix.transpose()
indices = [i for (i,v) in enumerate(matrix2[1]) if v=='Not assigned']
matrix2 = np.delete(matrix2, indices, 1)
matrix2.shape
for (i,v) in enumerate(matrix2[2]):
    if v=='Not assigned':
        matrix2[2][i]=matrix2[1][i]
# to check if we eliminate the desired rows succesfully the next statement should give 'False' as a result
print('Not assigned' in matrix2[1])
print(matrix2.shape)

False
(3, 213)


We eliminated the unassigned boroughs and we are left with 213 rows. Let's convert it to a **pandas** dataframe.

In [95]:
import pandas as pd
matrix3=matrix2.transpose()
df_can_pc=pd.DataFrame(data=matrix3[1:,0:],
                       columns=matrix3[0,0:])
df_can_pc.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [97]:
result = df_can_pc.groupby(by=['Postcode','Borough'],sort=False).agg( ', '.join)
result.reset_index(inplace=True)
result.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [99]:
result.shape

(103, 3)