# Segmenting and Clustering Neighborhoods in Toronto

In this notebook we will scrape the following Wikipedia page ([List of postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)) in order to obtain the data that is in the table of postal codes and transform it into a pandas dataframe. For this purpose, we will use the **BeautifulSoup** package.

In [1]:
from bs4 import BeautifulSoup

We also need to import the **requests** library, which allows to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. We take the link of the website through which we are going to scrape the data and assign it to variable named wiki_url.

In [2]:
import requests
wiki_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(wiki_url.content,'lxml')
#print(soup.prettify())

We can uncomment the last line in order to take a look at the whole HTML script.

Now, we need to find the class ‘wikitable sortable’ in the HTML script and assign it to the variable `my_table`. Let's take a look to the result.

In [3]:
my_table = soup.find('table',{'class':'wikitable sortable'})
#my_table

We can deduce every `<tr>`...`<\tr>` section corresponds to a row. We can use the `find_all()` method in order to find all the `<tr>` tags in the document and get a set. Now let's see what is the length of this set.

In [4]:
len(my_table.find_all("tr"))

290

The lenght of the set is 290, which means our table contains 290 rows. Now let's import the numpy library to create an empty array of 290 rows and 3 columns, as the table shown in the wiki page.

In [5]:
import numpy as np
matrix = np.empty((290, 3), dtype=object)

Now, we will fill our array with the values of the table using a for loop. Note that we use the `stripped_strings` generator.
When there’s more than one thing inside a tag (as it is our case), you can still look at just the strings using the `.strings` generator. Since these strings tend to have a lot of extra whitespace, you can remove it by using the `.stripped_strings` generator.

In [6]:
for i, val in enumerate(my_table.find_all("tr")):
    for j,string in enumerate(val.stripped_strings):
        matrix[i][j]=string
#matrix

But we wish to keep only the rows that have an assigned borough. We will eliminate the rows where the borough is Not assigned.

In [7]:
matrix2=matrix.transpose()
indices = [i for (i,v) in enumerate(matrix2[1]) if v=='Not assigned']
matrix2 = np.delete(matrix2, indices, 1)
matrix2.shape
for (i,v) in enumerate(matrix2[2]):
    if v=='Not assigned':
        matrix2[2][i]=matrix2[1][i]
# to check if we eliminate the desired rows succesfully the next statement should give 'False' as a result
print('Not assigned' in matrix2[1])
print(matrix2.shape)

False
(3, 213)


We eliminated the unassigned boroughs and we are left with 213 rows. Let's convert it to a **pandas** dataframe.

In [8]:
import pandas as pd
matrix3=matrix2.transpose()
df_torpc=pd.DataFrame(data=matrix3[1:,0:],
                       columns=matrix3[0,0:])
df_torpc.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [9]:
result = df_torpc.groupby(by=['Postcode','Borough'],sort=False).agg( ', '.join)
result.reset_index(inplace=True)
result.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [10]:
result.shape

(103, 3)

Now that we have have built a dataframe of the postal codes along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

I tried to use the Geocoder Python package (https://geocoder.readthedocs.io/index.html) but it didn't work, so I will use the csv file provided here: http://cocl.us/Geospatial_data

In [11]:
import pandas as pd
df_latlng = pd.read_csv('http://cocl.us/Geospatial_data')
df_latlng.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Rename "Postcode" to "Postal Code" and "Neighbourhood" to "Neighborhood"

In [12]:
result.rename(columns={'Postcode': 'Postal Code','Neighbourhood': 'Neighborhood'}, inplace=True)
result.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


Merge both dataframes using the key 'Postal Code'

In [13]:
df_neighb=pd.merge(result, df_latlng, on='Postal Code', how='inner')
df_neighb.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
