# Peer-graded Assignment:
# _Segmenting and Clustering Neighbourhoods in Toronto_

_By Oludayo_

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

This assignment is in 3 parts as indicated below:

<a href='#part1'>1. Web Scrapping and Data Cleansing</a>
<br>
<a href='#part2'>2. Geocoding</a>
<br>
<a href='#part3'>3. Exploration and Clustering</a></div>

Every part when updated is committed to `github` with respective link for assessment.

<a id='part1'></a>

## Part 1 - Web Scrapping and Data Cleansing

### 1.1 Notebook Created

A notebook was created accordingly

### 1.2 Web scrapping

To start the web scrapping of the Wikipedia page for the __[Toronto Neighbourhood](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)__, the following steps were taken and necessary libraries imported:

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

The `url` of the wikipedia page of the Toronto neighbourhood is then assigned as shown below:

In [2]:
wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

Sometime, it is important to specify the `User-Agent` to be able to be certain the requested page is actually downloaded. The following steps demonstrate this:

In [3]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0'}

In [4]:
wikipedia_page = requests.get(wikipedia_link, headers = headers)

To test whether the wikipedia page was rightly downloaded with the needed content, perform either of the following Options 1 or 2. If code `"200"` is obtained, then page downloaded rightly, otherwise, if code `"403"`, it is bad link - page forbidden.

In [5]:
# Option 1
wikipedia_page

<Response [200]>

In [6]:
# Option 2
wikipedia_page.status_code

200

Now, to clean our html page (source), the `BeautifulSoup` method is applied as shown below:

In [7]:
# Cleans html file
soup = BeautifulSoup(wikipedia_page.content, 'html.parser')
# soup

Or to see a more organised `html`, use `prettify` _(this might not be necessary though)_.

In [8]:
# print(soup.prettify())

In [9]:
# This extracts the "tbody" within the table where class is "wikitable sortable"
table = soup.find('table', {'class':'wikitable sortable'}).tbody
table

<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>
<td><a href="/wiki/North_York" tit

In [10]:
# Extracts all "tr" (table rows) within the table above
rows = table.find_all('tr')
rows

[<tr>
 <th>Postcode</th>
 <th>Borough</th>
 <th>Neighbourhood
 </th></tr>, <tr>
 <td>M1A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M2A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M3A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
 </td></tr>, <tr>
 <td>M4A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
 </td></tr>, <tr>
 <td>M5A</td>
 <td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
 <td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
 </td></tr>, <tr>
 <td>M5A</td>
 <td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
 <td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
 </td></tr>, <tr>
 <td>M6A</td>
 <td

In [11]:
# Extracts the column headers, removes and replaces possible '\n' with space for the "th" tag
columns = [i.text.replace('\n', '')
           for i in rows[0].find_all('th')]
columns

['Postcode', 'Borough', 'Neighbourhood']

### 1.3 Creating the Dataframe

In [12]:
# Converts columns to pd dataframe
df = pd.DataFrame(columns = columns)
df

Unnamed: 0,Postcode,Borough,Neighbourhood


In [13]:
# Extracts every row with corresponding columns
# Then appends the values to the create pd dataframe "df"
# Please not that the first row (row[0]) is skipped because it is already the header

for i in range(1, len(rows)):
    tds = rows[i].find_all('td')
    
    
    if len(tds) == 4:
        values = [tds[0].text, tds[1].text, tds[2].text, tds[3].text.replace('\n', ''.replace('\xa0',''))]
    else:
        values = [td.text.replace('\n', '').replace('\xa0','') for td in tds]
        
        df = df.append(pd.Series(values, index = columns), ignore_index = True)

        df

In [14]:
# Original shape of dataframe
df.shape

(289, 3)

In [15]:
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


In [16]:
(df['Borough'] == 'Not assigned').sum()

77

In [17]:
dff = df # reassignment to preserve the original content of "df" before manipulations

### 1.4 Dataframe manipulation

There are `77`"Not assigned" values under the **`Borough`**. Since only cells with an assigned Borough will be processed, therefore, any cell where the `Borough` is `Not assigned` are dropped as shown below:

In [18]:
# df_na is the output of the dataframe after dropping the Borough "Not assigned" rows
df_na = dff[dff.Borough != 'Not assigned']
df_na.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [19]:
# Shape of dataframe after dropping the Borough "Not assigned" rows
df_na.shape

(212, 3)

To know how many Neighbourhoods has **`Not assigned`** but with Borough name, the following was done:

In [20]:
(df_na['Neighbourhood'] == 'Not assigned').sum()

1

So it is only `1` `Not assigned` neighbourhood that will be replaced by it Borough name.

In [21]:
# Using numpy was faster for this replacement
import numpy as np
df_na['Neighbourhood'] = np.where(df_na['Neighbourhood'] == 'Not assigned', df_na['Borough'], df_na['Neighbourhood'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [22]:
df_na.shape

(212, 3)

In [247]:
df_na.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


The shape of the dataframe `df_na` is still the same - `(212, 3)`.

In [24]:
type(df_na)

pandas.core.frame.DataFrame

Now, to get where more than one neighborhood exist in one postal code area and combining the respective rows for the neighbourhood into the Postcode and corresponding Borough, the following was done and subsequently checked to for the given instance of **`M5A`** listed twice and with two neighborhoods - **Harbourfront** and **Regent Park**. These rows are combined and separated with a comma as shown.

In [25]:
# df_ca is the output of this combinations of neighbourhood
df_ca = df_na.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index()
df_ca.columns = ['Postcode', 'Borough', 'Neighbourhood']
df_ca.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [26]:
# df_ca is assigned to df_toronto
df_toronto = df_ca

In [27]:
type(df_toronto)

pandas.core.frame.DataFrame

In [28]:
# df_toronto is the final dataframe
df_toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [29]:
df_toronto.to_csv('TorontoPostcodes.csv', index = False)

The merging of neighbourhoods with same postcodes was verified for `M5A` as follows:

In [30]:
df_toronto.loc[df_toronto['Postcode'] == 'M5A']

Unnamed: 0,Postcode,Borough,Neighbourhood
53,M5A,Downtown Toronto,"Harbourfront, Regent Park"


In [31]:
# Final Data frame shape
df_toronto.shape

(103, 3)

<a id='part2'></a>

## Part 2 - Geocoding

<a id='part3'></a>

## Part 3 - Exploration and Clustering