# **SEGMENTING AND CLUSTERING IN TORONTO**
###### _PEER GRADED ASSIGNMENT_
###### By Ruben de las Heras





### TABLE OF CONTENTS

1. Web Scrapping and Data Cleansing

#### 1. WEB SCRAPPING AND DATA CLEANSING

Firstly, we import the different libraries we will be using throughout the assignment. Check code for more detail.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

!pip install beautifulsoup4 # tool to scrap data from a website
from bs4 import BeautifulSoup 
import requests

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


We get the source code of the webpage using the requests library. We add *.text* at the end to get the code and not make a call.

In [33]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

We clean the data using *BeautifulSoup*.

In [3]:
soup = BeautifulSoup(source, 'html5lib')

After checking our html, we have found out that the table belongs to the code *'table'*, so we are gonna get it.

In [4]:
table = soup.find('table')

Note: the outcome we get is not very organised, so we might use *prettify* to get it clearer.

In [5]:
# print(table.prettify())

After checking our table, the rows are identified as *tr*. We are going to start creating our table by extracting the rows.

In [6]:
rows = table.find_all('tr')

We extract the column headers (defined as *th*), remove and replace possible *'\n'* with space for the *th* tag.

In [7]:
columns = [i.text.replace('\n', '')
           for i in rows[0].find_all('th')]
columns

['Postal code', 'Borough', 'Neighborhood']

Now that we have the columns, we can start giving format to our df.

In [8]:
df = pd.DataFrame(columns = columns)
df

Unnamed: 0,Postal code,Borough,Neighborhood


Extracts every row with corresponding columns. Note that the first row *(row[0])* is skipped because it is already the header.
Then appends the values to the dataframe *df*.


In [10]:
for i in range (1, len(rows)):
    tds = rows[i].find_all('td')
    
    if len(tds) == 4:
        values = [tds[0].text, tds[1].text, tds[2].text, tds[3].text.replace('\n', ''.replace('\xa0',''))]
    else:
        values = [td.text.replace('\n', '').replace('\xa0','') for td in tds]
        
        df = df.append(pd.Series(values, index = columns), ignore_index = True)

        df

Check the shape of the *df*.


In [11]:
df.shape

(180, 3)

Check first rows.

In [13]:
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Rename the first column to match the example.

In [16]:
df.rename({'Postal code': 'PostalCode'}, axis=1, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Check number of *'Not assigned'* in **Boroughs**.

In [17]:
(df['Borough'] == 'Not assigned').sum()

77

Assign a new dataframe before starting manipulation to keep the original one.

In [18]:
dff = df

Remove _'Not assigned'_ in **Boroughs**. 

In [19]:
df_na = dff[dff.Borough != 'Not assigned']
df_na.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


Check we have successfully deleted the 77 not assigned.

In [20]:
df_na.shape

(103, 3)

Check if there are *'Not assigned'* in **Neighborhoods**.

In [21]:
(df_na['Neighborhood'] == 'Not assigned').sum()

0

We now have to combine *Neighborhoods* that share *PostalCode* and *Borough*. 

In [24]:
df_ca = df_na.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(','.join).reset_index()
df_ca.columns = ['PostalCode', 'Borough', 'Neighborhood']
df_ca.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West
9,M1N,Scarborough,Birch Cliff / Cliffside West


As the function did not separate Neighborhoods by comma, we try with another function.

In [27]:
df_ca = df_ca.apply(lambda x: x.str.replace("/",","))
df_ca.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern , Rouge"
1,M1C,Scarborough,"Rouge Hill , Port Union , Highland Creek"
2,M1E,Scarborough,"Guildwood , Morningside , West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


We randomly look up one *PostalCode* to check.

In [30]:
df_ca.loc[df_ca['PostalCode'] == 'M5A']

Unnamed: 0,PostalCode,Borough,Neighborhood
53,M5A,Downtown Toronto,"Regent Park , Harbourfront"


Check the shape of the dataframe.

In [31]:
df_ca.shape

(103, 3)