# IBM Certificate: Capstone Project

_This notebook contains the code for the Capstone project (IBM Certificate)_

## 1.Segmenting and Clustering Neighborhoods in Toronto

### 1.1 Importing the data

We import the below libraries:
- Pandas
- Numpy
- Requests


In [1]:
# Import libraries
import pandas as pd
import numpy as np
import requests

The data we want are the **postal codes** in Canada and in particular the ones corresponding to **Toronto** in the province of Ontario. We can find these data on Wikipedia. The aim is then to scrap the data on the webpage. 

This will be done using the library **_requests_**.

In [2]:
# URL of the wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
# Extract the content
r = requests.get(url)

Let's check what r contains so we know how to retrieve, clean, organise the data.

In [3]:
# Check the status code
print('Status code: ', r.status_code)
# Check the encoding
print('Encoding: ', r.encoding)
# Check the data type
print('Data type: ', type(r))
# Check the header
print('Headers: ', r.headers['content-type'])

Status code:  200
Encoding:  UTF-8
Data type:  <class 'requests.models.Response'>
Headers:  text/html; charset=UTF-8


We need to read an **html** content. Let's use _'read_html'_ from **Pandas**' library to retrieve the data.

In [4]:
raw_data = pd.read_html(r.text)

In [5]:
postal_codes = raw_data[0]

print('\n\n Dimension of the dataframe: ', postal_codes.shape)
postal_codes.head()



 Dimension of the dataframe:  (180, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### 1.2 Filtering the data

We need to filter the data a bit:
- remove the rows with a borough _'Not assigned'_
- a borough that has no corresponding neighborhood will have as neighborhood the same name as the borough

In [6]:
filter_1 = (postal_codes['Borough'] != 'Not assigned')
print('{} rows have a borough not assigned and need to be removed.'.format(len(postal_codes)-filter_1[filter_1 == True].count()))
postal_codes = postal_codes[filter_1]

77 rows have a borough not assigned and need to be removed.


In [7]:
filter_2 = (postal_codes['Neighborhood'] != 'Not assigned' )
print('{} boroughs have no corresponding neighborhood.'.format(filter_2[filter_2 == True].count()))
postal_codes = postal_codes[filter_2]

103 boroughs have no corresponding neighborhood.


In [27]:
print(' The dataframe is now of dimension {}.'.format(postal_codes.shape))
postal_codes.head()

 The dataframe is now of dimension (103, 3).


Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### 1.3 Adding Latitude, Longitude for each borough

The list of latitude and longitude for each postal code is given and is available at the following address https://cocl.us/Geospatial_data

In [14]:
url_lat_lng = 'https://cocl.us/Geospatial_data'
lat_lng = pd.read_csv(url_lat_lng)

Let's preview the imported data.

In [20]:
lat_lng.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now let's **merge** the dataframe **postal_codes** with the table **lat_lng** containing the latitude and longitude for each borough

In [22]:
df = pd.merge(postal_codes, lat_lng, how='left', on='Postal Code', sort=True,validate='1:1')

Let's verify that we added the latitude and longitude for each borough and that the resulting dataframe is of dimension 103x5 as we expect.

In [48]:
print('After merging the data the new dataframe is of dimension {} .'.format(df.shape))
print('\nFirst 10 rows sorted by \'Postal Code\' in descending order ')
df.sort_values(['Postal Code'], ascending = [False]).head(10)

After merging the data the new dataframe is of dimension (103, 5) .

First 10 rows sorted by 'Postal Code' in descending order 


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
102,M9W,Etobicoke,"Northwest, West Humber - Clairville",43.706748,-79.594054
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
98,M9N,York,Weston,43.706876,-79.518188
97,M9M,North York,"Humberlea, Emery",43.724766,-79.532242
96,M9L,North York,Humber Summit,43.756303,-79.565963
95,M9C,Etobicoke,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201
94,M9B,Etobicoke,"West Deane Park, Princess Gardens, Martin Grov...",43.650943,-79.554724
93,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
