---

# Final Assignment of the 3rd Week of the Applied Data Science Capstone Project

#### By : Achraf Ougdal 

---

#### In this Assignment we will explore and cluster the neighborhoods in Toronto based on the postalcode and borough information.
#### The dataset isn't ready on the internet, so we're going to need to scrape the Wikipedia Page that contains the data.
##### This is the link to the wikipedia page : https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

---

#### Let's Start by Importing the required libraries

In [1]:
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize 
from bs4 import BeautifulSoup
import requests

---

### 1st Section : Scrapping, creating, and cleaning our dataset

#### 1. Scrapping Data

In [2]:
# defining the url of the web page
data_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

# storing the page as raw html into a variable
html_data = requests.get(data_url).text

# Parsing html data with BeautifulSoup
parsed_data = BeautifulSoup(html_data, 'html.parser')

# Printing the title to ensure that everything is okay
print(parsed_data.title)

<title>List of postal codes of Canada: M - Wikipedia</title>


##### Now that we have the html code of the page, let's extract the table containing the data

The table containing the data is the first table on page. So, we can directly find it using the find method (in other words, no need to use find_all)

#### 2. Creating our DataFrame

In [3]:
# finding the table
wiki_table = parsed_data.find('table')

# extracting the headers
headers = []
for th in wiki_table.find_all('th'):
    headers.append(th.text.replace('\n', '').strip())

# print the headers
print( "The Headers of the table are : ", headers)

# extracting data
data = []
for row in wiki_table.find_all('tr')[1:]: # [1:] to skip the first row (the first row is the headers)
    row_data = {}
    for header, td in zip(headers, row.find_all('td')):
        row_data[header] = td.text.replace('\n', '').strip()
    data.append(row_data)

# Transform the data to a data frame
wiki_table = json_normalize(data)

wiki_table.head()

The Headers of the table are :  ['Postal Code', 'Borough', 'Neighbourhood']


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### 3. Cleaning Our DataFrame

In [4]:
# Print the shape of the dataset
print("The Dataset has ", wiki_table.shape[0], " rows and ",  wiki_table.shape[1], " columns")

# check for duplicates in the "Postal Code" column
print("There are ", len(wiki_table['Postal Code'].unique()), " unique values in the Postal Code column")

The Dataset has  180  rows and  3  columns
There are  180  unique values in the Postal Code column


Our dataset contains no duplicates in the Postal Code column

##### In the assignment instructions, we were asked to only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

Let's drop all rows that contain a borough that is Not ssigned

In [5]:
# count values for each value 
wiki_table['Borough'].value_counts()

Not assigned        77
North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East Toronto         5
East York            5
York                 5
Mississauga          1
Name: Borough, dtype: int64

In [6]:
# Replace "Not assigned" with NaN to make the dropping easier 
wiki_table['Borough'].replace({"Not assigned": np.nan}, inplace=True)
wiki_table.dropna(subset=['Borough'], inplace=True)

# reset index and Print the new data set
wiki_table.reset_index(drop=True, inplace=True)
wiki_table.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


##### Let's check if there's a row with a "Not assigned" Neighbourhood

In [7]:
wiki_table[wiki_table['Neighbourhood'] == "Not assigned"]

Unnamed: 0,Postal Code,Borough,Neighbourhood


#### Since There are no rows that have a "Not assigned" Neighborhood, then our Data set now is clean, 
let's print the Dataset along with its shape

In [8]:
print("\nThe cleaned Dataset has ", wiki_table.shape[0], " rows and ",  wiki_table.shape[1], " columns\n")
# Sort Values By Postal Code
wiki_table.sort_values("Postal Code", inplace=True)
wiki_table.head()


The cleaned Dataset has  103  rows and  3  columns



Unnamed: 0,Postal Code,Borough,Neighbourhood
6,M1B,Scarborough,"Malvern, Rouge"
12,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
18,M1E,Scarborough,"Guildwood, Morningside, West Hill"
22,M1G,Scarborough,Woburn
26,M1H,Scarborough,Cedarbrae


---

### 2nd Section : Get the latitude and the longitude coordinates of each neighborhood. 

#### For this task, we will use the csv file that has the geographical coordinates of each postal code

In [9]:
postal_code_coords = pd.read_csv("https://cocl.us/Geospatial_data")

In [10]:
postal_code_coords.sort_values("Postal Code", inplace=True)
postal_code_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [11]:
# Get the Coordinates of the postal codes that exist in the wiki_table dataframe
postal_codes = wiki_table["Postal Code"].unique()
postal_code_coords = postal_code_coords[postal_code_coords["Postal Code"].isin(postal_codes)]

In [12]:
postal_code_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [13]:
# merge the coords dataframe with the wiki_table dataframe and store the result in dataframe called Dataset
dataset = merged_inner = pd.merge(left=wiki_table, right=postal_code_coords, left_on='Postal Code', right_on='Postal Code')
dataset.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [14]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 0 to 102
Data columns (total 5 columns):
Postal Code      103 non-null object
Borough          103 non-null object
Neighbourhood    103 non-null object
Latitude         103 non-null float64
Longitude        103 non-null float64
dtypes: float64(2), object(3)
memory usage: 4.8+ KB


#### There are no missing values in any columns of our dataset

#### Moving on to the next section.

---

### 3rd Section : Exploring and clustering the neighborhoods in Toronto.

#### **NB : We will use all Boroughs for the clustering**

For this section we will cluster neighborhoods of Toronto.
Let's start by making a list of what we will do and what we need.

1. Identify Features and unnecessary columns
3. Plot the data Points into a Map to see the position on each neighborhood
3. Clean and Transform our dataset
4. Normalize and Scale our dataset
5. Cluster the neighborhoods
6. Visualize the results

#### 1. Identify Features and unnecessary columns

- **What are the columns that we should include in the features set** ?

> Let's start by The indexes, what column should be the index of our dataframe ? 

The answer here is obvious; since we will cluster neighborhoods in Toronto then the index should be the "Neighborhoods" column.

> But the some values in the "Neighborhoods" column contains multiple names separated by commas, shouldn't we split them into separated columns ?

The answer is NO. let's see for example the neighborhoods Rouge Hill, Port Union and Highland Creek they both have the same postal coden the same borough and the same coordinates, splitting them and plotting each of these 3 neighborhoods based on the coordinates will result one point. So there's no need to split those values into separated columns.

> Okay now, What are the features that we will use to Cluster the neighborhoods ?

Let's start with the "Postal Code" column. it is a categorical variable and we have 103 unique values of that column. getting dummy variables of each postal code will result 103 other columns each with a different value. Also, each postal code has only one latitude value and one longtitude value, and also unique values of neighborhoods (since no neighborhood can have only more that one postal code). so we can drop that column.

For the Borough column, we will transform it into continuous variable by getting its dummy variables. and use them as features

For the Latitude and Longitude columns it's obvious that we will use them as features

##### Okay, let's start by making the Neighborhoods column as index and dropping the Postal Code Column

#### 2. Plot the data Points into a Map to see the position on each neighborhood

In [15]:
#!pip install folium
import folium


# Plot the map of canada
toronto_map = folium.Map(location=[43.7043521,-79.4907077],tiles='Stamen Terrain', zoom_start=11)

# Add Markers
for lat, long, label in zip(dataset['Latitude'], dataset['Longitude'], dataset["Neighbourhood"]):
    marker = folium.Marker(
    [lat, long], popup=label, tooltip=label).add_to(toronto_map)

toronto_map

#### 3. Clean and Transform our dataset

In [16]:
# Set the Neighbourhood column as index
dataset.set_index('Neighbourhood', inplace=True)
dataset.head()

Unnamed: 0_level_0,Postal Code,Borough,Latitude,Longitude
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Malvern, Rouge",M1B,Scarborough,43.806686,-79.194353
"Rouge Hill, Port Union, Highland Creek",M1C,Scarborough,43.784535,-79.160497
"Guildwood, Morningside, West Hill",M1E,Scarborough,43.763573,-79.188711
Woburn,M1G,Scarborough,43.770992,-79.216917
Cedarbrae,M1H,Scarborough,43.773136,-79.239476


In [17]:
# drop the Postal Code Column
dataset.drop("Postal Code", axis=1, inplace=True)
dataset.head()

Unnamed: 0_level_0,Borough,Latitude,Longitude
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Malvern, Rouge",Scarborough,43.806686,-79.194353
"Rouge Hill, Port Union, Highland Creek",Scarborough,43.784535,-79.160497
"Guildwood, Morningside, West Hill",Scarborough,43.763573,-79.188711
Woburn,Scarborough,43.770992,-79.216917
Cedarbrae,Scarborough,43.773136,-79.239476


In [18]:
# Transform the Borough Column into a continuous variables by getting its dummy 

dataset = pd.get_dummies(dataset, columns=['Borough'], prefix="", prefix_sep="")
dataset.head()

Unnamed: 0_level_0,Latitude,Longitude,Central Toronto,Downtown Toronto,East Toronto,East York,Etobicoke,Mississauga,North York,Scarborough,West Toronto,York
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
"Malvern, Rouge",43.806686,-79.194353,0,0,0,0,0,0,0,1,0,0
"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,0,0,0,0,0,0,0,1,0,0
"Guildwood, Morningside, West Hill",43.763573,-79.188711,0,0,0,0,0,0,0,1,0,0
Woburn,43.770992,-79.216917,0,0,0,0,0,0,0,1,0,0
Cedarbrae,43.773136,-79.239476,0,0,0,0,0,0,0,1,0,0


#### 4. Normalize and Scale our dataset

In [19]:
# import the library
from sklearn.preprocessing import StandardScaler

cluster_dataset = StandardScaler().fit_transform(dataset)

#### 5. Cluster the neighborhoods

> Which clustering technique should we use ?

Since we are going to do the same analysis as in the new york lab, we will use the K-Means algorithm

In [20]:
from sklearn.cluster import KMeans

In [21]:
kmeans_model = KMeans(init="k-means++", n_clusters=4, n_init = 12)
kmeans_model.fit(cluster_dataset)

model_labels = kmeans_model.labels_
model_centroids = kmeans_model.cluster_centers_

In [22]:
print(model_labels)

[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 1 1 3
 3 3 3 3 3 1 3 3 3 3 3 3 0 3 0 0 0 0 0 0 0 0 1 1 3 0 0 0 0]


In [23]:
dataset["label"] = model_labels

In [24]:
print(len(list(dataset.index.values)))
print(len(dataset['Latitude']))
print(len(dataset['Longitude']))
print(len(dataset['label']))

103
103
103
103


#### 6. Visualize the results

In [25]:
toronto_map = folium.Map(location=[43.7043521,-79.4907077],tiles='Stamen Terrain', zoom_start=11)

# array of colors based on the number of cluster
colors = ['red', 'blue', 'green', 'purple']
print(colors)
# Add Markers
for lat, long, name, label in zip(dataset['Latitude'], dataset['Longitude'], list(dataset.index.values), dataset["label"]):
    marker = folium.Marker(
    [lat, long],
    popup="Label :"+str(label),
    tooltip="Label :"+str(label),
    icon=folium.Icon(color=colors[label])).add_to(toronto_map)

toronto_map

['red', 'blue', 'green', 'purple']


---

### Now we can clearly see that our data points are grouped into 4 clusters and the distribution of our data points pretty makes sense

#### This concludes our Final Assignment, I hope you liked my work.
#### Thank you for your time and have a good day

---