# Applied Data Science Capstone
## IBM Data Science Professional Certificate - Course #9
### By: Ahmed Sadek

## Table of Contents
* [Introduction](#intro)
* [Scraping Wikipedia Table](#wiki)
* [Clustering](#clus)


<a id='intro' ></a>
## Introduction

>This notebook will be used to complete the final project **(Battle of the Neighbourhoods)** of the IBM Data Science Professional Certificate.

In [1]:
import pandas as pd
import numpy as np
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


<a id='wiki' ></a>
## Getting the data from Wikipedia

>Now we will scrape the table of boroughs and neighbourhoods from wikipedia's Canadian Postal Codes table which you can find [here](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).


In [2]:
#link to the table.
link='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
#the read_html method returns a list of all tables in the html page
ls = pd.read_html(link,header=0)
#we're only interested in the first table, so we pick it as our dataframe
df = ls[0]
# checking to see everything worked smoothly.
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [3]:
#drop rows where Borough is Not assigned
df = df[df['Borough'] != 'Not assigned']
#confirm
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [4]:
#reset the index of the dataframe to have valid indeces
df.reset_index(drop=True,inplace=True)
#confirm
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


This concludes the initial cleaning, we have correct values for each column as well as the index. 

We also have the dataframe sorted by the **`Postal Code`** column. 

### Attaching geospatial data 
We can download the data by running the following command: `!wget -q -O 'Geospatial_Coordinates.csv' 'http://cocl.us/Geospatial_data'`

After downloading the file, we can import it with pandas `read_csv` method.

In [5]:
geo = pd.read_csv('Geospatial_Coordinates.csv')
#confirm the structure
geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Then we can merge the two dataframes based on the common **`Postal Code`** Column to make a *FULL* dataframe. Use pandas left merge (Left Join) with `df` on the left and `geo` on the right.

In [6]:
full = pd.merge(df,geo,how='left')
#confirm
full.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


<a id='clus'></a>
## Clustering

We will then cluster the postal codes into 4 clusters (as the total number of rows for only **`Toronto`** is almost 40).

In [10]:
#getting only Tornto Boroughs
tor = full[full.Borough.str.contains('Toronto')]
#confirming
tor.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


In [11]:
#number of observations
len(tor)

39

Now that we have the `tor` dataframe with only the observations we're interested in, we'll use folium to visually examine them initially.

In [14]:
#import folium and create a map centered around Toronto coordinates then view it
import folium
toronto_coordinates = [43.6532,-79.3832]
toronto_map = folium.Map(location=toronto_coordinates,zoom_start=12)
toronto_map

In [16]:
#Visualize our data points on the map using Folium.
for lat,lng,borough,neighbourhood in zip(tor['Latitude'],tor['Longitude'],tor['Borough'],tor['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat,lng],radius=4,popup=label,color='blue',
    fill=True,fill_color='#3e35db',fill_opacity=0.7,parse_html=False).add_to(toronto_map)
#view the map
toronto_map

Now we need to cluster the observations, we'll assume 4 clusters for simplicity. And we'll use the `KMeans` method from the `scikit-learn` clustering module.

In [17]:
#import the KMeans module
from sklearn.cluster import KMeans
#set the number of clusters to 4
k = 4
#drop any unnecessary features and leave only the latitude and longitude columns to cluster by in the `tor` dataframe
clustering = tor.drop(['Postal Code','Borough','Neighbourhood'],axis=1)
clusters = KMeans(n_clusters=k,random_state=0).fit(clustering)
#add the cluster labels to the data frame
tor['Cluster'] = clusters.labels_
#confirm everything worked
tor.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,3
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,3
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,3
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,3
19,M4E,East Toronto,The Beaches,43.676357,-79.293031,0
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,3
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,3
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564,1
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568,3
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259,1


Everything worked smoothly, and the warning is because of broadcasting as we are using a copy of a slice of a dataframe, but it is fine for our purpose right now.

We'll assign a color to each cluster and visualize them via `folium`.

In [18]:
#assigning colors to cluster numbers.
color_dict = {0:'red',1:'blue',2:'yellow',3:'black'}

In [19]:
cluster_map = folium.Map(location=toronto_coordinates,zoom_start=10)
for lat, lon, neighbourhood, cluster in zip(tor['Latitude'], tor['Longitude'], tor['Neighbourhood'], tor['Cluster']):
    label = folium.Popup('Cluster #' + str(cluster), parse_html=True)
    folium.CircleMarker([lat, lon],radius=4,popup=label,
        color=color_dict[cluster],fill=True,fill_color=color_dict[cluster],
        fill_opacity=0.7).add_to(cluster_map)
       
cluster_map

We can see that evgerything worked smoothly, and we got our 4 clusters with different colors perfectly.