# CAPSTONE PROJECT

This notebook will be mainly used for the capstone project of IBM data science capstone project.

## Hello Capstone Project Course!

First I installed most of the libraries I will use.

In [25]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


To gain access to the dataset, I converted the html file to a csv file using the next page:
http://wikitable2csv.ggor.de/ then i downloaded it, uploaded it to my CC account and read it.

Now, let's take a look at the shape of our dataframe.

In [26]:
df=pd.read_csv("/resources/Toronto.csv")
df.shape

(289, 3)

In [27]:
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


As we can see, we have useless data when we have "Not assigned" in the Borough column, we need to get rid of it

In [28]:
df=df[df.Borough != "Not assigned"]
df.shape

(212, 3)

Now we are more capable to work with the data frame, but we still need to get rid of the useless data in the Neighbourhood column

In [29]:
df=df.reset_index(drop=True)
for i,j in enumerate(df.Neighbourhood):
    if j=="Not assigned":
        df.Neighbourhood[i]=df.Borough[i] 
        
df.head(10)        

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Now we have to merge all the neighbourhoods that are in the same borough and have the same postcode

In [30]:
df=df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(','.join)
df=pd.DataFrame(df)
df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighbourhood
Postcode,Borough,Unnamed: 2_level_1
M1B,Scarborough,"Rouge,Malvern"
M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
M1E,Scarborough,"Guildwood,Morningside,West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


And now we set the index back to normal

In [31]:
df.reset_index(level=[0,1], inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Now let's rename the first column so it matches with the column name specified in the lab instructions

In [32]:
df.rename(columns={"Postcode":"Postal Code"}, inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Now, after downloading the file from http://cocl.us/Geospatial_data and uploading it to my CC account, it's time to load it and take a look at its shape.

In [33]:
df_coordinates=pd.read_csv("/resources/Toronto_Coordinates.csv")
df_coordinates.shape

(103, 3)

It is time for us to merge our dataframes into a common Postal code, so we get one final data set with the information of our two files.

In [34]:
df_final=pd.merge(df, df_coordinates, on="Postal Code")
df_final.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


Let's take a final look at the shape of the final dataframe.

In [35]:
df_final.shape

(103, 5)

Now let's extract all the boroughs that exist in Toronto so we can make a cluster of Neighbourhoods in Toronto

In [36]:
df_toronto=df_final[df_final['Borough'].str.contains("Toronto")].reset_index(drop=True)
df_toronto.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686412,-79.400049


Now, as we are going to apply kmeans lustering, we cannot have strings in our dataframe, so let's create a new dataframe without the columns containing objects,

In [37]:
df_ktoronto=df_toronto.drop(['Postal Code', 'Borough', 'Neighbourhood'], axis=1)
df_ktoronto.head()

Unnamed: 0,Latitude,Longitude
0,43.676357,-79.293031
1,43.679557,-79.352188
2,43.668999,-79.315572
3,43.659526,-79.340923
4,43.72802,-79.38879


Now we proceed to model our kmeans algorithm.

In [40]:
kclusters = 4

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_ktoronto)
df_toronto.insert(0, 'Cluster Labels', kmeans.labels_)

And now we create our map

In [44]:
map_clusters = folium.Map(location=[43.7001100, -79.4163000], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_ktoronto['Latitude'], df_ktoronto['Longitude'], df_toronto['Neighbourhood'], df_toronto['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster))
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

And we are done!