# Applied Data Science Capstone
This notebook will be used for the IBM Applied Data Science Capstone Project

In [1]:
import pandas as pd
import numpy as np

In [2]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


## Scape Data from Wiki Page and Create a Data Frame

First step is to import libraries

In [3]:
!pip install bs4
from bs4 import BeautifulSoup
import requests




Now we get the html code from the URL, save it as text and then create a soup object using the html code.


In [4]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
data  = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")  # create a soup object using the variable 'data'

In this step we process the data.
- Table are selected and its rows are read
- A borough that is Not assigned will be ignored
- Rows with the same postal code are gonna be combined
- a Data Frame is created from the table content
- column names are PostalCode, Borough and Neighborhood
- rows are added to the data frame after preprocessing(removing special characters etc.)

In [5]:
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

#display(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [6]:
print(df.shape)

(103, 3)


## Retreive Geo Coordinates

Either Retrieve the data via the given CSV File

In [7]:
!pip install wget
!python -m wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv -o Geospatial_Coordinates.csv

df_CSV = pd.read_csv('Geospatial_Coordinates.csv')


Saved under Geospatial_Coordinates (15).csv


In [8]:
df_CSV.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Some preprocessing is needed. Then we need to merge both data sets.
- We need to ensure that the geo coordinates are in the correct order.
- Then connect both data frames

In [9]:
row_list=[]
#ordered_row = pd.DataFrame({'Latitude':[], 'Longitude':[]})
for index, row in df.iterrows():
    postalCode = row['PostalCode']
    geo_coords_row = df_CSV.loc[df_CSV['Postal Code'] == postalCode]
    cell = {}
    cell['PostalCode'] = geo_coords_row.iloc[0]['Postal Code']
    cell['Latitude'] = geo_coords_row.iloc[0]['Latitude']
    cell['Longitude'] = geo_coords_row.iloc[0]['Longitude']
    row_list.append(cell)

ordered_row=pd.DataFrame(row_list)
display(ordered_row.head())
display(ordered_row.shape)

Unnamed: 0,PostalCode,Latitude,Longitude
0,M3A,43.753259,-79.329656
1,M4A,43.725882,-79.315572
2,M5A,43.65426,-79.360636
3,M6A,43.718518,-79.464763
4,M7A,43.662301,-79.389494


(103, 3)

Add relevant coloums to the data frame

In [10]:
df['Latitude'] = ordered_row['Latitude']
df['Longitude'] = ordered_row['Longitude']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


In [11]:
#!pip install geocoder
#import geocoder # import geocoder

#geo_coordinates=[]
#for index, row in df.iterrows():
#    #print(row['PostalCode'])
#    lat_lng_coords = None
#    # loop until you get the coordinates
#    while(lat_lng_coords is None):
#        g = geocoder.google('{}, Toronto, Ontario'.format(row['PostalCode']))
#        lat_lng_coords = g.latlng
#        if lat_lng_coords is not None:
#            cell = {}
#            cell['Latitude'] = lat_lng_coords[0]
#            cell['Longitude'] = lat_lng_coords[1]
#            geo_coordinates.append(cell)
            

## Cluster Analysis


Import Libraries and get geo coordinates from Toronto


In [12]:
import folium
from geopy.geocoders import Nominatim
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude


Creta Folium map with Toronto and add all map points which we have in the data frame.
Then show the map. So far it is not clustered.


In [13]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)
df_copy = df.copy()

# add markers to map
for lat, lng, borough, df_copy in zip(df_copy['Latitude'], df_copy['Longitude'], df_copy['Borough'], df_copy['PostalCode']):
    label = '{}, {}'.format(df_copy, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto


In the next step we want to cluster the data points and also show them in the map.


In [14]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.cm as cm
import matplotlib.colors as colors


Do the clustering and add the result to the data frame as a new column


In [15]:
kclusters=4
k_means = KMeans(init="k-means++", n_clusters=kclusters, n_init=12)
cluster_dataset = StandardScaler().fit_transform(df[['Latitude', 'Longitude']])
k_means.fit(cluster_dataset)
k_means.labels_
df.insert(0, 'Cluster Labels', k_means.labels_)
df.head()

Unnamed: 0,Cluster Labels,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,3,M3A,North York,Parkwoods,43.753259,-79.329656
1,3,M4A,North York,Victoria Village,43.725882,-79.315572
2,0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,1,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,0,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494



Show it on the map. Each cluster gets its own colour.


In [16]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

df_nextcopy = df.copy()

# add markers to the map
markers_colors = []
#for lat, lon, poi, cluster in zip(df['Latitude'], df['Longitude'], df['Neighborhood'], df['Cluster Labels']):   
for lat, lng, borough, cluster in zip(df_nextcopy['Latitude'], df_nextcopy['Longitude'], df_nextcopy['Borough'], df_nextcopy['Cluster Labels']):
    label = folium.Popup(str(borough) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters