# IBM Data Science Professional Certificate
https://www.coursera.org/specializations/ibm-data-science-professional-certificate

## COURSE 9 - Applied Data Science Capstone
https://www.coursera.org/learn/applied-data-science-capstone

### Week 03 - Segmenting and Clustering Neighborhoods in the city of Toronto, Canada

### Index:

* [Part 1 - scrapping](#part_001)

* [Part 2 - mapping](#part_002)

### Part 1 - Task: <a id='part_001'></a>
1. Scrape a table from the Wikipedia page
    https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
2. Preprocess the data in using pandas

In [1]:
#import libraries

import pandas as pd
import numpy as np

from bs4 import BeautifulSoup

import requests

In [2]:
# request link to wikipedia page in text format
table_link = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [3]:
# beautifulsoup function
soupcanada = BeautifulSoup(table_link)

In [4]:
# which line of html code contains the table details
wiki_table = soupcanada.find('table', {'class':'wikitable sortable'})

In [5]:
# create a empty variables
postcode = []
borough = []
neighborhood = []

In [6]:
# append data into a created variables
for row in wiki_table.find_all('tr'):
    cells = row.find_all('td')
    if len(cells) == 3:
        postcode.append(cells[0].find(text = True))
        borough.append(cells[1].find(text = True))
        neighborhood.append(cells[2].find(text = True))

In [7]:
# creating a empty dataframe
table_df = pd.DataFrame(columns = ['PostalCode', 'Borough', 'Neighbourhood'])

In [8]:
# insert data into the empty dataframe
table_df['PostalCode'] = postcode
table_df['Borough'] = borough
table_df['Neighbourhood'] = neighborhood

In [9]:
# set column Postalcode as index
table_df.set_index('PostalCode', inplace = True)
# view dateframe
table_df.head()

Unnamed: 0_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront


In [10]:
# remove rows with null value in Borough
table_df = table_df[table_df.Borough != 'Not assigned']
# view dateframe
table_df.head()

Unnamed: 0_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M5A,Downtown Toronto,Regent Park
M6A,North York,Lawrence Heights


In [11]:
# strip \n from string value
table_df['Neighbourhood'] = table_df['Neighbourhood'].str.strip('\n')
# view dateframe
table_df.head()

Unnamed: 0_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M5A,Downtown Toronto,Regent Park
M6A,North York,Lawrence Heights


In [12]:
# replace Neighbourhood with not assigned value with Borough
table_df['Neighbourhood'] = table_df['Neighbourhood'].replace(['Not assigned'], 'Queens Park')
# view dateframe
table_df.head()

Unnamed: 0_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M5A,Downtown Toronto,Regent Park
M6A,North York,Lawrence Heights


In [13]:
# group rows with similar Postalcode
table_df = table_df.groupby(['PostalCode', 'Borough']).agg(lambda col: ',' .join(col))
# view dateframe
table_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighbourhood
PostalCode,Borough,Unnamed: 2_level_1
M1B,Scarborough,"Rouge,Malvern"
M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
M1E,Scarborough,"Guildwood,Morningside,West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


In [14]:
# reset the index (Index column)
table_df.reset_index(['PostalCode', 'Borough'], inplace = True)
# view dateframe
table_df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [15]:
print('The final shape of the dataframe is', table_df.shape)

The final shape of the dataframe is (103, 3)


---------------------------

### Part 2 - Task:<a id='part_002'></a>
1. Merge the geocoder dataframe with dataframe from Part 1

In [16]:
!wget http://cocl.us/Geospatial_data

--2019-05-24 10:25:07--  http://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 159.8.72.228
Connecting to cocl.us (cocl.us)|159.8.72.228|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cocl.us/Geospatial_data [following]
--2019-05-24 10:25:07--  https://cocl.us/Geospatial_data
Connecting to cocl.us (cocl.us)|159.8.72.228|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-05-24 10:25:08--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 185.235.236.197
Connecting to ibm.box.com (ibm.box.com)|185.235.236.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-05-24 10:25:09--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5

In [17]:
# reading the csv file with GPS-coordinates into a dataframe
gps_df = pd.read_csv('Geospatial_data')
gps_df.columns=['PostalCode', 'GPS_Lat', 'GPS_Lon']
gps_df = gps_df.set_index('PostalCode')
gps_df.head()

Unnamed: 0_level_0,GPS_Lat,GPS_Lon
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [18]:
# merge both tables (gps_df with table_df)
table_gps_df = pd.merge(table_df, gps_df, on = 'PostalCode')
table_gps_df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,GPS_Lat,GPS_Lon
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [19]:
import folium

In [20]:
# create table 
canada_table_df = table_gps_df[table_gps_df['Borough'].str.contains('Toronto')]
canada_table_df.head(15)

Unnamed: 0,PostalCode,Borough,Neighbourhood,GPS_Lat,GPS_Lon
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
47,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316
49,M4V,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686412,-79.400049


In [21]:
# set gps coordinates for the map and the zoom level
# 52 - M4Y - Downtown Toronto - Church and Wellesley - 43.665860 and -79.383160
lat = 43.675860
lon = -79.393160

map_canda = folium.Map(location = [lat, lon], zoom_start = 12)

In [22]:
# create the map
for lat, lng, label in zip(canada_table_df['GPS_Lat'], canada_table_df['GPS_Lon'], canada_table_df['Neighbourhood']):
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
    [lat, lng],
    radius = 5,
    popup = label,
    color = 'blue',
    fill = True,
    fill_color = 'lightcoral',
    fill_opacity = 0.5,
    parse_html = False).add_to(map_canda)

map_canda