# Coursera Data Science Capstone Project

This Notebook is used for Scraping Data of Neighborhoods in Canada on the wikipedia page and then segmen.This is done as a part of the Capstone Project of the IBM Data Science Specialization on Coursera.

## Part1 - Scraping data from Wiki

The wikipedia page used for scraping is: [Postal Codes of Canada](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

The package used to scrape data is : [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/)

In [1]:
# requests package to get the html doc of a url
import requests

# Beautiful soup to Parse through the document
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np


### Storing Data into a data frame

The following 3 cells are to store the data from the postal code table in the url to a dataframe by parsing with the help of Beautiul Soup Package.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#get the html of the url 
url_html = requests.get(url).text

In [3]:
#parsing the html doc from above
html_doc = BeautifulSoup(url_html, 'html.parser')

In [4]:
#postal code table
table = html_doc.table

#inititating an empty df
df = pd.DataFrame()

# reading the table into the empty Data frame
i = 0
for row in table.find_all('tr'):
    j = 0
    for col in row.find_all('td'):
        df.loc[i,j] = col.string.replace('\n','')
        j += 1
    i += 1

    
# changing col Names
column_headers = []
for th in table.find_all('th'):
    column_headers.append(th.string.replace('\n',''))

df.columns = column_headers
df

Unnamed: 0,Postal Code,Borough,Neighborhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
176,M5Z,Not assigned,Not assigned
177,M6Z,Not assigned,Not assigned
178,M7Z,Not assigned,Not assigned
179,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


### Fromatting data frame
The following code cell is to modify the data frame we created in the above step into the required format of the assignment

In [5]:
# presenting the data in the required format
# editing the df created

#droping all Not Assigned Rows
df.drop(index = df[df['Borough'] == 'Not assigned'].index, axis = 0, inplace = True)

#grouping data based on Postal code then based on Borough and separating neighborhood values by ', '
df.groupby(['Postal Code','Borough'])['Neighborhood'].apply(', '.join)

# reset the index of the new database and then drop the index column created due to resetting
df.reset_index(inplace= True, drop = True)

df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [6]:
#displaying the shape of the data frame
df.shape

(103, 3)

## Part2 - Finding Lat Long of Postal Codes

In this Part we will use the package [pgeocode](https://pypi.org/project/pgeocode/) to get the latitude and longitude of all the postal codes.

This is applicable in certain countries only mentione in the url above.

In [7]:
import pgeocode

In [8]:
nomi = pgeocode.Nominatim('ca')

# initiating latitude and longitude var
latitude = []
longitude = []

# looping on all postal codes in the dataframe to get the lot and long of the postal code
for n in range(0,len(df['Postal Code'])):
    latitude.append(nomi.query_postal_code(df['Postal Code'][n]).latitude)
    longitude.append(nomi.query_postal_code(df['Postal Code'][n]).longitude)

# adding Latitude and Longitude cols ino the dataframe    
df['Latitude'] = latitude
df['Longitude'] = longitude
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889


In [9]:
# separating Toronto only rows then resetting index and deleting index column created due to resetting
tor_df = df[df['Borough'].str.contains('Toronto', case=False, regex=True)].reset_index(drop = True)

tor_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783
3,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756
4,M4E,East Toronto,The Beaches,43.6784,-79.2941


Building a map of all the boroughs using the folium package

In [10]:
import folium

In [11]:
# create map of Toronto
map_tor = folium.Map(location=[tor_df['Latitude'][0], tor_df['Longitude'][0]], zoom_start=11)

# adding markers on map of latitude and longitude from the dataframe
for lat, lng, borough, neighborhood in zip(tor_df['Latitude'], tor_df['Longitude'], tor_df['Borough'], tor_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tor)
map_tor

Running k- Means Clustering Algo beginning the clusetring a  k = 4

In [12]:
k_clusters = 4

# importing matplotlib associated plotting modules
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

In [13]:
# applying Clustering Algo
kmeans = KMeans(n_clusters=k_clusters, random_state=0).fit(tor_df.drop(columns = ['Neighborhood', 'Borough', 'Postal Code']))

#Combining kmeans labels with toronoto dataframe
tor_df = pd.concat([tor_df, pd.DataFrame(kmeans.labels_)], axis = 1)

# renaming kmeans combined column label
tor_df.rename(columns = {0 : "K Means Class"}, inplace = True)
tor_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,K Means Class
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626,1
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889,1
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783,1
3,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756,1
4,M4E,East Toronto,The Beaches,43.6784,-79.2941,0


In [14]:
# create map
map_clusters = folium.Map(location=[tor_df['Latitude'][0], tor_df['Longitude'][0]], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k_clusters)
ys = [i + x + (i*x)**2 for i in range(k_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tor_df['Latitude'], tor_df['Longitude'], tor_df['Neighborhood'], tor_df['K Means Class']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

**Important Note** - Please note that if you are unable to view the maps on this notebook please paste the notebook link of github on this website: [nbviewer](https://nbviewer.jupyter.org/).

This is happening as Github renders the .ipynb file

**Author** : Nishchay Nagpal

**Email** : nishchaynagpal419@gmail.com

**Git Repository Link** : [Coursera_Capstone](https://github.com/Anima419/Coursera_Capstone)