<h1>Segmenting and Clustering Troronto Neighborhoods</h1>

This Project is part of the IBM Data Science Professional Certificate Caoptone.

<b>Author</b>: Julian Oellrich

<h2 id="libraries">1. Setup and Libraries</h2>

Import Libraries

In [124]:
# Comuptation Libraries
import pandas as pd
import numpy as np

# API Communication
import json
import requests

# Plotting 
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

# Maps and geocoding
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium

# ML Algorithms
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# json & API
import json, lxml
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# data scrping modules
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings('ignore')

# Print Statement
print('Libraries imported!')

Libraries imported!


<h2 id="data_import">2. Data Import</h2>

<h3>Scrape Data from Wikipedia</h3>

Scraping the data from wikipedia page with BeautifulSoup Librariy

In [125]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url).text
soup = BeautifulSoup(source)

Extracting data from table on Wikipedia page

In [126]:
table_data = soup.find('div', class_='mw-parser-output')
table = table_data.table.tbody

Write Table data into new dataframe

In [127]:
columns = ['PostalCode', 'Borough', 'Neighbourhood']
data = dict({key:[]*len(columns) for key in columns})

for row in table.find_all('tr'):
    for i,column in zip(row.find_all('td'),columns):
        i = i.text
        i = i.replace('\n', '')
        data[column].append(i)

df = pd.DataFrame.from_dict(data=data)[columns]
print(df.shape)
df.head(10)

(180, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


<h3>Clean Data</h3>

<h4>Remove not assigned Boroughs</h4>

Only the cells that have an assigned borough should be processed. That means cells with a borough that is 'Not assigned' should be removed from the data frame.

1. The first step is to count how many rows have a 'Not assigned' Borough:

In [128]:
print_statement = 'There are {} rows where Borough is Not assigned'.format(
    df[df['Borough'] == 'Not assigned'].shape[0])
print(print_statement)

There are 77 rows where Borough is Not assigned


2. Drop the rows where Borough is 'Not assigned' and write it into new dataframe df_cleaned

In [129]:
df_cleaned = df.drop(df[df['Borough'] == 'Not assigned'].index) 
df_cleaned.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


3. Check if there are any rows with Not assigned Borough in the new dataframe

In [130]:
print_statement = 'There are {} rows where Borough is Not assigned'.format(
    df_cleaned[df['Borough'] == 'Not assigned'].shape[0])
print(print_statement)

There are 0 rows where Borough is Not assigned


<h4>Fill in not assigned neighborhood</h4>

If a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough

Let's check how many rows have a 'Not assigned' Neighbourhood:

In [131]:
print_statement = 'There are {} rows where Neighbourhood is Not assigned'.format(
    df_cleaned[df_cleaned['Neighbourhood'] == 'Not assigned'].shape[0])
print(print_statement)

There are 0 rows where Neighbourhood is Not assigned


<b>Conclusion:</b> It shows, that there are <b> no rows</b> with 'Not assigned' Neighbourhood but assigned Borough

<h4>Check duplicate Postalcodes</h4>

In [132]:
# create new test dataframe with only neighbourhoods and postal codes
df_neigh = df_cleaned[['PostalCode','Neighbourhood']]

# check Postalcode duplicates and create new bool column that marks duplicates with True
df_neigh['duplicate bool'] = df_neigh['PostalCode'].duplicated(keep = 'first')

# Output value count of duplicate PostalCode values
df_neigh['duplicate bool'].value_counts()

False    103
Name: duplicate bool, dtype: int64

<b>Conclusion:</b> There are no duplicate PostCode values 

<h4>Final cleaned DataFrame</h4>

Write cleaned data to Dataframe df_toronto and print shape

In [133]:
df_toronto = df_cleaned

print_statement = '\n The toronto dataframe has {} rows and {} columns \n'.format(df_toronto.shape[0], df_toronto.shape[1])
print(print_statement)
df_cleaned.head()


 The toronto dataframe has 103 rows and 3 columns 



Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


<h2 id="explore_venues">3. Explore Neighbourhood venues</h2>