<h1 align=center><font size = 5>Segmenting and Clustering Neighbourhoods in Toronto</font></h1>

## Introduction

In this lab, we will extract the postal codes from Canada from Wikipedia and clean the data set


### Install necessary packages

In [1]:
! pip install geopy
! pip install folium==0.5.0
! pip install bs4
! pip install lxml
#! pip install

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/07/e1/9c72de674d5c2b8fcb0738a5ceeb5424941fefa080bfe4e240d0bacb5a38/geopy-2.0.0-py3-none-any.whl (111kB)
[K     |████████████████████████████████| 112kB 6.7MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.0.0
Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 6.2MB/s eta 0:00:01
[

### Import necessary Libraries

In [2]:
import lxml
import html5lib
from bs4 import BeautifulSoup # library to parse HTML documents

import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

#import geocoder

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

import folium # plotting library

print('Libraries imported.')

from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

Libraries imported.


### Check if wikipedia site is responding - Getting a return value of 200 is a YES

In [4]:
wikiurl="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
table_class="wikitable sortable jquery-tablesorter"
response=requests.get(wikiurl)
print(response.status_code)

200


### Get Table with postal codes from Canada

In [5]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(response.text, 'html.parser')
# find table with postal codes from the webpage
indiatable=soup.find('table',{'class':"wikitable sortable"})

### Convert HTML Table to dataframe

In [6]:
# read html table into a list
df=pd.read_html(str(indiatable))
# convert list to dataframe
df=pd.DataFrame(df[0])
#print first 5 rows
print(df.head())

  Postal Code           Borough              Neighbourhood
0         M1A      Not assigned               Not assigned
1         M2A      Not assigned               Not assigned
2         M3A        North York                  Parkwoods
3         M4A        North York           Victoria Village
4         M5A  Downtown Toronto  Regent Park, Harbourfront


In [7]:
# get nr. of rows and columns of the dataframe
print(df.shape)
# get datatypes from columns
df.dtypes

(180, 3)


Postal Code      object
Borough          object
Neighbourhood    object
dtype: object

### Clean dataframe of postal codes from Canada

#### First clean data set of inclompete data rows

In [8]:
# exlucde all Borough which do not have a name
df = df.loc[df['Borough'] != 'Not assigned']

In [9]:
# compare nr of rows to the original dataframe to verify how many lines have been deleted
print(df.shape)
print(df.head())

(103, 3)
  Postal Code           Borough                                Neighbourhood
2         M3A        North York                                    Parkwoods
3         M4A        North York                             Victoria Village
4         M5A  Downtown Toronto                    Regent Park, Harbourfront
5         M6A        North York             Lawrence Manor, Lawrence Heights
6         M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government


#### Check for double entries and combine the results under one postal code
#### Apparently not doube entries have been found as suggested in the exam description

In [10]:
# count number of entries per Postal Code
df2 = df.groupby(by='Postal Code', as_index=False).agg({'Neighbourhood':'count'}).copy()
# print nr of rows & columns as well as the first entries of all Postal Codes with more then one entry ( 0 rows for the currently available data)
print(df2.loc[df2['Neighbourhood'] > 1 ].shape)
df2.loc[df2['Neighbourhood'] > 1 ]

(0, 2)


Unnamed: 0,Postal Code,Neighbourhood


#### Cross validate the given postal code with double entries given by the exam description

In [11]:
# select & print row for postal code M5A
df.loc[df['Postal Code'] == 'M5A']

Unnamed: 0,Postal Code,Borough,Neighbourhood
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### M5A is in the correct format with two neighbourhoods in one row

#### Next Step: replace missing values for Neighbourhood and replace them with the name of the borough

In [12]:
# replace missing values from neighbourhood with the name of the borough
df.loc[df['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = df['Borough']
# check for the first and last entries if it has been replaced proberly
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Create summary how many boroughs and neighborhoods are in the data set

In [46]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


Save the data frame into a csv file to continue working in the next part of the exercise

In [105]:
df.to_csv('postal_codes_canada.csv')