<h1 align=center><font size = 5>Segmenting and Clustering Neighborhouds in Toronto</font></h1>
<h1 align=center><font size = 2>Ilan Benchetrit</font></h1>

## Scrapping data

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: / 
  - anaconda/osx-64::ca-certificates-2020.1.1-0, anaconda/osx-64::openssl-1.1.1d-h1de35cc_4
  - anaconda/osx-64::openssl-1.1.1d-h1de35cc_4, defaults/osx-64::ca-certificates-2020.1.1-0
  - anaconda/osx-64::ca-certificates-2020.1.1-0, defaults/osx-64::openssl-1.1.1d-h1de35cc_4
  - defaults/osx-64::ca-certificates-2020.1.1-0, defaults/osx-64::openssl-1.1.1d-h1de35ccdone

## Package Plan ##

  environment location: /opt/anaconda3

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py37_0         148 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         148 KB

The following packages will be UPDATED:

  conda                        anaconda::conda-4.8.3-p

Now let's import Beautiful Soup and its dependecies to scrape the Wikipedia page

In [2]:
!conda install -c anaconda beautifulsoup4 --yes
from bs4 import BeautifulSoup

!conda install -c anaconda lxml --yes
!conda install -c anaconda html5lib --yes
!conda install -c anaconda requests --yes
import requests

print('BeautifulSoup and its dependecies imported')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda3

  added / updated specs:
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py37_0         156 KB  anaconda
    ------------------------------------------------------------
                                           Total:         156 KB

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi                                       conda-forge --> anaconda
  conda              conda-forge::conda-4.8.3-py37hc8dfbb8~ --> anaconda::conda-4.8.3-py37_0



Downloading and Extracting Packages
certifi-2019.11.28   | 156 KB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Collecting pac

#### Load the html page and scrap it with BeautifulSoup

In [43]:
!wget -q -O 'toronto_data.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
print('Data downloaded!')

Data downloaded!


In [44]:
with open("toronto_data.html") as html_file:
    wikipage = BeautifulSoup(html_file,'lxml')

body = wikipage.find('tbody')

#print(body.prettify())

Then, we extract the usefull data within the HTML page.
<br>In the following code, we assumed that : 
- if a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 
- if a cell has a neighbourhood but a Not assigned borough, then the borough will be the same as the neighbourhood (M7A for instance).
- as it is not requested, we get rid of cardinal specifications

In [45]:
data = []
for cell in body.find_all('p'):
    #we first extract the postal code
    pcode = cell.b.text
    
    #then we extract the borough and the neighbourhood
    try :
        try :
            borough = cell.i.text #in this case, the borough is not assigned so it is formated in italic
            neighbourhood = borough
        except :
            borough = cell.span.text.split('(')[0] 
            neighbourhood = cell.span.text.split('(')[1] #we split borough from neighbourhoods
            neighbourhood = neighbourhood.split(')')[0] #we get rid of cardinal specifications
            neighbourhood = neighbourhood.replace(' /',',')
    except : #this case is for postal code without borough like M7A
        borough = 'Not assigned'
        neighbourhood = cell.span.text
        neighbourhood = neighbourhood.split(')')[0] #we get rid of cardinal specifications
        neighbourhood = neighbourhood.replace(' /',',')
    
    #we append this instance of the loop into the postal_code list
    l = [pcode, borough, neighbourhood]
    data.append(l)

#print(data)

#### Tranform the data into a *pandas* dataframe

In [46]:
# define the dataframe columns
column_names = ['Postal Code', 'Borough', 'Neighbourhood'] 

# instantiate the dataframe
df = pd.DataFrame(columns=column_names)

#df

In [47]:
for l in data:
    postal_code = l[0]
    borough = l[1]
    neighbourhood = l[2]
    
    df = df.append({'Postal Code': postal_code,
                    'Borough': borough,
                    'Neighbourhood': neighbourhood}, 
                   ignore_index=True)

#df

In [48]:
# Delete rows for which Borough is not assigned
indexNames = df[ df['Borough'] == 'Not assigned' ].index
df.drop(indexNames , inplace=True)
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"
14,M6B,North York,Glencairn


As requested, here is the shape of the final dataframe with clean data

In [49]:
df.shape

(102, 3)