# Segmenting and Clustering Neighbourhoods in Toronto

Data about postal codes in Toronto will be used in this Notebook. These data are contained in a Wikipedia page and will be transformed into a _pandas_ dataframe.

To scrape data from the Wikipedia page, website scraping packages are needed. 
Here BeautifulSoup will be used, along with a XML parser in order to parse the html. Other parser can be used, for more infomation here is the [BeautifulSoup documentation about different parsers](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers).

#### __Installing packages__

In [1]:
# Installing beautifulsoup4, updated version of beautifulsoup
import sys
!conda install --yes --prefix {sys.prefix} beautifulsoup4

Collecting package metadata: ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\Duratorre\Anaconda

  added / updated specs:
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    beautifulsoup4-4.7.1       |           py36_1         144 KB
    certifi-2019.3.9           |           py36_0         156 KB
    conda-4.6.14               |           py36_0         2.1 MB
    ------------------------------------------------------------
                                           Total:         2.4 MB

The following packages will be SUPERSEDED by a higher-priority channel:

  beautifulsoup4     conda-forge::beautifulsoup4-4.7.1-py3~ --> pkgs/main::beautifulsoup4-4.7.1-py36_1
  ca-certificates    conda-forge::ca-certificates-2019.3.9~ --> pkgs/main::ca-certificates-2019.1.23-0
  certifi                                

In [2]:
# Installing a XML parser, which will be used in this case
!conda install --yes --prefix {sys.prefix} lxml

Collecting package metadata: ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\Duratorre\Anaconda

  added / updated specs:
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    lxml-4.3.3                 |   py36h1350720_0         1.2 MB
    ------------------------------------------------------------
                                           Total:         1.2 MB

The following packages will be SUPERSEDED by a higher-priority channel:

  lxml               conda-forge::lxml-4.3.3-py36heafd4d3_0 --> pkgs/main::lxml-4.3.3-py36h1350720_0



Downloading and Extracting Packages

lxml-4.3.3           | 1.2 MB    |            |   0% 
lxml-4.3.3           | 1.2 MB    |            |   1% 
lxml-4.3.3           | 1.2 MB    | 6          |   7% 
lxml-4.3.3           | 1.2 MB    | #2         |  13% 
lxml-4.3.3           | 1.2 MB 

#### __Importing libraries__

In [2]:
import requests
from bs4 import BeautifulSoup

#### __Fetching data from the url__

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
data = requests.get(url).text
soup = BeautifulSoup(data, 'lxml') #creating a BeautifulSoup object
#print(soup.prettify()) #show the html in a formatted way

The data that are needed to be scraped are contained in the "table" part (class = "wikitable sortable") of the html. To access this part of the html, the find method can be used.

In [4]:
table = soup.find('table', class_='wikitable sortable')
#print(table.prettify())

Inserting data from the table into a list.

In [5]:
prototype = [] #list that will contain the data

for line in table.find_all('tr'): #each row data in the table is between a 'tr' tag
    row = line.text.strip('\n').split('\n') #splitting each row data into 3 elements
    prototype.append(row)
                     
#print(prototype)

#### __Creating a pandas dataframe__

In [6]:
#importing pandas 
import pandas as pd

In [7]:
df = pd.DataFrame(prototype[1:], columns = prototype[0]) #first row of prototype contains the columns' names
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


Some cells have no Borough assigned, therefore these cells will be ignored.

In [8]:
df2 = df[df.Borough != 'Not assigned'].reset_index()
df2.drop(columns='index', axis=1, inplace=True)
df2.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In those cells where the Neighbourhood is not assigned, the Neighbourhood will be the same as the corresponding Borough.

In [9]:
locations = df2['Neighbourhood'] == 'Not assigned'
df2.loc[locations,'Neighbourhood'] = df2.loc[locations,'Borough']
df2.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In some postal code areas exists more than one neighborhood, so the same postal code is listed more than once. In this case, the multiple rows with the same postal code will be combined into one row with the neighborhoods separated with a comma.

In [10]:
join_postcodes = df2.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join)
join_postcodes.head(10)

Postcode  Borough    
M1B       Scarborough                                    Rouge,Malvern
M1C       Scarborough             Highland Creek,Rouge Hill,Port Union
M1E       Scarborough                  Guildwood,Morningside,West Hill
M1G       Scarborough                                           Woburn
M1H       Scarborough                                        Cedarbrae
M1J       Scarborough                              Scarborough Village
M1K       Scarborough        East Birchmount Park,Ionview,Kennedy Park
M1L       Scarborough                    Clairlea,Golden Mile,Oakridge
M1M       Scarborough    Cliffcrest,Cliffside,Scarborough Village West
M1N       Scarborough                       Birch Cliff,Cliffside West
Name: Neighbourhood, dtype: object

In [11]:
clean_data = pd.DataFrame(join_postcodes).reset_index()
clean_data.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


Now the data are ready for the analysis.

In [12]:
clean_data.shape

(103, 3)

In [13]:
clean_data.to_csv('clean_data2.csv')