<h1><center>Segmenting and clustering in Toronto´s neighborhood</center></h1>

<h2>Import libraries needed</h2>

In [1]:
import urllib.request
import pandas as pd
import numpy as np

<h1>Scraping libraries and packages in Python</h1>

Serching to internet I found in github one scraping code that transform html table into pandas dataframe, so i decided to use it.
The code was made by Josua Schimd like below:

In [2]:
# -----------------------------------------------------------------------------
# Name:        html_table_parser
# Purpose:     Simple class for parsing an (x)html string to extract tables.
#              Written in python3
#
# Author:      Josua Schmid
#
# Created:     05.03.2014
# Copyright:   (c) Josua Schmid 2014
# Licence:     AGPLv3
# -----------------------------------------------------------------------------

from html.parser import HTMLParser


class HTMLTableParser(HTMLParser):
    """ This class serves as a html table parser. It is able to parse multiple
    tables which you feed in. You can access the result per .tables field.
    """
    def __init__(
        self,
        decode_html_entities=False,
        data_separator=' ',
    ):

        HTMLParser.__init__(self, convert_charrefs=decode_html_entities)

        self._data_separator = data_separator

        self._in_td = False
        self._in_th = False
        self._current_table = []
        self._current_row = []
        self._current_cell = []
        self.tables = []

    def handle_starttag(self, tag, attrs):
        """ We need to remember the opening point for the content of interest.
        The other tags (<table>, <tr>) are only handled at the closing point.
        """
        if tag == 'td':
            self._in_td = True
        if tag == 'th':
            self._in_th = True

    def handle_data(self, data):
        """ This is where we save content to a cell """
        if self._in_td or self._in_th:
            self._current_cell.append(data.strip())
    
    def handle_endtag(self, tag):
        """ Here we exit the tags. If the closing tag is </tr>, we know that we
        can save our currently parsed cells to the current table as a row and
        prepare for a new row. If the closing tag is </table>, we save the
        current table and prepare for a new one.
        """
        if tag == 'td':
            self._in_td = False
        elif tag == 'th':
            self._in_th = False

        if tag in ['td', 'th']:
            final_cell = self._data_separator.join(self._current_cell).strip()
            self._current_row.append(final_cell)
            self._current_cell = []
        elif tag == 'tr':
            self._current_table.append(self._current_row)
            self._current_row = []
        elif tag == 'table':
            self.tables.append(self._current_table)
            self._current_table = []

<h1>Setting parameters</h1>

In [3]:
#Target the URL of Wikipedia
target='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#Request method to find URL and get que website content
req=urllib.request.Request(url=target)
f=urllib.request.urlopen(req)
xhtml=f.read().decode('utf-8')

#Instantiate HTMLTableParser Class and feed it
p=HTMLTableParser()
p.feed(xhtml)

#Write the table into Python list
Table_list=p.tables

 #Use from_dict to method  to read the table into a Pandas dataframe
Canada_df=pd.DataFrame.from_dict(Table_list[0])

In [4]:
Canada_df

Unnamed: 0,0,1,2
0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
...,...,...,...
176,M5Z,Not assigned,Not assigned
177,M6Z,Not assigned,Not assigned
178,M7Z,Not assigned,Not assigned
179,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Notice that the columns label is inside of table content, so let´s use rename and drop methods to fix the table.

In [5]:
Canada_df.rename(columns=Canada_df.iloc[0],inplace=True)
Canada_df.drop([0], inplace=True)

In [6]:
Canada_df

Unnamed: 0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
176,M5Z,Not assigned,Not assigned
177,M6Z,Not assigned,Not assigned
178,M7Z,Not assigned,Not assigned
179,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


<h1>Cleaning the data</h1>

<h3>1. Ignoring cells with a borough that is Not assigned.</h3>

In [8]:
df1=Canada_df[Canada_df['Borough']!='Not assigned']
df1.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
9,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
10,M1B,Scarborough,"Malvern, Rouge"
12,M3B,North York,Don Mills
13,M4B,East York,"Parkview Hill, Woodbine Gardens"
14,M5B,Downtown Toronto,"Garden District, Ryerson"


<h3>2. Combine repeated data to one, but first lets find repeated data using nunique and len method like below:</h3>

In [9]:
if(df1['Postal Code'].nunique()==len(df1['Postal Code'])):
    print('No hay valores repetidos')
else:
    print('Hay valores repetidos')

No hay valores repetidos


Notice the values in both cases are the same because dont found repeated values

<h3>3. Replace not assigned neighbourhood data, but first lets find not assigned data</h3>

In [10]:
df1[df1['Neighbourhood']=='Not assigned']['Neighbourhood'].count()

0

Notice dont have "not assigned" values in column Neighbourhood

<h3>4. Reset index to keep order</h3>

In [11]:
df2=df1.reset_index()
df2.drop(['index'], axis=1, inplace=True)
df2

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [12]:
df2.shape

(103, 3)