# Segmenting and Clustering Neighborhoods in Toronto

## Data Mining

Decription how to mine the data on Wikipedia page. Building the code to scrape the Wikipedia page. Writing the data into the DataFrame.

### Scraping

* __Use the Notebook to build the code to scrape the following [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M "List of postal codes of Canada: M").__

Loading necessary libraries. Except <code>pandas</code>, both <code>request</code> and <code>lxml</code> need installation.

In [600]:
import pandas as pd
#import numpy as np



try:
    import requests
except ModuleNotFoundError:
    ! conda install requests
finally:
    import requests
    
try:
    import lxml.html as lh
except ModuleNotFoundError:
    ! conda install lxml
finally:
    from lxml import html as lh

The code below allows us to get the data from the table from the [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M "List of postal codes of Canada: M").

In [601]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'



# Create a handle, page, to handle the contents of the website.
page = requests.get(url)

# Store the contents of the website under doc.
doc = lh.fromstring(page.content)

# Parse data that are stored between <tr>..</tr> of HTML.
tr_elements = doc.xpath('//tr')

Check the length of the first 12 rows.

In [602]:
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

Looks like all our rows have exactly 3 columns. This means all the data collected on <code>tr_elements</code> are from the table. Next, parse the first row as our header.

Making a scraper.

In [603]:
def preparator(tr_elements):
    col=[]
    i=0
    # For each row, store each first element (header) and an empty list.
    for t in tr_elements[0]:
        i+=1
        name=t.text_content()
        col.append((name,[]))
    # Since out first row is the header, data is stored on the second row onwards.
    for j, _ in enumerate(tr_elements, 1):
        #T is our j'th row
        T=tr_elements[j]

        # If row is not of size 10, the //tr data is not from our table. 
        if len(T)!=3:
            break

        # i is the index of our column.
        i=0

        # Iterate through each element of the row.
        for t in T.iterchildren():
            data=t.text_content() 
            # Check if row is empty.
            if i>0:
            # Convert any numerical value to integers.
                try:
                    data=int(data)
                except:
                    pass
            # Append the data to the empty list of the i'th column.
            col[i][1].append(data)
            # Increment i for the next column.
            i+=1
    return col

In [639]:
col = preparator(tr_elements)
col[0][1][:10]
col[1][1][:10]

['Not assigned',
 'Not assigned',
 'North York',
 'North York',
 'Downtown Toronto',
 'North York',
 'North York',
 'Downtown Toronto',
 'Not assigned',
 "Queen's Park"]

Scraping is complited.

### Data Cleaning

Forming the DataFrame from scraped data. Examining data to find out: excess sysmbols, missing data or something else in string data of DataDrame.
Changing headers if it is necessary and ect. 

* __The dataframe will consist of three columns: <span style="color:red">'PostalCode'</span>, <span style="color:red">'Borough'</span>, and <span style="color:red">'Neighborhood'</span>__

Examining number of columns and length of each column.

In [605]:
print(len(col), [len(C) for (title,C) in col])

3 [287, 287, 287]


Creating DataFrame.

In [606]:
dictionary={title:column for (title,column) in col}
df=pd.DataFrame(dictionary)

df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n
5,M6A,North York,Lawrence Heights\n
6,M6A,North York,Lawrence Manor\n
7,M7A,Downtown Toronto,Queen's Park\n
8,M8A,Not assigned,Not assigned\n
9,M9A,Queen's Park,Not assigned\n


In [607]:
df.tail()

Unnamed: 0,Postcode,Borough,Neighbourhood
282,M8Z,Etobicoke,Mimico NW\n
283,M8Z,Etobicoke,The Queensway West\n
284,M8Z,Etobicoke,Royal York South West\n
285,M8Z,Etobicoke,South of Bloor\n
286,M9Z,Not assigned,Not assigned\n


Renaming column <span style="color:red">'Postcode'</span> to <span style="color:red">'PostalCode'</span>.

In [608]:
df.rename(columns={'Postcode': 'PostalCode', 'Neighbourhood\n': 'Neighbourhood'}, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


Check if the all names in the column <span style="color:red">'Neighbourhood'</span> have a 'dirty' tail like as <span style="color:red">'\n'</span>.

In [609]:
len(pd.unique(df['Neighbourhood'].apply(lambda s: s[-1:])))

1

Yes. Each of them have identical "tin can tied to them legs". Since we found out it, we can apply very simple rule to cut off this "tin can".

In [610]:
df['Neighbourhood'] = df.Neighbourhood.apply(lambda s: s[:-1])
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Now, DataFrame is consist of three columns: <span style="color:red">'PostalCode'</span>, <span style="color:red">'Borough'</span>, and <span style="color:red">'Neighborhood'</span>.

* __Only process the cells that have an assigned <span style="color:red">'Borough'</span>. Ignore cells with a <span style="color:red">'Borough'</span> that is 'Not assigned'.__

Leave rows which have no missing data in <span style="color:red">'Borough'</span> column.

In [611]:
df = df[df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Reset the indexes. They must be started from 0.

In [612]:
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


* __If a cell has a <span style="color:red">'Borough'</span> but a 'Not assigned' <span style="color:red">'Neighbourhood'</span>, then the <span style="color:red">'Neighbourhood'</span> content will be the same as the <span style="color:red">'Borough'</span>.__ 

Checking it.

In [613]:
test = df[df['Neighbourhood'] == 'Not assigned']
test.loc[:, 'Neighbourhood'] = test.loc[:, 'Borough']
test.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
6,M9A,Queen's Park,Queen's Park


Fix it according to instructions.

In [614]:
df[df['Neighbourhood'] == 'Not assigned'] = test.loc[:, :] 
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Queen's Park,Queen's Park
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


* __More than one neighborhood can exist in one postal code area. These names will be combined into one row with the neighborhoods separated with a comma.__

Converting consist of <span style="color:red">'Neighbourhood'</span> column to the <code><span style="color:green">list</span></code> object, then grouping <span style="color:red">'Neighbourhood'</span> by <span style="color:red">'PostalCode'</span> and finally apply <code>sum()</code> for joining the grouped lists all together.

In [615]:
df['Neighbourhood'] = df.Neighbourhood.apply(lambda s: [s])
pc = df[['PostalCode', 'Neighbourhood']]
pc = pc.groupby(['PostalCode']).sum()
pc.head()

Unnamed: 0_level_0,Neighbourhood
PostalCode,Unnamed: 1_level_1
M1B,"[Rouge, Malvern]"
M1C,"[Highland Creek, Rouge Hill, Port Union]"
M1E,"[Guildwood, Morningside, West Hill]"
M1G,[Woburn]
M1H,[Cedarbrae]


Releasing  the <span style="color:red">'Neighbourhood'</span> content out  of squares.

In [616]:
pc['Neighbourhood'] = pc.Neighbourhood.apply(lambda l: ', '.join(l))
pc.reset_index(inplace=True)
pc.head()

Unnamed: 0,PostalCode,Neighbourhood
0,M1B,"Rouge, Malvern"
1,M1C,"Highland Creek, Rouge Hill, Port Union"
2,M1E,"Guildwood, Morningside, West Hill"
3,M1G,Woburn
4,M1H,Cedarbrae


Check length equality of DataFrames which must be join together.

In [617]:
df.shape == pc.shape

False

Since the <code>.shape</code> of <code>df</code> and <code>pd</code> DataFrames is not the same, there is needs another DataFrame for joining <code>df</code> and <code>pd</code> DataFrames. 

Syntezing the DataFrame <code>clear_df</code> from scratch.

In [618]:
clear_df = pd.DataFrame()
clear_df['PostalCode'] = pd.unique(df.PostalCode)
clear_df['Borough'] = False
clear_df['Neighbourhood'] = False
clear_df = clear_df.set_index('PostalCode')
clear_df.head()

Unnamed: 0_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,False,False
M4A,False,False
M5A,False,False
M6A,False,False
M7A,False,False


Loading the data to the <code>clear_df</code> DataFrame.

In [619]:
df['Neighbourhood'] = df.Neighbourhood.apply(lambda s: s[0])
for pcd, b, _ in df.values:
    clear_df.loc[pcd, 'Borough'] = b
for pcd, nb in pc.values:
    clear_df.loc[pcd, 'Neighbourhood'] = nb
clear_df.reset_index().head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Queen's Park,Queen's Park
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


Let ensure that in an original DataFrame <code>df</code> it won't happen that one postal code will the same for two or more different boroughs.

In [620]:
test = clear_df.Borough.apply(lambda l: l.split(','))
test = test.apply(lambda l: len(l))
print('Number of distinct boroughs corresponding to each postal code not greater than {}.'.format(test.max()))

Number of distinct boroughs corresponding to each postal code not greater than 1.


In [621]:
p_codes = ['M5G', 'M2H', 'M4B', 'M1J', 'M4G', 'M4M', 'M1R', 'M9V', 'M9L', 'M5V', 'M1B', 'M5A']
clear_df.loc[p_codes, :].reset_index()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Woodbine Gardens, Parkview Hill"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Maryvale, Wexford"
7,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo..."


It seems I made a mistake: 'M5A' must has two neighborhoods 'Regent Park' and 'Harbourfron', but there is 'Harbourfron' only. What about postal code 
'M5A', how many rows in <code>df</code> are containing it

In [622]:
df[df['PostalCode'] == 'M5A']

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M5A,Downtown Toronto,Harbourfront


Only one row! It is not my mistake. Let ensure by visit [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M "List of postal codes of Canada: M").

In [623]:
len(df[df['Neighbourhood'] == 'Regent Park'])

0

There is 0 rows which has 'Regent Park' as a neighborhood.

* __In the last cell of your notebook, use the <code>.shape</code> method to print the number of rows of your dataframe.__

In [624]:
clear_df.shape

(103, 2)

In [625]:
https://github.com/Tungsteniac/Tungsteniac/blob/master/Neighborhoods%20in%20Toronto.ipynb

SyntaxError: invalid syntax (<ipython-input-625-179a0f8c9e5e>, line 1)