# Clustering Neighborhoods in Toronto

This assignment is to explore and cluster neighborhood in Toronto.

## Import all essential pacakges
1. [Pandas](https://pandas.pydata.org/pandas-docs/stable/)
2. [Numpy](https://docs.scipy.org/doc/)

Note: when using pandas to import html, it requires lxml to be installed, this could be done by run:
    ```pip install lxml```

If you want to set this behaviour for all instances of Jupyter (Notebook and Console), simply create a file ~/.ipython/profile_default/ipython_config.py with the lines below.

```python
c = get_config()
# Run all nodes interactively
c.InteractiveShell.ast_node_interactivity = "all"
```

In [41]:
import pandas as pd
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Assignment requirement:
1. Use the Notebook to build the code to scrape the following [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)
2. The imported data frame should look like this:
<img src="img/example_df.png" style="height:500px">
3. To create the above dataframe:

    - The dataframe will consist of three columns: ___PostalCode___, ___Borough___, and ___Neighborhood___
    - Only process the cells that have an assigned borough. ___Ignore cells with a borough that is Not assigned.___
    - More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row ___with the neighborhoods separated with a comma___ as shown in row 11 in the above table.

    * If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
    * In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [45]:
with open("toronto_nb.html") as f:
    tables = pd.read_html(f.read())

toronto_nb = tables[0]
toronto_nb.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


### Tips:
The wiki page contains 3 tables, the first one is the one we wanted, so get it directly fromt the list[0]
The match argument is worth to try, but seems like it requires other additional pacakges.

In [43]:
indeces = toronto_nb[(toronto_nb.Neighbourhood == 'Not assigned')].index
for index in indeces:
    toronto_nb.loc[index, 'Neighbourhood'] = toronto_nb.loc[index, 'Borough']
toronto_nb.head()
toronto_nb.tail()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Unnamed: 0,Postcode,Borough,Neighbourhood
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor
287,M9Z,Not assigned,Not assigned


In [44]:
table = pd.pivot_table(toronto_nb, values=['Neighbourhood'], columns=['Postcode', 'Borough'], aggfunc=lambda x: ','.join(x))
toronto_nb = table.reset_index()
toronto_nb.drop('level_0', axis=1, inplace=True)
toronto_nb.rename(columns = {0: 'Neighbourhood'}, inplace=True)
toronto_nb.head()
toronto_nb.tail()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M1B,Scarborough,"Rouge,Malvern"
2,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
3,M1E,Scarborough,"Guildwood,Morningside,West Hill"
4,M1G,Scarborough,Woburn


Unnamed: 0,Postcode,Borough,Neighbourhood
175,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam..."
176,M9W,Etobicoke,Northwest
177,M9X,Not assigned,Not assigned
178,M9Y,Not assigned,Not assigned
179,M9Z,Not assigned,Not assigned
