# Instructions

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

Format: ![What dataframe should look like](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1561766400000&hmac=d8kll666HCK-Njmyedf9N6OuZwJlu8ASZ3bvcJG7ST8)

3. To create the above dataframe:

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
4. Submit a link to your Notebook on your Github repository. (10 marks)

Note: There are different website scraping libraries and packages in Python. One of the most common packages is BeautifulSoup. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/

The package is so popular that there is a plethora of tutorials and examples of how to use it. Here is a very good Youtube video on how to use the BeautifulSoup package: https://www.youtube.com/watch?v=ng2o98k983k

Use the BeautifulSoup package or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe

# Project

### Get data from website

In [1]:
import pandas as pd
import numpy as np

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' 

In [52]:
# create dataframe with url
df = pd.read_html(url, header=0) 

# we want the first table in the list
df = df[0]                        

In [53]:
df.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Clean Data

In [54]:
df.replace("Not assigned", np.nan, inplace=True)    
df.dropna(subset=["Borough"], axis=0, inplace=True) # drop NaN values in Borough column

In [55]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [56]:
# Reset index because we dropped some rows.
df.reset_index(drop=True, inplace=True) 
df.head(2)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village


In [57]:
# Sort values to find 1 missing data point.
df.sort_values(by=["Neighbourhood"], na_position='first', inplace=True) 
df.head(2)

Unnamed: 0,Postcode,Borough,Neighbourhood
6,M7A,Queen's Park,
49,M5H,Downtown Toronto,Adelaide


In [58]:
# This only works because there is one missing value, otherwise
# I would need a more appropriate method.
df.replace(np.nan, 'Queen\'s Park', inplace=True) 

In [59]:
df.sort_index(inplace=True)
df.head(2)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village


In [60]:
# Group the dataframe by the postcodes
grouped = df.groupby(['Postcode'])

# Isolate Neighbourhood
group_Neighbourhood = grouped['Neighbourhood']

#Find Unique Values
uniqNeigh = group_Neighbourhood.unique()

# Do the same with Borough
group_borough = grouped['Borough']
uniqBor = group_borough.unique()

# Reconstruct dataframe
newDF = pd.DataFrame(uniqBor)
newDF['Neighbourhood'] = uniqNeigh
df = newDF.reset_index()

In [61]:
df.head(2)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,[Scarborough],"[Rouge, Malvern]"
1,M1C,[Scarborough],"[Highland Creek, Rouge Hill, Port Union]"


In [62]:
# Pull borough names out of lists format
lst = []
for i in df['Borough']:
    lst.append(i[0])

In [63]:
# Re-add names
df['Borough'] = lst

In [64]:
df.head(2)

(103, 3)

In [65]:
# Pull out Neigbourhood series , replace column
seriesNeigh = df['Neighbourhood']

change it to string
seriesNeigh = seriesNeigh.apply(', '.join)
df['Neighbourhood'] = seriesNeigh

In [67]:
df.head(2)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"


In [None]:
df.shape