For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.
1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:
![dataframe](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1564704000000&hmac=uEVxJTTRbXL7VZOoCSjIH7wHCGYf9f0ywQFzqFOYfQI)

3. To create the above dataframe:

    - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
    - Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned**.
    - More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that **M5A** is listed twice and has two neighborhoods: **Harbourfront** and **Regent Park**. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in **row 11** in the above table.
    - If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be the same as the borough. So for the **9th** cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be **Queen's Park**.
    - Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
    - In the last cell of your notebook, use the **.shape** method to print the number of rows of your dataframe.
    - Submit a link to your Notebook on your Github repository. 

In [1]:
# import the libraries
import pandas as pd
import numpy as np 
from bs4 import BeautifulSoup
import requests
from pprint import pprint

In [2]:
# scrape the data from the page
link = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
raw_page = requests.get(link).text
# pprint(raw_page)

In [3]:
## use beautiful soup to parse page
soup = BeautifulSoup(raw_page, "xml")
# pprint(soup)

In [4]:
## extract the table 
table = soup.find("table")
df = pd.read_html(str(table))[0]    
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [5]:
# Cleaning the data according to the requirements

## Only process the cells that have an assigned borough. 
## Ignore cells with a borough that is Not assigned.
df = df[df["Borough"]!= "Not assigned"]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [6]:
## More than one neighborhood can exist in one postal code area. 
## For example, in the table on the Wikipedia page, 
## you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. 
## These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
groupby = df.groupby(["Postcode"])
clean_df = pd.DataFrame(columns=["Postcode", "Borough", "Neighbourhood"])

for postcode, every_group in groupby:
    list_neighbourhood = every_group["Neighbourhood"].tolist()
    str_neighbourhood  = ",".join(list_neighbourhood)
    new_line = pd.DataFrame({"Postcode" : postcode,
                      "Borough" : every_group["Borough"].iloc[0], 
                      "Neighbourhood" : str_neighbourhood}, index=[0])
    
    clean_df = clean_df.append(new_line, ignore_index=True)

clean_df.head()
    

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [8]:
## If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 
## So for the 9th cell in the table on the Wikipedia page, 
## the value of the Borough and the Neighborhood columns will be Queen's Park.

for index, row in clean_df.iterrows():
    if row["Neighbourhood"] == "Not assigned":
        row["Neighbourhood"] = row["Borough"]

In [9]:
clean_df.shape

(103, 3)