# Segmenting and Clustering Neighbourhoods in Toronto

**Problem 1**

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

In order to create the above dataframe:

* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. 
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.


In [2]:
pip install bs4 #To install Beautifhul soup package

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 6.9MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from beautifulsoup4->bs4)
  Downloading https://files.pythonhosted.org/packages/6f/8f/457f4a5390eeae1cc3aeab89deb7724c965be841ffca6cfca9197482e470/soupsieve-2.0.1-py3-none-any.whl
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jupyterlab/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.9.1 bs4-0.0.

In [1]:
from bs4 import BeautifulSoup
import requests #library to handle requests
import pandas as pd

In [2]:
Link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
Source = requests.get(Link).text


In [3]:
soup = BeautifulSoup(Source)

In [4]:
table = soup.find('table')

In [5]:
#Define the dataframe to consist of three columns: PostalCode, Borough and Neighborhoods
columns = ["PostalCode","Borough","Neighbourhoods"]
df = pd.DataFrame(columns=columns)

In [6]:
for tr in table.find_all('tr'):
    row_data = []
    for td in tr.find_all('td'):
        row_data.append(td.text.strip())
    if len(row_data) ==3:
        df.loc[len(df)] = row_data  

In [7]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhoods
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Data Cleaning

In [8]:
df = df[df['Borough'] != 'Not assigned']
df = df[df['Neighbourhoods'] !='Not assigned']
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhoods
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [12]:
grouped_df=df.groupby('PostalCode')['Neighbourhoods'].apply(lambda x: "%s" % ', '.join(x))
grouped_df.head()

PostalCode
M1B                            Malvern, Rouge
M1C    Rouge Hill, Port Union, Highland Creek
M1E         Guildwood, Morningside, West Hill
M1G                                    Woburn
M1H                                 Cedarbrae
Name: Neighbourhoods, dtype: object

In [13]:
grouped_df=grouped_df.reset_index(drop=False)
grouped_df.rename(columns = {'Neighbourhoods':'Neighbourhood_joined'},inplace=True)
grouped_df.head()

Unnamed: 0,PostalCode,Neighbourhood_joined
0,M1B,"Malvern, Rouge"
1,M1C,"Rouge Hill, Port Union, Highland Creek"
2,M1E,"Guildwood, Morningside, West Hill"
3,M1G,Woburn
4,M1H,Cedarbrae


In [20]:
df_merge = pd.merge(df, grouped_df, on='PostalCode')

In [21]:
df_merge.drop('Neighbourhoods', axis=1, inplace=True)

In [22]:
df_merge.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood_joined
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [23]:
df_merge.shape

(103, 3)