## Assignment: Segmenting and Clustering Neighborhoods in Toronto - Part 1
####   Problem: Create a cleaned and grouped dataframe with the required data using the Wikipedia url https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [15]:
import requests
from bs4 import BeautifulSoup

In [16]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content, "html.parser")
table = soup.find("table", class_ = "wikitable sortable")

In [17]:
postcodes=[]
boroughs=[]
neighborhoods=[]

In [18]:
for row in table.findAll("tr"):
    cells = row.findAll("td")
    if len(cells)==3: #Only extract table body not heading
        postcodes.append(cells[0].find(text=True))
        boroughs.append(cells[1].find(text=True))
        neighborhoods.append(cells[2].find(text=True))

In [19]:
#import pandas to convert list to data frame
import pandas as pd
df=pd.DataFrame(postcodes,columns=["PostalCode"])
df["Borough"]=boroughs
df["Neighborhood"]=neighborhoods

##### Eliminate Postalcodes with Borough "Not assigned"

In [20]:
# Ignore cells with a borough that is Not assigned
df = df[df["Borough"]!="Not assigned"]

# If a cell has a borough but a Not assigned neighborhood
# then the neighborhood will be the same as the borough

df.loc[df['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df['Borough']
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Etobicoke,Islington Avenue
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North\n


##### Group the dataframe by PostalCode, clean the dataframe and display the number of rows in it

In [28]:
df_agg = df.groupby(['PostalCode','Borough'], sort=False).agg(', '.join)
df_agg=df_agg.reset_index()
df_agg.columns = ['PostalCode', 'Borough', 'Neighborhood']
print(df_agg.head(10))

  PostalCode           Borough                      Neighborhood
0        M3A        North York                         Parkwoods
1        M4A        North York                  Victoria Village
2        M5A  Downtown Toronto                      Harbourfront
3        M6A        North York  Lawrence Heights, Lawrence Manor
4        M7A  Downtown Toronto                      Queen's Park
5        M9A         Etobicoke                  Islington Avenue
6        M1B       Scarborough                    Rouge, Malvern
7        M3B        North York                 Don Mills North\n
8        M4B         East York   Woodbine Gardens, Parkview Hill
9        M5B  Downtown Toronto      Ryerson\n, Garden District\n


In [29]:
# Clean the Neighborhood text by removing new lines "\n"
df_agg = df_agg.replace('\n','', regex=True)
print(df_agg.head(10))

  PostalCode           Borough                      Neighborhood
0        M3A        North York                         Parkwoods
1        M4A        North York                  Victoria Village
2        M5A  Downtown Toronto                      Harbourfront
3        M6A        North York  Lawrence Heights, Lawrence Manor
4        M7A  Downtown Toronto                      Queen's Park
5        M9A         Etobicoke                  Islington Avenue
6        M1B       Scarborough                    Rouge, Malvern
7        M3B        North York                   Don Mills North
8        M4B         East York   Woodbine Gardens, Parkview Hill
9        M5B  Downtown Toronto          Ryerson, Garden District


In [30]:
print('Number of lines in the dataframe:', df_agg.shape[0])

Number of lines in the dataframe: 103
