### Segmenting and Clustering Neighborhoods in Toronto

#### Building the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data and to transform the data into a pandas dataframe like the one shown below:

![alt text](https://i.ibb.co/wgBDk3Y/7-JXaz3-NNEei-Mw-Ape4i-f-Lg-40e690ae0e927abda2d4bde7d94ed133-Screen-Shot-2018-06-18-at-7-17-57-PM.png "Dataframe")

In [1]:
# importing required libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
print('libraries are imported')

libraries are imported


In [2]:
# gathering data from Wikipedia
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(url).text
print('data is gathered')

data is gathered


In [3]:
soup= BeautifulSoup(source, 'lxml')

#### Transforming the data into a Pandas Dataframe

In [4]:
# defining the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighbourhood'] 

# instantiating the dataframe
neighbourhoods = pd.DataFrame(columns=column_names)

In [5]:
# loop to fill dataframe
table=soup.find('table')
for tr_table in table.find_all('tr'):
    raw_data=[]
    for td_table in tr_table.find_all('td'):
        raw_data.append(td_table.text.strip())
    if len(raw_data)==3:
        neighbourhoods.loc[len(neighbourhoods)] = raw_data


In [6]:
neighbourhoods

Unnamed: 0,Borough,Neighbourhood,PostalCode
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


#### Data Cleansing

In [7]:
#removing borough cells that are ''not assigned'
neighbourhoods = neighbourhoods[neighbourhoods.Borough!='Not assigned']
neighbourhoods = neighbourhoods[neighbourhoods.Borough!= 0]
neighbourhoods.reset_index(drop = True, inplace = True)
i = 0
for i in range(0,neighbourhoods.shape[0]):
    if neighbourhoods.iloc[i][2] == 'Not assigned':
        neighbourhoods.iloc[i][2] = neighbourhoods.iloc[i][1]
        i = i+1
                                 
df = neighbourhoods.groupby(['PostalCode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
df.head()


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,Agincourt,M1S,Scarborough
1,"Alderwood, Long Branch",M8W,Etobicoke
2,"Bathurst Manor, Wilson Heights, Downsview North",M3H,North York
3,Bayview Village,M2K,North York
4,"Bedford Park, Lawrence Manor East",M5M,North York


In [8]:
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,Agincourt,M1S,Scarborough
1,"Alderwood, Long Branch",M8W,Etobicoke
2,"Bathurst Manor, Wilson Heights, Downsview North",M3H,North York
3,Bayview Village,M2K,North York
4,"Bedford Park, Lawrence Manor East",M5M,North York
...,...,...,...
175,"Willowdale, Willowdale West",M2R,North York
176,Woburn,M1G,Scarborough
177,Woodbine Heights,M4C,East York
178,York Mills West,M2P,North York


In [16]:
df.loc[df.Borough =='Not assigned']

Unnamed: 0,PostalCode,Borough,Neighbourhood


In [21]:
df=df[df['Borough']!='Not assigned']

In [25]:
df.loc[df.Neighbourhood =='Not assigned']

Unnamed: 0,PostalCode,Borough,Neighbourhood
63,Not assigned,M1A,Not assigned
64,Not assigned,M1Y,Not assigned
65,Not assigned,M1Z,Not assigned
66,Not assigned,M2A,Not assigned
67,Not assigned,M2B,Not assigned
...,...,...,...
135,Not assigned,M9S,Not assigned
136,Not assigned,M9T,Not assigned
137,Not assigned,M9X,Not assigned
138,Not assigned,M9Y,Not assigned


In [34]:
df2=df[df.PostalCode != 'Not assigned']

In [35]:
df2

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,Agincourt,M1S,Scarborough
1,"Alderwood, Long Branch",M8W,Etobicoke
2,"Bathurst Manor, Wilson Heights, Downsview North",M3H,North York
3,Bayview Village,M2K,North York
4,"Bedford Park, Lawrence Manor East",M5M,North York
...,...,...,...
175,"Willowdale, Willowdale West",M2R,North York
176,Woburn,M1G,Scarborough
177,Woodbine Heights,M4C,East York
178,York Mills West,M2P,North York


In [40]:
print('Dataframe shape is', df2.shape)

Dataframe shape is (103, 3)


In [52]:
toronto_geo = 'https://cocl.us/Geospatial_data'
geo_data = pd.read_csv(toronto_geo).set_index("Postal Code")
geo_data

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476
...,...,...
M9N,43.706876,-79.518188
M9P,43.696319,-79.532242
M9R,43.688905,-79.554724
M9V,43.739416,-79.588437


In [53]:
geo_data.rename(columns={'Postal Code':'PostalCode'},inplace=True)
df_merged = pd.merge(geo_data, df2, on='PostalCode')

KeyError: 'PostalCode'

In [50]:
geo_data

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476
...,...,...
M9N,43.706876,-79.518188
M9P,43.696319,-79.532242
M9R,43.688905,-79.554724
M9V,43.739416,-79.588437
