# Segmenting and Clustering Neighborhoods in Toronto

## 1. Scraping
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe


We first install Beautiful Soup package

In [4]:
# install Beautiful Soup package and import it
! pip install bs4
from bs4 import BeautifulSoup

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/1d/5d/3260694a59df0ec52f8b4883f5d23b130bc237602a1411fa670eae12351e/beautifulsoup4-4.7.1-py3-none-any.whl (94kB)
[K    100% |████████████████████████████████| 102kB 14.9MB/s 
[?25hCollecting soupsieve>=1.2 (from beautifulsoup4->bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/77/78/bca00cc9fa70bba1226ee70a42bf375c4e048fe69066a0d9b5e69bc2a79a/soupsieve-1.8-py2.py3-none-any.whl (88kB)
[K    100% |████████████████████████████████| 92kB 21.5MB/s 
[?25hBuilding wheels for collected packages: bs4
  Running setup.py bdist_wheel for bs4 ... [?25l- done
[?25h  Stored in directory: /home/notebook/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: s

To parse the document, we will pass it into the BeautifulSoup constructor.

In [113]:
!wget --quiet https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M -O CodesCanada

with open("CodesCanada") as fp:
    soup = BeautifulSoup(fp)

Then, we create a pandas dataframe that will contain the postal codes

In [137]:
# import pandas libraries
import pandas as pd

# define the column of our dataframe
column_names=['Postalcode','Borough','Neighbourhood']

In [341]:
# create dataframe
df_CanadaPostCodes_raw=pd.DataFrame(columns=column_names)

#count number of postalcode
n_max=len(list(soup.body('tr')))-5

#populate the dataframe with the Postalcode, the Borough and Neihborhoods
for i in range(1,n_max):
    Postalcode=soup.find_all('tr')[i].contents[1].text
    Borough=soup.find_all('tr')[i].contents[3].text
    Neighbourhood=soup.find_all('tr')[i].contents[5].text
    
    df_CanadaPostCodes_raw=df_CanadaPostCodes_raw.append({'Postalcode':Postalcode,'Borough':Borough,'Neighbourhood':Neighbourhood},ignore_index=True)

    #check data
df_CanadaPostCodes_raw.head(10)

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n
5,M5A,Downtown Toronto,Regent Park\n
6,M6A,North York,Lawrence Heights\n
7,M6A,North York,Lawrence Manor\n
8,M7A,Queen's Park,Not assigned\n
9,M8A,Not assigned,Not assigned\n


## 2. Clean the dataset
1. Only process the cells that have an assigned borough.
2. remove the \n at the end of Neighbourhood names
3. if Neighbourhood has a value 'Not assigned', replace the value by the "Borough" value
4. Aggregate Postalcode

In [363]:
# 1) We drop lines that have no assigned borough.
df_CanadaPostCodes = df_CanadaPostCodes_raw[~df_CanadaPostCodes_raw['Borough'].isin(['Not assigned'])].reset_index(drop=True)

# 2) remove the \n at the end of Neighbourhood names
df_CanadaPostCodes['Neighbourhood'] = df_CanadaPostCodes.Neighbourhood.str.replace('\n', '')

# 3) if Neighbourhood has a value 'Not assigned',we replace the value by the "Borough" value
df = df_CanadaPostCodes.applymap(str)
for i in df_CanadaPostCodes.index:
    if df_CanadaPostCodes.Neighbourhood[i]=='Not assigned':
        df_CanadaPostCodes.Neighbourhood[i]=df_CanadaPostCodes.Borough[i]
df_CanadaPostCodes.head(15)

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [400]:
# 4) Aggregate PostalCode
df_CanadaPostCodes=df_CanadaPostCodes.groupby(['Postalcode']).agg({'Neighbourhood': lambda a: " , ".join(a),'Postalcode':'first','Borough':'first'})
df_CanadaPostCodes.reset_index(drop=True)

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge , Malvern"
1,M1C,Scarborough,"Highland Creek , Rouge Hill , Port Union"
2,M1E,Scarborough,"Guildwood , Morningside , West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park , Ionview , Kennedy Park"
7,M1L,Scarborough,"Clairlea , Golden Mile , Oakridge"
8,M1M,Scarborough,"Cliffcrest , Cliffside , Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff , Cliffside West"


In [362]:
df_CanadaPostCodes.shape

(103, 3)

# 3. Map Postal Codes to Lat and Long

1.Import data

In [387]:
!wget -O PostalCode.csv https://cocl.us/Geospatial_data

--2019-02-18 17:29:36--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 169.48.113.201
Connecting to cocl.us (cocl.us)|169.48.113.201|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-02-18 17:29:37--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.26.197
Connecting to ibm.box.com (ibm.box.com)|107.152.26.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-02-18 17:29:37--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-02-18 

In [401]:
#Assign the data to a panda dataframe
df_PostalCodeCoord = pd.read_csv("PostalCode.csv")

# take a look at the dataset
df_PostalCodeCoord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


2.Merge with Postal Code dataframe

In [402]:
#rename column "Postal Code" in dataframe df_PostalCodeCoord
df_PostalCodeCoord.rename(columns={'Postal Code':'Postalcode'}, inplace=True)

#merge based on Postalcode column
df_CanadaPostCodes=pd.merge(df_CanadaPostCodes,df_PostalCodeCoord, on='Postalcode')

#check merge
df_CanadaPostCodes

Defaulting to column, but this will raise an ambiguity error in a future version
  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge , Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek , Rouge Hill , Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood , Morningside , West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park , Ionview , Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea , Golden Mile , Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest , Cliffside , Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff , Cliffside West",43.692657,-79.264848
