# Segmenting and Clustering Neighborhoods in Toronto (Part 1)

In this exercise we use a web scraping technique to create a dataframe of the neighborhoods in Toronto grouped by postal code

## Import required libraries
For this exrecise we'll use 'BeautifulSoup' for web scraping and 'requests' to get to data from the Wikepdia webpage located at <https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M>

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import folium
import matplotlib
from pandas.io.json import json_normalize
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

## Get data from website

First we'll get the data from Wikipedia

In [2]:
#Get Data from website
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
source[0:3000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of postal codes of Canada: M - Wikipedia</title>\n<script>document.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":890001695,"wgRevisionId":890001695,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","Au

## Parse website data
Then we'll parse the data retrived from the Wikipedia

In [3]:
#Parse data
soup = BeautifulSoup(source,'lxml')
table = soup.find('table').text
table_split = table.split('\n\n')
table_split[0:10]

['',
 'Postcode\nBorough\nNeighbourhood',
 '\nM1A\nNot assigned\nNot assigned',
 '\nM2A\nNot assigned\nNot assigned',
 '\nM3A\nNorth York\nParkwoods',
 '\nM4A\nNorth York\nVictoria Village',
 '\nM5A\nDowntown Toronto\nHarbourfront',
 '\nM5A\nDowntown Toronto\nRegent Park',
 '\nM6A\nNorth York\nLawrence Heights',
 '\nM6A\nNorth York\nLawrence Manor']

## Create DataFrame and Clean Data
Once the data is parsed we'll put it into a Pandas dataframe and clean the data

In [4]:
#create new df 
df = pd.DataFrame({'col':table_split})

#split column of data into multiple columns
df = df['col'].apply(lambda x : pd.Series(x.split('\n')))

#Get column names
col_names = df.iloc[1]

#Drop rows with empty values
df.dropna(inplace=True)

#Drop empty columns
df.drop([0], axis=1, inplace=True)

#Reshape columns
df.columns = range(df.shape[1])

#Drop 'Not assigned' Burough rows
df = df[df[1].str.contains('Not assigned') == False]

#Rename 'Not assigned' Neighborhoods
na_location = df[df[2].str.contains('Not assigned') == True].index[0]
df[2].loc[na_location] = df[1].loc[na_location]

#Add column names
df.rename(columns=col_names, inplace=True)

#Reset row index
df.reset_index(drop=True, inplace=True)

df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


## Group data
Now let's group the data my postcode

In [5]:
#Group by postcode

df_grp = pd.DataFrame(df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(lambda tags: ', '.join(tags)))
df_grp.reset_index(inplace=True)
df_grp.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


## DataFrame size

Now that the dataframe has been cleaned and grouped by postcode, let's see the datafram size

In [6]:
df_grp.shape

(103, 3)

As shown above, the dataframe has 3 columns and 103 rows.