# Segmenting and Clustering Neighborhoods in Toronto

## Part 1

Import relevant libraries:

In [1]:
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
import numpy as np
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium 

Use urllib and BeautifulSoup to get the wikitable data as html then initialise empty lists for each column and loop through the items in the wikitable and append to lists:

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page,'html.parser')

postal_code = []
borough = []
neighbourhood = []

for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
    data = items.find_all(['th','td'])
    borough.append(data[1].find(text=True))
    neighbourhood.append(data[2].find(text=True))
    postal_code.append(data[0].find(text=True)) 

Combine the lists into one dataframe:

In [3]:
df = pd.DataFrame(postal_code,columns=['postal_code'])
df['borough'] = borough
df['neighbourhood'] = neighbourhood
df

Unnamed: 0,postal_code,borough,neighbourhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"
...,...,...,...
175,M5Z\n,Not assigned\n,Not assigned\n
176,M6Z\n,Not assigned\n,Not assigned\n
177,M7Z\n,Not assigned\n,Not assigned\n
178,M8Z\n,Etobicoke\n,"Mimico NW, The Queensway West, South of Bloor,..."


Remove the unwanted new line '\n' at the end of each string:

In [4]:
df.replace('\n','', regex=True, inplace=True)

Remove rows where Borough = 'Not assigned'. Note: As there are no neighbourhoods with the value 'Not assigned' there is no handler needed for this at this stage.

In [5]:
df = df[df['borough'] != 'Not assigned']

In [6]:
df.shape

(103, 3)

## Part 2

Read in coordinates from CSV file as api not returning expected results:

In [15]:
coords = pd.read_csv(r'C:\Users\Eilidh.Mayne\Documents\Coursera Repo\Coursera_Capstone\Geospatial_Coordinates.csv')

Merge the two dataframes on postal code and delete the duplicated postal code column:

In [16]:
newdf = pd.merge(df, coords, how='left', left_on = 'postal_code', right_on = 'Postal Code')
newdf.drop(['Postal Code'], axis=1, inplace=True)

In [23]:
newdf

Unnamed: 0,postal_code,borough,neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509
