<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

## Importing necessary libraries
Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [26]:
import pandas as pd
import numpy as np
import urllib.request #library to fetch the HTML from the URL
try:
    from bs4 import BeautifulSoup #library to scrapping information 
except ImportError:
    !conda install -c anaconda beautifulsoup4 --yes

## Scrapping Wikipedia page

We specify the URL of the Wikipedia page. Using urllib.request library, we want to query the page and put HTML data into a variable.

In [27]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html.parser")

We retrive all instances of the \<table\> tag. We choose this one which contains List of postal codes of Canada "wikitable sortable". We loop over HTML tags and assigns values to each of list (A - PostalCode , B - Borough , C - Neighborhood)

In [28]:
all_tables = soup.find_all("table")
right_table=soup.find('table', class_='wikitable sortable')
postalCode = []
borough = []
neighborhood = []

for row in right_table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells)==3:
        postalCode.append(cells[0].find(text=True))
        borough.append(cells[1].find(text=True))
        neighborhood.append(cells[2].find(text=True))

## Create dataframe from scrapped data

We create DataFrame and assign scrapped data to proper columns. We clean our data.

In [29]:
df = pd.DataFrame(postalCode,columns = ['PostalCode'])
df['Borough'] = borough
df['Neighborhood'] = neighborhood
df = df.replace('\n','', regex=True)
df = df[df.Borough != 'Not assigned'].reset_index(drop=True)

if df[df.Neighborhood=='Not assigned'].shape[0] != 0:
    df.loc[df['Neighborhood']=='Not assigned', 'Neighborhood'] = df['Borough']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Let's check size of created DataFrame

In [30]:
df.shape

(103, 3)

Now let's add coordinates for each Postal Code to our DataFrame

In [31]:
df_coords = pd.read_csv("Geospatial_Coordinates.csv")
df_coords.rename(columns={'Postal Code':'PostalCode'},inplace=True)
df = pd.merge(df, df_coords, on='PostalCode')
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
