# Segmenting and Clustering Neighborhoods in Toronto
One notebook is used for all three parts of this assignment. Each part will be clearly labeled though!

## Part 0: Load packages

In [1]:
%%capture

import numpy as np
import pandas as pd

# For web scraping
from bs4 import BeautifulSoup
import requests

# To obtain latitude/longitude
!pip install pgeocode
import pgeocode


## Part 1: Scrape Wikipedia page for Toronto

First use Requests and BeautifulSoup to scrape the "List of postal codes of Canada: M"-Wikipedia page. Store all tables in a variable.

In [2]:
# Get the html 
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = requests.get(url).text

# Turn into a beautiful soup
soup = BeautifulSoup(html, 'html5lib')

# Find all html tables
tables = soup.find_all('table')
print(f"{len(tables)} tables were found")

# Find the correct table index
for index,table in enumerate(tables):
    if ("M1A" in str(table)):
        tableIndex = index

table = tables[tableIndex]


3 tables were found


Now obtain the contents of each cell in the correct table (the one that actually contains the postal codes etc.). If the cell contains the string "Not assigned", the cell is passed and not stored in the dataframe called ``neighborhoodsToronto``. The assignment states the following:
> if a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough.
 
but a visual inspection of the Wikipedia pages shows that this does not occur anywhere in the dataset. The only time that "Not assigned" is part of the cell, no borough names are given. It will therefore suffice to filter out _all_ cells that contain "Not assigned". There are a number of odd neighborhood/borough names, which are likely the special-purpose codes mentioned in the Wikipedia page. These are cleaned up manually. 

In [3]:
# Get the content of each cell
tableContents = [];

for row in table.findAll('td'):
    cell = {}
    
    if ("Not assigned" in str(row)):
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        tableContents.append(cell)

# Transform into dataframe
neighborhoodsToronto = pd.DataFrame(tableContents)

# Clean up some odd borough/neighborhood names
neighborhoodsToronto['Borough'] = neighborhoodsToronto['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest',
                                             'East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
neighborhoodsToronto.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [4]:
neighborhoodsToronto.shape

(103, 3)

## Part 2: Obtain latitude/longitude of each neighborhood

Use the ``pgeocode`` Python package to obtain the latitude and longitude of each neighborhood. The ``geocoder`` package given in the assignment has not worked properly and requires _many_ function calls. I have also tested it out using the example given on the website of the package (for Mountain View, CA), but this also returned None for >200 tries. That is not sustainable, so the decision was made to change packages.

In [5]:
# Initialize variables for latitude and longitude
latitude  = np.empty(neighborhoodsToronto.shape[0])
longitude = np.empty(neighborhoodsToronto.shape[0])

# Loop over all postal codes with the pgeocode package
canadaGeoCode = pgeocode.Nominatim('ca')
for postalCodeIndex in neighborhoodsToronto.index:
    postalCode = neighborhoodsToronto.loc[postalCodeIndex, 'PostalCode']
    
    locationInformation = canadaGeoCode.query_postal_code(postalCode)
    
    latitude[postalCodeIndex]  = locationInformation.latitude
    longitude[postalCodeIndex] = locationInformation.longitude    
    #print("%s, latitude: %.3f, longitude: %.3f" % (postalCode, latitude[postalCodeIndex], longitude[postalCodeIndex]))
    
# Add latitude/longitude to neighborhoodsToronto dataframe
neighborhoodsToronto['Latitude']  = latitude
neighborhoodsToronto['Longitude'] = longitude

neighborhoodsToronto.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Queen's Park,Ontario Provincial Government,43.6641,-79.3889
5,M9A,Etobicoke,Islington Avenue,43.6662,-79.5282
6,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.193
7,M3B,North York,Don Mills North,43.745,-79.359
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.7063,-79.3094
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783
