# Segmenting and Clustering Neighborhoods in Toronto

## In order to do this, I will have to break the process down into a couple parts:
1. Scrape the [Canadian Postal Code Wiki page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) to get the table containing postal code, borough, and neighborhood information for the city of Toronto
2. Obtain coordinates for each of the postal codes using the Geocoder package
3. Explore and cluster the Toronto neighborhoods for analysis

## Part 1
To get some practice web-scraping I'm going to obtain the postal code table with the BeautifulSoup package.

In [91]:
# Import BeautifulSoup4
from bs4 import BeautifulSoup

# Import Requests library so that we can feed the document behind the url to the BeautifulSoup constructor
import requests

# Get the html for soup
text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

soup = BeautifulSoup(text)

postalCodeTable = soup.find('table')

In [92]:
# Get the table headings for later
headings = []
for th in postalCodeTable.find('tr').find_all('th'):
    headings.append(th.text.replace('\n', ' ').strip())

# loop through each table row 'tr' and get table data 'td'
# store this data in an array and append the row data to a larger array of rows
arrayOfRows = []
for tr in postalCodeTable.find_all('tr'):
    temp_row = []
    for td in tr.find_all('td'):
        temp_row.append(td.text.replace('\n', ' ').strip())
    arrayOfRows.append(temp_row)

#remove an empty row that was created
del arrayOfRows[0]

Now that we have the data scraped and placed into an array of rows, let's put it all together into a Pandas DataFrame object. To do this we will need to import some libraries.

In [93]:
import pandas as pd
import numpy as np

# Use the headings and row data to make a DataFrame object
df = pd.DataFrame(arrayOfRows, columns = headings)
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Now that we have the DataFrame object 'df' we have to clean it to make it look like the project description. This means deleting any rows that have no borough listed.

In [96]:
# Remove postal codes with no borough assigned to them
df = df[df.Borough != 'Not assigned']

# Make sure that there are no unassigned neighbourhoods
print('There are {} postal codes with unassigned neighbourhoods.'.format(
    df[df.Neighbourhood == 'Not assigned'].shape[0]))

# Print the number of rows in my Dataframe 
print('There are {} rows in my DataFrame of Toronto postal codes!'.format(
    df.shape[0]))

There are 0 postal codes with unassigned neighbourhoods.
There are 103 rows in my DataFrame of Toronto postal codes!
