# Segmenting and Clustering Neighborhoods in Toronto

Build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, 
in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline 
from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs

# For web scraping
import urllib.request # Open URLs
from bs4 import BeautifulSoup # Parse HTML documents

print('Done!')

Done!


#### 1. Scrape the Wikipedia page

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' # The page we are scraping
page = urllib.request.urlopen(url) # Open the url and put the HTML into the page variable

# https://simpleanalytical.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas

Parse the HTML from our URL into the BeautifulSoup parse tree format.

In [3]:
soup = BeautifulSoup(page, "lxml")

# The table <table class="wikitable sortable">/
# Rows start and end with '<tr>' and </tr> tags
# Top row of headers has <th> tags
# Data rows beneath for each postcode has <td> tags
# Optional: To view the structure of the HTML >> print(soup.prettify())

Find all tables in the Wikipedia page.

In [4]:
all_tables = soup.find_all('table')

Find the right table.

In [5]:
postcode_table = soup.find('table', class_='wikitable sortable') 

Loop through each row, pick out the "td" tags and take the contents from each into a list.

In [6]:
# There are three columns in the table that we want to scrabe the data from --- create three empty lists (A, B, C) to store the data in
A = []
B = []
C = []

for row in postcode_table.findAll('tr'): # Use 'findAll' function to look for the string 'tr'
    cells = row.findAll('td') # Use 'findAll' function within the loop to search each row for 'td' and add all of these to variable called 'cells'
    if len(cells) == 3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

#### 2. Create the dataframe
The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood.

In [7]:
postcode_df = pd.DataFrame(A, columns = ['PostalCode'])
postcode_df['Borough'] = B
postcode_df['Neighborhood'] = C
postcode_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Ignore cells with a borough that is 'Not assigned'.

In [8]:
postcode_df = postcode_df[postcode_df.Borough !='Not assigned'] # Drops the rows but does not reset the index
postcode_df = postcode_df.reset_index(drop=True) # Reset the index
postcode_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


More than one neighborhood can exist in one postal code area. Combine rows and separate neighborhoods with comma.

In [9]:
postcode_df2 = postcode_df.groupby(['PostalCode', 'Borough'], as_index=False).agg({'Neighborhood': ', '.join})
postcode_df2['Neighborhood'].replace(regex=True, inplace=True, to_replace=r'\n', value=r'') # Because '\n' bothers me a lot
postcode_df2.head()

# https://stackoverflow.com/questions/45705474/transform-with-group-by-in-pandas

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


If a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough.

In [10]:
postcode_df2['Neighborhood'].replace('Not assigned', postcode_df2['Borough'], inplace=True)
postcode_df2.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Print the number of rows of the dataframe.

In [11]:
postcode_df2.shape

(103, 3)