# Segmenting & Clustering Neighborhoods in Toronto

### Install & import necessary libraries, modules, etc.

In [1]:
!pip install bs4

from bs4 import BeautifulSoup
import requests

import pandas as pd



### Pull data from url and create BeautifulSoup object

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

data  = requests.get(url).text

soup = BeautifulSoup(data,"html5lib")

   Create list of tables using BeautifulSoup `find` function on `table`.
    
   Create empty list (`table_contents`) to fill with the table data from tables.
    
   Loop through html table data cells (with `td` tag), skipping rows with `'Not Assigned'` borough.  Parse text in other rows.
    
   Postal codes are the first 3 characters, assign those characters `'Postal Code'` dictionary key.
    
   Neighborhoods are in parentheses following Boroughs in the row text, so they can be split at the first `'('`.  Boroughs are on the `[0]` side of the split. Assign to `'Borough'` key.
    
   Neighborhoods are on the `[1]` side of the split. Strip off the second `')'`.  Rows with multiple neighborhoods use slashes as separators.  Replace with commas. A few cells have additional parentheses for the neighborhoods.  Replace with spaces.  Strip off extra spaces.
    
   Append dictionaries for each row to `table_contents` list. 

In [3]:
# Save results of BeautifulSoup 'find' as list object
table = soup.find('table')

# Empty list to populate with table data extracted by for-loop below
table_contents = []

for row in table.findAll('td'):
    cell = {}  # Create empty dictionary to fill with keys and corresponding extracted data
    if row.span.text=='Not assigned': # Ignore cells with 'Not Assigned' borough
        pass
    else:
        cell['PostalCode'] = row.p.text[:3] # Parse 3-character Postal Codes, assign to key
        cell['Borough'] = (row.span.text).split('(')[0] # Take left side of split, assign to key
        # Take right side of split, clean text, and assign to key
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell) # Append dictionary to list

### Cast list of dictionaries (`table_contents`) to pandas DataFrame. Check result.

In [4]:
df=pd.DataFrame(table_contents)
df

Unnamed: 0,Borough,Neighborhood,PostalCode
0,North York,Parkwoods,M3A
1,North York,Victoria Village,M4A
2,Downtown Toronto,"Regent Park, Harbourfront",M5A
3,North York,"Lawrence Manor, Lawrence Heights",M6A
4,Queen's Park,Ontario Provincial Government,M7A
5,Etobicoke,Islington Avenue,M9A
6,Scarborough,"Malvern, Rouge",M1B
7,North York,Don Mills North,M3B
8,East York,"Parkview Hill, Woodbine Gardens",M4B
9,Downtown Toronto,"Garden District, Ryerson",M5B


Reorder columns to match assignment.

In [5]:
df = df[['PostalCode', 'Borough', 'Neighborhood']]
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


Viewing `df` shows some poorly formatted borough names.

Loop through `df` and create a list of unique borough values to see what may need adjustment.  I defined a function (`col_values_list`) so I can check it later or use on other columns if need be.

In [6]:
def col_values_list(column):
    
    col_values = [] # Empty list to populate w/ unique values
    
    for i in df[column]:
        if i not in col_values:
            col_values.append(i)
            
    return col_values

col_values_list('Borough')

['North York',
 'Downtown Toronto',
 "Queen's Park",
 'Etobicoke',
 'Scarborough',
 'East York',
 'York',
 'East Toronto',
 'West Toronto',
 'East YorkEast Toronto',
 'Central Toronto',
 'MississaugaCanada Post Gateway Processing Centre',
 'Downtown TorontoStn A PO Boxes25 The Esplanade',
 'EtobicokeNorthwest',
 'East TorontoBusiness reply mail Processing Centre969 Eastern']

Replace problematic borough names.  Check result with `col_values_list` function.

In [7]:
df['Borough']=df['Borough'].replace({'East YorkEast Toronto':'East York/East Toronto',
                                     'MississaugaCanada Post Gateway Processing Centre':'Mississauga',
                                     'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                     'EtobicokeNorthwest':'Etobicoke Northwest',
                                     'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business'})

col_values_list('Borough')

['North York',
 'Downtown Toronto',
 "Queen's Park",
 'Etobicoke',
 'Scarborough',
 'East York',
 'York',
 'East Toronto',
 'West Toronto',
 'East York/East Toronto',
 'Central Toronto',
 'Mississauga',
 'Downtown Toronto Stn A',
 'Etobicoke Northwest',
 'East Toronto Business']

Print number of rows using `.shape` function.

In [8]:
print('Number of rows in DataFrame: ' + str(df.shape[0]))

Number of rows in DataFrame: 103
