# Neighbourhood Segmentation and Clustering

In this notebook, we are going to segment, and cluster the neighborhoods in the city of Toronto.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. As data is not readily available, we are going to scrape the Wikipedia page and wrangle the data, clean it using the utility **BeautifulSoup** and then read it into a pandas dataframe so that it is in a structured format.

Once the data is in a structured format, we can explore and cluster the neighborhoods in the city of Toronto.

In [54]:
#importing necessary libraries
import pip
!pip install beautifulsoup4
!pip install requests
from bs4 import BeautifulSoup
import requests
import pandas as pd



   **Beautiful Soup** is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. It is available for Python 2.7 and Python 3. First, we are going to import the wikipedia page source code and parse it using the BeautifulSoup package. Next, we extract the table containing the required postcodes and other details and load each row of data into a **.csv** file.

In [2]:
#importing the wikipedia page & parsing
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url).text
soup = BeautifulSoup(source,'html.parser')

In [3]:
#extracting the necessary table details from soup
table = soup.findAll('table', class_ = 'wikitable sortable')
len(table)
table = table[0] #checking for multiple tables and choosing the correct one
type(table)

bs4.element.Tag

In [49]:
#scraping the table data, formatting & loading to a .csv file
import csv
import os

table_data = table_data_new=""
for row in table.findAll('tr'):
    table_data = ""
    for cell in row.findAll('td'):
        table_data = table_data+"|"+cell.text
    for cell in row.findAll('th'):
        table_data = table_data+"|"+cell.text
    if len(table_data)!=0:
        table_data_new = table_data_new+table_data[1:]
    
file=open(os.path.expanduser("postal_codes.csv"), "wb") 
file.write(bytes(table_data_new, encoding = "ascii", errors='ignore'))

8804

In [50]:
#importing .csv file to a dataframe using Python
df = pd.read_csv('postal_codes.csv', sep = '|')
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


The above dataframe consists of three columns: PostalCode, Borough, and Neighborhood. We need to process the cells that have an assigned borough and ignore cells with a borough that is Not assigned. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, we notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. For example, for the postal code M7A, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [51]:
#Cleansing data 
df = df.drop(df[df['Borough']=='Not assigned'].index)

import numpy as py
df['Neighbourhood'] = py.where(df['Neighbourhood'] == 'Not assigned', df['Borough'], df['Neighbourhood'])
df = df.groupby('Postcode').agg({'Borough':'first', 
                             'Neighbourhood': ', '.join, 
                             'Borough':'first' }).reset_index()

In [52]:
#updating the cleaned data in the .csv file
df.to_csv("postal_codes.csv", sep='|')

In [53]:
#checking for number of rows in final dataframe
df.shape 

(103, 3)