# Clustering Neighborhoods in Toronto

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format.

Once the data is in a structured format, you can cluster Toronto neighborhoods.

In [7]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

## Retrieving data

1. Get the html data from the url page using Requests
2. Get postal codes and neighborhood data using BeautifulSoup

In [40]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html_data  = requests.get(url).text
soup = BeautifulSoup(html_data,"html5lib")  # create a soup object using the variable 'data'
table = soup.find('table')
postal_codes = table.find_all('p')

3. Structure data into a pandas data frame

In [49]:
postal_codes_data = []

for i, postal_code in enumerate(postal_codes):

    if postal_code.span.text == 'Not assigned':
        pass
    else:
        dic = {}
        dic['PostalCode'] = postal_code.find('b').text.strip(' ')
        text = postal_code.span.text.split(sep='(')
        dic['Borough'] = text[0].strip(' ')
        dic['Neighboors'] = text[1].split(sep=')')[0].replace(' /',',').strip(' ')
        
        postal_codes_data.append(dic)
        
df = pd.DataFrame(data=postal_codes_data)
df.head()

Unnamed: 0,PostalCode,Borough,Neighboors
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [51]:
df.shape

(103, 3)