# Segmenting and Clustering Neighborhoods in Toronto
In this assignment, we will be exploring, segmenting, and clustering the neighborhoods in the city of Toronto.

## Part 1: Scraping data from Wikipedia webpage

First we will be installing the required libraries

In [None]:
import sys
!conda install --yes --prefix {sys.prefix} beautifulsoup4
!conda install --yes --prefix {sys.prefix} lxml
!conda install --yes --prefix {sys.prefix} html5lib
!conda install --yes --prefix {sys.prefix} requests
print("everything is installed now...")

Importing the required libraries 


In [23]:
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd

# to enable autocomplete in the notebook
%config IPCompleter.greedy=True 

Getting the webpage from Wikipedia

In [182]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_page = requests.get(url)

Parsing the html content of the webpage using beatifulsoup <br> 
Then extract the table tag that contains the table with neighborhoods data

In [183]:
soup = BeautifulSoup(html_page.content,'lxml')
table = soup.table
print(table.prettify())

<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postcode
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighbourhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Parkwoods" title="Parkwoods">
     Parkwoods
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M4A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Victoria_Village" title="Victoria Village">
     Victoria Village
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M5A
   </td>
   <td>
    <a href="/wiki/Downtown_Toronto" title="Downtown Toronto">
     Downtown Toronto
    </a>
   </td>
   <td>
    <a href="

Extract headers from the table

In [210]:
headers = table.find_all('th')
headers_list = []
for x in headers:
    headers_list.append(x.text)
headers_list[2] = headers_list[2].replace('\n','')
print(headers_list)

['Postcode', 'Borough', 'Neighbourhood']


Extract rows from the table (delete the headers row)

In [211]:
content = table.find_all('tr')
del content[0]


Convert Extracted data to pandas dataframe

In [222]:
# initializing list of neighbourhoods
l = []

# put neighbourhoods in the list one by one (loop over the)
for tr in content:
    # convert the extracted row to a list 
    row = tr.find_all('td')
    tmp_lst = [elem.text for elem in row]
    # the next line is to remove the \n (newline) from the last element of the list
    tmp_lst[2] = tmp_lst[2].replace('\n','')
    # appending the list to the list of the lists
    l.append(tmp_lst)
    
df_nbrs = pd.DataFrame(l,columns=headers_list)
print(df_nbrs.shape)
df_nbrs.head(30)

(288, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


Ignore cells with a borough that is <b>Not assigned

In [224]:
indexNames = df_nbrs[ df_nbrs['Borough'] == "Not assigned" ].index
df_nbrs.drop(indexNames , inplace=True)
df_nbrs.reset_index(inplace=True)
df_nbrs.drop("index",axis=1, inplace=True)
print(df_nbrs.shape)
df_nbrs.head(30)

(211, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In the next cell, I am grouping the dataframe by the Postcode column to join the Neighbourhoods that have the same Postcode

In [225]:
grouped = df_nbrs.groupby("Postcode").agg([','.join])
final_df = grouped.reset_index().droplevel(1,axis=1)
final_df.head(20)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,"Scarborough,Scarborough","Rouge,Malvern"
1,M1C,"Scarborough,Scarborough,Scarborough","Highland Creek,Rouge Hill,Port Union"
2,M1E,"Scarborough,Scarborough,Scarborough","Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,"Scarborough,Scarborough,Scarborough","East Birchmount Park,Ionview,Kennedy Park"
7,M1L,"Scarborough,Scarborough,Scarborough","Clairlea,Golden Mile,Oakridge"
8,M1M,"Scarborough,Scarborough,Scarborough","Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,"Scarborough,Scarborough","Birch Cliff,Cliffside West"


In the next cell I am removing the duplicates in every row in Borough column in the previous result

In [216]:
for i in range(0,final_df.shape[0]):
    final_df["Borough"].iloc[i] = final_df["Borough"].iloc[i].split(',')[0]
final_df.head(20)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


Handling cases where Neighbourhood column is Not assigned

In [241]:
for i in final_df[final_df["Neighbourhood"]=="Not assigned"].index:
        final_df["Neighbourhood"].iloc[i] = final_df["Borough"].iloc[i]
# showing the example mentioned (9th row in wikipedia page)
final_df[final_df["Postcode"]=="M7A"]

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


Showing that there is no rows with "Not assigned" Borough

In [217]:
final_df[final_df["Borough"]=="Not assigned"]

Unnamed: 0,Postcode,Borough,Neighbourhood


Showing that there is no duplicates at column Postcode, meaning that all Neighbourhoods with the same Postcode were combined into one row

In [218]:
final_df[final_df["Postcode"].duplicated()]

Unnamed: 0,Postcode,Borough,Neighbourhood


Showing that there is no Neighbourhood == Not assigned

In [242]:
final_df[final_df["Neighbourhood"]=="Not assigned"]

Unnamed: 0,Postcode,Borough,Neighbourhood


In [244]:
final_df.shape

(103, 3)

In [247]:
ss = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

In [252]:
ss[0]

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


## Part 2:
