# Segmenting and Clustering Neighborhoods in Toronto
In this assignment, we will be exploring, segmenting, and clustering the neighborhoods in the city of Toronto.

First we will be installing the required libraries

In [1]:
import sys
!conda install --yes --prefix {sys.prefix} beautifulsoup4
!conda install --yes --prefix {sys.prefix} lxml
!conda install --yes --prefix {sys.prefix} html5lib
!conda install --yes --prefix {sys.prefix} requests
print("everything is installed now...")

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    beautifulsoup4-4.8.0       |           py36_0         147 KB
    openssl-1.1.1d             |       h7b6447c_2         3.7 MB
    certifi-2019.9.11          |           py36_0         154 KB
    ca-certificates-2019.8.28  |                0         132 KB
    ------------------------------------------------------------
                                           Total:         4.2 MB

The following packages will be UPDATED:

    beautifulsoup4:  4.7.1-py36_1      --> 4.8.0-py36_0     
    ca-certificates: 2019.5.15-1       --> 2019.8.28-0      
    certifi:         2019.6.16-py36_1  --> 2019.9.11-py36_0 
    openssl:         1.1.1d-h7b6447c_1 --> 1.1.1d-h7b6447c_2


Downloading and Extracting Packa

Importing the required libraries 


In [10]:
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd


Getting the webpage from Wikipedia

In [11]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_page = requests.get(url)

Parsing the html content of the webpage using beatifulsoup <br> 
Then extract the table tag that contains the table with neighborhoods data

In [12]:
soup = BeautifulSoup(html_page.content,'lxml')
table = soup.table
print(table.prettify())

<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postcode
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighbourhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Parkwoods" title="Parkwoods">
     Parkwoods
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M4A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Victoria_Village" title="Victoria Village">
     Victoria Village
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M5A
   </td>
   <td>
    <a href="/wiki/Downtown_Toronto" title="Downtown Toronto">
     Downtown Toronto
    </a>
   </td>
   <td>
    <a href="

Extract headers from the table

In [13]:
headers = table.find_all('th')
headers_list = []
for x in headers:
    headers_list.append(x.text)
#     print(x.text)
#     print()
print(headers_list)

['Postcode', 'Borough', 'Neighbourhood\n']


Extract rows from the table (delete the headers row)

In [14]:
content = table.find_all('tr')
del content[0]
print(content)

[<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>, <tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>, <tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>, <tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>, <tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>, <tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>, <tr>
<td>M6A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Lawrence_Heigh

Convert Extracted data to pandas dataframe

In [47]:
# initializing list of neighbourhoods
l = []

# put neighbourhoods in the list one by one
for tr in content:
    # extract 
    row = tr.find_all('td')
    tmp_lst = [elem.text for elem in row]
    # the next line is to remove the \n (newline) from the last element of the list
    tmp_lst[2] = tmp_lst[2].replace('\n','')
    # appending the list to the list of the lists
    l.append(tmp_lst)
    
df_nbrs = pd.DataFrame(l,columns=headers_list)
df_nbrs.head(30)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


In [48]:
indexNames = df_nbrs[ df_nbrs['Borough'] == "Not assigned" ].index
df_nbrs.drop(indexNames , inplace=True)
df_nbrs.head(30)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [49]:
df_nbrs.reset_index(inplace=True)
df_nbrs.drop("index",axis=1, inplace=True)
df_nbrs.head(30)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern
