# Segmenting and Clustering Neighborhoods in Toronto

In [15]:
conda install -c anaconda beautifulsoup4

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    soupsieve-2.0              |             py_0          33 KB  anaconda
    openssl-1.1.1              |       h7b6447c_0         5.0 MB  anaconda
    certifi-2019.11.28         |           py36_1         157 KB  anaconda
    beautifulsoup4-4.8.2       |           py36_0         161 KB  anaconda
    ------------------------------------------------------------
                                           Total:         5.4 MB

The following NEW packages will be INSTALLED:

    soupsieve:      2.0-py_0          anaconda   

The following packages will be UPDATED:

### First of all lets import the required libraries 

In [1]:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup 
import numpy as np
import pandas as pd 
from urllib.request import urlopen

### Know lets define some variables to be used later

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

html = urlopen(url)

soup = BeautifulSoup(html, 'html.parser')

In [3]:
my_table = soup.find_all('table', class_= 'wikitable')

### It is time to write a for loop to pull the data from the url mentioned above

In [4]:
postal_codes = []
boroughs = []
neighbourhoods = []

for table in my_table:
    rows = table.find_all('tr')
    
    for row in rows:
        cells = row.find_all('td')
        
        if len(cells)==3:
            postal_codes.append(cells[0].find(text=True).strip())
            boroughs.append(cells[1].find(text=True).strip())
            neighbourhoods.append(cells[2].find(text=True).strip())
            

### Lets put the data in a dataframe and check if anything is missing

In [5]:
df = pd.DataFrame(postal_codes,
                  columns = ['Postal Codes'])

df['Borough'] = boroughs
df['Neighbourhood'] = neighbourhoods

print(df.shape)
df.head(10)

(180, 3)


Unnamed: 0,Postal Codes,Borough,Neighbourhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


### Lets check if any of the values in the 'Neighbourhood' column is 'Not assigned'

In [6]:
df['Neighbourhood'] != 'Not assigned'

0      True
1      True
2      True
3      True
4      True
       ... 
175    True
176    True
177    True
178    True
179    True
Name: Neighbourhood, Length: 180, dtype: bool

### Know it is time to clean the data so that it looks like the end result table required to complete the assignment 

In [7]:
df_filter = df['Borough'] != 'Not assigned'
df_filter.head(10)

df2 = df[df_filter]
df2.head()

df2.reset_index(inplace = True, drop = True)
df2.head()

Unnamed: 0,Postal Codes,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [10]:
df2['Neighbourhood'].replace('/', ',',regex = True, inplace = True)
df2.head()

Unnamed: 0,Postal Codes,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"


### A final check that the shape matches the intended shape

In [11]:
df2.shape

(103, 3)