# Segmenting and Clustering Neighborhoods

First we need to import the required libraries for the  project.

In [1]:
#importing required libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

### Import Data
1. I retrieved the data from the source website using the requests library.
2. Use Beautiful Soup to find instances of 'table' on the website.
3. Use pandas to convert it into a dataframe.

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' #identify url
source = requests.get(url).text #scrape the website
soup = BeautifulSoup(source, 'lxml') #parse the code
table = soup.find_all('table') #find instances of table
df = pd.read_html(str(table))[0] #convert to data frame

### Identify  Missing Values
Before cleaning the data had 180 rows.
After using `.describe()` I could see that there were 180 unique values for Postal Code but  77 "Not Assigned" Boroughs.

In [3]:
df.describe()

Unnamed: 0,Postal code,Borough,Neighborhood
count,180,180,103
unique,180,11,98
top,M8X,Not assigned,Downsview
freq,1,77,4


### Deal with missing values
After removing those I could see that I now had 103 rows.

In [4]:
print("Before Cleaning",df.shape)
df = df[df['Borough']!="Not assigned"] #Remove Not Assigned from Borough
print("Removed Not Assigned",df.shape)

Before Cleaning (180, 3)
Removed Not Assigned (103, 3)


The assignment instructions stated:
>If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

However, when I tried to find any unassigned there weren't any!

In [5]:
df[['Neighborhood']].isna().groupby('Neighborhood').count()

False


In [6]:
#Commented out as the code is redundant. There is no data that meets those requirements for cleaning
    #df = df[df['Neighborhood']!=np.nan]
    #print("Removed Nulls Cleaning",df.shape)

### Correct the data format
Next, I reformatted the Neighborhood column so that the " /" were replaced with ",".
I then re-indexed to reflect the missing columns.

In [7]:
df['Neighborhood']  = df['Neighborhood'].str.replace(" /", ",")  # replaced / with ,
df.reset_index(drop=True,inplace=True)

### Commentary
Whilst there are expected duplicates in the borough column, I can see there are no duplicates in the postal code column.\
There are however, duplicates in the Neighborhood column.

In [8]:
df.describe(include='all')

Unnamed: 0,Postal code,Borough,Neighborhood
count,103,103,103
unique,103,10,98
top,M9N,North York,Downsview
freq,1,24,4


### Assumptions
I will assume that those duplicates are due to a Neighborhood spanning multiple postcodes but only **a single borough per Neighborhood**.\
The code below also shows that there are no hidden duplicates of Borough, through different spellings or formats.

In [9]:
df['Borough'].value_counts()

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
York                 5
East Toronto         5
Mississauga          1
Name: Borough, dtype: int64

#### Let's see  what the final data frame looks like.

In [10]:
df.shape

(103, 3)