# Coursera - Applied Data Science Capstone
## Week 3
### Segmenting and Clustering Neighborhoods in Toronto - Problem 1

This notebook is part of the assignment of Week 3 of the Applied Data Science Capstone course on Coursera. 

<b>Import Libraries</b>

In [1]:
import pandas as pd
import numpy as np
!pip install lxml html5lib beautifulsoup4



<b>Data Scraping from Website. Save result as Dataframe.</b>

In [2]:
# scrape the following Wikipedia page to obtain the data that is in the table of postal codes 
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
dfs = pd.read_html(url)

# number of tables on website
print(len(dfs))

3


In [3]:
# display table 1 to confirm that's the one we want
dfs[0]

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [4]:
# transform the data into a pandas dataframe consisting of three columns: PostalCode, Borough, and Neighborhood
df_postalCodes = dfs[0][['Postal Code','Borough', 'Neighborhood']]

#rename header 'Postal Code'
df_postalCodes.rename(columns={"Postal Code": "PostalCode"}, inplace=True)

df_postalCodes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


<b>Initial Analysis of Data</b>

Let's take a look at some initial statistical values and check column types

In [5]:
df_postalCodes.shape

(180, 3)

In [6]:
df_postalCodes.describe()

Unnamed: 0,PostalCode,Borough,Neighborhood
count,180,180,180
unique,180,11,100
top,M1M,Not assigned,Not assigned
freq,1,77,77


In [7]:
df_postalCodes['Borough'].value_counts()

Not assigned        77
North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
York                 5
East Toronto         5
East York            5
Mississauga          1
Name: Borough, dtype: int64

_as we can see from both commands above, we have 77 missing, i.e. not assigned, values in 'Borough'_

In [8]:
df_postalCodes.dtypes

PostalCode      object
Borough         object
Neighborhood    object
dtype: object

_this seems as expected_

<b>Replace Missing Values in 'Borough'</b>

In [9]:
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

# replace "Not assigned" to NaN
df_postalCodes.Borough.replace("Not assigned", np.nan, inplace = True)
df_postalCodes.Neighborhood.replace("Not assigned", np.nan, inplace = True)

df_postalCodes.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [10]:
df_postalCodes['Borough'].isnull().value_counts()

False    103
True      77
Name: Borough, dtype: int64

In [11]:
df_postalCodes['Neighborhood'].isnull().value_counts()

False    103
True      77
Name: Neighborhood, dtype: int64

In [12]:
# drop whole row containing NaN in "Borough" column
df_postalCodes.dropna(subset=["Borough"], axis=0, inplace=True)

# reset index, because we dropped rows
df_postalCodes.reset_index(drop=True, inplace=True)

df_postalCodes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [13]:
# checking if we have any missing neighborhood values to deal with
df_postalCodes['Neighborhood'].isnull().value_counts()

False    103
Name: Neighborhood, dtype: int64

<b> Shape of Final Dataset</b>

In [14]:
# In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
df_postalCodes.shape

(103, 3)

<b>Save Dataframe to CSV file</b>

In [15]:
df_postalCodes.to_csv('df_postalCodes_Problem1.csv')