# Capstone Project - The Battle of the Neighborhoods

This Jupyter Notebook is mainly used for the __Capstone Projekt__ in the Coursera/IBM course _Applied Data Science Capstone_

## Week 3 - Toronto Neighborhoods

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

![Coursera image](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1589414400000&hmac=XzHQN-7m3GJQYhlT8NyGjgPuEWl3iivMUr8EvLVFbVM)

# TWO ALTERNATIVES ARE PRESENTED HERE
# ALTERNATIVE 1 USES "ONLY" PANDAS
# ALTERNATIVE 2 USES ALSO BEAUTIFULSOUP (PLEASE SCROLL DOWN)

---

# Alternative 1 - pandas

In [1]:
# Import libraries and packages
import pandas as pd
import numpy as np

In [2]:
# Returns list of all tables on page
wikipedia_url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
tables = pd.read_html(wikipedia_url, flavor='html5lib')
for table in tables:
    if 'Postal Code' in table:
        df_pandas = table
df_pandas.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Get basic information of the data frame

In [3]:
# Get basic information of the data frame
df_pandas.describe()

Unnamed: 0,Postal Code,Borough,Neighborhood
count,180,180,103
unique,180,11,98
top,M1M,Not assigned,Downsview
freq,1,77,4


## Drop all rows in data frame which have a "Not assigned" in the "Borough" column

In [4]:
# Drop all rows in data frame which have a "Not assigned" in the "Borough" column
df_pandas.drop(df_pandas.index[df_pandas['Borough'] == 'Not assigned'], inplace = True)

# Reset the index after row drop
df_pandas = df_pandas.reset_index(drop=True)

# Show first five rows of alterd data frame
df_pandas.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## Get basic information of the data frame

In [5]:
# Get basic information of the data frame
df_pandas.describe()

Unnamed: 0,Postal Code,Borough,Neighborhood
count,103,103,103
unique,103,10,98
top,M4G,North York,Downsview
freq,1,24,4


## If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough

In [6]:
df_pandas.loc[df_pandas['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df_pandas['Borough']
df_pandas.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## Get basic information of the data frame

In [7]:
# Get basic information of the data frame
df_pandas.describe()

Unnamed: 0,Postal Code,Borough,Neighborhood
count,103,103,103
unique,103,10,98
top,M4G,North York,Downsview
freq,1,24,4


## Show shape data frame

In [8]:
df_pandas

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing Centre
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


## Save data frame to csv file

In [9]:
df_pandas.to_csv(r'df_pandas.csv')

## Show shape (number of rows and columns) of data frame

In [10]:
df_pandas.shape

(103, 3)

---

# Alternative 2 - BeautifulSoup

In [11]:
# Check if needed package(s) is/are installed; if not install
conda_package_check = !conda list beautifulsoup4
if 'beautifulsoup4' not in str(conda_package_check[-1]):
    print('Anaconda package "beautifulsoup4" is not installed yet.\n Installation will be executed now...')
    !conda install -c anaconda beautifulsoup4 --yes
else:
    print('Anaconda package "beautifulsoup4" is installed.\n No further actions needed...')
    
conda_package_check = !conda list lxml
if 'lxml' not in str(conda_package_check[-1]):
    print('Anaconda package "lxml" is not installed yet.\n Installation will be executed now...')
    !conda install -c anaconda lxml --yes    
else:
    print('Anaconda package "lxml" is installed.\n No further actions needed...')

Anaconda package "beautifulsoup4" is installed.
 No further actions needed...
Anaconda package "lxml" is installed.
 No further actions needed...


In [12]:
# Import libraries and packages
import pandas as pd
import numpy as np
import requests
#import lxml.html as lh
from bs4 import BeautifulSoup

In [13]:
# Wikipedia URL
wikipedia_url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Request the URL
wikipedia_page = requests.get(wikipedia_url)

# Store the URL/page
wikipedia_content = BeautifulSoup(wikipedia_page.content, 'html5lib')

# Get right table with 'Postal Code' in the header of the table
tables = wikipedia_content.find_all('table', class_='sortable')
for table in tables:
    ths = table.find_all('th')
    header = [th.text.strip() for th in ths]
    if 'Postal Code' in header:
        df_bs = pd.read_html(str(table), index_col=None, header=0, flavor='html5lib')[0]
        break
df_bs.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Get basic information of the data frame

In [14]:
# Get basic information of the data frame
df_bs.describe()

Unnamed: 0,Postal Code,Borough,Neighborhood
count,180,180,103
unique,180,11,98
top,M1M,Not assigned,Downsview
freq,1,77,4


## Drop all rows in data frame which have a "Not assigned" in the "Borough" column

In [15]:
# Drop all rows in data frame which have a "Not assigned" in the "Borough" column
df_bs.drop(df_bs.index[df_bs['Borough'] == 'Not assigned'], inplace = True)

# Reset the index after row drop
df_bs = df_bs.reset_index(drop=True)

# Show first five rows of alterd data frame
df_bs.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## Get basic information of the data frame

In [16]:
# Get basic information of the data frame
df_bs.describe()

Unnamed: 0,Postal Code,Borough,Neighborhood
count,103,103,103
unique,103,10,98
top,M4G,North York,Downsview
freq,1,24,4


## If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough

In [17]:
df_bs.loc[df_bs['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df_bs['Borough']
df_bs.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## Get basic information of the data frame

In [18]:
# Get basic information of the data frame
df_bs.describe()

Unnamed: 0,Postal Code,Borough,Neighborhood
count,103,103,103
unique,103,10,98
top,M4G,North York,Downsview
freq,1,24,4


## Show shape data frame

In [19]:
df_bs

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing Centre
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


## Save data frame to csv file

In [20]:
df_bs.to_csv(r'df_bs.csv')

## Show shape (number of rows and columns) of data frame

In [21]:
df_bs.shape

(103, 3)