# Cousera - IBM Data Scientist Specialization
## Capstone Project

#### This notebook will be used for development of the Capstone Project within the Professional Data Scientist Certification specialization in Coursera
##### by Juliano Garcia

In [2]:
# Importing basic libraries
import pandas as pd
import numpy as np

In [3]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


## Importing and cleaning the dataset  
#### Toronto, Canada: Neighbourhood, Borough and Postcode information

In [4]:
# Importing html scraper: BeatifulSoup
from bs4 import BeautifulSoup

# Importing html parser: lxml
import lxml

# Importing requests library
import requests

# Importing the geocoder library (not working)
#import geocoder # import geocoder

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# library to handle JSON files
import json

# Iteration tools library
import itertools

Our Dataset is a table in a wikipedia page:  
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M  
  
Below is the first 5 lines of the table:  

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
</table>

Let's parse the url to scrap the data

In [5]:
# Parsing the webpage url
html_page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(html_page, 'lxml')

# Getting the postcodes table
table = soup.find('table')

Now we'll create a Numpy array with the values of the table.  
First let's get the header names with the html tag '< th >'

In [6]:
# Let's get the table headers
columns = []
for col in table.find_all('th'):
    columns.append(col.text)

# Eliminates the '\n' at the end of the last string
columns[-1] = columns[-1][:-1]
columns

['Postcode', 'Borough', 'Neighbourhood']

Now we get the values in the lines, each row is encompassed between a < tr > tag while each column value of that row is inside of a < td > tag

In [7]:
# Initializing an array with the column headers
table_array = np.array([columns])

# Iterate throu all the lines in the table
for i, lin in enumerate(table.find_all('tr')):
    if i != 0: # Ignores first row since it is the headers
        line = []
        for j, val in enumerate(lin.find_all('td')):
            val = val.text
            if j == 2: # Eliminates the '/n' at the end of all the last column values
                val = val[:-1]
            line.append(val)
        table_array = np.append(table_array, [line], axis=0)

# Now let's convert the array into a pandas dataframe with the first line as headers
df = pd.DataFrame(table_array[1:, :], columns=table_array[0,:])
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Great! We have our raw table, now let's clean it!

In [8]:
# Eliminating the Not assigned Borough rows
df = df[df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In order to group by Postcode so we only have uniques Postcodes in the dataset, let's concatenate the names of the neighbourhood  
First let's create a a function and apply it to the groupby dataset

In [9]:
# Creates a function to concatenate the Neighbouhood names when grouping by Postcode
def group_postcode(x):
     return pd.Series({'Borough' : x['Borough'].values[0], 
                       'Neighbourhood' : "%s" % ', '.join(x['Neighbourhood'])})

In [10]:
df = df.groupby('Postcode').apply(group_postcode)
df.reset_index(inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Now, do we have any "Not assigned" neighbourhoods? If so, let's replace it with the Borough name

In [11]:
pd.Series(df['Neighbourhood'] == 'Not assigned').sum()

1

In [12]:
df['Neighbourhood'].replace('Not assigned', df['Borough'], inplace=True)
df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [13]:
pd.Series(df['Neighbourhood'] == 'Not assigned').sum()

0

In [14]:
df.shape

(103, 3)