# IBM - Applied data science capstone project
This notebook only serves the purpouse to document and pass the capstone project for the data science professional certificate by IBM on coursera. 

Check out the link for more information: [Applied Data Science Capstone](https://www.coursera.org/learn/applied-data-science-capstone)

## Peer graded assignment: Capstone Project Notebook 1
The following lines of code are only supposed to satisfy the peer graded assignment number 1 and do not serve any other purpose at this point.

In [1]:
# Import working libraries
import pandas as pd
import numpy as np

In [2]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


<br>
<br>
<br>
<br>

## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto
The following code sections only serve the purpose of solving the given tasks.

### Task 1: Create a dataframe about Toronto's neighborhoods
To solve this task, we have to scrape the neighborhoods table from Toronto's [Wikipedia website](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). <br><br>

To do so, we will use the Pandas "read_html" function in order to scrape the Wikipedia page.

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" # Define url
dfs = pd.read_html(url) # Download tables on website via pandas
print("The downloaded website contains {} tables.".format(len(dfs))) # Display how many tables have been downloaded

The downloaded website contains 3 tables.


As we can see, we have downloaded multiple tables. A quick visual check on the website itself shows us that the information we are interested in lies within table 1. We thus need to extract table 1 from the downloaded website.

In [4]:
df = dfs[0] # Get first table
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M1ANot assigned,M2ANot assigned,M3ANorth York(Parkwoods),M4ANorth York(Victoria Village),M5ADowntown Toronto(Regent Park / Harbourfront),M6ANorth York(Lawrence Manor / Lawrence Heights),M7AQueen's Park(Ontario Provincial Government),M8ANot assigned,M9AEtobicoke(Islington Avenue)
1,M1BScarborough(Malvern / Rouge),M2BNot assigned,M3BNorth York(Don Mills)North,M4BEast York(Parkview Hill / Woodbine Gardens),"M5BDowntown Toronto(Garden District, Ryerson)",M6BNorth York(Glencairn),M7BNot assigned,M8BNot assigned,M9BEtobicoke(West Deane Park / Princess Garden...
2,M1CScarborough(Rouge Hill / Port Union / Highl...,M2CNot assigned,M3CNorth York(Don Mills)South(Flemingdon Park),M4CEast York(Woodbine Heights),M5CDowntown Toronto(St. James Town),M6CYork(Humewood-Cedarvale),M7CNot assigned,M8CNot assigned,M9CEtobicoke(Eringate / Bloordale Gardens / Ol...
3,M1EScarborough(Guildwood / Morningside / West ...,M2ENot assigned,M3ENot assigned,M4EEast Toronto(The Beaches),M5EDowntown Toronto(Berczy Park),M6EYork(Caledonia-Fairbanks),M7ENot assigned,M8ENot assigned,M9ENot assigned
4,M1GScarborough(Woburn),M2GNot assigned,M3GNot assigned,M4GEast York(Leaside),M5GDowntown Toronto(Central Bay Street),M6GDowntown Toronto(Christie),M7GNot assigned,M8GNot assigned,M9GNot assigned


We can see that we downloaded the correct data. The data, however, is messy and needs to be cleaned before we can use it for something else:

#### Stacking the columns horizontally
To be able to iterate through the columns more easily, we need to stack them. We can do so by appending each column to a new variable. In result, we will have all converted our initial df with 20 rows and 9 columns to a new df with 180 rows and 1 column.

In [5]:
df_stacked = df[0] # Grab first column in new variable

for column in df.columns[1:]: # Iterate through columns, starting with the second column
    df_stacked = df_stacked.append(df[column]) # Append each column to new variable

df_stacked = pd.DataFrame(df_stacked) # Convert back to df
print("The stacked df has {} rows and {} columns:".format(df_stacked.shape[0], df_stacked.shape[1]))
df_stacked.head()

The stacked df has 180 rows and 1 columns:


Unnamed: 0,0
0,M1ANot assigned
1,M1BScarborough(Malvern / Rouge)
2,M1CScarborough(Rouge Hill / Port Union / Highl...
3,M1EScarborough(Guildwood / Morningside / West ...
4,M1GScarborough(Woburn)


#### Getting the post codes
When we observe the data more closely, we can notice that the postal code is contained in the first three letters of each string. We can store these in a new column by accessing the first three positions of the string:

In [6]:
df_stacked['POSTAL_CODE'] = df_stacked[0].str[:3] # Receive first three characters as string
df_stacked.head()

Unnamed: 0,0,POSTAL_CODE
0,M1ANot assigned,M1A
1,M1BScarborough(Malvern / Rouge),M1B
2,M1CScarborough(Rouge Hill / Port Union / Highl...,M1C
3,M1EScarborough(Guildwood / Morningside / West ...,M1E
4,M1GScarborough(Woburn),M1G


#### Getting the boroughs
When we have a second look at the downloaded information, we can see that the boroughs always start after the postal code (first three letters) and end before a "(" character. We can use the split function to split our information and then delete the rest of the string to isolate the boroughs:

In [7]:
split_char = "(" # Define ( as the character to split the string
df_stacked['BOROUGH'] = df_stacked[0].str.split(pat=split_char) # Splits the string after the ( character

df_stacked['BOROUGH'] = df_stacked['BOROUGH'].str[0] # Select first split
df_stacked['BOROUGH'] = df_stacked['BOROUGH'].str[3:] # Kick out the first three letters (postal code)

df_stacked.head()

Unnamed: 0,0,POSTAL_CODE,BOROUGH
0,M1ANot assigned,M1A,Not assigned
1,M1BScarborough(Malvern / Rouge),M1B,Scarborough
2,M1CScarborough(Rouge Hill / Port Union / Highl...,M1C,Scarborough
3,M1EScarborough(Guildwood / Morningside / West ...,M1E,Scarborough
4,M1GScarborough(Woburn),M1G,Scarborough


#### Getting the neighborhoods
In this step, we need to extract the neighborhoods. We can do that with the same approach as above. Afterwards, we will seperate the neighborhoods with a "," instead of a "/" as the course material tells us to.

In [8]:
split_char = "(" # Define ( as the character to split the string
df_stacked['NEIGHBORHOOD'] = df_stacked[0].str.split(pat=split_char) # Splits the string after the "(" character

df_stacked['NEIGHBORHOOD'] = df_stacked['NEIGHBORHOOD'].str[1] # Select second split
df_stacked['NEIGHBORHOOD'] = df_stacked['NEIGHBORHOOD'].str[:-1] # Kick out the last character "("

df_stacked['NEIGHBORHOOD'] = df_stacked['NEIGHBORHOOD'].str.replace(" / ", ", ")

df_stacked.head()

Unnamed: 0,0,POSTAL_CODE,BOROUGH,NEIGHBORHOOD
0,M1ANot assigned,M1A,Not assigned,
1,M1BScarborough(Malvern / Rouge),M1B,Scarborough,"Malvern, Rouge"
2,M1CScarborough(Rouge Hill / Port Union / Highl...,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
3,M1EScarborough(Guildwood / Morningside / West ...,M1E,Scarborough,"Guildwood, Morningside, West Hill"
4,M1GScarborough(Woburn),M1G,Scarborough,Woburn


#### Cleaning up the dataframe
Finally, we can clean up the data to fullfill the courses' requirements for task 1. First, we will drop the column with the messy data:

In [9]:
df_stacked.drop(columns=[0], inplace=True)
df_stacked.head()

Unnamed: 0,POSTAL_CODE,BOROUGH,NEIGHBORHOOD
0,M1A,Not assigned,
1,M1B,Scarborough,"Malvern, Rouge"
2,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
3,M1E,Scarborough,"Guildwood, Morningside, West Hill"
4,M1G,Scarborough,Woburn


Next, we will get rid of all rows where the borough is not assigned:

In [10]:
df_stacked = df_stacked.reset_index(drop=True) # Reset index
index_names = df_stacked[df_stacked['BOROUGH'] == 'Not assigned'].index # Define condition
df_stacked.drop(index_names, axis=0, inplace=True) # Drop rows that meet condition
df_stacked.head()

Unnamed: 0,POSTAL_CODE,BOROUGH,NEIGHBORHOOD
1,M1B,Scarborough,"Malvern, Rouge"
2,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
3,M1E,Scarborough,"Guildwood, Morningside, West Hill"
4,M1G,Scarborough,Woburn
5,M1H,Scarborough,Cedarbrae


The course says that the data set includes duplicates in the postal codes column. We therefore will check our dataframe for duplicates: 

In [11]:
duplicates = df_stacked['POSTAL_CODE'].duplicated() # Check for duplicates
print("We have {} duplicates in our postal codes.".format(duplicates.sum())) # Count duplicates = True

We have 0 duplicates in our postal codes.


As we have no duplicates in our postal code column, we can move on and check whether we have a the value "Not assigned" in our neighborhoods column:

In [12]:
if 'Not assigned' in set(df_stacked['NEIGHBORHOOD']) == True:
    print("There are missing neighborhoods, please check.")

else:
    print("No missing neighborhoods, proceed.")

No missing neighborhoods, proceed.


Since we do not have any unassigned neighborhoods, we can proceed to the final step - assigning our cleaned dataframe to a new variable for further processing:

In [16]:
toronto = df_stacked
print("The Torono neighborhood df has {} rows and {} columns:".format(toronto.shape[0], toronto.shape[1]))
print()
toronto.head()

The Torono neighborhood df has 103 rows and 3 columns:



Unnamed: 0,POSTAL_CODE,BOROUGH,NEIGHBORHOOD
1,M1B,Scarborough,"Malvern, Rouge"
2,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
3,M1E,Scarborough,"Guildwood, Morningside, West Hill"
4,M1G,Scarborough,Woburn
5,M1H,Scarborough,Cedarbrae


### Task 2: Add geospatial data to your dataframe
To solve this task we are asked to add geographical data to our toronto dataframe. We will do so by using the Geocoder library. If this does not work, we will use the provided csv file. 

In [14]:
import geocoder # import library
print("Geocoder imported.")

Geocoder imported.


In [15]:

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

NameError: name 'postal_code' is not defined