<h1 align = center>Applied Data Science Capstone Project Notebook</h1>
<h3 align = right>Nicolas Yanez, M.D., M.Sc.

## Week 1 - Capstone Project Notebook

Create a new repository on your Github account and name it <b>Coursera_Capstone</b>. I have prepared a guide to walk you through the process of creating a repository and setting it up. For Mac users, click [here][1]. For Windows users, click [here][2].

Now, start a Jupyter Notebook using any platform that you are comfortable with and do the following:

[1]: https://medium.com/@aklson_DS/how-to-properly-setup-your-github-repository-mac-version-3a8047b899e5
[2]: https://medium.com/@aklson_DS/how-to-properly-setup-your-github-repository-windows-version-ea596b398b

1. Write some markdown to explain that this notebook will be mainly used for the capstone project. **[DONE]**

This notebook will be used for the capstone project of the Applied Data Science Capstone course in the IBM Data Science Professional Certification.

2. Import the <i>pandas</i> library as pd. **[DONE]**
3. Import the <i>Numpy</i> library as np. **[DONE]**

In [1]:
import pandas as pd
import numpy as np

4. Print the following the statement: Hello Capstone Project Course! **[DONE]**

In [2]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


5. Push the Notebook to your Github repository and submit a link to the notebook on your Github repository. **[DONE]**

## Week 1 PGA - END

## Week 2 - Segmenting and Clustering Neighborhoods in Toronto

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment. **[DONE]**
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a *pandas* dataframe like the one shown below:

![alt text][logo]

3. To create the above dataframe:  
  
  
* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood **[DONE]**
* Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned**. **[DONE]**
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that **M5A** is listed twice and has two neighborhoods: **Harbourfront** and **Regent Park**. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in **row 11** in the above table. **[DONE]**
* If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be the same as the borough. So for the **9th** cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be **Queen's Park**. **[DONE]**
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making. **[DONE]**
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe. **[DONE]**


**Note:** There are different website scraping libraries and packages in Python. For scraping the above table, you can simply use *pandas* to read the table into a *pandas* dataframe.

Another way, which would help to learn for more complicated cases of web scraping is using the BeautifulSoup package. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/

The package is so popular that there is a plethora of tutorials and examples on how to use it. Here is a very good Youtube video on how to use the BeautifulSoup package: https://www.youtube.com/watch?v=ng2o98k983k

Use *pandas*, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe.

[logo]: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1577923200000&hmac=5pnBiPm5sXBYCe9OAcumreGkgKKmkttIDlQAcgxHbKk

In [3]:
# Install and import needed packages

#!pip install beautifulsoup4
from bs4 import BeautifulSoup

#!pip install lxml
import lxml

#!pip install html5lib
#!pip install requests
import requests

In [4]:
# Read-in html source code from Wikipedia using a get request as a BeautifulSoup object
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

In [5]:
# Recreate the postcode table as a list of values
data = []
table = soup.find('table', attrs = {'class': 'wikitable sortable'})
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

In [6]:
# Create a pandas dataframe based on the previously generated list, assigning column names, removing rows with unassigned buroughs, and coalescing neighborhood values into unique rows based on postal code
df = pd.DataFrame(data)
df.columns = ['PostalCode', 'Borough', 'Neighborhood']
df = df[df.Borough != 'Not assigned']

df = df.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()

for index, row in df.iterrows():
    if row['Neighborhood'] == 'Not assigned':
        row['Neighborhood'] = row['Borough']

In [7]:
# Dimensions of the final pandas dataframe
df.shape

(103, 3)