# Capstone project

## Coursera | Applied Data Science Capstone
### by Martin Kovarik

### Assignment
Goal: **Explore and cluster the neighborhoods in Toronto.**

1. Start by creating a new Notebook for this assignment.

2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, <br>
    2a) Obtain the data that is in the table of postal codes <br>
    2b) transform the data into a pandas  dataframe.<br>

3. To create the above dataframe:<br>
    3a) The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood<br>
    3b) Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.<br>
    3c) More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed 
   twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.<br>
    3d) If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.<br>
    3e) Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.<br>
    3f) In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.<br>

4. Submit a link to your Notebook on your Github repository. (10 marks)

### Ad1) Start by creating a new Notebook for this assignment.

Done.

### Ad2) Use the Notebook to build the code to scrape the following Wikipedia page: 
URL: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [102]:
# LIBS
import pandas as pd
import requests
import numpy as np
import geocoder
import folium
import requests 
import matplotlib.cm as cm
import matplotlib.colors as colors
import json
import xml
import matplotlib.pyplot as plt
%matplotlib inline
#import warnings
#warnings.filterwarnings("ignore")
from pandas.io.json import json_normalize 
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim 
from bs4 import BeautifulSoup

In [103]:
#Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

**2a) Obtain the data that is in the table of postal codes**

In [104]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
extracting_data = requests.get(url)

In [105]:
soup = BeautifulSoup(extracting_data.text, 'lxml')

In [106]:
soup.head()

[<meta charset="utf-8"/>,
 <title>List of postal codes of Canada: M - Wikipedia</title>,
 <script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"X-QVRgpAIH4AAyH3hf0AAAAI","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":995657573,"wgRevisionId":995657573,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgPageContentLangu

**2b) transform the data into a pandas  dataframe.**

In [107]:
data = []
columns = []
wiki_data = soup.find(class_ = 'wikitable')
for idx, tr in enumerate(wiki_data.find_all('tr')):
    part = []
    for td in tr.find_all(['td', 'th']):
        part.append(td.text.strip())
    if not columns: #find header of the table
        columns = part
    else:
        data.append(part)
toronto = pd.DataFrame(data = data, columns = columns)
toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Ad3) To create the above dataframe

**3a) The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood**

In [108]:
toronto = toronto.rename(columns={'Postal Code' : 'PostalCode', 'Neighbourhood' : 'Neighborhood'})
toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


**3b) Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.**

In [109]:
toronto = toronto[toronto.Borough!='Not assigned']
toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


**3c) More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed 
   twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.**

In [110]:
def neighborhoods(grouped):    
    return ', '.join(sorted(grouped['Neighborhood'].tolist()))
                    
group = toronto.groupby(['PostalCode', 'Borough'])
toronto = group.apply(neighborhoods).reset_index(name='Neighborhood')

In [111]:
toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


**3d) If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.**

In [112]:
#Check for 'Neighborhood' values if we have any 'Not assigned' value.
#It turned out that we don't. 
toronto['Neighborhood'].value_counts()

Downsview                                                                                                                                 4
Don Mills                                                                                                                                 2
Woburn                                                                                                                                    1
New Toronto, Mimico South, Humber Bay Shores                                                                                              1
Glencairn                                                                                                                                 1
Steeles West, L'Amoreaux West                                                                                                             1
East Toronto, Broadview North (Old East York)                                                                                             1
Thorncliffe Park    

**3e) Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.**

Done

**3f) In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.**

In [113]:
print("In our dataset is \"{}\" rows.".format(toronto.shape[0]))

In our dataset is "103" rows.
