## Data Science Capstone Project
There are three chapters in this project:

<b> (1) Web Scraping the List of Postal Codes of Canada</b>

<b> (2) Adding Geolocation to Postcodes</b>

<b> (3) Explore the Neighborhoods in Toronto</b>


### (1) Web Scraping the List of Postal Codes of Canada



In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
import requests
from bs4 import BeautifulSoup



Load the html file

In [2]:
response = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
# Load html OK is '200'
#response.status_code

Use BeautifulSoup to parse the html file and retrieve the table information.
The data of interest is in the first table and only this table will be scraped.  

In [3]:
soup_obj = BeautifulSoup(response.text,'lxml')
#soup_obj.prettify
tables = soup_obj.find_all('tbody')

Extract the data into Nx5 array 'bits'. 

In [4]:
snip=tables[0].get_text()
snippit = snip.split("\n")
bits=np.reshape(snip[0:-1].split("\n"),(-1,5))

Create three DataFrames from the data table, one for each column as they will be treated differently later. First and last empty columes in the Nx5 data array are dropped automatically.  

In [5]:
bits_p_df = pd.DataFrame(bits[1:],columns = bits[0])
bits_p_df.drop(['Borough','Neighbourhood'],axis=1,inplace=True)

bits_n_df = pd.DataFrame(bits[1:],columns = bits[0])
bits_n_df.drop('Borough',axis=1,inplace=True)

bits_b_df = pd.DataFrame(bits[1:],columns = bits[0])
bits_b_df.drop('Neighbourhood',axis=1,inplace=True)

Group the data by postalcodes. Create concatenated entries for the variable 'Neighborhood'.


In [6]:
bits_p_li = list(bits_p_df.groupby(['Postcode'], as_index=True)['Postcode'].apply(lambda x: (','.join(x)).split(',')[0] ))
bits_n_li = list(bits_n_df.groupby(['Postcode'], as_index=True)['Neighbourhood'].apply(lambda x: ','.join(x)))
bits_b_li = list(bits_b_df.groupby(['Postcode'], as_index=False)['Borough'].apply(lambda x: (','.join(x)).split(',')[0] ))

Create a dictionary with three data columns and read it into a dataframe. 

In [7]:
postcode_dic = { 'Postcode':bits_p_li , 'Borough':bits_b_li , 'Neighbourhood':bits_n_li }
postcode_df = pd.DataFrame.from_dict(postcode_dic)

First select rows if the variable 'Borough' has entries not equal to 'Not assigned'.

In [8]:
bo_is_assigned = (postcode_df['Borough'] != 'Not assigned') 
postcode_df = postcode_df[bo_is_assigned]

Then copy the 'Borough' value to 'Neighbourhood' if the variable 'Neighbourhood' has an value 'Not assigned'. Reset the index.

In [9]:
ne_not_assigned = (postcode_df['Neighbourhood'] == 'Not assigned') 
postcode_df['Neighbourhood'][ne_not_assigned] = postcode_df['Borough'][ne_not_assigned]
postcode_df.reset_index(drop=True, inplace=True)

Print DataFrame as a table. Index # 85 is postcode M7A from table entry 9 on the Wikipedia page. The Neighbourhood name was fixed by the call in the previous cell.

In [10]:
postcode_df.head()
#postcode_df # use this line to look at all data

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Finally the shape function:

In [11]:
postcode_df.shape

(103, 3)

Save DataFrame postcode_df to file.

In [12]:
export_csv = postcode_df.to_csv (r'postcode_df.csv', index = None, header=True)