# FIRST TASK - WEB SCRAPING

**In this first line, I import all the libraries needed for the assignment.**

*Note: most of the libreries were already installed. I've just installed the BeautifulSoup4 library.*

In [14]:
# library to handle data in a vectorized manner
import numpy as np 

# library for data analsysis
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# import needed libraries for web scraping
from bs4 import BeautifulSoup
import lxml
import html5lib
import requests

# map rendering library
import folium 

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

print('Libraries imported.')

Libraries imported.


**In the second line I download the html file to be able to scrap the information needed.**

In [15]:
!wget -q -O 'toronto_postcode.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
print('Data downloaded!')

Data downloaded!


**Opening the file.**

In [16]:
with open('toronto_postcode.html') as html_file:
    soup = BeautifulSoup(html_file,'lxml')

**In the next line I localize the table from which I'll parse the values for the data frame.**

In [17]:
new_table = soup.find('table',{'class':'wikitable sortable'})

**The following three lines are for parsing the values for the PostaCode, the Borough, and the Neighborhood columns.**

In [18]:
values_post = new_table.findAll('td')[0::3]

PostalCode = []
for value in values_post:
    PostalCode.append(value.text)    

In [19]:
values_bor = new_table.findAll('td')[1::3]

Borough = []
for value in values_bor:
    Borough.append(value.text) 

In [20]:
values_nei = new_table.findAll('td')[2::3]

Neighborhood = []
for value in values_nei:
    value = value.text
    value = value.split("\n")
    value = value[0]
    Neighborhood.append(value)

**This next line is for creating our data frame with the Toronto's FSAs.** 

In [21]:
df_toronto = pd.DataFrame()
df_toronto['PostalCode'] = PostalCode
df_toronto['Borough'] = Borough
df_toronto['Neighborhood'] = Neighborhood

df_toronto.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


**The two following lines are for getting rid of the Not assigned rows.**

- The first code is for creating a conditioning mask to handle the Not assigned values. 
- The second one is for actually dropping the rows. 

In [22]:
borough_mask = df_toronto.index[df_toronto['Borough'] == 'Not assigned']
neighborhood_mask = df_toronto.index[df_toronto['Neighborhood'] == 'Not assigned']
neighborhood_and_borough_mask = borough_mask & neighborhood_mask

In [23]:
df_toronto.drop(df_toronto.index[borough_mask], inplace=True)
df_toronto.reset_index(drop=True, inplace=True)

print(df_toronto.shape)
df_toronto.head(10)

(212, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


**In the next line I use once again a mask to substitute the value Not assigned from a Neighborhoods to its Borough value.**

In [25]:
neighborhood_mask = df_toronto.index[df_toronto['Neighborhood'] == 'Not assigned']

for idx in neighborhood_mask:
    df_toronto['Neighborhood'][idx] = df_toronto['Borough'][idx]

In [32]:
#double check
df_toronto.loc[6,:]

PostalCode               M7A
Borough         Queen's Park
Neighborhood    Queen's Park
Name: 6, dtype: object

**Finally, in the following line I use lambda to create two functions to combine the neighborhoods with the same postal code are into the same row.**

*Note: the neighborhoods will be separated with a comma.*

In [28]:
lambda_nei = lambda x : '%s' % ', '.join(x)
lambda_bo = lambda x : set(x).pop()

new = df_toronto.groupby('PostalCode')
new_neighborhood = new['Neighborhood'].apply(lambda_nei)
new_borough = new['Borough'].apply(lambda_bo)

columns = list(zip(new_borough.index, new_borough, new_neighborhood))
df_toronto_new = pd.DataFrame(columns)
df_toronto_new.columns = ['PostalCode','Borough','Neighborhood']

### FINAL RESULT

In [31]:
print(df_toronto_new.shape)
df_toronto_new.head()

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


----------------
*The dataframe was parsed for the following wikipedia page:*

[Toronto Postal Codes](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)