# IBM Applied Data Science Capstone Project

This notebook is being used to fulfill the requirements of the IBM Applied Data Science Capstone Project on coursera

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

### Beautiful Soup

In [2]:
url=requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(url, 'lxml')
#print(soup.prettify())

There is a ton of information that is returned from the request, beautifulSoups prettify function does a good job a cleaning the raw html up, but if you delete if you delete the # symbol and rerun the cell above you will see that there is still a lot and if you are new to scraping or html it can be a little overwhelming. for this example the data is in a table, so if you want to find the table you are going to search for the table-tag(< table >)

In [3]:
table = soup.find_all('table')[0]
#table
len(table)

2

Since there are two instances of 'table' we are going to need to be a little more specific to get what we want. By looking at the html, it appears that we want the table that has the class of "wikitable sortable".  Lets try to just grab it with the **.find_all** function. 

In [4]:
table = soup.find_all('table', class_ = "wikitable sortable")
#table
print(len(table),type(table))

1 <class 'bs4.element.ResultSet'>


In [5]:
table=table[0]
type(table)

bs4.element.Tag

An alternative to this is 
```table_alt = soup.find('table',{'class':'wikitable sortable'})```  
It gives you more information like the hyperlink, which is the **"a href ="** this would be usefull if you wanted to scrap infromation from these links.

In [7]:
table_alt = soup.find('table',{'class':'wikitable sortable'})
#table_alt

### Create Pandas Dataframe

If you look at the html code most table you will see 'tr', 'th', and 'td' tags, so we are going to use this structure to get this data into a pandas data frame. 'tr' is the rows, 'th' is the heading, and 'td' is the data in the table.

In [8]:
table_rows = table.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)
df = pd.DataFrame(res, columns=["Postcode", "Borough", "Neighbourhood"])
df=df.rename(columns = {'Postcode':'Postal Code', 'Neighbourhood':'Neighborhood'})
df.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


In [9]:
df.shape

(289, 3)

##### Remove row where the Borough is "Not assigned"

In [11]:
df = df[df.Borough != 'Not assigned']
df.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [12]:
df.shape

(212, 3)

##### Assign Borough-Name to neighborhoods that don't have a name assigned but have a Borough assigned

In [13]:
df.Neighborhood[df.Neighborhood == 'Not assigned'] = df.Borough

df.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [14]:
df.shape

(212, 3)

##### Group neighborhoods that have the same postal code

In [15]:
df = df.groupby(['Postal Code','Borough'])['Neighborhood'].apply(', '.join).reset_index()
df.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [16]:
df.shape

(103, 3)