<img align=left src="Data/NUSlogo.png" width=125>
<br><br>

# RE2708 Lecture 5

## Working online

Dr. Cristian Badarinza

## Structure of this Lecture

1.	Some additional functions for data cleaning
2.	The **requests** library
3.	Three examples of web scraping
4.	Google Colab


## 1. Some additional functions for data cleaning

In [1]:
import pandas as pd

Let's start by downloading a data set from data.gov.sg into our ***Data*** folder.

*median-rent-by-town-and-flat-type.csv* 

(Source:https://data.gov.sg/dataset/median-rent-by-town-and-flat-type)

### Making sure the data is of a numeric type:

The first data set that we analyze contains the median rent by town and flat type:

In [2]:
df = pd.read_csv('Data/median-rent-by-town-and-flat-type.csv', parse_dates = ['quarter'])
df.head()

Unnamed: 0,quarter,town,flat_type,median_rent
0,2005-04-01,ANG MO KIO,1-RM,na
1,2005-04-01,ANG MO KIO,2-RM,na
2,2005-04-01,ANG MO KIO,3-RM,800
3,2005-04-01,ANG MO KIO,4-RM,950
4,2005-04-01,ANG MO KIO,5-RM,-


Note that the column **median_rent** should be a number, but PANDAS reads it as text. 

Therefore, we need to transform this into a numeric type, using the function `to_numeric`:

In [3]:
df['median_rent'] = pd.to_numeric(df['median_rent'], errors = 'coerce')
df.head()

Unnamed: 0,quarter,town,flat_type,median_rent
0,2005-04-01,ANG MO KIO,1-RM,
1,2005-04-01,ANG MO KIO,2-RM,
2,2005-04-01,ANG MO KIO,3-RM,800.0
3,2005-04-01,ANG MO KIO,4-RM,950.0
4,2005-04-01,ANG MO KIO,5-RM,


### Renaming columns

In [4]:
df.columns = ['Quarter', 'Planning area', 'Flat type', 'Median Rent']
df.head()

Unnamed: 0,Quarter,Planning area,Flat type,Median Rent
0,2005-04-01,ANG MO KIO,1-RM,
1,2005-04-01,ANG MO KIO,2-RM,
2,2005-04-01,ANG MO KIO,3-RM,800.0
3,2005-04-01,ANG MO KIO,4-RM,950.0
4,2005-04-01,ANG MO KIO,5-RM,


## 2. The **requests** library

The **requests** library allows us to read information directly from any website that is available online.

In [5]:
import requests

## 3. Three examples of web scraping

### Data on buyer's stamp duty

In [6]:
# Reading from the online source
df = pd.read_html(requests.get('https://www.iras.gov.sg/taxes/stamp-duty/for-property/buying-or-acquiring-property/buyer%27s-stamp-duty-(bsd)').text)

# Selecting the right data frame
df = df[0]

# Viewing the data frame
df.head()

Unnamed: 0_level_0,Before 20 Feb 2018,Before 20 Feb 2018
Unnamed: 0_level_1,Purchase Price or Market Value of the Property,BSD Rates
0,"First $180,000",1%
1,"Next $180,000",2%
2,Remaining Amount,3%


### Data on the world's highest buildings

In [7]:
# Reading from the online source
df = pd.read_html(requests.get('https://www.skyscrapercenter.com/buildings').text)

# Selecting the right data frame
df = df[0]

# Renaming columns
df.columns = ['Rank', 'Name', 'City', 'Status', 'Completion', 'Height', 'Floors', 'Material', 'Function']

# Viewing the data
df.head()

Unnamed: 0,Rank,Name,City,Status,Completion,Height,Floors,Material,Function
0,1,Burj Khalifa,Dubai,,2010,"828 m / 2,717 ft",163,Steel Over Concrete,office / residential / hotel
1,2,Shanghai Tower,Shanghai,,2015,"632 m / 2,073 ft",128,Concrete-Steel Composite,hotel / office
2,3,Makkah Royal Clock Tower,Mecca,,2012,"601 m / 1,972 ft",120,Steel Over Concrete,serviced apartments / hotel / retail
3,4,Ping An Finance Center,Shenzhen,,2017,"599.1 m / 1,965 ft",115,Concrete-Steel Composite,office
4,5,Lotte World Tower,Seoul,,2017,"554.5 m / 1,819 ft",123,Concrete-Steel Composite,hotel / residential / office / retail


### Data on protected green areas

In [8]:
# Reading from the online source
df = pd.read_html(requests.get('https://en.wikipedia.org/wiki/List_of_parks_in_Singapore').text)

# Selecting the right data frame
df = df[2]

# Viewing the data frame
df.head()

Unnamed: 0,Name,Type,Area (m)
0,Admiralty Park,Nature,270000.0
1,Ang Mo Kio Town Garden East,Community,49000.0
2,Ang Mo Kio Town Garden West,Community,206000.0
3,Bedok Town Park,Community,146000.0
4,Bishan-Ang Mo Kio Park,Community,620000.0


## 4. Google Colab

Similarly to how we can work with Google Docs, Google Slides or Google Sheets, there is on online Juyter Notebook collaboration tool available:

https://colab.research.google.com/