# Lesson 5: Web Scraping

<hr>

## Acquire Data
**Common Data Sources are:**
* The internet - Web Scraping
* Database
* CSV, Excel file
* Parquet

### Web Scraping
Extracting data from websites.

**Be ethical**
* Do not use for commercial use
* Only for private use

In [1]:
import pandas as pd

In [2]:
# get the web url
url = "https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics"

In [3]:
# read the data
data = pd.read_html(url)

In [4]:
type(data)

list

In [5]:
type(data[0])

pandas.core.frame.DataFrame

In [6]:
len(data)

1

In [7]:
data[0].head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Net assets at end of year
0,2022/2023,PDF,"$180,174,103","$169,095,381","$15,619,804","$254,971,336"
1,2021/2022,PDF,"$154,686,521","$145,970,915","$8,173,996","$239,351,532"
2,2020/2021,PDF,"$162,886,686","$111,839,819","$50,861,811","$231,177,536"
3,2019/2020,PDF,"$129,234,327","$112,489,397","$14,674,300","$180,315,725"
4,2018/2019,PDF,"$120,067,266","$91,414,010","$30,691,855","$165,641,425"


In [8]:
fundraising = data[0]

<hr>

### Data Wrangling
* **Data wrangling:** transforming and mapping data from one 'raw' data form into another format.
* With the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

In [9]:
fundraising.dtypes

Year                         object
Source                       object
Revenue                      object
Expenses                     object
Asset rise                   object
Net assets at end of year    object
dtype: object

In [10]:
# remove the dollar sign
fundraising['Exp'] = fundraising['Expenses'].str[1:]
fundraising.head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Net assets at end of year,Exp
0,2022/2023,PDF,"$180,174,103","$169,095,381","$15,619,804","$254,971,336",169095381
1,2021/2022,PDF,"$154,686,521","$145,970,915","$8,173,996","$239,351,532",145970915
2,2020/2021,PDF,"$162,886,686","$111,839,819","$50,861,811","$231,177,536",111839819
3,2019/2020,PDF,"$129,234,327","$112,489,397","$14,674,300","$180,315,725",112489397
4,2018/2019,PDF,"$120,067,266","$91,414,010","$30,691,855","$165,641,425",91414010


In [11]:
# replace the comma
fundraising['Exp'] = fundraising['Exp'].str.replace(',','')
# convert the object to integer
fundraising['Exp'] = pd.to_numeric(fundraising['Exp'])

In [12]:
fundraising.head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Net assets at end of year,Exp
0,2022/2023,PDF,"$180,174,103","$169,095,381","$15,619,804","$254,971,336",169095381
1,2021/2022,PDF,"$154,686,521","$145,970,915","$8,173,996","$239,351,532",145970915
2,2020/2021,PDF,"$162,886,686","$111,839,819","$50,861,811","$231,177,536",111839819
3,2019/2020,PDF,"$129,234,327","$112,489,397","$14,674,300","$180,315,725",112489397
4,2018/2019,PDF,"$120,067,266","$91,414,010","$30,691,855","$165,641,425",91414010


In [13]:
fundraising.dtypes

Year                         object
Source                       object
Revenue                      object
Expenses                     object
Asset rise                   object
Net assets at end of year    object
Exp                           int64
dtype: object

In [14]:
# revenue : remove dollar sign and replace comma
fundraising['Rev'] = fundraising['Revenue'].str[1:]
fundraising['Rev'] = fundraising['Rev'].str.replace(',','')
fundraising['Rev'] = pd.to_numeric(fundraising['Rev'])
fundraising.head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Net assets at end of year,Exp,Rev
0,2022/2023,PDF,"$180,174,103","$169,095,381","$15,619,804","$254,971,336",169095381,180174103
1,2021/2022,PDF,"$154,686,521","$145,970,915","$8,173,996","$239,351,532",145970915,154686521
2,2020/2021,PDF,"$162,886,686","$111,839,819","$50,861,811","$231,177,536",111839819,162886686
3,2019/2020,PDF,"$129,234,327","$112,489,397","$14,674,300","$180,315,725",112489397,129234327
4,2018/2019,PDF,"$120,067,266","$91,414,010","$30,691,855","$165,641,425",91414010,120067266


In [15]:
# manipulate
fundraising.loc[0, "Rev"] = 'spam'
fundraising.head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Net assets at end of year,Exp,Rev
0,2022/2023,PDF,"$180,174,103","$169,095,381","$15,619,804","$254,971,336",169095381,spam
1,2021/2022,PDF,"$154,686,521","$145,970,915","$8,173,996","$239,351,532",145970915,154686521
2,2020/2021,PDF,"$162,886,686","$111,839,819","$50,861,811","$231,177,536",111839819,162886686
3,2019/2020,PDF,"$129,234,327","$112,489,397","$14,674,300","$180,315,725",112489397,129234327
4,2018/2019,PDF,"$120,067,266","$91,414,010","$30,691,855","$165,641,425",91414010,120067266


**N.B:** If you then convert from object to numeric, it will raise an error.

In [16]:
# dealing with the error
# error='coerce' will change unparse value to NaN
fundraising['Rev'] = pd.to_numeric(fundraising['Rev'], errors='coerce')

In [17]:
fundraising['Rev'].head()

0            NaN
1    154686521.0
2    162886686.0
3    129234327.0
4    120067266.0
Name: Rev, dtype: float64

In [18]:
fundraising['Rev'].iloc[0:5]

0            NaN
1    154686521.0
2    162886686.0
3    129234327.0
4    120067266.0
Name: Rev, dtype: float64