<a href="https://colab.research.google.com/github/NikkyXO/DAS_projects/blob/main/first_projects/03_Lesson_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping

![Data Science Workflow](img/ds-workflow.png)

## Acquire Data
### Common Data Sources
- **The Internet - Web Scraping**
- Databasis
- CSV
- Excel
- Parquet

### Web Scraping
- Extracting data from websites
- Leagal issues: [wikipedia.org](https://en.wikipedia.org/wiki/Web_scraping#Legal_issues)
- The legality of web scraping varies across the world.
- In general, web scraping may be against the terms of use of some websites, but the enforceability of these terms is unclear.

### Be ethical
- Not for commercial use
- Only private use

## Example
- Let's consider [https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics](https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics)
- **pandas** ```.read_html(.)``` Read HTML tables into a list of DataFrame objects ([docs](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html)).

In [1]:
import pandas as pd

In [27]:
url = "https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics"

In [28]:
# .read_html method returns a list of html tables
data = pd.read_html(url)
type(data)

list

In [29]:
# Each table is a dataframe
type(data[0])

pandas.core.frame.DataFrame

In [30]:
data[0].head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets
0,2021/22,PDF,"$ 154,686,521","$ 145,970,915","$ 8,173,996","$ 239,351,532"
1,2020/21,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536"
2,2019/20,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725"
3,2018/19,PDF,"$ 120,067,266","$ 91,414,010","$ 30,691,855","$ 165,641,425"
4,2017/18,PDF,"$ 104,505,783","$ 81,442,265","$ 21,619,373","$ 134,949,570"


In [31]:
fundraising = data[0]

In [32]:
fundraising.dtypes

Year            object
Source          object
Revenue         object
Expenses        object
Asset rise      object
Total assets    object
dtype: object

In [33]:
fundraising.head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets
0,2021/22,PDF,"$ 154,686,521","$ 145,970,915","$ 8,173,996","$ 239,351,532"
1,2020/21,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536"
2,2019/20,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725"
3,2018/19,PDF,"$ 120,067,266","$ 91,414,010","$ 30,691,855","$ 165,641,425"
4,2017/18,PDF,"$ 104,505,783","$ 81,442,265","$ 21,619,373","$ 134,949,570"


Year            object
Source          object
Revenue         object
Expenses        object
Asset rise      object
Total assets    object
Exp              int64
dtype: object

## Data Wrangling
- Data wrangling (data munging): transforming and mapping data from one "raw" data form into another format
- With the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics

### Check the data types
- Remember ```.dtypes```

In [34]:
# using list indexing to remove dollar sign and space
fundraising["Exp"] = fundraising["Expenses"].str[2:].str.replace(",", "")
fundraising["Exp"] = pd.to_numeric(fundraising["Exp"])
fundraising.head(3)
# fundraising.dtypes

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets,Exp
0,2021/22,PDF,"$ 154,686,521","$ 145,970,915","$ 8,173,996","$ 239,351,532",145970915
1,2020/21,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536",111839819
2,2019/20,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725",112489397


In [35]:
fundraising.head(3)

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets,Exp
0,2021/22,PDF,"$ 154,686,521","$ 145,970,915","$ 8,173,996","$ 239,351,532",145970915
1,2020/21,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536",111839819
2,2019/20,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725",112489397


In [36]:
fundraising["Rev"] = fundraising["Revenue"].str[2:].str.replace(",", "")
fundraising["Rev"] = pd.to_numeric(fundraising["Rev"])
fundraising.head(3)

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets,Exp,Rev
0,2021/22,PDF,"$ 154,686,521","$ 145,970,915","$ 8,173,996","$ 239,351,532",145970915,154686521
1,2020/21,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536",111839819,162886686
2,2019/20,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725",112489397,129234327


In [43]:
fundraising.loc[0, 'Rev'] = "spam"
fundraising.head(2)

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets,Exp,Rev
0,2021/22,PDF,"$ 154,686,521","$ 145,970,915","$ 8,173,996","$ 239,351,532",145970915,spam
1,2020/21,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536",111839819,162886686


In [44]:
fundraising['Rev'] = pd.to_numeric(fundraising['Rev'], errors="coerce")

In [45]:
fundraising.head(4)

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets,Exp,Rev
0,2021/22,PDF,"$ 154,686,521","$ 145,970,915","$ 8,173,996","$ 239,351,532",145970915,
1,2020/21,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536",111839819,162886686.0
2,2019/20,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725",112489397,129234327.0
3,2018/19,PDF,"$ 120,067,266","$ 91,414,010","$ 30,691,855","$ 165,641,425",91414010,120067266.0


In [46]:
fundraising.dtypes

Year             object
Source           object
Revenue          object
Expenses         object
Asset rise       object
Total assets     object
Exp               int64
Rev             float64
dtype: object

In [47]:
# fundraising.loc[0, 'Rev'] = 154686521
# fundraising.head(2)