## What is Screen-Scraping?


Screenscraping refers to the process of automatically extracting data from web pages. 

A typical screenscraping program:
- download a webpage in HTML format
- finds some piece of desired information 
- places that information in a convenient format

Screenscraping can also be used to download other types of content as well, however, such as audio-visual content.

## Reading content from a web-page in Python 

Let's say we wanted to parse a table from this web page:
[https://en.wikipedia.org/wiki/List_of_countries_by_social_welfare_spending](https://en.wikipedia.org/wiki/List_of_countries_by_social_welfare_spending)


First, we would import a couple of useful packages. 

In [24]:
from bs4 import BeautifulSoup ##A package to work with HTML data
import requests #A package to make HTTP requests
from io import StringIO


Then, we would request the page from the internet using the `requests` package, and parse the HTML content using `Beautifulsoup`. We store the content of the page in a variable named _soup_.

In [4]:
LINK = "https://en.wikipedia.org/wiki/List_of_countries_by_social_welfare_spending"

headers = {
    "User-Agent": "DTU-Compute-Research-Bot/1.0"
}

r = requests.get(LINK, headers=headers)
soup = BeautifulSoup(r.content, "html.parser")


To find our table, we open the elements panel from our browser (Command + Shift + C on Chrome).
Scrolling through the elements, we find that the our table is stored in a _&lt;table&gt;_ element of class _wikitable_. 
Each row in the table is stored in a _&lt;tr&gt;_ table element.
The row's cells can then be established using a mix of _&lt;td&gt;_ (data cell) and _&lt;th&gt;_ (header cell) elements.



We use the command ``find`` to find the first element of this kind within _soup_. 

Then, we use the command ``find_all`` to find all the rows (table row elements _&lt;tr&gt;_) in our table. We loop through the rows and use ``find_all`` to get the headers (table header elements _&lt;th&gt;_) and the data (table data elements _&lt;td&gt;_). 

In [12]:

table = soup.find("table",{"class":"wikitable"})
table_rows = table.find_all("tr")

#HERE I GET THE HEADER
clean = lambda th: th.get_text(" ", strip=True)

# Row 0
ths = table_rows[0].find_all("th")
row0 = [clean(th) for th in ths for _ in range(int(th.get("colspan", 1)))]

# Row 1
ths_2 = table_rows[1].find_all("th")
row1 = [clean(th) for th in ths_2 for _ in range(int(th.get("colspan", 1)))]

# Headers from row0 that span into row1 (e.g., Country)
carry = [clean(th) for th in ths if int(th.get("rowspan", 1)) == 2]

# Full second-row header (Country placeholder + the rest)
row1_full = carry + row1

headers = [(i+"-"+j)  if i!=j else i for (i,j) in zip(row0,row1_full)]


#HERE I GET THE ROWS
rows = []
for tr in table_rows[2:]:
    tds = tr.find_all('td')
    row = [td.text.replace("\n","") for td in tds]
    rows.append(row)           
         

In [13]:
import pandas as pd
pd.DataFrame(rows, columns=headers)

Unnamed: 0,Country,% of GDP-Public,% of GDP-Net total,Per capita ($)-Constant,Per capita ($)-Current
0,Australia,17.1,22.9,8659.0,12246.0
1,Austria,31.6,29.4,16364.0,20349.0
2,Belgium,28.6,26.7,14138.0,18007.0
3,Bulgaria,20.5,—,4622.0,6054.0
4,Canada,19.3,—,8967.0,11974.0
5,Chile,12.9,20.9,3019.0,4217.0
6,Colombia,14.1,17.4,2091.0,3093.0
7,Costa Rica,12.6,13.0,2548.0,3291.0
8,Croatia,22.5,—,6690.0,8298.0
9,Czech Republic,22.1,20.8,8607.0,11108.0


There is a way to get a table from an html page automatically using Pandas (using ``pd.read_html``). 
However, this will not be useful if you need to parse content other than tables.

In [25]:
tables = pd.read_html(StringIO(r.text), flavor="lxml")
tables[1]

Unnamed: 0_level_0,Country,% of GDP,% of GDP,Per capita ($),Per capita ($)
Unnamed: 0_level_1,Country,Public,Net total,Constant,Current
0,Australia,17.1,22.9,8659,12246
1,Austria,31.6,29.4,16364,20349
2,Belgium,28.6,26.7,14138,18007
3,Bulgaria,20.5,—,4622,6054
4,Canada,19.3,—,8967,11974
5,Chile,12.9,20.9,3019,4217
6,Colombia,14.1,17.4,2091,3093
7,Costa Rica,12.6,13.0,2548,3291
8,Croatia,22.5,—,6690,8298
9,Czech Republic,22.1,20.8,8607,11108
