# YAIS: Yet Another Introduction to Scraping

###### Let's talk about scraping today!
<br>
<br>

MLJC @ Toolbox

#### Today's Big Mission

* Understand the concepts underlying scraping to start being confident and scrape by yourself
* Know about common pittfalls and scraping related problems 

To follow step by step: 

* create a virtual environment: `conda create -n scraping` or something else
* activate the virtual env: `conda activate scraping`
* install the deps: `pip install -r requirements.txt`
* run the notebook: `jupyter notebook` or something else

Github repo: @[MachineLearningJournalClub/YAIS-scraping](https://github.com/MachineLearningJournalClub/YAIS-scraping)

**Just stop me and ask whenever you want!**

#### What is scraping?

Scraping is just a fancy word that stands for a **data collection** activity. To be more precise, scraping is a (not so new, did you ever heard *screen reading*?) way to perform data collection from webpages.

#### Why scraping?

"*Data is the new oil*" bla bla bla... To verify our hypthesis, get insights and to understand the world quantitatively, we just need data!

Web is full of data but in a **unstructured** format, so scraping is just **searching + structuring information** in a way that is useful for our needs.

#### When we should scrape the web?

When an official API is not available!
APIs are "services" that websites offers to give access to their data in a programmatic way. 

When a website release an API they are just implicitely telling us what data and how this data should be accessed.


Let's make a quick example with a [free access API](https://datausa.io/about/api/). 

In [11]:
# make a request to datausa.io APIt

import requests

url="https://datausa.io/api/data"
params={"drilldowns":"Nation",
        "measures":"Population",
        "year":"latest"}

resp = requests.get(url, params)

resp.json()['data'][0]['Population']


328239523

In [24]:
# trick for reveal js speaker notes
print("""import requests
import pprint

url = "https://datausa.io/api/data"
params = {"drilldowns":"Nation",
          "measures": "Population",
          "year": "latest"}

resp = requests.get(url,params)
resp.json()
""")

import requests
import pprint

url = "https://datausa.io/api/data"
params = {"drilldowns":"Nation",
          "measures": "Population",
          "year": "latest"}

resp = requests.get(url,params)
resp.json()



Do we really need of scraping at all?
* Websites saldomly provides APIs.
* Usually an API requires a **paid subscription** or some form of **restricted access** (eg. [IMDB data](https://developer.imdb.com/)).
* Usually APIs have some kind of **requests limits** to preserve server resources.
* API doesn't provide the wanted data.

To get a solid understanding we need to get a glimpse of the fundamental blocks on which scraping libraries build upon, that is the *Fab Four* of the web tech stack.

<img src="https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview/http-layers.png" width=40% align=center/>


## HTTP

A standardized protocol based on **client-server** model, which specify the structure of the messages sent between clients and servers 


In the stone-age ...

1. Client (aka *user-agent*) **send a request** to fetch the HTML page or a resource $\rightarrow$ *whom to contact*? 
2. Server (should) send back a **reply** with the given resource $\rightarrow$ *what could go [wrong](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#5xx_server_errors)*?
3. Client **parse** the response, if it's a [successful one](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#2xx_success), **render** the content.

... Nowadays
<center> <img src="https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview/fetching_a_page.png" width=65% /> </center>

#### With a little help from my friends

Let's [explore](https://developer.mozilla.org/en-US/) *developer tools' network* tab

We'll use [Requests](https://docs.python-requests.org/en/master/user/quickstart/) library in order to send and receive HTTP messages

In [16]:
# make an HTTP GET request
url="https://developer.mozilla.org/en-US/"
resp = requests.get(url)

resp.text

'<!DOCTYPE html><html lang="en-US" prefix="og: https://ogp.me/ns#"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width,initial-scale=1"><link rel="icon" href="/favicon-48x48.97046865.png"><link rel="apple-touch-icon" href="/apple-touch-icon.0ea0fa02.png"><meta name="theme-color" content="#ffffff"><link rel="manifest" href="/manifest.56b1cedc.json"><link rel="search" type="application/opensearchdescription+xml" href="/opensearch.xml" title="MDN Web Docs"><script>Array.prototype.flat&&Array.prototype.includes||document.write(\'<script src="https://polyfill.io/v3/polyfill.min.js?features=Array.prototype.flat%2Ces6"><\\/script>\')</script><title>MDN Web Docs</title><link rel="preload" as="font" type="font/woff2" crossorigin="" href="/static/media/ZillaSlab-Bold.subset.0beac26b.woff2"><meta name="description" content="The MDN Web Docs site provides information about Open Web technologies including HTML, CSS, and APIs for both Web sites and progressive web apps."><m

In [27]:
print("""
import requests

url = "https://developer.mozilla.org/"
url_params = {"test_param":23}
resp = requests.get(url, params=url_params)

resp.status_code""")


import requests

url = "https://developer.mozilla.org/"
url_params = {"test_param":23}
resp = requests.get(url, params=url_params)

resp.status_code


## HTML

HTML main purpose is to define the **structure** of a web page.

HTML files contains a series of **nested elements** that wrap the content and make it appear or beahave in a certain way.

<center><img src="https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics/grumpy-cat-small.png" width=70% /></center>

HTML elements could also have attached *attributes*



<center><img src="https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics/grumpy-cat-attribute-small.png" width=75% /></center>



#### HTML DOM



```HTML
<body>
    <h1> Title </h1>
    <p> Lorem ipsum dolor sit amet </p>
    <ul>
        <li> ... </li>
        <li> ... </li>
        <li> ... </li>
    </ul>
</body>
```


<img src="https://www.dottedsquirrel.com/content/images/2021/03/csssiblings.png" width=70% />






We can parse and access HTML elements using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library

In [25]:
# parse the HTML

from bs4 import BeautifulSoup

html_doc = resp.text
soup = BeautifulSoup(html_doc, "html.parser")

type(soup.head) 

bs4.element.Tag

In [36]:
print("""
from bs4 import BeautifulSoup

html_doc = resp.content
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify()[:200])

print(soup.head.prettify())
print(type(soup.head))""")


from bs4 import BeautifulSoup

html_doc = resp.content
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify()[:200])

print(soup.head.prettify())
print(type(soup.head))


#### With a little help from my friends

Let's explore *developer tools' elements* tab

Let's try to extract blog post titles

In [33]:
# extract blog articles titles

for el in soup.find_all("h3"):
    print(el.a.text)

WebAssembly and Back Again: Fine-Grained Sandboxing in Firefox 95
Hacks Decoded: Seyi Akiwowo, Founder of Glitch
Hacks Decoded: Thomas Park, Founder of Codepip
Lots to see in Firefox 93!
Implementing form filling and accessibility in the Firefox PDF viewer


In [29]:
print("""
for el in soup.find_all("h3"):
    print(el.a.text)
""")


for el in soup.find_all("h3"):
    print(el.a.text)



# CSS

The main purpose of CSS is to define the HTML elements **style** (eg. sizes, colors, animations, etc...). 

Beforehand to specify the style of an element we need to **select** it, right? These is what **CSS selectors** are used for! 

CSS selectors is a *mini-language* that allows you to select one or multiple elements in a very clever way.
```
.classes
#ids
[attributes=value]
parent child
parent > child
sibling ~ sibling
sibling + sibling
:not(element.class, element2.class)
:is(element.class, element2.class)
parent:has(> child)
```

In [57]:
# build a mini dataset from Mozilla's blog articles

titles = []
authors = []
dates = []
articles = []

  

for article_el in post_elements:
    title = article_el.select_one("h3 a").text
    titles.append(title)
    
    post_meta = article_el.select_one("p.post-meta")
    data, author = post_meta.text.replace("Posted ", "").strip().split(" by ")
    
    dates.append(data)
    authors.append(author)
    
    articles.append(article_el.select_one("p:not(.post-meta)").text)
  
import pandas as pd

pd.DataFrame({"title":titles, "authors":authors}).to_csv("mini_dataset.csv")

In [37]:
print("""
post_elements = soup.select("div.blog-feed > ul > li")

titles = []
urls = []
descriptions = []
dates = []
authors = []

for post_el in post_elements:
    titles.append(post_el.h3.text) 
    
    urls.append(post_el.h3.a["href"])
    
    post_meta = post_el.select_one("p.post-meta").text
    date, author = post_meta.replace("Posted","").strip().split(" by ") 
    dates.append(date)
    authors.append(author)
    
    descriptions.append(post_el.select_one("p:not(.post-meta)").text) 


import pandas as pd

mini_dataset = pd.DataFrame({"title": titles, "url":urls,
                             "description": descriptions, "dates": dates,
                             "author": authors})

mini_dataset.to_csv("mini_dataset_mozilla.csv")""")


post_elements = soup.select("div.blog-feed > ul > li")

titles = []
urls = []
descriptions = []
dates = []
authors = []

for post_el in post_elements:
    titles.append(post_el.h3.text) 
    
    urls.append(post_el.h3.a["href"])
    
    post_meta = post_el.select_one("p.post-meta").text
    date, author = post_meta.replace("Posted","").strip().split(" by ") 
    dates.append(date)
    authors.append(author)
    
    descriptions.append(post_el.select_one("p:not(.post-meta)").text) 


import pandas as pd

mini_dataset = pd.DataFrame({"title": titles, "url":urls,
                             "description": descriptions, "dates": dates,
                             "author": authors})

mini_dataset.to_csv("mini_dataset_mozilla.csv")


Some stuff to notice:
* the selected element is always the **right-most**.
* the selection is always **relative** to the (root) node we are working on.

## All you need is XPath

Query language for selecting HTML (XML) elements.

<img src="https://www.softwaretestinghelp.com/wp-content/qa/uploads/2019/05/XPATH-Syntax-screenshot-1-1.png" width=60% />

Some advantages:
* Bidirectional flow $\rightarrow$ moving left-right, up-down
* Partial matching with *contains()*




# JS

JS main purpose is to add **dynamic functionalities and behaviours** to a web page.

JS code **could change** the underlying HTML structure or elements styles, but can also "silently" send HTTP requests to other servers (eg. for retrieving data from web api)



#### With a little help from my friends

Let's explore *view source* of [this website](http://elezioni2015.consiglioveneto.it/elezioni2015/regionali/)

#### Headless browsers to the rescue

Provides automated control of a web page without actually render and display it:
* js running
* automatic interaction (eg. click on an button, compiling a form, etc.)

Some solutions:
* [requests-html](https://github.com/psf/requests-html)
* [Selenium](https://selenium.dev)
* [Scrapy-selenium](https://github.com/clemfromspace/scrapy-selenium)

## Scraping or Crawling?

* extract information from few web pages $\rightarrow$ scraping 🔎
* discover rules and links following? $\rightarrow$ "Crawl it up baby now" 💃🕺

 #### Scrapy
    
Scrapy is a *framework*, so it expects from you to:

* fit your mental model to the framework conceptual model and execution flow
* follow conventions and rules (eg. directories stucture)
* extend and use framework' classes and tools

#### How Scrapy works?

<img src="https://docs.scrapy.org/en/latest/_images/scrapy_architecture_02.png" width=60% />

Let's crawl [Pagella Politica](https://pagellapolitica.it/) to build a fact checking dataset

Scrapy CLI

Scray provides a CLI tool to manage projects and spiders

To initialize a project:

```scrapy startproject fact_checking```

To create a spider:

`scrapy genspider pagellapolitica pagellapolitica.it`

To run the spider:

`scrapy crawl pagellapolitica`

Let's switch to the same ol' vs code

# Legal and Ethical Concerns

Is web scraping a legal activity? 

... "ish"

We are collecting public available data after all... but be a nice human:
* present yourself: leave information about the crawlers owners and how to contact you 
* respect *robots.txt*
* limits your requests firing rate
* check the  website' *terms and conditions*.
* respect the website owner's will. [Read more](https://jaxenter.com/data-scraping-cases-165385.html) 

Remember: scraping always implies consuming someone else computing resources!

# Some tips:

* Check the API then scrape by yourself
* Select your elements with "*least-likely change*" heuristic
* Nowadays web is dynamic! Always check page source code.
* Web page evolves, your scraper should change accordingly. Try to anticipate these changes and make a robust scraper!
* Scraping some pages could be not so easy: frontend developers will be your best friends 👩‍💻👨‍💻

#### Did you find the easter-eggs?



<img src="https://londonita.com/wp-content/uploads/2020/02/Abbey-Road-beatles.jpg" width=45% />

References

* for HTTP, HTML, CSS, JS: [MDN web docs](https://developer.mozilla.org/en-US/docs/Web)
* for CSS selectors: [dotted squirrel visual guide](https://www.dottedsquirrel.com/the-ultimate-visual-guide-to-css-selectors/)
* Scrapy: [official documentation](https://docs.scrapy.org/en/latest/)

Further readings: Seppe Vanden, BrouckeBart Baesens; [Practical Web Scraping for Data Science](https://link.springer.com/book/10.1007/978-1-4842-3582-9)
