```{css, echo = FALSE}
.justify {
  text-align: justify !important
}
```


# A few words about me
1. PhD student (Demography) - IDESO (UNIGE)
2. Member of [WeData](https://wedata.ch/)
3. Fan of web scraping, text mining, network analysis and data visualization
4. My programming languages: 
    1. R, Nim and Julia 
    2. Python and JavaScript

# Plan
1. What is web scraping?
2. Web scraping in R
3. What is {hayalbaz}?
4. Advantages of {hayalbaz}

# What is web scraping?
## Definition

::: {.justify}

**Web scraping** is the process of extracting information from websites and converting it into a structured format, such as a spreadsheet or a database. This is usually done programmatically using a software tool or script, which simulates human browsing behavior to navigate through web pages, identify relevant data, and extract the desired information.

:::

**Main goal: collect data** 

## Web crawling?
Web crawling is the process of systematically browsing, discovering, and indexing the content of websites. This is typically done by automated programs called web crawlers, spiders, or bots. Web crawling is a crucial component of search engines, as it allows them to build a comprehensive index of web pages that can be used to serve relevant search results to users.

**Main goal: index and/or discover content by following urls**

## What is it used for?
- Data mining/analysis
- Content aggregation
- Sentiment analysis
- Price comparaison

## Where is it used?
- Business (analytics)
	- Competitor
	- Profiling
	- Finance
- Research
	- Analysis purpose
- Private
	- Automation

## Example
(==Reuse my scraper of le temps.ch==)

## Is it hard?
It depends...

**We can use programming language or tools to make it easy**

**It is not always required but is always good to have a bit of knowledge in HTML, CSS and JavaScript**

**With modern tools, we can virtually scrape almost any website!!!**

## Limits
- **Legal and ethical considerations**: Web scraping may infringe on copyright, trademark, or other intellectual property rights. Additionally, it might violate a website's terms of service, privacy policies, or other legal agreements. It's essential to understand and comply with the legal and ethical aspects of web scraping to avoid potential legal issues.
	- Website term of use
	- Privacy policies
	- Intellectual property, copyright or trademark

- **Data quality and accuracy**: Web scraping can sometimes result in inaccurate or incomplete data, especially when dealing with poorly structured or inconsistent websites. Data cleaning and validation are often necessary to ensure the quality of the extracted data.

- **Website structure changes**: Websites frequently change their structure, layout, or design. This can cause web scrapers to break or become unreliable, requiring constant maintenance and updates to keep up with website changes.

- **Dynamic and JavaScript-heavy websites**: Web scraping can be challenging when dealing with websites that rely heavily on JavaScript for content loading or have dynamic elements. Traditional web scraping libraries might not be able to extract data from such websites, and alternative approaches, such as using browser automation tools like Selenium or Playwright, may be needed.
	- Javascript content loading
	- Dynamic/interactive element

- **Rate limiting and server load**: Sending too many requests in a short period can overwhelm a website's server or trigger rate-limiting mechanisms. It's essential to respect a website's server resources and limit the rate of requests to avoid potential issues or bans.
	- Slow down / Block
	- DDOS attack

- **Anti-scraping techniques**: Many websites employ anti-scraping techniques to prevent or limit web scraping, such as CAPTCHAs, IP blocking, user-agent restrictions, or requiring login credentials. Overcoming these obstacles can be challenging and may require more advanced scraping techniques or proxy services.
	- CAPTCHA
	- IP blocking
	- Login
	- User-agent restriction

## Good practice
1. **Limit the rate of your requests** to avoid putting strain on the website's server.
2. **Identify your scraper in the User-Agent header** when sending requests.
3. **Respect the guidelines** provided in the robots.txt file.
4. **Cache data whenever possible** to minimize the number of requests you need to send.

# Ethical consideration
## Is it legal?
**It depends... **
- A grey area? Not really...

**It is legal for publicly available data on the internet.**

**But be careful with:**
- Personal data
- Data under intellectual property

**Risks:**
- From simple warning to legal actions

## How to check?
Robot.txt
- ex: https://www.letemps.ch/robots.txt

Website term of use
- ex: https://twitter.com/en/tos

# Tools for web scraping
## Code, no code?
**Nowadays there are tools in web browser to perform web scraping tasks**
ex. [web scraper](https://chrome.google.com/webstore/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn)

**Using a programming language gives you more flexibility**

## Best programming language?
Classical ranking on internet:
1. Python
2. JavaScript
3. Ruby
4. Java
5. PhP

**No places for R... But it doesn't mean there is nothing!!**

## Web scraping in R
Packages for web scraping:
(exclude web server, json tools, etc.)
- [htmldf](https://www.ebay.com/): Collect pages Metadata
- [scrappy](https://github.com/villegar/scrappy/): Scraping helper for specific web sites
- [ragler](https://github.com/feddelegrand7/ralger): Easiest rvest
- [parsel](https://github.com/till-tietz/parsel): Parallel processing for RSelenium
- [rvest](https://rvest.tidyverse.org/): Reference in R (wrap httr and XML packages)
- [scraEP](https://github.com/cran/scraEP): Little tools for web scraping
- [Rcrawler](https://github.com/salimk/Rcrawler/): A crawler for R
- [curl](https://github.com/jeroen/curl): Modern web curl
- [chromote](https://github.com/rstudio/chromote): Headless chrome driver
	- [crrri](https://github.com/RLesur/crrri)
	- [decapited](https://github.com/hrbrmstr/decapitated/)
	- [chradle](https://github.com/milesmcbain/chradle)

## Most popular Packages
For API (Google Map API)
- [httr](https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html) / [httr2](https://cran.r-project.org/web/packages/httr2/vignettes/httr2.html)
For static web site (Wikipedia)
- [rvest](https://rvest.tidyverse.org/)
For dynamic web site (Twitter)
- [Rselenium](https://docs.ropensci.org/RSelenium/)

## Static vs Dynamic method
**Static method (rvest) simply do request and collect basic html pages**
**Pros**
- Fast
- Light
- Most web site are static
**Cons**
- Can't load JavaScript
- Easily blocked by antibot methods

**Dynamic method (RSelenium) automate a web browser to render website (JavaScript)**
**Pros**
- Can render JavaScript
- Can avoid anti-bot methods
- Can deal with interactive website (trigger events)
**Cons**
- Slow
- Heavy

## Example with Rvest
### Installation

### Collection

## Example with RSelenium
### Installation

### Collection

## Limits of RSelenium
- Dependencies (Need Java)
	- Not easily shareable
	- Can be discouraging
- Not beginner friendly
	- Verbose

# What is {hayalbaz}
## {hayalbaz}
Author: [Colind Rundel](https://github.com/rundel) 
"Puppeteer in a different language - this R package provides a puppeteer inspired interface to the Chrome Devtools Protocol using chromote."

- What is Puppeteer?
- What is chromote?


## Puppeteer
[Puppeteer](https://pptr.dev/)
"Puppeteer is a Node.js library which provides a high-level API to control Chrome/Chromium over the [DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/). Puppeteer runs in [headless](https://developer.chrome.com/articles/new-headless/) mode by default, but can be configured to run in full (non-headless) Chrome/Chromium."

## {chromote}
[chromote](https://github.com/rstudio/chromote)
Chromote is an R implementation of the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/). It works with Chrome, Chromium, Opera, Vivaldi, and other browsers based on [Chromium](https://www.chromium.org/). By default it uses Google Chrome (which must already be installed on the system). To use a different browser, see [Specifying which browser to use](https://github.com/rstudio/chromote#specifying-which-browser-to-use).

Chromote is not the only R package that implements the Chrome DevTools Protocol. Here are some others:
- [crrri](https://github.com/RLesur/crrri) by Romain Lesur and Christophe Dervieux
- [decapitated](https://github.com/hrbrmstr/decapitated/) by Bob Rudis
- [chradle](https://github.com/milesmcbain/chradle) by Miles McBain

The interface to Chromote is similar to [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface) for node.js.

## Web scraping with chromote?
Yes, but really verbose
Example:
==give a code example==

## hayalbaz make it easier
==Same example as before with hayalbaz==

## Advantage of hayalbaz over RSelenium
1. Easy installation
2. Less verbose
3. Easier overall
4. Faster
5. Good synergy with Rvest
6. Promising features

**Note: I don't think the goal of hayalbaz is to replace RSelenium, but it can be a valuable alternative**

# Advantages of hayalbaz over RSelenium
## 1. Easy installation
RSelenium:
[The Complete RSelenium Installation Guide (2023)](https://www.youtube.com/watch?v=GnpJujF9dBw)
- Need JavaScript installed

## 2. Less verbose
==Same example as earlier with RSelenium and hayalbaz next to each other==

## 3. Faster
==Use the script benchmark_rselenium_hayalbaz.R==

## 4. Easier overall
Because:
1. Less installation
2. Less verbose
3. No port management
4. One web browser (you can use chromium too though)
5. Self-explanatory methods

## 5. Good synergy with rvest
==Find an easy example==

## 6. Promising feature
**Do you know the {shinytest2}?**
To test applications automatically.

[shinytest2](https://rstudio.github.io/shinytest2/) uses [chromote](https://github.com/rstudio/chromote) to render applications in a headless Chrome browser. [chromote](https://github.com/rstudio/chromote) allows for a live preview, better debugging tools, and/or simply using modern JavaScript/CSS.

By simply recording your actions as code and extending them to test the more particular aspects of your application, it will result in fewer bugs and more confidence in future Shiny application development.

### Example
Create

```r
shinytest2::record_test()
```

### Like puppeteer or playwright
Recording action to:
1. Save time
2. Be more beginner friendly

**It is still in the project stage**

**Again, the goal is not to replace RSelenium but it can be a valuable alternative**

# Thank you for your attention