![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)



### Legality of Scraping

#### 1) White Zone


<img src="./images/api.png" width="300">

-  Legally safe as long as you obtain information from APIs or public databases.

#### 2) Dark Zone



<img src="./images/dark.png" width="300">

-  Entails pretending to be a person. Avoid accepting website policies with a scraper. This applies particularly to activities within the Chinese intranet.

#### 3) Grey Zone

- Involves pretending to be yourself. Accepting cookies is generally fine, but ensure they do not imply agreement to any internal policy. Using an email to log in is acceptable, but avoid using personal information such as an address, ID, etc.


### Packages for Web Scraping

#### Requests

- **Function**: Serves to fetch HTML content for webpages. You first make the request and then extract data with beautiful soup. 

#### BeautifulSoup

- **Function**: BeautifulSoup does not simulate browsers. It fetches HTML and transforms it into a BeautifulSoup object, simplifying the extraction of HTML tags.

#### Selenium

- **Function**: Selenium is used to simulate a browser. It can perform human-like actions within the browser.

#### Scrapy

- **Function**: Ideal for large-scale projects, Scrapy doesn't use a browser. Sometimes, it may require browser simulation to bypass blocking protocols. It's primarily designed for efficient data extraction.

#### Differences

- **Use Case Dependence**: The choice among these tools depends on your specific needs.
- **Handling JavaScript**: JavaScript runs in the browser. To extract JavaScript-generated content, a browser environment like that provided by Selenium is necessary.

#### Handling Blockages

- **Cloudflare**: Many websites use Cloudflare to prevent DDoS attacks. Check if a site uses Cloudflare [here](https://checkforcloudflare.selesti.com/).
- **Selenium Strategies**:
  - `time.sleep()` or `WebDriverWait()` can help bypass blockages.
  - For anti-bot protection, `undetected-chromedriver` can be useful. Note that this is only available for Chrome, not Firefox.

#### Simulating Human-Like Behavior with Selenium

- **Short-Term**: Implement random waiting times or keystroke intervals.
- **Long-Term**: Emulate human sleep cycle behavior.


###  HTML, Java, JavaScript in the Context of Web Scraping

#### HTML (HyperText Markup Language)

- HTML is the standard markup language for documents designed to be displayed in a web browser. It describes the structure of web pages using markup.
- HTML will be your main nightmare. You want to extract text information cleaning the language to get what you need. 

All (well formatted) HTML webpages start with a basic structure:

```
<html>
<head>
<title> my title <title>
<html>
``````



#### JavaScript

   JavaScript is a scripting language that enables interactive web pages. Unlike Java, it is primarily used for client-side web development.


## When is JavaScript Rendered?

- **During Browsing**: JavaScript is rendered by the browser when a page is loaded or interacted with. This can include actions like clicking, scrolling, or submitting forms.
- **Scraping**: For scraping purposes, tools that can execute JavaScript (like Selenium) are necessary to access content that is loaded or altered dynamically by scripts.




### Basic Structure of XPath and CSS Selectors

When searching for elements in HTML, it's common to use either XPath or CSS selectors. The choice of selector depends on the specific requirements and context of the task. Below is a basic guide to understanding the structure of these selectors.

#### Search Hierarchy

1. **ID**: 
   - **Description**: Unique identifier for an element. Not always available.
   - **Format**: `id="unique_id"`
2. **Class**:
   - **Description**: Class name(s) associated with an element.
   - **Format**: `class="class_name"`
3. **Tag**:
   - **Description**: HTML tag (like `div`, `label`, etc.).
   - **Format**: `<tag>`
4. **Attribute**:
   - **Description**: An attribute and its value within an element.
   - **Format**: `attribute="attribute_value"`

#### XPath

- **Basic Structure**:
  
  ```xpath
  //tag[@attribute="attribute_value"][position]

The double bar `//` selects all elements that match that path 

#### CSS

You also have CSS selectors that will match elements based on attributes. These are easier but less robust than xpath. 






### So how do I scrape:


- CloudFare protection will block you from doing many requests at a time so this does not necesarely speeds up the process. 
- More profesional techniques would create cloud services with multiple instances (and thus multiple ips).
- Avoid accepting terms of service and potentially emulate human like behavior. 

