# D√©finition

Le **web scraping** consiste √† √©crire un programme qui va visiter une ou plusieurs pages web, analyser le contenu HTML, et extraire des donn√©es sp√©cifiques (ex : titres, prix, images, liens, etc.).

# Web Scraping, l√©gal ?

La l√©galit√© du **web scraping** d√©pend du contexte, du pays, et des conditions d‚Äôutilisation du site cibl√©. Voici les points essentiels √† conna√Ætre :

---

### ‚úÖ **Quand le web scraping est l√©gal :**

* **Sites publics sans restrictions explicites** : Si les donn√©es sont accessibles librement, sans connexion ni verrou, et que le site ne l‚Äôinterdit pas explicitement dans ses conditions d‚Äôutilisation (CGU), le scraping est g√©n√©ralement autoris√©.
* **Respect du fichier `robots.txt`** : Ce fichier indique ce que le site autorise ou interdit aux robots. Respecter ces r√®gles est une bonne pratique recommand√©e (mais pas toujours une obligation l√©gale stricte).
* **Usage non commercial et respectueux** : Extraire des donn√©es pour un usage personnel, p√©dagogique, ou scientifique est souvent tol√©r√©, surtout si le scraping ne surcharge pas le serveur.

---

### ‚ùå **Quand le web scraping est ill√©gal ou risqu√© :**

* **Violation des CGU du site** : Si le site interdit explicitement le scraping dans ses conditions d‚Äôutilisation, scraper peut √™tre consid√©r√© comme une violation contractuelle.
* **Acc√®s non autoris√© ou contournement de protections** : Scraper des contenus prot√©g√©s par authentification, payants, ou utiliser des techniques pour masquer l‚Äôidentit√© (proxy, user-agent falsifi√©) peut √™tre ill√©gal (ex : violation du Computer Fraud and Abuse Act aux USA).
* **Atteinte √† la vie priv√©e ou aux donn√©es personnelles** : Scraper des donn√©es personnelles sans consentement peut violer le RGPD (en Europe) ou d‚Äôautres lois sur la protection des donn√©es.
* **Surcharge ou attaque** : Effectuer des requ√™tes trop fr√©quentes peut √™tre consid√©r√© comme une attaque (DoS) ou une nuisance.

---


### üìù En r√©sum√© :

| **Condition**                                 | **L√©galit√© probable**            |
| --------------------------------------------- | -------------------------------- |
| Site public, donn√©es non personnelles         | G√©n√©ralement l√©gal               |
| Respect des CGU et `robots.txt`               | Recommand√© et souvent n√©cessaire |
| Donn√©es personnelles sans consentement        | Risqu√© et souvent ill√©gal        |
| Contournement de protections (login, captcha) | Ill√©gal                          |
| Usage abusif ou surcharge serveur             | Ill√©gal                          |

---

### Conseils pratiques :

* Toujours lire les **conditions d‚Äôutilisation** du site.
* Favoriser les **API officielles** si disponibles.
* Limiter la fr√©quence des requ√™tes pour ne pas surcharger le site.
* √âviter de scraper des donn√©es personnelles sensibles.
* En cas de doute, consulter un **juriste sp√©cialis√©**.

---

# Librairies pour webscraping

## 1. **Requests**

### Fonctionnement :

Requests est une librairie Python simple et puissante pour **faire des requ√™tes HTTP** (GET, POST, etc.).

* Elle permet de r√©cup√©rer le contenu brut d‚Äôune page web (le code HTML) en envoyant une requ√™te √† l‚ÄôURL cible.
* Exemple : t√©l√©charger la page HTML d‚Äôune URL.

### Particularit√© :

* Ne comprend pas ni n‚Äôinterpr√®te le contenu, elle se contente de r√©cup√©rer les donn√©es brutes.
* Tr√®s l√©g√®re, facile √† utiliser.
* Ne g√®re pas le JavaScript dynamique (pages o√π le contenu s‚Äôaffiche apr√®s chargement JS).

---

## 2. **BeautifulSoup**

### Fonctionnement :

BeautifulSoup est une librairie de **parsing HTML/XML**.

* Elle prend en entr√©e le code HTML (souvent r√©cup√©r√© via Requests) et le transforme en un **arbre d‚Äôobjets Python** faciles √† manipuler.
* Permet de naviguer dans l‚Äôarbre DOM, rechercher des balises, extraire du texte, des attributs (ex : liens, images, titres).

### Particularit√© :

* Tr√®s intuitive et flexible pour extraire pr√©cis√©ment les √©l√©ments cibl√©s dans le HTML.
* Fonctionne sur le contenu HTML statique (ce qui est dans le code source).
* Combine parfaitement avec Requests pour un scraping simple.

---

## 3. **Selenium**

### Fonctionnement :

Selenium est un outil d‚Äô**automatisation de navigateur web**.

* Il pilote un vrai navigateur (Chrome, Firefox, etc.) ou un navigateur sans interface graphique (headless).
* Charge les pages comme un utilisateur r√©el, ce qui permet de g√©rer le contenu **charg√© dynamiquement via JavaScript**.
* Permet aussi d‚Äôinteragir avec la page (cliquer sur des boutons, remplir des formulaires).

### Particularit√© :

* Indispensable pour scraper des sites avec **contenu dynamique ou prot√©g√© par JS**.
* Plus lourd et lent que Requests + BeautifulSoup car il ouvre un navigateur complet.
* Permet de simuler une navigation humaine.

---

## Synth√®se rapide :

| Librairie         | R√¥le principal                         | Utilisation type                       | Points forts                  | Limitations                    |
| ----------------- | -------------------------------------- | -------------------------------------- | ----------------------------- | ------------------------------ |
| **Requests**      | R√©cup√©rer le code HTML d‚Äôune page      | Acc√®s aux pages statiques              | Simple, rapide                | Pas d‚Äôex√©cution JS             |
| **BeautifulSoup** | Parser et extraire des donn√©es du HTML | Extraction cibl√©e sur contenu statique | Flexible, intuitive           | Ne traite pas le JS            |
| **Selenium**      | Automatiser un vrai navigateur         | Scraping de sites dynamiques           | G√®re JavaScript, interaction  | Lourd, plus lent               |

---

# Exemples avec https://quotes.toscrape.com

In [15]:
!pip install requests beautifulsoup4 




[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: c:\Users\Dell\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip


In [20]:
import requests
from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com"
response = requests.get(url)
# response.text
soup = BeautifulSoup(response.text, "html.parser")
# soup.find_all("div", class_="quote")
# # Extraire les citations et auteurs
for quote in soup.find_all("div", class_="quote"):
    texte = quote.find("span", class_="text").text
    auteur = quote.find("small", class_="author").text
    print(f"{texte} ‚Äî {auteur}")

‚ÄúThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.‚Äù ‚Äî Albert Einstein
‚ÄúIt is our choices, Harry, that show what we truly are, far more than our abilities.‚Äù ‚Äî J.K. Rowling
‚ÄúThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.‚Äù ‚Äî Albert Einstein
‚ÄúThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.‚Äù ‚Äî Jane Austen
‚ÄúImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.‚Äù ‚Äî Marilyn Monroe
‚ÄúTry not to become a man of success. Rather become a man of value.‚Äù ‚Äî Albert Einstein
‚ÄúIt is better to be hated for what you are than to be loved for what you are not.‚Äù ‚Äî Andr√© Gide
‚ÄúI have not failed. I've just found 10,000 ways that won't work.‚Äù ‚Äî Thomas A. Edison
‚ÄúA woman is like a tea bag; you never know how st

In [21]:
page = 1
while True:
    url = f"https://quotes.toscrape.com/page/{page}/"
    response = requests.get(url)
    if "No quotes found!" in response.text:
        break

    soup = BeautifulSoup(response.text, "html.parser")
    print(f"\nüìÑ Page {page}")
    for quote in soup.find_all("div", class_="quote"):
        texte = quote.find("span", class_="text").text
        auteur = quote.find("small", class_="author").text
        print(f"{texte} ‚Äî {auteur}")
    
    page += 1



üìÑ Page 1
‚ÄúThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.‚Äù ‚Äî Albert Einstein
‚ÄúIt is our choices, Harry, that show what we truly are, far more than our abilities.‚Äù ‚Äî J.K. Rowling
‚ÄúThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.‚Äù ‚Äî Albert Einstein
‚ÄúThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.‚Äù ‚Äî Jane Austen
‚ÄúImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.‚Äù ‚Äî Marilyn Monroe
‚ÄúTry not to become a man of success. Rather become a man of value.‚Äù ‚Äî Albert Einstein
‚ÄúIt is better to be hated for what you are than to be loved for what you are not.‚Äù ‚Äî Andr√© Gide
‚ÄúI have not failed. I've just found 10,000 ways that won't work.‚Äù ‚Äî Thomas A. Edison
‚ÄúA woman is like a tea bag; you neve

In [12]:
!pip install selenium

Collecting fake-useragent
  Downloading fake_useragent-2.2.0-py3-none-any.whl.metadata (17 kB)
Collecting cloudscraper
  Downloading cloudscraper-1.2.71-py2.py3-none-any.whl.metadata (19 kB)
Collecting playwright
  Downloading playwright-1.54.0-py3-none-win_amd64.whl.metadata (3.5 kB)
Collecting requests-toolbelt>=0.9.1 (from cloudscraper)
  Downloading requests_toolbelt-1.0.0-py2.py3-none-any.whl.metadata (14 kB)
Collecting greenlet<4.0.0,>=3.1.1 (from playwright)
  Downloading greenlet-3.2.3-cp39-cp39-win_amd64.whl.metadata (4.2 kB)
Downloading fake_useragent-2.2.0-py3-none-any.whl (161 kB)
Downloading cloudscraper-1.2.71-py2.py3-none-any.whl (99 kB)
Downloading playwright-1.54.0-py3-none-win_amd64.whl (35.5 MB)
   ---------------------------------------- 0.0/35.5 MB ? eta -:--:--
   -- ------------------------------------- 2.6/35.5 MB 13.7 MB/s eta 0:00:03
   ----- ---------------------------------- 4.7/35.5 MB 11.4 MB/s eta 0:00:03
   ------- -------------------------------- 6.6/35


[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: c:\Users\Dell\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip


In [26]:
from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://quotes.toscrape.com/js/")

quotes = driver.find_elements(By.CLASS_NAME, "quote")
# quotes
for q in quotes:
    print(q.text)
    print("\n")

driver.quit()


‚ÄúThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.‚Äù
by Albert Einstein
Tags: change deep-thoughts thinking world


‚ÄúIt is our choices, Harry, that show what we truly are, far more than our abilities.‚Äù
by J.K. Rowling
Tags: abilities choices


‚ÄúThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.‚Äù
by Albert Einstein
Tags: inspirational life live miracle miracles


‚ÄúThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.‚Äù
by Jane Austen
Tags: aliteracy books classic humor


‚ÄúImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.‚Äù
by Marilyn Monroe
Tags: be-yourself inspirational


‚ÄúTry not to become a man of success. Rather become a man of value.‚Äù
by Albert Einstein
Tags: adulthood success value


‚ÄúIt is better to be hated 

## https://books.toscrape.com/catalogue/page-{2}.html

In [28]:
import csv
from bs4 import BeautifulSoup
import requests

base_url = "https://books.toscrape.com/catalogue/page-{}.html"
livres = []

for page in range(1, 6):
    url = base_url.format(page)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    
    articles = soup.find_all("article")
    
    for article in articles:
        titre = article.h3.a["title"]
        prix = article.find("p", class_="price_color").text[1:]
        stock = article.find("p", class_="instock availability").text.strip()
        livres.append([titre, prix, stock])

# Sauvegarder en CSV
with open("livres.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["Titre", "Prix", "Stock"])
    writer.writerows(livres)

print("‚úÖ Donn√©es sauvegard√©es dans livres.csv")


‚úÖ Donn√©es sauvegard√©es dans livres.csv
