# Contexte Workshop

Ce workshop vous apprend à construire un pipeline de collecte de données complet en combinant web scrapping, stockage objet en s3 (minio) et base de données nosql (MongoDB)

**Site cible:** https://quotes.toscrape.com


# Architecture

| Composant | Rôle | Type de données |
|-----------|------|-----------------|
| **MinIO** | Stockage objet S3-compatible | Images, fichiers bruts, exports volumineux |
| **MongoDB** | Base documentaire NoSQL | Métadonnées structurées, données queryables |

# Prérequis
## Environnement techniques

- Docker et docker compose
- Python 3.10+
- Vscode 
- Connexion internet

## Connaissances requises

- Bases de python
- Notion HTML et HTTP
- Docker

# Partie 1: Mise en place de l'infrastructure

## 1.1 Architecture du projet 

```
workshop/
├── docker-compose.yml
├── requirements.txt
├── config/
│   └── settings.py
├── src/
│   ├── __init__.py
│   ├── scraper.py
│   ├── storage/
│   │   ├── __init__.py
│   │   ├── minio_client.py
│   │   └── mongo_client.py
│   └── pipeline.py
└── tests/
    └── test_scraper.py
```

## 1.2 Architecture cible

```
┌─────────────────────┐
│ quotes.toscrape.com │
│                     │
│  • Citations        │
│  • Auteurs          │
│  • Tags             │
└──────────┬──────────┘
           │ Scraping
           ▼
┌─────────────────────┐
│      Scraper        │
│      Python         │
└──────────┬──────────┘
           │
     ┌─────┴─────┐
     │           │
     ▼           ▼
┌─────────┐  ┌─────────┐
│  MinIO  │  │ MongoDB │
│         │  │         │
│ Exports │  │ Quotes  │
│ Backups │  │ Authors │
│ Images  │  │ Tags    │
└─────────┘  └─────────┘
     │           │
     └─────┬─────┘
           ▼
┌─────────────────────┐
│   Analytics & NLP   │
│   ML Datasets       │
└─────────────────────┘
```

## 1.3 Configuration de Docker compose

Créer le fichier `docker-compose.yml`


```yml
services:
  minio:
    image: minio/minio:latest
    container_name: workshop-minio
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      MINIO_ROOT_USER: minioadmin
      MINIO_ROOT_PASSWORD: minioadmin123
    command: server /data --console-address ":9001"
    volumes:
      - minio_data:/data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3

  mongodb:
    image: mongo:7.0
    container_name: workshop-mongodb
    ports:
      - "27017:27017"
    environment:
      MONGO_INITDB_ROOT_USERNAME: admin
      MONGO_INITDB_ROOT_PASSWORD: admin123
      MONGO_INITDB_DATABASE: scraping_db
    volumes:
      - mongo_data:/data/db
    healthcheck:
      test: echo 'db.runCommand("ping").ok' | mongosh localhost:27017/test --quiet
      interval: 30s
      timeout: 10s
      retries: 3

  mongo-express:
    image: mongo-express:latest
    container_name: workshop-mongo-express
    ports:
      - "8081:8081"
    environment:
      ME_CONFIG_MONGODB_ADMINUSERNAME: admin
      ME_CONFIG_MONGODB_ADMINPASSWORD: admin123
      ME_CONFIG_MONGODB_URL: mongodb://admin:admin123@mongodb:27017/
      ME_CONFIG_BASICAUTH: false
    depends_on:
      - mongodb

volumes:
  minio_data:
  mongo_data:
```

- Démarrage des conteneurs:

```bash
docker compose up -d
```

## 1.4 Dépendances Python

Créer le fichier `requirements.txt`

```text
# Web Scraping
requests==2.31.0
beautifulsoup4==4.12.3
lxml==5.1.0
fake-useragent==1.4.0

# Stockage
minio==7.2.3
pymongo==4.6.1

# Utilitaires
python-dotenv==1.0.1

```

In [3]:
!pip install -r requirements.txt

Collecting python-dotenv==1.0.1 (from -r requirements.txt (line 12))
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
  Attempting uninstall: python-dotenv
    Found existing installation: python-dotenv 1.2.1
    Uninstalling python-dotenv-1.2.1:
      Successfully uninstalled python-dotenv-1.2.1
Successfully installed python-dotenv-1.0.1



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip
