# Exercice 05 - Stockage dans MinIO (Data Lake)

## Objectifs
- Configurer Spark pour se connecter a MinIO
- Lire et ecrire des donnees dans MinIO
- Organiser les donnees en Bronze / Silver / Gold
- Comprendre les chemins S3

---

## 1. Rappel : Architecture du Data Lake

```
+------------------------------------------------------------------+
|                         MINIO (S3)                               |
+------------------------------------------------------------------+
|                                                                  |
|  +----------------+  +----------------+  +----------------+      |
|  |    BRONZE      |  |    SILVER      |  |     GOLD       |      |
|  |                |  |                |  |                |      |
|  | Donnees brutes |  | Donnees        |  | Donnees        |      |
|  | telles que     |  | nettoyees      |  | agregees       |      |
|  | recues         |  | et typees      |  | pret pour      |      |
|  |                |  |                |  | l'analyse      |      |
|  | s3a://bronze/  |  | s3a://silver/  |  | s3a://gold/    |      |
|  +----------------+  +----------------+  +----------------+      |
|                                                                  |
+------------------------------------------------------------------+
```

## 2. Configuration de Spark pour MinIO

In [1]:
from pyspark.sql import SparkSession

# Arreter la session existante si elle existe (necessaire pour charger les JARs)
try:
    spark.stop()
except:
    pass

# Configuration de Spark pour se connecter a MinIO
# Les JARs AWS sont telecharges automatiquement au premier lancement
spark = SparkSession.builder \
    .appName("Spark MinIO") \
    .config("spark.jars.packages", "org.postgresql:postgresql:42.6.0,org.apache.hadoop:hadoop-aws:3.4.1,com.amazonaws:aws-java-sdk-bundle:1.12.262") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .getOrCreate()

print("Spark configure pour MinIO !")
print(f"Version: {spark.version}")

Spark configure pour MinIO !
Version: 4.0.1


### Explication de la configuration

| Option | Description |
|--------|-------------|
| `fs.s3a.endpoint` | Adresse du serveur MinIO |
| `fs.s3a.access.key` | Identifiant de connexion |
| `fs.s3a.secret.key` | Mot de passe |
| `fs.s3a.path.style.access` | Necessaire pour MinIO |
| `fs.s3a.impl` | Implementation du systeme de fichiers S3 |

## 3. Creation des buckets

In [2]:
pip install minio

Note: you may need to restart the kernel to use updated packages.


In [3]:
from minio import Minio

# Connexion a MinIO
client = Minio(
    "minio:9000",
    access_key="minioadmin",
    secret_key="minioadmin123",
    secure=False
)

# Creer les buckets s'ils n'existent pas
for bucket in ["bronze", "silver", "gold"]:
    if not client.bucket_exists(bucket):
        client.make_bucket(bucket)
        print(f"Bucket '{bucket}' cree")
    else:
        print(f"Bucket '{bucket}' existe deja")

Bucket 'bronze' existe deja
Bucket 'silver' existe deja
Bucket 'gold' existe deja


## 4. Creer des donnees de test

In [4]:
# Creer un DataFrame de ventes
ventes = [
    ("2025-01-01", "Laptop", "Paris", 2, 999.99),
    ("2025-01-01", "Souris", "Lyon", 10, 29.99),
    ("2025-01-02", "Clavier", "Paris", 5, 79.99),
    ("2025-01-02", "Ecran", "Marseille", 3, 299.99),
    ("2025-01-03", "Laptop", "Lyon", 1, 999.99),
    ("2025-01-03", "Casque", "Paris", 8, 149.99)
]

colonnes = ["date", "produit", "ville", "quantite", "prix_unitaire"]

df_ventes = spark.createDataFrame(ventes, colonnes)
df_ventes.show()

+----------+-------+---------+--------+-------------+
|      date|produit|    ville|quantite|prix_unitaire|
+----------+-------+---------+--------+-------------+
|2025-01-01| Laptop|    Paris|       2|       999.99|
|2025-01-01| Souris|     Lyon|      10|        29.99|
|2025-01-02|Clavier|    Paris|       5|        79.99|
|2025-01-02|  Ecran|Marseille|       3|       299.99|
|2025-01-03| Laptop|     Lyon|       1|       999.99|
|2025-01-03| Casque|    Paris|       8|       149.99|
+----------+-------+---------+--------+-------------+



## 5. Ecriture dans Bronze

Les chemins S3 ont le format : `s3a://bucket/chemin/fichier`

In [5]:
# Ecrire les donnees brutes dans Bronze
chemin_bronze = "s3a://bronze/ventes/raw"

df_ventes.write \
    .mode("overwrite") \
    .parquet(chemin_bronze)

print(f"Donnees ecrites dans : {chemin_bronze}")

Donnees ecrites dans : s3a://bronze/ventes/raw


In [6]:
# Verifier en relisant
df_bronze = spark.read.parquet(chemin_bronze)
print("Lecture depuis Bronze :")
df_bronze.show()

Lecture depuis Bronze :
+----------+-------+---------+--------+-------------+
|      date|produit|    ville|quantite|prix_unitaire|
+----------+-------+---------+--------+-------------+
|2025-01-02|  Ecran|Marseille|       3|       299.99|
|2025-01-02|Clavier|    Paris|       5|        79.99|
|2025-01-01| Laptop|    Paris|       2|       999.99|
|2025-01-03| Casque|    Paris|       8|       149.99|
|2025-01-03| Laptop|     Lyon|       1|       999.99|
|2025-01-01| Souris|     Lyon|      10|        29.99|
+----------+-------+---------+--------+-------------+



## 6. Transformation vers Silver

En Silver, on nettoie et on enrichit les donnees.

In [7]:
from pyspark.sql.functions import col, to_date

# Transformation : ajouter le montant total
df_silver = df_bronze \
    .withColumn("date", to_date(col("date"))) \
    .withColumn("montant_total", col("quantite") * col("prix_unitaire"))

df_silver.show()
df_silver.printSchema()

+----------+-------+---------+--------+-------------+-------------+
|      date|produit|    ville|quantite|prix_unitaire|montant_total|
+----------+-------+---------+--------+-------------+-------------+
|2025-01-02|  Ecran|Marseille|       3|       299.99|       899.97|
|2025-01-02|Clavier|    Paris|       5|        79.99|       399.95|
|2025-01-01| Laptop|    Paris|       2|       999.99|      1999.98|
|2025-01-03| Casque|    Paris|       8|       149.99|      1199.92|
|2025-01-03| Laptop|     Lyon|       1|       999.99|       999.99|
|2025-01-01| Souris|     Lyon|      10|        29.99|        299.9|
+----------+-------+---------+--------+-------------+-------------+

root
 |-- date: date (nullable = true)
 |-- produit: string (nullable = true)
 |-- ville: string (nullable = true)
 |-- quantite: long (nullable = true)
 |-- prix_unitaire: double (nullable = true)
 |-- montant_total: double (nullable = true)



In [8]:
# Ecrire dans Silver
chemin_silver = "s3a://silver/ventes/enriched"

df_silver.write \
    .mode("overwrite") \
    .parquet(chemin_silver)

print(f"Donnees ecrites dans : {chemin_silver}")

Donnees ecrites dans : s3a://silver/ventes/enriched


## 7. Agregation vers Gold

En Gold, on prepare les donnees pour l'analyse.

In [9]:
from pyspark.sql.functions import sum, count, avg

# Lire depuis Silver
df_silver = spark.read.parquet(chemin_silver)

# Agregation par ville
df_gold_ville = df_silver.groupBy("ville").agg(
    count("*").alias("nb_ventes"),
    sum("quantite").alias("total_quantite"),
    sum("montant_total").alias("chiffre_affaires"),
    avg("montant_total").alias("panier_moyen")
)

df_gold_ville.show()

+---------+---------+--------------+------------------+-----------------+
|    ville|nb_ventes|total_quantite|  chiffre_affaires|     panier_moyen|
+---------+---------+--------------+------------------+-----------------+
|Marseille|        1|             3|            899.97|           899.97|
|    Paris|        3|            15|           3599.85|          1199.95|
|     Lyon|        2|            11|1299.8899999999999|649.9449999999999|
+---------+---------+--------------+------------------+-----------------+



In [10]:
# Ecrire dans Gold
chemin_gold = "s3a://gold/ventes/par_ville"

df_gold_ville.write \
    .mode("overwrite") \
    .parquet(chemin_gold)

print(f"Donnees ecrites dans : {chemin_gold}")

Donnees ecrites dans : s3a://gold/ventes/par_ville


## 8. Verification du Data Lake

Listons les objets dans chaque bucket.

In [11]:
# Lister les objets dans chaque bucket
for bucket in ["bronze", "silver", "gold"]:
    print(f"\n=== Bucket: {bucket} ===")
    objets = client.list_objects(bucket, recursive=True)
    for obj in objets:
        print(f"  {obj.object_name} ({obj.size} octets)")


=== Bucket: bronze ===
  clients/ (0 octets)
  clients/raw/_SUCCESS (0 octets)
  clients/raw/part-00000-86c65917-8879-4404-8523-9c31de6d0601-c000.snappy.parquet (649 octets)
  clients/raw/part-00002-86c65917-8879-4404-8523-9c31de6d0601-c000.snappy.parquet (1732 octets)
  clients/raw/part-00004-86c65917-8879-4404-8523-9c31de6d0601-c000.snappy.parquet (1691 octets)
  clients/raw/part-00007-86c65917-8879-4404-8523-9c31de6d0601-c000.snappy.parquet (1761 octets)
  clients/raw/part-00009-86c65917-8879-4404-8523-9c31de6d0601-c000.snappy.parquet (1698 octets)
  clients/raw/part-00011-86c65917-8879-4404-8523-9c31de6d0601-c000.snappy.parquet (1711 octets)
  demo/premier_fichier.json (107 octets)
  exercice/mon_profil.json (64 octets)
  ventes/ (0 octets)
  ventes/raw/_SUCCESS (0 octets)
  ventes/raw/part-00000-afd7ff3d-5b75-4e94-b28a-f292741f3b19-c000.snappy.parquet (657 octets)
  ventes/raw/part-00001-afd7ff3d-5b75-4e94-b28a-f292741f3b19-c000.snappy.parquet (1641 octets)
  ventes/raw/part-0000

## 9. Schema du pipeline

```
DONNEES SOURCES                      DATA LAKE (MinIO)
---------------                      -----------------

  [DataFrame]                        +-------------------+
       |                             |      BRONZE       |
       |  Ingestion brute            |  s3a://bronze/    |
       +--------------------------->|  ventes/raw       |
                                     +--------+----------+
                                              |
                                              | Nettoyage
                                              | Enrichissement
                                              v
                                     +-------------------+
                                     |      SILVER       |
                                     |  s3a://silver/    |
                                     |  ventes/enriched  |
                                     +--------+----------+
                                              |
                                              | Agregation
                                              | Indicateurs
                                              v
                                     +-------------------+
                                     |       GOLD        |
                                     |  s3a://gold/      |
                                     |  ventes/par_ville |
                                     +-------------------+
                                              |
                                              v
                                        Reporting
                                        BI Tools
```

---

## Exercice

**Objectif** : Creer votre propre pipeline Bronze / Silver / Gold

**Consigne** :
1. Creez un DataFrame de clients (id, nom, email, pays, date_inscription)
2. Ecrivez-le dans `s3a://bronze/clients/raw`
3. Transformez : ajoutez une colonne `annee_inscription` extraite de la date
4. Ecrivez dans `s3a://silver/clients/enriched`
5. Agregez : comptez les clients par pays
6. Ecrivez dans `s3a://gold/clients/par_pays`

A vous de jouer :

In [12]:
# Donnees clients
clients = [
    (1, "Alice Martin", "alice@email.com", "France", "2023-03-15"),
    (2, "Bob Johnson", "bob@email.com", "USA", "2023-06-22"),
    (3, "Charlie Dupont", "charlie@email.com", "France", "2024-01-10"),
    (4, "Diana Smith", "diana@email.com", "UK", "2023-11-05"),
    (5, "Eve Bernard", "eve@email.com", "France", "2024-02-28")
]

colonnes = ["id", "nom", "email", "pays", "date_inscription"]

# TODO: Creez le DataFrame
df_clients = spark.createDataFrame (clients, colonnes)
df_clients.show()

# TODO: Ecrivez dans Bronze
chemin_bronze_exo = "s3a://bronze/clients/raw"

df_clients.write \
    .mode("overwrite") \
    .parquet(chemin_bronze_exo)

print(f"Donnees ecrites dans : {chemin_bronze_exo}")

+---+--------------+-----------------+------+----------------+
| id|           nom|            email|  pays|date_inscription|
+---+--------------+-----------------+------+----------------+
|  1|  Alice Martin|  alice@email.com|France|      2023-03-15|
|  2|   Bob Johnson|    bob@email.com|   USA|      2023-06-22|
|  3|Charlie Dupont|charlie@email.com|France|      2024-01-10|
|  4|   Diana Smith|  diana@email.com|    UK|      2023-11-05|
|  5|   Eve Bernard|    eve@email.com|France|      2024-02-28|
+---+--------------+-----------------+------+----------------+

Donnees ecrites dans : s3a://bronze/clients/raw


In [13]:
df_bronze_exo = spark.read.parquet(chemin_bronze_exo)
print("Lecture depuis Bronze :")
df_bronze_exo.show()

Lecture depuis Bronze :
+---+--------------+-----------------+------+----------------+
| id|           nom|            email|  pays|date_inscription|
+---+--------------+-----------------+------+----------------+
|  3|Charlie Dupont|charlie@email.com|France|      2024-01-10|
|  1|  Alice Martin|  alice@email.com|France|      2023-03-15|
|  5|   Eve Bernard|    eve@email.com|France|      2024-02-28|
|  4|   Diana Smith|  diana@email.com|    UK|      2023-11-05|
|  2|   Bob Johnson|    bob@email.com|   USA|      2023-06-22|
+---+--------------+-----------------+------+----------------+



In [14]:
df_bronze_exo.printSchema()

root
 |-- id: long (nullable = true)
 |-- nom: string (nullable = true)
 |-- email: string (nullable = true)
 |-- pays: string (nullable = true)
 |-- date_inscription: string (nullable = true)



In [15]:
# TODO: Transformez et ecrivez dans Silver
# Indice : utilisez year(to_date(col("date_inscription"))) pour extraire l'annee

from pyspark.sql.functions import year, to_date, col
df_silver_exo = df_bronze_exo \
            .withColumn("date_inscription", to_date(col("date_inscription"))) \
            .withColumn("annee_inscription", year(col("date_inscription"))) \

df_silver_exo.show()
df_silver_exo.printSchema()

+---+--------------+-----------------+------+----------------+-----------------+
| id|           nom|            email|  pays|date_inscription|annee_inscription|
+---+--------------+-----------------+------+----------------+-----------------+
|  3|Charlie Dupont|charlie@email.com|France|      2024-01-10|             2024|
|  1|  Alice Martin|  alice@email.com|France|      2023-03-15|             2023|
|  5|   Eve Bernard|    eve@email.com|France|      2024-02-28|             2024|
|  4|   Diana Smith|  diana@email.com|    UK|      2023-11-05|             2023|
|  2|   Bob Johnson|    bob@email.com|   USA|      2023-06-22|             2023|
+---+--------------+-----------------+------+----------------+-----------------+

root
 |-- id: long (nullable = true)
 |-- nom: string (nullable = true)
 |-- email: string (nullable = true)
 |-- pays: string (nullable = true)
 |-- date_inscription: date (nullable = true)
 |-- annee_inscription: integer (nullable = true)



In [16]:
#Ecriture dans Silver
chemin_silver_exo = "s3a://silver/clients/enriched"

df_silver_exo.write \
    .mode("overwrite") \
    .parquet(chemin_silver_exo)

print(f"Donnees ecrites dans : {chemin_silver_exo}")

Donnees ecrites dans : s3a://silver/clients/enriched


In [17]:
# TODO: Agregez et ecrivez dans Gold


df_silver_clients = spark.read.parquet(chemin_silver_exo)

# Agregation par ville
df_gold_pays= df_silver_clients.groupBy("pays").agg(
    count("*").alias("nb_clients")
)

df_gold_pays.show()

+------+----------+
|  pays|nb_clients|
+------+----------+
|France|         3|
|    UK|         1|
|   USA|         1|
+------+----------+



In [18]:
# Ecrire dans Gold
chemin_gold_exo = "s3a://gold/clients/par_pays"

df_gold_ville.write \
    .mode("overwrite") \
    .parquet(chemin_gold_exo)

print(f"Donnees ecrites dans : {chemin_gold_exo}")

Donnees ecrites dans : s3a://gold/clients/par_pays


---

## Resume

Dans ce notebook, vous avez appris :
- Comment **configurer Spark pour MinIO**
- Le format des chemins S3 : `s3a://bucket/chemin`
- Comment **lire et ecrire** dans MinIO avec Spark
- L'organisation **Bronze / Silver / Gold**

### Prochaine etape
Dans le prochain notebook, nous apprendrons a ingerer des donnees depuis **PostgreSQL**.