# Exercice 11 - Agregations Gold

## Objectifs
- Comprendre la couche Gold du Data Lake
- Creer des agregations metier
- Construire des tables de faits et dimensions
- Optimiser pour la BI et le reporting

---

## 1. La couche Gold

```
+------------------------------------------------------------------+
|                       COUCHE GOLD                                |
+------------------------------------------------------------------+
|                                                                  |
|  Objectif : Donnees pretes pour la BI et les decisions          |
|                                                                  |
|  +------------------------+   +------------------------+        |
|  | Tables agregees        |   | KPIs et metriques      |        |
|  +------------------------+   +------------------------+        |
|  | - Ventes par jour      |   | - CA total             |        |
|  | - Ventes par produit   |   | - Panier moyen         |        |
|  | - Ventes par region    |   | - Taux de croissance   |        |
|  +------------------------+   +------------------------+        |
|                                                                  |
|  +------------------------+   +------------------------+        |
|  | Dimensions             |   | Tables de faits        |        |
|  +------------------------+   +------------------------+        |
|  | - dim_clients          |   | - fact_ventes          |        |
|  | - dim_produits         |   | - fact_commandes       |        |
|  | - dim_temps            |   | - fact_livraisons      |        |
|  +------------------------+   +------------------------+        |
|                                                                  |
+------------------------------------------------------------------+
```

## 2. Configuration

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from datetime import datetime

# Creer la SparkSession
spark = SparkSession.builder \
    .appName("Agregations Gold") \
    .config("spark.jars.packages", "org.postgresql:postgresql:42.6.0,org.apache.hadoop:hadoop-aws:3.4.1,com.amazonaws:aws-java-sdk-bundle:1.12.262") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .getOrCreate()

date_traitement = datetime.now().strftime("%Y-%m-%d")
print(f"Spark pret - Date : {date_traitement}")

Spark pret - Date : 2026-01-16


In [2]:
# Charger les donnees depuis PostgreSQL
jdbc_url = "jdbc:postgresql://postgres:5432/app"
jdbc_properties = {
    "user": "postgres",
    "password": "postgres",
    "driver": "org.postgresql.Driver"
}

df_orders = spark.read.jdbc(url=jdbc_url, table="orders", properties=jdbc_properties)
df_order_details = spark.read.jdbc(url=jdbc_url, table="order_details", properties=jdbc_properties)
df_products = spark.read.jdbc(url=jdbc_url, table="products", properties=jdbc_properties)
df_customers = spark.read.jdbc(url=jdbc_url, table="customers", properties=jdbc_properties)
df_categories = spark.read.jdbc(url=jdbc_url, table="categories", properties=jdbc_properties)

print("Donnees chargees")

Donnees chargees


In [3]:
# Preparer les donnees de base
df_details = df_order_details.withColumn(
    "montant",
    F.round(F.col("unit_price") * F.col("quantity") * (1 - F.col("discount")), 2)
)

# Joindre avec orders pour avoir la date
df_ventes = df_details.join(
    df_orders.select("order_id", "customer_id", "order_date", "employee_id"),
    on="order_id",
    how="left"
)

print(f"Donnees de ventes preparees : {df_ventes.count()} lignes")

Donnees de ventes preparees : 2155 lignes


## 3. Agregation par date

In [4]:
# Ventes par jour
gold_ventes_jour = df_ventes.groupBy("order_date").agg(
    F.countDistinct("order_id").alias("nb_commandes"),
    F.sum("quantity").alias("nb_articles"),
    F.round(F.sum("montant"), 2).alias("ca_total"),
    F.round(F.avg("montant"), 2).alias("panier_moyen_ligne")
).orderBy("order_date")

print("Ventes par jour :")
gold_ventes_jour.show(10)

Ventes par jour :
+----------+------------+-----------+--------+------------------+
|order_date|nb_commandes|nb_articles|ca_total|panier_moyen_ligne|
+----------+------------+-----------+--------+------------------+
|1996-07-04|           1|         27|   440.0|            146.67|
|1996-07-05|           1|         49|  1863.4|             931.7|
|1996-07-08|           2|        101| 2206.66|            367.78|
|1996-07-09|           1|        105|  3597.9|            1199.3|
|1996-07-10|           1|        102|  1444.8|             481.6|
|1996-07-11|           1|         57|  556.62|            185.54|
|1996-07-12|           1|        110|  2490.5|            622.63|
|1996-07-15|           1|         27|   517.8|             258.9|
|1996-07-16|           1|         46|  1119.9|             373.3|
|1996-07-17|           1|        121| 1614.88|            538.29|
+----------+------------+-----------+--------+------------------+
only showing top 10 rows


In [5]:
# Ventes par mois
gold_ventes_mois = df_ventes \
    .withColumn("annee", F.year("order_date")) \
    .withColumn("mois", F.month("order_date")) \
    .groupBy("annee", "mois").agg(
        F.countDistinct("order_id").alias("nb_commandes"),
        F.round(F.sum("montant"), 2).alias("ca_total")
    ).orderBy("annee", "mois")

print("Ventes par mois :")
gold_ventes_mois.show()

Ventes par mois :
+-----+----+------------+--------+
|annee|mois|nb_commandes|ca_total|
+-----+----+------------+--------+
| 1996|   7|          22|27861.89|
| 1996|   8|          25|25485.27|
| 1996|   9|          23| 26381.4|
| 1996|  10|          26|37515.72|
| 1996|  11|          25|45600.04|
| 1996|  12|          31|45239.63|
| 1997|   1|          33|61258.06|
| 1997|   2|          29|38483.63|
| 1997|   3|          30|38547.21|
| 1997|   4|          31|53032.95|
| 1997|   5|          32|53781.29|
| 1997|   6|          30|36362.79|
| 1997|   7|          33|51020.84|
| 1997|   8|          33|47287.67|
| 1997|   9|          37|55629.23|
| 1997|  10|          38|66749.24|
| 1997|  11|          34|43533.79|
| 1997|  12|          48|71398.41|
| 1998|   1|          55|94222.12|
| 1998|   2|          54|99415.29|
+-----+----+------------+--------+
only showing top 20 rows


In [6]:
# Ajouter le taux de croissance mois sur mois
window_mois = Window.orderBy("annee", "mois")

gold_ventes_mois_growth = gold_ventes_mois.withColumn(
    "ca_mois_precedent",
    F.lag("ca_total", 1).over(window_mois)
).withColumn(
    "croissance_pct",
    F.round((F.col("ca_total") - F.col("ca_mois_precedent")) / F.col("ca_mois_precedent") * 100, 2)
)

gold_ventes_mois_growth.show()

+-----+----+------------+--------+-----------------+--------------+
|annee|mois|nb_commandes|ca_total|ca_mois_precedent|croissance_pct|
+-----+----+------------+--------+-----------------+--------------+
| 1996|   7|          22|27861.89|             NULL|          NULL|
| 1996|   8|          25|25485.27|         27861.89|         -8.53|
| 1996|   9|          23| 26381.4|         25485.27|          3.52|
| 1996|  10|          26|37515.72|          26381.4|         42.21|
| 1996|  11|          25|45600.04|         37515.72|         21.55|
| 1996|  12|          31|45239.63|         45600.04|         -0.79|
| 1997|   1|          33|61258.06|         45239.63|         35.41|
| 1997|   2|          29|38483.63|         61258.06|        -37.18|
| 1997|   3|          30|38547.21|         38483.63|          0.17|
| 1997|   4|          31|53032.95|         38547.21|         37.58|
| 1997|   5|          32|53781.29|         53032.95|          1.41|
| 1997|   6|          30|36362.79|         53781

## 4. Agregation par produit

In [7]:
# Enrichir avec les noms de produits
df_ventes_produit = df_ventes.join(
    df_products.select("product_id", "product_name", "category_id"),
    on="product_id",
    how="left"
).join(
    df_categories.select("category_id", "category_name"),
    on="category_id",
    how="left"
)

In [8]:
# Top produits par CA
gold_top_produits = df_ventes_produit.groupBy("product_id", "product_name").agg(
    F.sum("quantity").alias("quantite_vendue"),
    F.round(F.sum("montant"), 2).alias("ca_total"),
    F.countDistinct("order_id").alias("nb_commandes")
).orderBy(F.desc("ca_total"))

print("Top 10 produits par CA :")
gold_top_produits.show(10)

Top 10 produits par CA :
+----------+--------------------+---------------+---------+------------+
|product_id|        product_name|quantite_vendue| ca_total|nb_commandes|
+----------+--------------------+---------------+---------+------------+
|        38|       Côte de Blaye|            623|141396.73|          24|
|        29|Thüringer Rostbra...|            746| 80368.69|          32|
|        59|Raclette Courdavault|           1496|  71155.7|          54|
|        62|      Tarte au sucre|           1083| 47234.95|          48|
|        60|   Camembert Pierrot|           1577| 46825.48|          51|
|        56|Gnocchi di nonna ...|           1263| 42593.06|          50|
|        51|Manjimup Dried Ap...|            886| 41819.65|          39|
|        17|        Alice Mutton|            978| 32698.38|          37|
|        18|    Carnarvon Tigers|            539| 29171.88|          27|
|        28|   Rössle Sauerkraut|            640| 25696.64|          33|
+----------+--------------

In [9]:
# Ventes par categorie
gold_categories = df_ventes_produit.groupBy("category_id", "category_name").agg(
    F.countDistinct("product_id").alias("nb_produits"),
    F.sum("quantity").alias("quantite_vendue"),
    F.round(F.sum("montant"), 2).alias("ca_total")
).orderBy(F.desc("ca_total"))

print("Ventes par categorie :")
gold_categories.show()

Ventes par categorie :
+-----------+--------------+-----------+---------------+---------+
|category_id| category_name|nb_produits|quantite_vendue| ca_total|
+-----------+--------------+-----------+---------------+---------+
|          1|     Beverages|         12|           9532|267868.16|
|          4|Dairy Products|         10|           9149|234507.26|
|          3|   Confections|         13|           7906|167357.19|
|          6|  Meat/Poultry|          6|           4199|163022.37|
|          8|       Seafood|         12|           7681|131261.72|
|          2|    Condiments|         12|           5298|106047.09|
|          7|       Produce|          5|           2990| 99984.57|
|          5|Grains/Cereals|          7|           4562| 95744.59|
+-----------+--------------+-----------+---------------+---------+



## 5. Agregation par client

In [10]:
# Enrichir avec les clients
df_ventes_client = df_ventes.join(
    df_customers.select("customer_id", "company_name", "country"),
    on="customer_id",
    how="left"
)

In [11]:
# Analyse RFM (Recence, Frequence, Montant)
date_reference = df_ventes_client.agg(F.max("order_date")).collect()[0][0]

gold_rfm = df_ventes_client.groupBy("customer_id", "company_name", "country").agg(
    F.datediff(F.lit(date_reference), F.max("order_date")).alias("recence_jours"),
    F.countDistinct("order_id").alias("frequence"),
    F.round(F.sum("montant"), 2).alias("montant_total")
)

print("Analyse RFM :")
gold_rfm.orderBy(F.desc("montant_total")).show(10)

Analyse RFM :
+-----------+--------------------+-------+-------------+---------+-------------+
|customer_id|        company_name|country|recence_jours|frequence|montant_total|
+-----------+--------------------+-------+-------------+---------+-------------+
|      QUICK|          QUICK-Stop|Germany|           22|       28|    110277.31|
|      ERNSH|        Ernst Handel|Austria|            1|       30|    104874.97|
|      SAVEA|  Save-a-lot Markets|    USA|            5|       31|    104361.94|
|      RATTC|Rattlesnake Canyo...|    USA|            0|       18|      51097.8|
|      HUNGO|Hungry Owl All-Ni...|Ireland|            6|       19|      49979.9|
|      HANAR|       Hanari Carnes| Brazil|            9|       14|     32841.37|
|      KOENE|     Königlich Essen|Germany|           20|       14|     30908.38|
|      FOLKO|      Folk och fä HB| Sweden|            9|       19|     29567.56|
|      MEREP|      Mère Paillarde| Canada|          188|       13|     28872.19|
|      WHITC|W

In [12]:
# Segmentation clients
gold_segments = gold_rfm.withColumn(
    "segment",
    F.when((F.col("recence_jours") <= 30) & (F.col("frequence") >= 5), "VIP")
     .when((F.col("recence_jours") <= 60) & (F.col("frequence") >= 3), "Fidele")
     .when(F.col("recence_jours") <= 90, "Actif")
     .when(F.col("recence_jours") <= 180, "A risque")
     .otherwise("Inactif")
)

# Repartition par segment
gold_segments.groupBy("segment").agg(
    F.count("*").alias("nb_clients"),
    F.round(F.sum("montant_total"), 2).alias("ca_total")
).orderBy(F.desc("ca_total")).show()

+--------+----------+----------+
| segment|nb_clients|  ca_total|
+--------+----------+----------+
|     VIP|        51|1030351.06|
|  Fidele|        16|  84680.39|
|A risque|        11|  76254.86|
|   Actif|         6|   38005.9|
| Inactif|         5|  36500.74|
+--------+----------+----------+



## 6. Agregation geographique

In [13]:
# Ventes par pays
gold_pays = df_ventes_client.groupBy("country").agg(
    F.countDistinct("customer_id").alias("nb_clients"),
    F.countDistinct("order_id").alias("nb_commandes"),
    F.round(F.sum("montant"), 2).alias("ca_total")
).withColumn(
    "ca_moyen_client",
    F.round(F.col("ca_total") / F.col("nb_clients"), 2)
).orderBy(F.desc("ca_total"))

print("Ventes par pays :")
gold_pays.show()

Ventes par pays :
+-----------+----------+------------+---------+---------------+
|    country|nb_clients|nb_commandes| ca_total|ca_moyen_client|
+-----------+----------+------------+---------+---------------+
|        USA|        13|         122|245584.59|       18891.12|
|    Germany|        11|         122|230284.62|       20934.97|
|    Austria|         2|          40|128003.83|       64001.92|
|     Brazil|         9|          83|106925.77|       11880.64|
|     France|        10|          77| 81358.31|        8135.83|
|         UK|         7|          56| 58971.31|        8424.47|
|  Venezuela|         4|          46| 56810.63|       14202.66|
|     Sweden|         2|          37| 54495.14|       27247.57|
|     Canada|         3|          30| 50196.29|        16732.1|
|    Ireland|         1|          19|  49979.9|        49979.9|
|    Belgium|         2|          19| 33824.86|       16912.43|
|    Denmark|         2|          18| 32661.02|       16330.51|
|Switzerland|         

## 7. KPIs globaux

In [14]:
# Calculer les KPIs
kpis = df_ventes.agg(
    F.round(F.sum("montant"), 2).alias("ca_total"),
    F.countDistinct("order_id").alias("nb_commandes_total"),
    F.countDistinct("customer_id").alias("nb_clients_actifs"),
    F.sum("quantity").alias("nb_articles_vendus")
).collect()[0]

print("=" * 50)
print("KPIs GLOBAUX")
print("=" * 50)
print(f"CA Total          : {kpis['ca_total']:,.2f} EUR")
print(f"Nb Commandes      : {kpis['nb_commandes_total']:,}")
print(f"Nb Clients Actifs : {kpis['nb_clients_actifs']:,}")
print(f"Nb Articles Vendus: {kpis['nb_articles_vendus']:,}")
print(f"Panier Moyen      : {kpis['ca_total'] / kpis['nb_commandes_total']:,.2f} EUR")
print("=" * 50)

KPIs GLOBAUX
CA Total          : 1,265,792.95 EUR
Nb Commandes      : 830
Nb Clients Actifs : 89
Nb Articles Vendus: 51,317
Panier Moyen      : 1,525.05 EUR


In [15]:
# Creer une table de KPIs
gold_kpis = spark.createDataFrame([
    ("ca_total", float(kpis['ca_total']), "EUR"),
    ("nb_commandes", float(kpis['nb_commandes_total']), "count"),
    ("nb_clients", float(kpis['nb_clients_actifs']), "count"),
    ("nb_articles", float(kpis['nb_articles_vendus']), "count"),
    ("panier_moyen", float(kpis['ca_total'] / kpis['nb_commandes_total']), "EUR")
], ["kpi", "valeur", "unite"])

gold_kpis = gold_kpis.withColumn("date_calcul", F.lit(date_traitement))
gold_kpis.show()

+------------+------------------+-----+-----------+
|         kpi|            valeur|unite|date_calcul|
+------------+------------------+-----+-----------+
|    ca_total|        1265792.95|  EUR| 2026-01-16|
|nb_commandes|             830.0|count| 2026-01-16|
|  nb_clients|              89.0|count| 2026-01-16|
| nb_articles|           51317.0|count| 2026-01-16|
|panier_moyen|1525.0517469879517|  EUR| 2026-01-16|
+------------+------------------+-----+-----------+



## 8. Sauvegarder les tables Gold

In [16]:
# Sauvegarder toutes les tables Gold
tables_gold = {
    "ventes_jour": gold_ventes_jour,
    "ventes_mois": gold_ventes_mois_growth,
    "top_produits": gold_top_produits,
    "categories": gold_categories,
    "clients_rfm": gold_segments,
    "ventes_pays": gold_pays,
    "kpis": gold_kpis
}

print("Sauvegarde des tables Gold :")
print("=" * 50)

for nom, df in tables_gold.items():
    chemin = f"s3a://gold/{nom}/{date_traitement}"
    df.write.mode("overwrite").parquet(chemin)
    print(f"[OK] {nom} : {df.count()} lignes -> {chemin}")

print("=" * 50)
print("Sauvegarde terminee")

Sauvegarde des tables Gold :
[OK] ventes_jour : 480 lignes -> s3a://gold/ventes_jour/2026-01-16
[OK] ventes_mois : 23 lignes -> s3a://gold/ventes_mois/2026-01-16
[OK] top_produits : 77 lignes -> s3a://gold/top_produits/2026-01-16
[OK] categories : 8 lignes -> s3a://gold/categories/2026-01-16
[OK] clients_rfm : 89 lignes -> s3a://gold/clients_rfm/2026-01-16
[OK] ventes_pays : 21 lignes -> s3a://gold/ventes_pays/2026-01-16
[OK] kpis : 5 lignes -> s3a://gold/kpis/2026-01-16
Sauvegarde terminee


---

## Exercice

**Objectif** : Creer une analyse Gold des employes

**Consigne** :
1. Calculez les ventes par employe (nb commandes, CA)
2. Classez les employes par performance
3. Calculez la part de CA de chaque employe (en %)
4. Sauvegardez dans Gold

A vous de jouer :

In [17]:
# TODO: Charger employees
df_employees = spark.read.jdbc(url=jdbc_url, table="employees", properties=jdbc_properties)
df_employees.show(5)

+-----------+---------+----------+--------------------+-----------------+----------+----------+--------------------+--------+------+-----------+-------+--------------+---------+-----+--------------------+----------+--------------------+
|employee_id|last_name|first_name|               title|title_of_courtesy|birth_date| hire_date|             address|    city|region|postal_code|country|    home_phone|extension|photo|               notes|reports_to|          photo_path|
+-----------+---------+----------+--------------------+-----------------+----------+----------+--------------------+--------+------+-----------+-------+--------------+---------+-----+--------------------+----------+--------------------+
|          1|  Davolio|     Nancy|Sales Representative|              Ms.|1948-12-08|1992-05-01|507 - 20th Ave. E...| Seattle|    WA|      98122|    USA|(206) 555-9857|     5467|   []|Education include...|         2|http://accweb/emm...|
|          2|   Fuller|    Andrew|Vice President, S.

In [18]:
# TODO: Calculer les ventes par employe

# Agrégation des métriques (CA et nb commandes)
df_emp_stats = df_ventes.groupBy("employee_id").agg(
    F.countDistinct("order_id").alias("nb_commandes"),
    F.round(F.sum("montant"), 2).alias("ca_total")
)

# Ajout des noms des employés (Jointure)
gold_employees = df_emp_stats.join(
    df_employees.select("employee_id", "first_name", "last_name"),
    on="employee_id",
    how="left"
)

In [19]:
# TODO: Classer et calculer les parts
from pyspark.sql.window import Window

# Définition des fenêtres
window_global = Window.partitionBy() # Fenêtre vide = prend tout le dataset
window_classement = Window.orderBy(F.desc("ca_total"))

gold_employees_perf = gold_employees.withColumn(
    "rang",
    F.rank().over(window_classement)
).withColumn(
    "total_global", 
    F.sum("ca_total").over(window_global)
).withColumn(
    "part_ca_pct",
    F.round((F.col("ca_total") / F.col("total_global")) * 100, 2)
).drop("total_global")

# Affichage du rapport final
print("--- Performance des Employés ---")
gold_employees_perf.select(
    "rang", "first_name", "last_name", "nb_commandes", "ca_total", "part_ca_pct"
).orderBy("rang").show()

--- Performance des Employés ---
+----+----------+---------+------------+---------+-----------+
|rang|first_name|last_name|nb_commandes| ca_total|part_ca_pct|
+----+----------+---------+------------+---------+-----------+
|   1|  Margaret|  Peacock|         156|232890.83|       18.4|
|   2|     Janet|Leverling|         127|202812.83|      16.02|
|   3|     Nancy|  Davolio|         123|192107.57|      15.18|
|   4|    Andrew|   Fuller|          96|166537.76|      13.16|
|   5|     Laura| Callahan|         104|126862.28|      10.02|
|   6|    Robert|     King|          72|124568.24|       9.84|
|   7|      Anne|Dodsworth|          43| 77308.05|       6.11|
|   8|   Michael|   Suyama|          67| 73913.13|       5.84|
|   9|    Steven| Buchanan|          42| 68792.26|       5.43|
+----+----------+---------+------------+---------+-----------+



---

## Resume

Dans ce notebook, vous avez appris :
- Le role de la couche **Gold** dans un Data Lake
- Comment creer des **agregations temporelles** (jour, mois)
- Comment calculer des **KPIs metier**
- Comment faire une **analyse RFM** client
- Comment **segmenter** les donnees
- Comment **sauvegarder** plusieurs tables Gold

### Prochaine etape
Dans le prochain notebook, nous apprendrons les operations avancees sur les DataFrames.