# Exercice 10 - Transformations Silver

## Objectifs
- Enrichir les donnees avec des colonnes calculees
- Appliquer des transformations metier
- Normaliser les structures de donnees
- Creer des vues denormalisees

---

## 1. Types de transformations Silver

```
+------------------------------------------------------------------+
|                    TRANSFORMATIONS SILVER                        |
+------------------------------------------------------------------+
|                                                                  |
|  +------------------------+   +------------------------+        |
|  | Colonnes calculees     |   | Enrichissement         |        |
|  +------------------------+   +------------------------+        |
|  | - Calculs arithmetiques|   | - Jointures tables     |        |
|  | - Dates derivees       |   | - Lookup externe       |        |
|  | - Categorisations      |   | - Ajout dimensions     |        |
|  +------------------------+   +------------------------+        |
|                                                                  |
|  +------------------------+   +------------------------+        |
|  | Normalisation          |   | Denormalisation        |        |
|  +------------------------+   +------------------------+        |
|  | - Types uniformes      |   | - Vue aplatie          |        |
|  | - Unites standards     |   | - Tout en une table    |        |
|  | - Codes normalises     |   | - Performance lecture  |        |
|  +------------------------+   +------------------------+        |
|                                                                  |
+------------------------------------------------------------------+
```

## 2. Configuration

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from datetime import datetime

# Creer la SparkSession
spark = SparkSession.builder \
    .appName("Transformations Silver") \
    .config("spark.jars.packages", "org.postgresql:postgresql:42.6.0,org.apache.hadoop:hadoop-aws:3.4.1,com.amazonaws:aws-java-sdk-bundle:1.12.262") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .getOrCreate()

print("Spark pret")

Spark pret


In [2]:
# Configuration PostgreSQL
jdbc_url = "jdbc:postgresql://postgres:5432/app"
jdbc_properties = {
    "user": "postgres",
    "password": "postgres",
    "driver": "org.postgresql.Driver"
}

# Charger les tables Northwind
df_orders = spark.read.jdbc(url=jdbc_url, table="orders", properties=jdbc_properties)
df_order_details = spark.read.jdbc(url=jdbc_url, table="order_details", properties=jdbc_properties)
df_products = spark.read.jdbc(url=jdbc_url, table="products", properties=jdbc_properties)
df_customers = spark.read.jdbc(url=jdbc_url, table="customers", properties=jdbc_properties)
df_categories = spark.read.jdbc(url=jdbc_url, table="categories", properties=jdbc_properties)

print("Tables chargees")

Tables chargees


## 3. Colonnes calculees

In [3]:
# Calcul du montant par ligne de commande
df_details = df_order_details.withColumn(
    "montant_ligne",
    F.round(F.col("unit_price") * F.col("quantity") * (1 - F.col("discount")), 2)
)

df_details.select("order_id", "product_id", "unit_price", "quantity", "discount", "montant_ligne").show(10)

+--------+----------+----------+--------+--------+-------------+
|order_id|product_id|unit_price|quantity|discount|montant_ligne|
+--------+----------+----------+--------+--------+-------------+
|   10248|        11|      14.0|      12|     0.0|        168.0|
|   10248|        42|       9.8|      10|     0.0|         98.0|
|   10248|        72|      34.8|       5|     0.0|        174.0|
|   10249|        14|      18.6|       9|     0.0|        167.4|
|   10249|        51|      42.4|      40|     0.0|       1696.0|
|   10250|        41|       7.7|      10|     0.0|         77.0|
|   10250|        51|      42.4|      35|    0.15|       1261.4|
|   10250|        65|      16.8|      15|    0.15|        214.2|
|   10251|        22|      16.8|       6|    0.05|        95.76|
|   10251|        57|      15.6|      15|    0.05|        222.3|
+--------+----------+----------+--------+--------+-------------+
only showing top 10 rows


In [4]:
# Ajouter des colonnes de date derivees
df_orders_enrichi = df_orders \
    .withColumn("annee", F.year("order_date")) \
    .withColumn("mois", F.month("order_date")) \
    .withColumn("trimestre", F.quarter("order_date")) \
    .withColumn("jour_semaine", F.dayofweek("order_date")) \
    .withColumn("nom_jour", F.date_format("order_date", "EEEE")) \
    .withColumn("nom_mois", F.date_format("order_date", "MMMM"))

df_orders_enrichi.select("order_id", "order_date", "annee", "mois", "trimestre", "nom_jour").show(10)

+--------+----------+-----+----+---------+---------+
|order_id|order_date|annee|mois|trimestre| nom_jour|
+--------+----------+-----+----+---------+---------+
|   10248|1996-07-04| 1996|   7|        3| Thursday|
|   10249|1996-07-05| 1996|   7|        3|   Friday|
|   10250|1996-07-08| 1996|   7|        3|   Monday|
|   10251|1996-07-08| 1996|   7|        3|   Monday|
|   10252|1996-07-09| 1996|   7|        3|  Tuesday|
|   10253|1996-07-10| 1996|   7|        3|Wednesday|
|   10254|1996-07-11| 1996|   7|        3| Thursday|
|   10255|1996-07-12| 1996|   7|        3|   Friday|
|   10256|1996-07-15| 1996|   7|        3|   Monday|
|   10257|1996-07-16| 1996|   7|        3|  Tuesday|
+--------+----------+-----+----+---------+---------+
only showing top 10 rows


In [5]:
# Calculer le delai de livraison
df_orders_enrichi = df_orders_enrichi.withColumn(
    "delai_livraison_jours",
    F.datediff(F.col("shipped_date"), F.col("order_date"))
)

df_orders_enrichi.select("order_id", "order_date", "shipped_date", "delai_livraison_jours").show(10)

+--------+----------+------------+---------------------+
|order_id|order_date|shipped_date|delai_livraison_jours|
+--------+----------+------------+---------------------+
|   10248|1996-07-04|  1996-07-16|                   12|
|   10249|1996-07-05|  1996-07-10|                    5|
|   10250|1996-07-08|  1996-07-12|                    4|
|   10251|1996-07-08|  1996-07-15|                    7|
|   10252|1996-07-09|  1996-07-11|                    2|
|   10253|1996-07-10|  1996-07-16|                    6|
|   10254|1996-07-11|  1996-07-23|                   12|
|   10255|1996-07-12|  1996-07-15|                    3|
|   10256|1996-07-15|  1996-07-17|                    2|
|   10257|1996-07-16|  1996-07-22|                    6|
+--------+----------+------------+---------------------+
only showing top 10 rows


## 4. Categorisations

In [6]:
# Categoriser les produits par niveau de prix
df_products_cat = df_products.withColumn(
    "niveau_prix",
    F.when(F.col("unit_price") < 10, "Economique")
     .when(F.col("unit_price") < 25, "Standard")
     .when(F.col("unit_price") < 50, "Premium")
     .otherwise("Luxe")
)

df_products_cat.select("product_name", "unit_price", "niveau_prix").show(10)

+--------------------+----------+-----------+
|        product_name|unit_price|niveau_prix|
+--------------------+----------+-----------+
|                Chai|      18.0|   Standard|
|               Chang|      19.0|   Standard|
|       Aniseed Syrup|      10.0|   Standard|
|Chef Anton's Caju...|      22.0|   Standard|
|Chef Anton's Gumb...|     21.35|   Standard|
|Grandma's Boysenb...|      25.0|    Premium|
|Uncle Bob's Organ...|      30.0|    Premium|
|Northwoods Cranbe...|      40.0|    Premium|
|     Mishi Kobe Niku|      97.0|       Luxe|
|               Ikura|      31.0|    Premium|
+--------------------+----------+-----------+
only showing top 10 rows


In [7]:
# Categoriser les delais de livraison
df_orders_cat = df_orders_enrichi.withColumn(
    "categorie_delai",
    F.when(F.col("delai_livraison_jours").isNull(), "Non expedie")
     .when(F.col("delai_livraison_jours") <= 3, "Rapide")
     .when(F.col("delai_livraison_jours") <= 7, "Normal")
     .when(F.col("delai_livraison_jours") <= 14, "Lent")
     .otherwise("Tres lent")
)

df_orders_cat.groupBy("categorie_delai").count().show()

+---------------+-----+
|categorie_delai|count|
+---------------+-----+
|           Lent|  240|
|    Non expedie|   21|
|      Tres lent|   96|
|         Rapide|  146|
|         Normal|  327|
+---------------+-----+



## 5. Enrichissement par jointure

In [8]:
# Joindre order_details avec products
df_details_enrichi = df_details.join(
    df_products.select("product_id", "product_name", "category_id"),
    on="product_id",
    how="left"
)

df_details_enrichi.show(5)

+----------+--------+----------+--------+--------+-------------+--------------------+-----------+
|product_id|order_id|unit_price|quantity|discount|montant_ligne|        product_name|category_id|
+----------+--------+----------+--------+--------+-------------+--------------------+-----------+
|        11|   10248|      14.0|      12|     0.0|        168.0|      Queso Cabrales|          4|
|        14|   10249|      18.6|       9|     0.0|        167.4|                Tofu|          7|
|        41|   10250|       7.7|      10|     0.0|         77.0|Jack's New Englan...|          8|
|        42|   10248|       9.8|      10|     0.0|         98.0|Singaporean Hokki...|          5|
|        51|   10249|      42.4|      40|     0.0|       1696.0|Manjimup Dried Ap...|          7|
+----------+--------+----------+--------+--------+-------------+--------------------+-----------+
only showing top 5 rows


In [9]:
# Ajouter le nom de la categorie
df_details_enrichi = df_details_enrichi.join(
    df_categories.select("category_id", "category_name"),
    on="category_id",
    how="left"
)

df_details_enrichi.select("order_id", "product_name", "category_name", "montant_ligne").show(10)

+--------+--------------------+--------------+-------------+
|order_id|        product_name| category_name|montant_ligne|
+--------+--------------------+--------------+-------------+
|   10248|      Queso Cabrales|Dairy Products|        168.0|
|   10249|                Tofu|       Produce|        167.4|
|   10251| Gustaf's Knäckebröd|Grains/Cereals|        95.76|
|   10250|Jack's New Englan...|       Seafood|         77.0|
|   10248|Singaporean Hokki...|Grains/Cereals|         98.0|
|   10249|Manjimup Dried Ap...|       Produce|       1696.0|
|   10250|Manjimup Dried Ap...|       Produce|       1261.4|
|   10251|      Ravioli Angelo|Grains/Cereals|        222.3|
|   10250|Louisiana Fiery H...|    Condiments|        214.2|
|   10251|Louisiana Fiery H...|    Condiments|        336.0|
+--------+--------------------+--------------+-------------+
only showing top 10 rows


## 6. Vue denormalisee complete

In [10]:
# Creer une vue complete des ventes
df_ventes = df_details_enrichi.join(
    df_orders_cat.select(
        "order_id", "customer_id", "order_date", 
        "annee", "mois", "trimestre",
        "delai_livraison_jours", "categorie_delai"
    ),
    on="order_id",
    how="left"
)

# Ajouter les informations client
df_ventes = df_ventes.join(
    df_customers.select("customer_id", "company_name", "country"),
    on="customer_id",
    how="left"
)

print("Vue denormalisee des ventes :")
df_ventes.printSchema()

Vue denormalisee des ventes :
root
 |-- customer_id: string (nullable = true)
 |-- order_id: short (nullable = true)
 |-- category_id: short (nullable = true)
 |-- product_id: short (nullable = true)
 |-- unit_price: float (nullable = true)
 |-- quantity: short (nullable = true)
 |-- discount: float (nullable = true)
 |-- montant_ligne: double (nullable = true)
 |-- product_name: string (nullable = true)
 |-- category_name: string (nullable = true)
 |-- order_date: date (nullable = true)
 |-- annee: integer (nullable = true)
 |-- mois: integer (nullable = true)
 |-- trimestre: integer (nullable = true)
 |-- delai_livraison_jours: integer (nullable = true)
 |-- categorie_delai: string (nullable = true)
 |-- company_name: string (nullable = true)
 |-- country: string (nullable = true)



In [11]:
# Afficher un apercu
df_ventes.select(
    "order_id", "order_date", "company_name", "country",
    "product_name", "category_name", "montant_ligne"
).show(10, truncate=False)

+--------+----------+-------------------------+-------+--------------------------------+--------------+-------------+
|order_id|order_date|company_name             |country|product_name                    |category_name |montant_ligne|
+--------+----------+-------------------------+-------+--------------------------------+--------------+-------------+
|10250   |1996-07-08|Hanari Carnes            |Brazil |Louisiana Fiery Hot Pepper Sauce|Condiments    |214.2        |
|10251   |1996-07-08|Victuailles en stock     |France |Louisiana Fiery Hot Pepper Sauce|Condiments    |336.0        |
|10251   |1996-07-08|Victuailles en stock     |France |Gustaf's Knäckebröd             |Grains/Cereals|95.76        |
|10251   |1996-07-08|Victuailles en stock     |France |Ravioli Angelo                  |Grains/Cereals|222.3        |
|10250   |1996-07-08|Hanari Carnes            |Brazil |Jack's New England Clam Chowder |Seafood       |77.0         |
|10248   |1996-07-04|Vins et alcools Chevalier|France |M

## 7. Fonctions de fenetre (Window Functions)

In [12]:
# Classement des produits par montant dans chaque commande
window_order = Window.partitionBy("order_id").orderBy(F.desc("montant_ligne"))

df_ventes_rank = df_ventes.withColumn(
    "rang_produit",
    F.rank().over(window_order)
)

df_ventes_rank.filter(F.col("order_id") == 10248) \
    .select("order_id", "product_name", "montant_ligne", "rang_produit").show()

+--------+--------------------+-------------+------------+
|order_id|        product_name|montant_ligne|rang_produit|
+--------+--------------------+-------------+------------+
|   10248|Mozzarella di Gio...|        174.0|           1|
|   10248|      Queso Cabrales|        168.0|           2|
|   10248|Singaporean Hokki...|         98.0|           3|
+--------+--------------------+-------------+------------+



In [13]:
# Calcul du pourcentage du montant par ligne
window_total = Window.partitionBy("order_id")

df_ventes_pct = df_ventes.withColumn(
    "total_commande",
    F.sum("montant_ligne").over(window_total)
).withColumn(
    "pct_commande",
    F.round(F.col("montant_ligne") / F.col("total_commande") * 100, 2)
)

df_ventes_pct.filter(F.col("order_id") == 10248) \
    .select("order_id", "product_name", "montant_ligne", "total_commande", "pct_commande").show()

+--------+--------------------+-------------+--------------+------------+
|order_id|        product_name|montant_ligne|total_commande|pct_commande|
+--------+--------------------+-------------+--------------+------------+
|   10248|      Queso Cabrales|        168.0|         440.0|       38.18|
|   10248|Singaporean Hokki...|         98.0|         440.0|       22.27|
|   10248|Mozzarella di Gio...|        174.0|         440.0|       39.55|
+--------+--------------------+-------------+--------------+------------+



In [14]:
# Cumul des ventes par client
window_client = Window.partitionBy("customer_id").orderBy("order_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)

df_cumul = df_ventes.withColumn(
    "cumul_achats",
    F.sum("montant_ligne").over(window_client)
)

df_cumul.filter(F.col("customer_id") == "VINET") \
    .select("order_id", "order_date", "montant_ligne", "cumul_achats") \
    .distinct() \
    .orderBy("order_date").show()

+--------+----------+-------------+------------+
|order_id|order_date|montant_ligne|cumul_achats|
+--------+----------+-------------+------------+
|   10248|1996-07-04|         98.0|        98.0|
|   10248|1996-07-04|        168.0|       266.0|
|   10248|1996-07-04|        174.0|       440.0|
|   10274|1996-08-06|        194.6|       978.6|
|   10274|1996-08-06|        344.0|       784.0|
|   10295|1996-09-02|        121.6|      1100.2|
|   10737|1997-11-11|         24.0|      1240.0|
|   10737|1997-11-11|        115.8|      1216.0|
|   10739|1997-11-12|        114.0|      1354.0|
|   10739|1997-11-12|        126.0|      1480.0|
+--------+----------+-------------+------------+



## 8. Fonction de transformation complete

In [15]:
def transformer_ventes_silver(df_orders, df_order_details, df_products, df_categories, df_customers):
    """
    Cree une vue denormalisee et enrichie des ventes.
    """
    # 1. Enrichir order_details
    df = df_order_details.withColumn(
        "montant_ligne",
        F.round(F.col("unit_price") * F.col("quantity") * (1 - F.col("discount")), 2)
    )
    
    # 2. Joindre products
    df = df.join(
        df_products.select("product_id", "product_name", "category_id"),
        on="product_id",
        how="left"
    )
    
    # 3. Joindre categories
    df = df.join(
        df_categories.select("category_id", "category_name"),
        on="category_id",
        how="left"
    )
    
    # 4. Enrichir orders
    df_ord = df_orders \
        .withColumn("annee", F.year("order_date")) \
        .withColumn("mois", F.month("order_date")) \
        .withColumn("trimestre", F.quarter("order_date")) \
        .withColumn("delai_livraison_jours", F.datediff("shipped_date", "order_date"))
    
    # 5. Joindre orders
    df = df.join(
        df_ord.select("order_id", "customer_id", "order_date", "annee", "mois", "trimestre", "delai_livraison_jours"),
        on="order_id",
        how="left"
    )
    
    # 6. Joindre customers
    df = df.join(
        df_customers.select("customer_id", "company_name", "country", "city"),
        on="customer_id",
        how="left"
    )
    
    # 7. Ajouter metadata
    df = df.withColumn("_transformed_at", F.current_timestamp())
    
    return df

In [16]:
# Appliquer la transformation
df_silver_ventes = transformer_ventes_silver(
    df_orders, df_order_details, df_products, df_categories, df_customers
)

print(f"Vue Silver : {df_silver_ventes.count()} lignes")
df_silver_ventes.show(5)

Vue Silver : 2155 lignes
+-----------+--------+-----------+----------+----------+--------+--------+-------------+--------------------+--------------+----------+-----+----+---------+---------------------+--------------------+-------+-------+--------------------+
|customer_id|order_id|category_id|product_id|unit_price|quantity|discount|montant_ligne|        product_name| category_name|order_date|annee|mois|trimestre|delai_livraison_jours|        company_name|country|   city|     _transformed_at|
+-----------+--------+-----------+----------+----------+--------+--------+-------------+--------------------+--------------+----------+-----+----+---------+---------------------+--------------------+-------+-------+--------------------+
|      TOMSP|   10249|          7|        14|      18.6|       9|     0.0|        167.4|                Tofu|       Produce|1996-07-05| 1996|   7|        3|                    5|  Toms Spezialitäten|Germany|Münster|2026-01-15 16:30:...|
|      TOMSP|   10249|     

In [17]:
# Sauvegarder dans Silver
date_traitement = datetime.now().strftime("%Y-%m-%d")
chemin_silver = f"s3a://silver/ventes/{date_traitement}"

df_silver_ventes.write.mode("overwrite").parquet(chemin_silver)
print(f"Sauvegarde : {chemin_silver}")

Sauvegarde : s3a://silver/ventes/2026-01-15


---

## Exercice

**Objectif** : Creer une vue Silver des employes enrichie

**Consigne** :
1. Lisez la table `employees`
2. Calculez l'anciennete (annees depuis embauche)
3. Categorisez par anciennete (Junior, Confirme, Senior, Expert)
4. Ajoutez le nombre de commandes gerees par employe
5. Sauvegardez dans Silver

A vous de jouer :

In [18]:
# TODO: Lire employees
df_employees = spark.read.jdbc(url=jdbc_url, table="employees", properties=jdbc_properties)
df_employees.show(5)

+-----------+---------+----------+--------------------+-----------------+----------+----------+--------------------+--------+------+-----------+-------+--------------+---------+-----+--------------------+----------+--------------------+
|employee_id|last_name|first_name|               title|title_of_courtesy|birth_date| hire_date|             address|    city|region|postal_code|country|    home_phone|extension|photo|               notes|reports_to|          photo_path|
+-----------+---------+----------+--------------------+-----------------+----------+----------+--------------------+--------+------+-----------+-------+--------------+---------+-----+--------------------+----------+--------------------+
|          1|  Davolio|     Nancy|Sales Representative|              Ms.|1948-12-08|1992-05-01|507 - 20th Ave. E...| Seattle|    WA|      98122|    USA|(206) 555-9857|     5467|   []|Education include...|         2|http://accweb/emm...|
|          2|   Fuller|    Andrew|Vice President, S.

In [19]:
# TODO: Calculer l'anciennete
# Indice : F.datediff(F.current_date(), F.col("hire_date")) / 365

df_emp_anciennete = df_employees.withColumn(
    "anciennete_ans",
    F.round(F.datediff(F.current_date(), F.col("hire_date")) / 365, 1)
)

print("Aperçu de l'ancienneté :")
df_emp_anciennete.select("first_name", "last_name", "hire_date", "anciennete_ans").show(5)

Aperçu de l'ancienneté :
+----------+---------+----------+--------------+
|first_name|last_name| hire_date|anciennete_ans|
+----------+---------+----------+--------------+
|     Nancy|  Davolio|1992-05-01|          33.7|
|    Andrew|   Fuller|1992-08-14|          33.4|
|     Janet|Leverling|1992-04-01|          33.8|
|  Margaret|  Peacock|1993-05-03|          32.7|
|    Steven| Buchanan|1993-10-17|          32.3|
+----------+---------+----------+--------------+
only showing top 5 rows


In [20]:
# TODO: Categoriser et ajouter le nombre de commandes

#Catégoriser
df_emp_cat = df_emp_anciennete.withColumn(
    "niveau_experience",
    F.when(F.col("anciennete_ans") < 5, "Junior")
     .when(F.col("anciennete_ans") < 15, "Confirmé")
     .when(F.col("anciennete_ans") < 25, "Senior")
     .otherwise("Expert")
)

#Calcul du nombre de commandes
df_nb_orders = df_orders.groupBy("employee_id").agg(
    F.count("order_id").alias("nb_commandes_gerees")
)

# Jointure 
df_silver_employees = df_emp_cat.join(
    df_nb_orders,
    on="employee_id",
    how="left"
).fillna({"nb_commandes_gerees": 0}) # Remplace NULL par 0

print("Vue Silver Employés :")
df_silver_employees.select("first_name", "last_name", "niveau_experience", "nb_commandes_gerees").show(5)

Vue Silver Employés :
+----------+---------+-----------------+-------------------+
|first_name|last_name|niveau_experience|nb_commandes_gerees|
+----------+---------+-----------------+-------------------+
|     Nancy|  Davolio|           Expert|                123|
|   Michael|   Suyama|           Expert|                 67|
|     Janet|Leverling|           Expert|                127|
|    Steven| Buchanan|           Expert|                 42|
|  Margaret|  Peacock|           Expert|                156|
+----------+---------+-----------------+-------------------+
only showing top 5 rows


In [21]:
#Sauvegarde dans le silver du data Lake
from datetime import datetime
date_traitement = datetime.now().strftime("%Y-%m-%d")

chemin_silver_emp = f"s3a://silver/employees/{date_traitement}"
df_silver_employees.write.mode("overwrite").parquet(chemin_silver_emp)

print(f"Sauvegarde terminée : {chemin_silver_emp}")

Sauvegarde terminée : s3a://silver/employees/2026-01-15


---

## Resume

Dans ce notebook, vous avez appris :
- Comment creer des **colonnes calculees** (montants, dates derivees)
- Comment **categoriser** les donnees avec when/otherwise
- Comment **enrichir** les donnees par jointures
- Comment creer des **vues denormalisees**
- Comment utiliser les **fonctions de fenetre** (Window Functions)

### Prochaine etape
Dans le prochain notebook, nous apprendrons a creer des agregations pour la couche Gold.