# Exercice 12 - Operations DataFrame avancees

## Objectifs
- Maitriser les operations de pivot
- Utiliser les User Defined Functions (UDF)
- Travailler avec les types complexes (arrays, maps)
- Optimiser les performances des DataFrames

---

## 1. Configuration

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, IntegerType, ArrayType, MapType, StructType, StructField
from pyspark.sql.window import Window

spark = SparkSession.builder \
    .appName("Operations DataFrame Avancees") \
    .config("spark.jars.packages", "org.postgresql:postgresql:42.6.0") \
    .getOrCreate()

print("Spark pret")

Spark pret


In [2]:
# Charger les donnees
jdbc_url = "jdbc:postgresql://postgres:5432/app"
jdbc_properties = {
    "user": "postgres",
    "password": "postgres",
    "driver": "org.postgresql.Driver"
}

df_orders = spark.read.jdbc(url=jdbc_url, table="orders", properties=jdbc_properties)
df_order_details = spark.read.jdbc(url=jdbc_url, table="order_details", properties=jdbc_properties)
df_products = spark.read.jdbc(url=jdbc_url, table="products", properties=jdbc_properties)
df_customers = spark.read.jdbc(url=jdbc_url, table="customers", properties=jdbc_properties)

print("Donnees chargees")

Donnees chargees


## 2. Pivot et Unpivot

```
PIVOT : Lignes -> Colonnes

Avant:                          Apres:
+-------+------+-------+        +-------+--------+--------+--------+
| annee | mois | ventes|   -->  | annee | Jan    | Fev    | Mar    |
+-------+------+-------+        +-------+--------+--------+--------+
| 2023  | Jan  | 100   |        | 2023  | 100    | 200    | 150    |
| 2023  | Fev  | 200   |        | 2024  | 120    | 180    | 160    |
| 2023  | Mar  | 150   |        +-------+--------+--------+--------+
| 2024  | Jan  | 120   |
+-------+------+-------+
```

In [3]:
# Preparer les donnees de ventes
df_ventes = df_order_details.join(df_orders.select("order_id", "order_date"), on="order_id")
df_ventes = df_ventes.withColumn(
    "montant",
    F.round(F.col("unit_price") * F.col("quantity"), 2)
).withColumn(
    "annee", F.year("order_date")
).withColumn(
    "mois", F.month("order_date")
)

# Agreger par annee et mois
df_mensuel = df_ventes.groupBy("annee", "mois").agg(
    F.round(F.sum("montant"), 2).alias("ca")
)

df_mensuel.orderBy("annee", "mois").show()

+-----+----+---------+
|annee|mois|       ca|
+-----+----+---------+
| 1996|   7|  30192.1|
| 1996|   8|  26609.4|
| 1996|   9|  27636.0|
| 1996|  10|  41203.6|
| 1996|  11|  49704.0|
| 1996|  12|  50953.4|
| 1997|   1|  66692.8|
| 1997|   2|  41207.2|
| 1997|   3|  39979.9|
| 1997|   4| 55699.39|
| 1997|   5|  56823.7|
| 1997|   6|  39088.0|
| 1997|   7| 55464.93|
| 1997|   8| 49981.69|
| 1997|   9| 59733.02|
| 1997|  10|  70328.5|
| 1997|  11| 45913.36|
| 1997|  12| 77476.26|
| 1998|   1|100854.72|
| 1998|   2|104561.95|
+-----+----+---------+
only showing top 20 rows


In [4]:
# PIVOT : mois en colonnes
df_pivot = df_mensuel.groupBy("annee").pivot("mois").agg(F.first("ca"))

print("Tableau pivot par mois :")
df_pivot.orderBy("annee").show()

Tableau pivot par mois :
+-----+---------+---------+---------+---------+--------+-------+--------+--------+--------+-------+--------+--------+
|annee|        1|        2|        3|        4|       5|      6|       7|       8|       9|     10|      11|      12|
+-----+---------+---------+---------+---------+--------+-------+--------+--------+--------+-------+--------+--------+
| 1996|     NULL|     NULL|     NULL|     NULL|    NULL|   NULL| 30192.1| 26609.4| 27636.0|41203.6| 49704.0| 50953.4|
| 1997|  66692.8|  41207.2|  39979.9| 55699.39| 56823.7|39088.0|55464.93|49981.69|59733.02|70328.5|45913.36|77476.26|
| 1998|100854.72|104561.95|109825.45|134630.56|19898.66|   NULL|    NULL|    NULL|    NULL|   NULL|    NULL|    NULL|
+-----+---------+---------+---------+---------+--------+-------+--------+--------+--------+-------+--------+--------+



In [5]:
# Pivot avec liste de valeurs (plus performant)
mois_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
df_pivot_opt = df_mensuel.groupBy("annee").pivot("mois", mois_list).agg(F.first("ca")).fillna(0)

df_pivot_opt.show()

+-----+---------+---------+---------+---------+--------+-------+--------+--------+--------+-------+--------+--------+
|annee|        1|        2|        3|        4|       5|      6|       7|       8|       9|     10|      11|      12|
+-----+---------+---------+---------+---------+--------+-------+--------+--------+--------+-------+--------+--------+
| 1997|  66692.8|  41207.2|  39979.9| 55699.39| 56823.7|39088.0|55464.93|49981.69|59733.02|70328.5|45913.36|77476.26|
| 1996|      0.0|      0.0|      0.0|      0.0|     0.0|    0.0| 30192.1| 26609.4| 27636.0|41203.6| 49704.0| 50953.4|
| 1998|100854.72|104561.95|109825.45|134630.56|19898.66|    0.0|     0.0|     0.0|     0.0|    0.0|     0.0|     0.0|
+-----+---------+---------+---------+---------+--------+-------+--------+--------+--------+-------+--------+--------+



In [6]:
# UNPIVOT : Colonnes -> Lignes
# Spark n'a pas de fonction unpivot native, on utilise stack()
colonnes_mois = [f"`{i}` as mois_{i}" for i in range(1, 13)]

df_unpivot = df_pivot_opt.select(
    "annee",
    F.expr(f"stack(12, {', '.join([f'{i}, `{i}`' for i in range(1, 13)])}) as (mois, ca)")
).filter(F.col("ca").isNotNull())

print("Apres unpivot :")
df_unpivot.orderBy("annee", "mois").show()

Apres unpivot :
+-----+----+--------+
|annee|mois|      ca|
+-----+----+--------+
| 1996|   1|     0.0|
| 1996|   2|     0.0|
| 1996|   3|     0.0|
| 1996|   4|     0.0|
| 1996|   5|     0.0|
| 1996|   6|     0.0|
| 1996|   7| 30192.1|
| 1996|   8| 26609.4|
| 1996|   9| 27636.0|
| 1996|  10| 41203.6|
| 1996|  11| 49704.0|
| 1996|  12| 50953.4|
| 1997|   1| 66692.8|
| 1997|   2| 41207.2|
| 1997|   3| 39979.9|
| 1997|   4|55699.39|
| 1997|   5| 56823.7|
| 1997|   6| 39088.0|
| 1997|   7|55464.93|
| 1997|   8|49981.69|
+-----+----+--------+
only showing top 20 rows


## 3. User Defined Functions (UDF)

In [7]:
from pyspark.sql.functions import udf

# Definir une fonction Python
def categoriser_prix(prix):
    """Categorise un prix en niveau"""
    if prix is None:
        return "Inconnu"
    elif prix < 10:
        return "Economique"
    elif prix < 25:
        return "Standard"
    elif prix < 50:
        return "Premium"
    else:
        return "Luxe"

# Enregistrer comme UDF
categoriser_prix_udf = udf(categoriser_prix, StringType())

# Utiliser l'UDF
df_products_cat = df_products.withColumn(
    "niveau_prix",
    categoriser_prix_udf(F.col("unit_price"))
)

df_products_cat.select("product_name", "unit_price", "niveau_prix").show(10)

+--------------------+----------+-----------+
|        product_name|unit_price|niveau_prix|
+--------------------+----------+-----------+
|                Chai|      18.0|   Standard|
|               Chang|      19.0|   Standard|
|       Aniseed Syrup|      10.0|   Standard|
|Chef Anton's Caju...|      22.0|   Standard|
|Chef Anton's Gumb...|     21.35|   Standard|
|Grandma's Boysenb...|      25.0|    Premium|
|Uncle Bob's Organ...|      30.0|    Premium|
|Northwoods Cranbe...|      40.0|    Premium|
|     Mishi Kobe Niku|      97.0|       Luxe|
|               Ikura|      31.0|    Premium|
+--------------------+----------+-----------+
only showing top 10 rows


In [8]:
# UDF avec decorateur (plus lisible)
@udf(returnType=StringType())
def format_telephone(phone):
    """Formate un numero de telephone"""
    if phone is None:
        return None
    # Garder seulement les chiffres
    digits = ''.join(filter(str.isdigit, phone))
    return digits if len(digits) > 0 else None

df_customers.withColumn(
    "phone_clean",
    format_telephone(F.col("phone"))
).select("company_name", "phone", "phone_clean").show(10)

+--------------------+--------------+-----------+
|        company_name|         phone|phone_clean|
+--------------------+--------------+-----------+
| Alfreds Futterkiste|   030-0074321| 0300074321|
|Ana Trujillo Empa...|  (5) 555-4729|   55554729|
|Antonio Moreno Ta...|  (5) 555-3932|   55553932|
|     Around the Horn|(171) 555-7788| 1715557788|
|  Berglunds snabbköp| 0921-12 34 65| 0921123465|
|Blauer See Delika...|    0621-08460|  062108460|
|Blondesddsl père ...|   88.60.15.31|   88601531|
|Bólido Comidas pr...|(91) 555 22 82|  915552282|
|            Bon app'|   91.24.45.40|   91244540|
|Bottom-Dollar Mar...|(604) 555-4729| 6045554729|
+--------------------+--------------+-----------+
only showing top 10 rows


In [9]:
# UDF retournant une structure complexe
@udf(returnType=ArrayType(StringType()))
def extraire_mots(texte):
    """Extrait les mots d'un texte"""
    if texte is None:
        return []
    return texte.split()

df_products.withColumn(
    "mots_nom",
    extraire_mots(F.col("product_name"))
).select("product_name", "mots_nom").show(5, truncate=False)

+----------------------------+---------------------------------+
|product_name                |mots_nom                         |
+----------------------------+---------------------------------+
|Chai                        |[Chai]                           |
|Chang                       |[Chang]                          |
|Aniseed Syrup               |[Aniseed, Syrup]                 |
|Chef Anton's Cajun Seasoning|[Chef, Anton's, Cajun, Seasoning]|
|Chef Anton's Gumbo Mix      |[Chef, Anton's, Gumbo, Mix]      |
+----------------------------+---------------------------------+
only showing top 5 rows


## 4. Pandas UDF (plus performant)

In [10]:
import pandas as pd
from pyspark.sql.functions import pandas_udf

# Pandas UDF (vectorisee, beaucoup plus rapide)
@pandas_udf("double")
def calculer_remise(prix: pd.Series, quantite: pd.Series) -> pd.Series:
    """Calcule une remise progressive"""
    montant = prix * quantite
    remise = pd.Series([0.0] * len(montant))
    remise[montant >= 100] = 0.05
    remise[montant >= 500] = 0.10
    remise[montant >= 1000] = 0.15
    return remise

df_order_details.withColumn(
    "remise_calculee",
    calculer_remise(F.col("unit_price"), F.col("quantity"))
).select("order_id", "unit_price", "quantity", "remise_calculee").show(10)

+--------+----------+--------+---------------+
|order_id|unit_price|quantity|remise_calculee|
+--------+----------+--------+---------------+
|   10248|      14.0|      12|           0.05|
|   10248|       9.8|      10|            0.0|
|   10248|      34.8|       5|           0.05|
|   10249|      18.6|       9|           0.05|
|   10249|      42.4|      40|           0.15|
|   10250|       7.7|      10|            0.0|
|   10250|      42.4|      35|           0.15|
|   10250|      16.8|      15|           0.05|
|   10251|      16.8|       6|           0.05|
|   10251|      15.6|      15|           0.05|
+--------+----------+--------+---------------+
only showing top 10 rows


## 5. Types complexes : Arrays

In [11]:
# Creer un array des produits par commande
df_produits_commande = df_order_details.join(
    df_products.select("product_id", "product_name"),
    on="product_id"
).groupBy("order_id").agg(
    F.collect_list("product_name").alias("produits"),
    F.collect_set("product_id").alias("product_ids")
)

df_produits_commande.show(5, truncate=False)

+--------+------------------------------------------------------------------------------------------+------------+
|order_id|produits                                                                                  |product_ids |
+--------+------------------------------------------------------------------------------------------+------------+
|10248   |[Mozzarella di Giovanni, Queso Cabrales, Singaporean Hokkien Fried Mee]                   |[72, 42, 11]|
|10249   |[Manjimup Dried Apples, Tofu]                                                             |[51, 14]    |
|10250   |[Jack's New England Clam Chowder, Manjimup Dried Apples, Louisiana Fiery Hot Pepper Sauce]|[65, 51, 41]|
|10251   |[Gustaf's Knäckebröd, Ravioli Angelo, Louisiana Fiery Hot Pepper Sauce]                   |[65, 22, 57]|
|10252   |[Sir Rodney's Marmalade, Camembert Pierrot, Geitost]                                      |[33, 20, 60]|
+--------+----------------------------------------------------------------------

In [12]:
# Operations sur les arrays
df_array_ops = df_produits_commande.withColumn(
    "nb_produits",
    F.size("produits")
).withColumn(
    "premier_produit",
    F.element_at("produits", 1)
).withColumn(
    "produits_tries",
    F.sort_array("produits")
)

df_array_ops.select("order_id", "nb_produits", "premier_produit", "produits_tries").show(5, truncate=False)

+--------+-----------+-------------------------------+------------------------------------------------------------------------------------------+
|order_id|nb_produits|premier_produit                |produits_tries                                                                            |
+--------+-----------+-------------------------------+------------------------------------------------------------------------------------------+
|10248   |3          |Mozzarella di Giovanni         |[Mozzarella di Giovanni, Queso Cabrales, Singaporean Hokkien Fried Mee]                   |
|10249   |2          |Manjimup Dried Apples          |[Manjimup Dried Apples, Tofu]                                                             |
|10250   |3          |Jack's New England Clam Chowder|[Jack's New England Clam Chowder, Louisiana Fiery Hot Pepper Sauce, Manjimup Dried Apples]|
|10251   |3          |Gustaf's Knäckebröd            |[Gustaf's Knäckebröd, Louisiana Fiery Hot Pepper Sauce, Ravioli Angelo

In [13]:
# Explode : Array -> Lignes multiples
df_explode = df_produits_commande.select(
    "order_id",
    F.explode("produits").alias("produit")
)

print("Apres explode :")
df_explode.show(10)

Apres explode :
+--------+--------------------+
|order_id|             produit|
+--------+--------------------+
|   10248|Mozzarella di Gio...|
|   10248|      Queso Cabrales|
|   10248|Singaporean Hokki...|
|   10249|Manjimup Dried Ap...|
|   10249|                Tofu|
|   10250|Jack's New Englan...|
|   10250|Manjimup Dried Ap...|
|   10250|Louisiana Fiery H...|
|   10251| Gustaf's Knäckebröd|
|   10251|      Ravioli Angelo|
+--------+--------------------+
only showing top 10 rows


## 6. Types complexes : Maps

In [14]:
# Creer une map produit -> quantite par commande
df_map = df_order_details.join(
    df_products.select("product_id", "product_name"),
    on="product_id"
).groupBy("order_id").agg(
    F.map_from_entries(
        F.collect_list(
            F.struct(F.col("product_name"), F.col("quantity"))
        )
    ).alias("produits_quantites")
)

df_map.show(3, truncate=False)

+--------+------------------------------------------------------------------------------------------------------------+
|order_id|produits_quantites                                                                                          |
+--------+------------------------------------------------------------------------------------------------------------+
|10248   |{Mozzarella di Giovanni -> 5, Queso Cabrales -> 12, Singaporean Hokkien Fried Mee -> 10}                    |
|10249   |{Manjimup Dried Apples -> 40, Tofu -> 9}                                                                    |
|10250   |{Jack's New England Clam Chowder -> 10, Manjimup Dried Apples -> 35, Louisiana Fiery Hot Pepper Sauce -> 15}|
+--------+------------------------------------------------------------------------------------------------------------+
only showing top 3 rows


In [15]:
# Acceder aux valeurs d'une map
df_map.withColumn(
    "toutes_cles",
    F.map_keys("produits_quantites")
).withColumn(
    "toutes_valeurs",
    F.map_values("produits_quantites")
).select("order_id", "toutes_cles", "toutes_valeurs").show(3, truncate=False)

+--------+------------------------------------------------------------------------------------------+--------------+
|order_id|toutes_cles                                                                               |toutes_valeurs|
+--------+------------------------------------------------------------------------------------------+--------------+
|10248   |[Mozzarella di Giovanni, Queso Cabrales, Singaporean Hokkien Fried Mee]                   |[5, 12, 10]   |
|10249   |[Manjimup Dried Apples, Tofu]                                                             |[40, 9]       |
|10250   |[Jack's New England Clam Chowder, Manjimup Dried Apples, Louisiana Fiery Hot Pepper Sauce]|[10, 35, 15]  |
+--------+------------------------------------------------------------------------------------------+--------------+
only showing top 3 rows


## 7. Operations sur les Structs

In [16]:
# Creer une structure imbriquee
df_struct = df_customers.select(
    "customer_id",
    "company_name",
    F.struct(
        F.col("address").alias("rue"),
        F.col("city").alias("ville"),
        F.col("postal_code").alias("code_postal"),
        F.col("country").alias("pays")
    ).alias("adresse_complete")
)

df_struct.printSchema()
df_struct.show(5, truncate=False)

root
 |-- customer_id: string (nullable = true)
 |-- company_name: string (nullable = true)
 |-- adresse_complete: struct (nullable = false)
 |    |-- rue: string (nullable = true)
 |    |-- ville: string (nullable = true)
 |    |-- code_postal: string (nullable = true)
 |    |-- pays: string (nullable = true)

+-----------+----------------------------------+-----------------------------------------------------------+
|customer_id|company_name                      |adresse_complete                                           |
+-----------+----------------------------------+-----------------------------------------------------------+
|ALFKI      |Alfreds Futterkiste               |{Obere Str. 57, Berlin, 12209, Germany}                    |
|ANATR      |Ana Trujillo Emparedados y helados|{Avda. de la Constitución 2222, México D.F., 05021, Mexico}|
|ANTON      |Antonio Moreno Taquería           |{Mataderos  2312, México D.F., 05023, Mexico}              |
|AROUT      |Around the Horn     

In [17]:
# Acceder aux champs d'une structure
df_struct.select(
    "company_name",
    F.col("adresse_complete.ville").alias("ville"),
    F.col("adresse_complete.pays").alias("pays")
).show(10)

+--------------------+-----------+-------+
|        company_name|      ville|   pays|
+--------------------+-----------+-------+
| Alfreds Futterkiste|     Berlin|Germany|
|Ana Trujillo Empa...|México D.F.| Mexico|
|Antonio Moreno Ta...|México D.F.| Mexico|
|     Around the Horn|     London|     UK|
|  Berglunds snabbköp|      Luleå| Sweden|
|Blauer See Delika...|   Mannheim|Germany|
|Blondesddsl père ...| Strasbourg| France|
|Bólido Comidas pr...|     Madrid|  Spain|
|            Bon app'|  Marseille| France|
|Bottom-Dollar Mar...|  Tsawassen| Canada|
+--------------------+-----------+-------+
only showing top 10 rows


## 8. Optimisation des performances

In [18]:
# Broadcast join (pour petites tables)
from pyspark.sql.functions import broadcast

# La table categories est petite -> broadcast
df_categories = spark.read.jdbc(url=jdbc_url, table="categories", properties=jdbc_properties)

df_joined = df_products.join(
    broadcast(df_categories),
    on="category_id",
    how="left"
)

df_joined.show(5)

+-----------+----------+--------------------+-----------+-------------------+----------+--------------+--------------+-------------+------------+-------------+--------------------+-------+
|category_id|product_id|        product_name|supplier_id|  quantity_per_unit|unit_price|units_in_stock|units_on_order|reorder_level|discontinued|category_name|         description|picture|
+-----------+----------+--------------------+-----------+-------------------+----------+--------------+--------------+-------------+------------+-------------+--------------------+-------+
|          1|         1|                Chai|          8| 10 boxes x 30 bags|      18.0|            39|             0|           10|           1|    Beverages|Soft drinks, coff...|     []|
|          1|         2|               Chang|          1| 24 - 12 oz bottles|      19.0|            17|            40|           25|           1|    Beverages|Soft drinks, coff...|     []|
|          2|         3|       Aniseed Syrup|          

In [19]:
# Cache pour les DataFrames reutilises
df_ventes_cached = df_ventes.cache()

# Premiere utilisation (materialise le cache)
print(f"Nombre de lignes : {df_ventes_cached.count()}")

# Utilisations suivantes (depuis le cache)
print(f"CA total : {df_ventes_cached.agg(F.sum('montant')).collect()[0][0]:.2f}")

Nombre de lignes : 2155
CA total : 1354458.59


In [20]:
# Voir le plan d'execution
df_joined.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [category_id#22, product_id#19, product_name#20, supplier_id#21, quantity_per_unit#23, unit_price#24, units_in_stock#25, units_on_order#26, reorder_level#27, discontinued#28, category_name#1640, description#1641, picture#1642]
   +- BroadcastHashJoin [category_id#22], [category_id#1639], LeftOuter, BuildRight, false
      :- Scan JDBCRelation(products) [numPartitions=1] [product_id#19,product_name#20,supplier_id#21,category_id#22,quantity_per_unit#23,unit_price#24,units_in_stock#25,units_on_order#26,reorder_level#27,discontinued#28] PushedFilters: [], ReadSchema: struct<product_id:smallint,product_name:string,supplier_id:smallint,category_id:smallint,quantity...
      +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, smallint, true] as bigint)),false), [plan_id=3324]
         +- Scan JDBCRelation(categories) [numPartitions=1] [category_id#1639,category_name#1640,description#1641,picture#1642] PushedFil

In [21]:
# Liberer le cache
df_ventes_cached.unpersist()

DataFrame[order_id: smallint, product_id: smallint, unit_price: float, quantity: smallint, discount: float, order_date: date, montant: double, annee: int, mois: int]

---

## Exercice

**Objectif** : Creer une UDF et un pivot

**Consigne** :
1. Creez une UDF qui categorise les pays en regions (Europe, Amerique, Asie, Autre)
2. Appliquez-la aux customers
3. Faites un pivot des ventes par region et par trimestre

A vous de jouer :

In [22]:
# TODO: Creer l'UDF de categorisation par region
@udf(returnType=StringType())
def categoriser_region(pays):
    if pays is None:
        return "Inconnu"
        
    # Listes des pays presents dans Northwind
    europe = ["UK", "France", "Germany", "Sweden", "Italy", "Spain", "Portugal", "Ireland", "Belgium", "Switzerland", "Austria", "Finland", "Poland", "Norway", "Denmark"]
    amerique = ["USA", "Canada", "Brazil", "Mexico", "Argentina", "Venezuela"]
    asie = ["Japan", "Singapore", "India"]
    
    if pays in europe:
        return "Europe"
    elif pays in amerique:
        return "Amerique"
    elif pays in asie:
        return "Asie"
    else:
        return "Autre"

In [23]:
# TODO: Appliquer aux customers et faire le pivot

# Appliquer l'UDF pour ajouter la colonne 'region_monde'
df_cust_region = df_customers.withColumn("region_monde", categoriser_region(F.col("country")))

# Préparer la table des ventes complète (Jointures)
df_ventes_complete = df_order_details.join(df_orders, on="order_id") \
                                     .join(df_cust_region, on="customer_id")

# Ajouter les colonnes calculées (Montant et Trimestre)
df_analyse = df_ventes_complete.withColumn(
    "montant", 
    F.col("unit_price") * F.col("quantity")
).withColumn(
    "trimestre", 
    F.quarter("order_date")
)

# Pivot : Région en lignes, Trimestres en colonnes
df_pivot_region = df_analyse.groupBy("region_monde") \
                            .pivot("trimestre", [1, 2, 3, 4]) \
                            .agg(F.round(F.sum("montant"), 2)) \
                            .fillna(0) \
                            .orderBy("region_monde")

print("=== Chiffre d'Affaires par Région et Trimestre ===")
df_pivot_region.show()

=== Chiffre d'Affaires par Région et Trimestre ===
+------------+---------+---------+---------+---------+
|region_monde|        1|        2|        3|        4|
+------------+---------+---------+---------+---------+
|    Amerique|194792.87|102882.88|110112.85| 119088.4|
|      Europe|268329.15|203257.43|139504.29|216490.72|
+------------+---------+---------+---------+---------+



---

## Resume

Dans ce notebook, vous avez appris :
- Comment faire un **pivot** (lignes vers colonnes)
- Comment creer des **UDF** Python et Pandas
- Comment travailler avec les **arrays** et **maps**
- Comment creer et utiliser des **structs**
- Comment **optimiser** les performances (broadcast, cache)

### Prochaine etape
Dans le prochain notebook, nous approfondirons les jointures Spark.