# Exercice 13 - Jointures Spark

## Objectifs
- Maitriser les differents types de jointures
- Comprendre les performances des jointures
- Eviter les pieges courants
- Optimiser les jointures sur de grands volumes

---

## 1. Types de jointures

```
+--------------------+--------------------------------------------+
| Type               | Description                                |
+--------------------+--------------------------------------------+
| inner              | Lignes presentes dans les 2 tables         |
| left (left_outer)  | Toutes lignes gauche + match droite        |
| right (right_outer)| Toutes lignes droite + match gauche        |
| full (full_outer)  | Toutes lignes des 2 tables                 |
| cross              | Produit cartesien (toutes combinaisons)    |
| left_semi          | Lignes gauche qui ont un match droite      |
| left_anti          | Lignes gauche sans match droite            |
+--------------------+--------------------------------------------+

Visualisation :

   LEFT         INNER        RIGHT          FULL
 +-----+       +-----+      +-----+       +-----+
 |#####|       |  ###|      |###  |       |#####|
 |#####|       |  ###|      |###  |       |#####|
 +-----+       +-----+      +-----+       +-----+
```

## 2. Configuration

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
    .appName("Jointures Spark") \
    .config("spark.jars.packages", "org.postgresql:postgresql:42.6.0") \
    .getOrCreate()

print("Spark pret")

Spark pret


In [2]:
# Charger les donnees
jdbc_url = "jdbc:postgresql://postgres:5432/app"
jdbc_properties = {
    "user": "postgres",
    "password": "postgres",
    "driver": "org.postgresql.Driver"
}

df_orders = spark.read.jdbc(url=jdbc_url, table="orders", properties=jdbc_properties)
df_order_details = spark.read.jdbc(url=jdbc_url, table="order_details", properties=jdbc_properties)
df_products = spark.read.jdbc(url=jdbc_url, table="products", properties=jdbc_properties)
df_customers = spark.read.jdbc(url=jdbc_url, table="customers", properties=jdbc_properties)
df_employees = spark.read.jdbc(url=jdbc_url, table="employees", properties=jdbc_properties)
df_categories = spark.read.jdbc(url=jdbc_url, table="categories", properties=jdbc_properties)
df_suppliers = spark.read.jdbc(url=jdbc_url, table="suppliers", properties=jdbc_properties)

print("Donnees chargees")

Donnees chargees


## 3. INNER JOIN

In [3]:
# Inner join : seulement les lignes avec correspondance dans les 2 tables
df_inner = df_orders.join(
    df_customers,
    on="customer_id",
    how="inner"
)

print(f"Orders: {df_orders.count()}")
print(f"Customers: {df_customers.count()}")
print(f"Inner join: {df_inner.count()}")

Orders: 830
Customers: 91
Inner join: 830


In [4]:
# Syntaxes alternatives

# Syntaxe avec expression
df_join1 = df_orders.join(
    df_customers,
    df_orders.customer_id == df_customers.customer_id,
    "inner"
)

# Attention : cette syntaxe garde les 2 colonnes customer_id
print("Colonnes :")
print(df_join1.columns)

Colonnes :
['order_id', 'customer_id', 'employee_id', 'order_date', 'required_date', 'shipped_date', 'ship_via', 'freight', 'ship_name', 'ship_address', 'ship_city', 'ship_region', 'ship_postal_code', 'ship_country', 'customer_id', 'company_name', 'contact_name', 'contact_title', 'address', 'city', 'region', 'postal_code', 'country', 'phone', 'fax']


In [5]:
# Jointure sur plusieurs colonnes
# Exemple : order_details avec products (une seule cle ici)
df_multi = df_order_details.join(
    df_products,
    on=["product_id"],  # Liste de colonnes
    how="inner"
)

df_multi.select("order_id", "product_id", "product_name", "quantity").show(5)

+--------+----------+-----------------+--------+
|order_id|product_id|     product_name|quantity|
+--------+----------+-----------------+--------+
|   10253|        31|Gorgonzola Telino|      20|
|   10272|        31|Gorgonzola Telino|      40|
|   10273|        31|Gorgonzola Telino|      15|
|   10325|        31|Gorgonzola Telino|       4|
|   10335|        31|Gorgonzola Telino|      25|
+--------+----------+-----------------+--------+
only showing top 5 rows


## 4. LEFT JOIN

In [6]:
# Creer des donnees avec des orphelins pour la demonstration
# Clients qui n'ont jamais commande

# Left join : tous les clients, meme sans commandes
df_left = df_customers.join(
    df_orders,
    on="customer_id",
    how="left"
)

print(f"Customers: {df_customers.count()}")
print(f"Orders: {df_orders.count()}")
print(f"Left join: {df_left.count()}")

Customers: 91
Orders: 830
Left join: 832


In [7]:
# Trouver les clients sans commandes
df_sans_commandes = df_left.filter(F.col("order_id").isNull())
print(f"Clients sans commandes: {df_sans_commandes.count()}")

if df_sans_commandes.count() > 0:
    df_sans_commandes.select("customer_id", "company_name").show()

Clients sans commandes: 2
+-----------+--------------------+
|customer_id|        company_name|
+-----------+--------------------+
|      PARIS|   Paris spécialités|
|      FISSA|FISSA Fabrica Int...|
+-----------+--------------------+



## 5. LEFT SEMI et LEFT ANTI

In [8]:
# LEFT SEMI : clients qui ont au moins une commande (sans dupliquer)
# Equivalent a : WHERE EXISTS
df_semi = df_customers.join(
    df_orders,
    on="customer_id",
    how="left_semi"
)

print(f"Clients avec commandes (left_semi): {df_semi.count()}")
print("Colonnes:", df_semi.columns)  # Seulement les colonnes de customers

Clients avec commandes (left_semi): 89
Colonnes: ['customer_id', 'company_name', 'contact_name', 'contact_title', 'address', 'city', 'region', 'postal_code', 'country', 'phone', 'fax']


In [9]:
# LEFT ANTI : clients sans aucune commande
# Equivalent a : WHERE NOT EXISTS
df_anti = df_customers.join(
    df_orders,
    on="customer_id",
    how="left_anti"
)

print(f"Clients sans commandes (left_anti): {df_anti.count()}")
df_anti.show()

Clients sans commandes (left_anti): 2
+-----------+--------------------+--------------+------------------+--------------------+------+------+-----------+-------+---------------+---------------+
|customer_id|        company_name|  contact_name|     contact_title|             address|  city|region|postal_code|country|          phone|            fax|
+-----------+--------------------+--------------+------------------+--------------------+------+------+-----------+-------+---------------+---------------+
|      PARIS|   Paris spécialités|Marie Bertrand|             Owner|265, boulevard Ch...| Paris|  NULL|      75012| France|(1) 42.34.22.66|(1) 42.34.22.77|
|      FISSA|FISSA Fabrica Int...|    Diego Roel|Accounting Manager|  C/ Moralzarzal, 86|Madrid|  NULL|      28034|  Spain| (91) 555 94 44| (91) 555 55 93|
+-----------+--------------------+--------------+------------------+--------------------+------+------+-----------+-------+---------------+---------------+



## 6. FULL OUTER JOIN

In [10]:
# Full outer : toutes les lignes des 2 tables
df_full = df_customers.join(
    df_orders.select("order_id", "customer_id", "order_date"),
    on="customer_id",
    how="full"
)

print(f"Full outer join: {df_full.count()}")

Full outer join: 832


In [11]:
# Analyser les resultats du full join
print("Lignes avec client mais sans commande:")
print(df_full.filter(F.col("order_id").isNull()).count())

print("Lignes avec commande mais sans client:")
print(df_full.filter(F.col("company_name").isNull()).count())

Lignes avec client mais sans commande:
2
Lignes avec commande mais sans client:
0


## 7. CROSS JOIN

In [12]:
# Cross join : produit cartesien (attention aux volumes !)
# Toutes les combinaisons possibles

# Exemple : toutes les combinaisons categories x annees
df_annees = spark.createDataFrame([(2023,), (2024,)], ["annee"])
df_mois = spark.createDataFrame([(1,), (2,), (3,), (4,)], ["mois"])

df_cross = df_annees.crossJoin(df_mois)
print(f"Annees: {df_annees.count()}, Mois: {df_mois.count()}")
print(f"Cross join: {df_cross.count()}")
df_cross.orderBy("annee", "mois").show()

Annees: 2, Mois: 4
Cross join: 8
+-----+----+
|annee|mois|
+-----+----+
| 2023|   1|
| 2023|   2|
| 2023|   3|
| 2023|   4|
| 2024|   1|
| 2024|   2|
| 2024|   3|
| 2024|   4|
+-----+----+



In [13]:
# Utilisation pratique : grille de temps x categories
df_grille = df_annees.crossJoin(df_mois).crossJoin(
    df_categories.select("category_id", "category_name")
)

print(f"Grille complete: {df_grille.count()} combinaisons")
df_grille.show(10)

Grille complete: 64 combinaisons
+-----+----+-----------+--------------+
|annee|mois|category_id| category_name|
+-----+----+-----------+--------------+
| 2023|   1|          1|     Beverages|
| 2023|   1|          2|    Condiments|
| 2023|   1|          3|   Confections|
| 2023|   1|          4|Dairy Products|
| 2023|   1|          5|Grains/Cereals|
| 2023|   1|          6|  Meat/Poultry|
| 2023|   1|          7|       Produce|
| 2023|   1|          8|       Seafood|
| 2023|   2|          1|     Beverages|
| 2023|   2|          2|    Condiments|
+-----+----+-----------+--------------+
only showing top 10 rows


## 8. Jointures multiples (chainer)

In [14]:
# Construire une vue complete en chainant les jointures
df_complet = df_order_details \
    .join(df_orders.select("order_id", "customer_id", "employee_id", "order_date"), 
          on="order_id", how="left") \
    .join(df_products.select("product_id", "product_name", "category_id", "supplier_id"), 
          on="product_id", how="left") \
    .join(df_categories.select("category_id", "category_name"), 
          on="category_id", how="left") \
    .join(df_customers.select("customer_id", "company_name", "country"), 
          on="customer_id", how="left") \
    .join(df_employees.select("employee_id", F.concat(F.col("first_name"), F.lit(" "), F.col("last_name")).alias("employee_name")), 
          on="employee_id", how="left")

print(f"Vue complete: {df_complet.count()} lignes")
df_complet.printSchema()

Vue complete: 2155 lignes
root
 |-- employee_id: short (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- category_id: short (nullable = true)
 |-- product_id: short (nullable = true)
 |-- order_id: short (nullable = true)
 |-- unit_price: float (nullable = true)
 |-- quantity: short (nullable = true)
 |-- discount: float (nullable = true)
 |-- order_date: date (nullable = true)
 |-- product_name: string (nullable = true)
 |-- supplier_id: short (nullable = true)
 |-- category_name: string (nullable = true)
 |-- company_name: string (nullable = true)
 |-- country: string (nullable = true)
 |-- employee_name: string (nullable = true)



In [15]:
# Afficher un apercu
df_complet.select(
    "order_id", "order_date", "company_name", 
    "employee_name", "product_name", "category_name", 
    "quantity", "unit_price"
).show(10, truncate=False)

+--------+----------+-------------------------+----------------+--------------------------------+--------------+--------+----------+
|order_id|order_date|company_name             |employee_name   |product_name                    |category_name |quantity|unit_price|
+--------+----------+-------------------------+----------------+--------------------------------+--------------+--------+----------+
|10248   |1996-07-04|Vins et alcools Chevalier|Steven Buchanan |Queso Cabrales                  |Dairy Products|12      |14.0      |
|10248   |1996-07-04|Vins et alcools Chevalier|Steven Buchanan |Singaporean Hokkien Fried Mee   |Grains/Cereals|10      |9.8       |
|10248   |1996-07-04|Vins et alcools Chevalier|Steven Buchanan |Mozzarella di Giovanni          |Dairy Products|5       |34.8      |
|10249   |1996-07-05|Toms Spezialitäten       |Michael Suyama  |Tofu                            |Produce       |9       |18.6      |
|10249   |1996-07-05|Toms Spezialitäten       |Michael Suyama  |Manji

## 9. Optimisation des jointures

In [16]:
from pyspark.sql.functions import broadcast

# Broadcast join : pour les petites tables
# La petite table est envoyee a tous les executeurs

# Sans broadcast
df_no_broadcast = df_order_details.join(
    df_products,
    df_order_details.product_id == df_products.product_id
)

# Avec broadcast (plus rapide pour petites tables)
df_with_broadcast = df_order_details.join(
    df_products.select("product_id", "product_name", "category_id"),
    on="product_id"
).join(
    broadcast(df_categories),  # categories est petite -> broadcast
    on="category_id"
)

print(f"Jointure avec broadcast: {df_with_broadcast.count()}")

Jointure avec broadcast: 2155


In [17]:
# Voir le plan d'execution avec broadcast
df_with_broadcast.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [category_id#22, product_id#15, order_id#14, unit_price#16, quantity#17, discount#18, product_name#20, category_name#59, description#60, picture#61]
   +- BroadcastHashJoin [category_id#22], [category_id#58], Inner, BuildRight, false
      :- Project [product_id#15, order_id#14, unit_price#16, quantity#17, discount#18, product_name#20, category_id#22]
      :  +- SortMergeJoin [product_id#15], [product_id#19], Inner
      :     :- Sort [product_id#15 ASC NULLS FIRST], false, 0
      :     :  +- Exchange hashpartitioning(product_id#15, 200), ENSURE_REQUIREMENTS, [plan_id=4796]
      :     :     +- Scan JDBCRelation(order_details) [numPartitions=1] [order_id#14,product_id#15,unit_price#16,quantity#17,discount#18] PushedFilters: [*IsNotNull(product_id)], ReadSchema: struct<order_id:smallint,product_id:smallint,unit_price:float,quantity:smallint,discount:float>
      :     +- Sort [product_id#19 ASC NULLS FIRST], false, 0
 

In [18]:
# Eviter les doublons de colonnes
# Mauvais : garde 2 fois customer_id
df_bad = df_orders.join(
    df_customers,
    df_orders.customer_id == df_customers.customer_id
)
print("Avec doublons:", [c for c in df_bad.columns if 'customer_id' in c])

# Bon : une seule colonne
df_good = df_orders.join(
    df_customers,
    on="customer_id"  # Utiliser on= pour eviter les doublons
)
print("Sans doublons:", [c for c in df_good.columns if 'customer_id' in c])

Avec doublons: ['customer_id', 'customer_id']
Sans doublons: ['customer_id']


## 10. Jointures avec conditions complexes

In [19]:
# Jointure avec condition complexe
# Exemple : trouver les commandes passees par le meme employe qui a traite une commande client

df_self = df_orders.alias("o1").join(
    df_orders.alias("o2"),
    (F.col("o1.employee_id") == F.col("o2.employee_id")) &
    (F.col("o1.order_id") != F.col("o2.order_id")) &
    (F.col("o1.order_date") == F.col("o2.order_date")),
    "inner"
)

print("Commandes par le meme employe le meme jour:")
df_self.select(
    F.col("o1.order_id").alias("order_1"),
    F.col("o2.order_id").alias("order_2"),
    F.col("o1.employee_id"),
    F.col("o1.order_date")
).show(10)

Commandes par le meme employe le meme jour:
+-------+-------+-----------+----------+
|order_1|order_2|employee_id|order_date|
+-------+-------+-----------+----------+
|  11024|  11026|          4|1998-04-15|
|  11026|  11024|          4|1998-04-15|
|  10766|  10767|          4|1997-12-05|
|  10767|  10766|          4|1997-12-05|
|  10968|  10969|          1|1998-03-23|
|  10969|  10968|          1|1998-03-23|
|  10854|  10855|          3|1998-01-27|
|  10855|  10854|          3|1998-01-27|
|  10844|  10845|          8|1998-01-21|
|  10845|  10844|          8|1998-01-21|
+-------+-------+-----------+----------+
only showing top 10 rows


In [20]:
# Jointure avec inegalite (range join)
# Exemple : trouver les produits dans une fourchette de prix

df_fourchettes = spark.createDataFrame([
    ("Economique", 0, 10),
    ("Standard", 10, 25),
    ("Premium", 25, 50),
    ("Luxe", 50, 1000)
], ["categorie_prix", "prix_min", "prix_max"])

df_range = df_products.join(
    df_fourchettes,
    (F.col("unit_price") >= F.col("prix_min")) &
    (F.col("unit_price") < F.col("prix_max")),
    "left"
)

df_range.select("product_name", "unit_price", "categorie_prix").show(10)

+--------------------+----------+--------------+
|        product_name|unit_price|categorie_prix|
+--------------------+----------+--------------+
|                Chai|      18.0|      Standard|
|               Chang|      19.0|      Standard|
|       Aniseed Syrup|      10.0|      Standard|
|Chef Anton's Caju...|      22.0|      Standard|
|Chef Anton's Gumb...|     21.35|      Standard|
|Grandma's Boysenb...|      25.0|       Premium|
|Uncle Bob's Organ...|      30.0|       Premium|
|Northwoods Cranbe...|      40.0|       Premium|
|     Mishi Kobe Niku|      97.0|          Luxe|
|               Ikura|      31.0|       Premium|
+--------------------+----------+--------------+
only showing top 10 rows


---

## Exercice

**Objectif** : Maitriser les jointures

**Consigne** :
1. Trouvez les produits qui n'ont jamais ete commandes (left anti)
2. Creez une vue complete des ventes avec :
   - Informations commande
   - Nom du client et pays
   - Nom du produit et categorie
   - Nom du fournisseur
3. Calculez le CA par fournisseur

A vous de jouer :

In [21]:
# TODO: Produits jamais commandes

df_never_ordered = df_products.join(
    df_order_details,
    on="product_id",
    how="left_anti"
)

print(f"Nombre de produits jamais commandés : {df_never_ordered.count()}")
df_never_ordered.select("product_id", "product_name").show(truncate=False)

Nombre de produits jamais commandés : 0
+----------+------------+
|product_id|product_name|
+----------+------------+
+----------+------------+



In [22]:
# TODO: Vue complete avec fournisseur

# On chaine les jointures;On enlève 'unit_price' de products pour ne garder que celui de order_details (le vrai prix payé)
df_products_clean = df_products.drop("unit_price")

df_vue_complete = df_order_details \
    .join(df_orders, on="order_id", how="left") \
    .join(df_products_clean, on="product_id", how="left") \
    .join(df_customers, on="customer_id", how="left") \
    .join(df_categories, on="category_id", how="left") \
    .join(
        df_suppliers.select(F.col("supplier_id"), F.col("company_name").alias("supplier_name")),
        on="supplier_id",
        how="left"
    )

# --- ÉTAPE 3 : Calcul du CA (Fonctionne maintenant !) ---

df_ca_fournisseur = df_vue_complete.withColumn(
    "montant", 
    F.col("unit_price") * F.col("quantity")  # unit_price est maintenant unique !
).groupBy("supplier_name").agg(
    F.round(F.sum("montant"), 2).alias("ca_total")
).orderBy(F.desc("ca_total"))

print("--- Top Fournisseurs par CA ---")
df_ca_fournisseur.show(10, truncate=False)

--- Top Fournisseurs par CA ---
+---------------------------------+---------+
|supplier_name                    |ca_total |
+---------------------------------+---------+
|Aux joyeux ecclésiastiques       |163135.0 |
|Plutzer Lebensmittelgroßmärkte AG|155946.55|
|Gai pâturage                     |126582.0 |
|Pavlova, Ltd.                    |115386.05|
|G'day, Mate                      |69636.6  |
|Forêts d'érables                 |66266.7  |
|Specialty Biscuits, Ltd.         |63071.4  |
|Pasta Buttini s.r.l.             |52929.0  |
|Formaggi Fortini s.r.l.          |51082.5  |
|Norske Meierier                  |46897.2  |
+---------------------------------+---------+
only showing top 10 rows


In [23]:
# TODO: CA par fournisseur
from pyspark.sql import functions as F

# Calcul du montant (Prix * Quantité)
df_ca_fournisseur = df_vue_complete.withColumn(
    "montant", 
    F.col("unit_price") * F.col("quantity")
).groupBy("supplier_name").agg(
    F.round(F.sum("montant"), 2).alias("ca_total")
).orderBy(F.desc("ca_total"))

print("--- Top Fournisseurs par CA ---")
df_ca_fournisseur.show(10, truncate=False)

--- Top Fournisseurs par CA ---
+---------------------------------+---------+
|supplier_name                    |ca_total |
+---------------------------------+---------+
|Aux joyeux ecclésiastiques       |163135.0 |
|Plutzer Lebensmittelgroßmärkte AG|155946.55|
|Gai pâturage                     |126582.0 |
|Pavlova, Ltd.                    |115386.05|
|G'day, Mate                      |69636.6  |
|Forêts d'érables                 |66266.7  |
|Specialty Biscuits, Ltd.         |63071.4  |
|Pasta Buttini s.r.l.             |52929.0  |
|Formaggi Fortini s.r.l.          |51082.5  |
|Norske Meierier                  |46897.2  |
+---------------------------------+---------+
only showing top 10 rows


---

## Resume

Dans ce notebook, vous avez appris :
- Les **types de jointures** : inner, left, right, full, cross, semi, anti
- Comment **chainer** plusieurs jointures
- Comment utiliser **broadcast** pour optimiser
- Comment eviter les **doublons de colonnes**
- Comment faire des jointures avec **conditions complexes**

### Prochaine etape
Dans le prochain notebook, nous apprendrons les fonctions SQL avec Spark.