# Exercice 14 - Fonctions SQL avec Spark

## Objectifs
- Utiliser Spark SQL pour interroger les DataFrames
- Maitriser les fonctions SQL courantes
- Combiner SQL et API DataFrame
- Creer des vues temporaires et permanentes

---

## 1. Spark SQL

```
+------------------------------------------------------------------+
|                         SPARK SQL                                |
+------------------------------------------------------------------+
|                                                                  |
|  DataFrame     <------>    Vue SQL    <------>    Requete SQL    |
|                                                                  |
|  df_orders     ------>    orders      ------>    SELECT * FROM   |
|                           (temp view)            orders          |
|                                                                  |
+------------------------------------------------------------------+
|                                                                  |
|  Avantages SQL :                                                 |
|  - Syntaxe familiere                                             |
|  - Requetes complexes lisibles                                   |
|  - Meme optimiseur que DataFrame API                             |
|                                                                  |
+------------------------------------------------------------------+
```

## 2. Configuration

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
    .appName("Spark SQL") \
    .config("spark.jars.packages", "org.postgresql:postgresql:42.6.0") \
    .getOrCreate()

print("Spark pret")

Spark pret


In [2]:
# Charger les donnees
jdbc_url = "jdbc:postgresql://postgres:5432/app"
jdbc_properties = {
    "user": "postgres",
    "password": "postgres",
    "driver": "org.postgresql.Driver"
}

tables = ["orders", "order_details", "products", "customers", "employees", "categories", "suppliers"]

for table in tables:
    df = spark.read.jdbc(url=jdbc_url, table=table, properties=jdbc_properties)
    df.createOrReplaceTempView(table)  # Creer une vue SQL
    print(f"[OK] {table} : {df.count()} lignes")

print("\nToutes les vues SQL creees")

[OK] orders : 830 lignes
[OK] order_details : 2155 lignes
[OK] products : 77 lignes
[OK] customers : 91 lignes
[OK] employees : 9 lignes
[OK] categories : 8 lignes
[OK] suppliers : 29 lignes

Toutes les vues SQL creees


## 3. Requetes SQL de base

In [3]:
# SELECT simple
spark.sql("""
    SELECT customer_id, company_name, country
    FROM customers
    LIMIT 10
""").show()

+-----------+--------------------+-------+
|customer_id|        company_name|country|
+-----------+--------------------+-------+
|      ALFKI| Alfreds Futterkiste|Germany|
|      ANATR|Ana Trujillo Empa...| Mexico|
|      ANTON|Antonio Moreno Ta...| Mexico|
|      AROUT|     Around the Horn|     UK|
|      BERGS|  Berglunds snabbköp| Sweden|
|      BLAUS|Blauer See Delika...|Germany|
|      BLONP|Blondesddsl père ...| France|
|      BOLID|Bólido Comidas pr...|  Spain|
|      BONAP|            Bon app'| France|
|      BOTTM|Bottom-Dollar Mar...| Canada|
+-----------+--------------------+-------+



In [4]:
# WHERE
spark.sql("""
    SELECT product_name, unit_price
    FROM products
    WHERE unit_price > 50
    ORDER BY unit_price DESC
""").show()

+--------------------+----------+
|        product_name|unit_price|
+--------------------+----------+
|       Côte de Blaye|     263.5|
|Thüringer Rostbra...|    123.79|
|     Mishi Kobe Niku|      97.0|
|Sir Rodney's Marm...|      81.0|
|    Carnarvon Tigers|      62.5|
|Raclette Courdavault|      55.0|
|Manjimup Dried Ap...|      53.0|
+--------------------+----------+



In [5]:
# GROUP BY
spark.sql("""
    SELECT country, COUNT(*) as nb_clients
    FROM customers
    GROUP BY country
    ORDER BY nb_clients DESC
""").show()

+-----------+----------+
|    country|nb_clients|
+-----------+----------+
|        USA|        13|
|    Germany|        11|
|     France|        11|
|     Brazil|         9|
|         UK|         7|
|      Spain|         5|
|     Mexico|         5|
|  Venezuela|         4|
|  Argentina|         3|
|      Italy|         3|
|     Canada|         3|
|     Sweden|         2|
|    Belgium|         2|
|    Finland|         2|
|    Denmark|         2|
|Switzerland|         2|
|   Portugal|         2|
|    Austria|         2|
|     Norway|         1|
|    Ireland|         1|
+-----------+----------+
only showing top 20 rows


In [6]:
# HAVING
spark.sql("""
    SELECT country, COUNT(*) as nb_clients
    FROM customers
    GROUP BY country
    HAVING COUNT(*) >= 5
    ORDER BY nb_clients DESC
""").show()

+-------+----------+
|country|nb_clients|
+-------+----------+
|    USA|        13|
|Germany|        11|
| France|        11|
| Brazil|         9|
|     UK|         7|
|  Spain|         5|
| Mexico|         5|
+-------+----------+



## 4. Jointures en SQL

In [7]:
# INNER JOIN
spark.sql("""
    SELECT 
        o.order_id,
        o.order_date,
        c.company_name,
        c.country
    FROM orders o
    INNER JOIN customers c ON o.customer_id = c.customer_id
    LIMIT 10
""").show()

+--------+----------+--------------+-------+
|order_id|order_date|  company_name|country|
+--------+----------+--------------+-------+
|   10374|1996-12-05|Wolski  Zajazd| Poland|
|   10611|1997-07-25|Wolski  Zajazd| Poland|
|   10792|1997-12-23|Wolski  Zajazd| Poland|
|   10870|1998-02-04|Wolski  Zajazd| Poland|
|   10906|1998-02-25|Wolski  Zajazd| Poland|
|   10998|1998-04-03|Wolski  Zajazd| Poland|
|   11044|1998-04-23|Wolski  Zajazd| Poland|
|   10529|1997-05-07|  Maison Dewey|Belgium|
|   10649|1997-08-28|  Maison Dewey|Belgium|
|   10760|1997-12-01|  Maison Dewey|Belgium|
+--------+----------+--------------+-------+



In [8]:
# Jointure multiple
spark.sql("""
    SELECT 
        o.order_id,
        c.company_name,
        p.product_name,
        cat.category_name,
        od.quantity,
        od.unit_price
    FROM order_details od
    JOIN orders o ON od.order_id = o.order_id
    JOIN customers c ON o.customer_id = c.customer_id
    JOIN products p ON od.product_id = p.product_id
    JOIN categories cat ON p.category_id = cat.category_id
    LIMIT 10
""").show(truncate=False)

+--------+-----------------+--------------------+-------------+--------+----------+
|order_id|company_name     |product_name        |category_name|quantity|unit_price|
+--------+-----------------+--------------------+-------------+--------+----------+
|10623   |Frankenversand   |Guaraná Fantástica  |Beverages    |3       |4.5       |
|10623   |Frankenversand   |Steeleye Stout      |Beverages    |30      |18.0      |
|10817   |Königlich Essen  |Côte de Blaye       |Beverages    |30      |263.5     |
|10703   |Folk och fä HB   |Chang               |Beverages    |5       |19.0      |
|11025   |Wartian Herkku   |Chai                |Beverages    |10      |18.0      |
|10468   |Königlich Essen  |Ipoh Coffee         |Beverages    |15      |36.8      |
|10632   |Die Wandernde Kuh|Chang               |Beverages    |30      |19.0      |
|10788   |QUICK-Stop       |Rhönbräu Klosterbier|Beverages    |40      |7.75      |
|10840   |LINO-Delicateses |Chartreuse verte    |Beverages    |10      |18.0

In [9]:
# LEFT JOIN avec condition
spark.sql("""
    SELECT 
        c.customer_id,
        c.company_name,
        COUNT(o.order_id) as nb_commandes
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.company_name
    ORDER BY nb_commandes
    LIMIT 10
""").show()

+-----------+--------------------+------------+
|customer_id|        company_name|nb_commandes|
+-----------+--------------------+------------+
|      PARIS|   Paris spécialités|           0|
|      FISSA|FISSA Fabrica Int...|           0|
|      CENTC|Centro comercial ...|           1|
|      GROSR|GROSELLA-Restaurante|           2|
|      LAZYK|Lazy K Kountry Store|           2|
|      CONSH|Consolidated Hold...|           3|
|      TRAIH|Trail's Head Gour...|           3|
|      FRANR| France restauration|           3|
|      BOLID|Bólido Comidas pr...|           3|
|      THECR|     The Cracker Box|           3|
+-----------+--------------------+------------+



## 5. Fonctions de date

In [10]:
# Extraction de composants de date
spark.sql("""
    SELECT 
        order_id,
        order_date,
        YEAR(order_date) as annee,
        MONTH(order_date) as mois,
        DAY(order_date) as jour,
        QUARTER(order_date) as trimestre,
        DAYOFWEEK(order_date) as jour_semaine,
        WEEKOFYEAR(order_date) as semaine
    FROM orders
    LIMIT 5
""").show()

+--------+----------+-----+----+----+---------+------------+-------+
|order_id|order_date|annee|mois|jour|trimestre|jour_semaine|semaine|
+--------+----------+-----+----+----+---------+------------+-------+
|   10248|1996-07-04| 1996|   7|   4|        3|           5|     27|
|   10249|1996-07-05| 1996|   7|   5|        3|           6|     27|
|   10250|1996-07-08| 1996|   7|   8|        3|           2|     28|
|   10251|1996-07-08| 1996|   7|   8|        3|           2|     28|
|   10252|1996-07-09| 1996|   7|   9|        3|           3|     28|
+--------+----------+-----+----+----+---------+------------+-------+



In [11]:
# Calcul de differences de dates
spark.sql("""
    SELECT 
        order_id,
        order_date,
        shipped_date,
        DATEDIFF(shipped_date, order_date) as delai_jours
    FROM orders
    WHERE shipped_date IS NOT NULL
    LIMIT 10
""").show()

+--------+----------+------------+-----------+
|order_id|order_date|shipped_date|delai_jours|
+--------+----------+------------+-----------+
|   10248|1996-07-04|  1996-07-16|         12|
|   10249|1996-07-05|  1996-07-10|          5|
|   10250|1996-07-08|  1996-07-12|          4|
|   10251|1996-07-08|  1996-07-15|          7|
|   10252|1996-07-09|  1996-07-11|          2|
|   10253|1996-07-10|  1996-07-16|          6|
|   10254|1996-07-11|  1996-07-23|         12|
|   10255|1996-07-12|  1996-07-15|          3|
|   10256|1996-07-15|  1996-07-17|          2|
|   10257|1996-07-16|  1996-07-22|          6|
+--------+----------+------------+-----------+



In [12]:
# Formatage de dates
spark.sql("""
    SELECT 
        order_id,
        order_date,
        DATE_FORMAT(order_date, 'dd/MM/yyyy') as date_fr,
        DATE_FORMAT(order_date, 'MMMM yyyy') as mois_annee,
        DATE_FORMAT(order_date, 'EEEE') as nom_jour
    FROM orders
    LIMIT 5
""").show(truncate=False)

+--------+----------+----------+----------+--------+
|order_id|order_date|date_fr   |mois_annee|nom_jour|
+--------+----------+----------+----------+--------+
|10248   |1996-07-04|04/07/1996|July 1996 |Thursday|
|10249   |1996-07-05|05/07/1996|July 1996 |Friday  |
|10250   |1996-07-08|08/07/1996|July 1996 |Monday  |
|10251   |1996-07-08|08/07/1996|July 1996 |Monday  |
|10252   |1996-07-09|09/07/1996|July 1996 |Tuesday |
+--------+----------+----------+----------+--------+



## 6. Fonctions de chaines

In [13]:
# Manipulation de chaines
spark.sql("""
    SELECT 
        company_name,
        UPPER(company_name) as upper_name,
        LOWER(company_name) as lower_name,
        LENGTH(company_name) as longueur,
        SUBSTRING(company_name, 1, 10) as debut
    FROM customers
    LIMIT 5
""").show(truncate=False)

+----------------------------------+----------------------------------+----------------------------------+--------+----------+
|company_name                      |upper_name                        |lower_name                        |longueur|debut     |
+----------------------------------+----------------------------------+----------------------------------+--------+----------+
|Alfreds Futterkiste               |ALFREDS FUTTERKISTE               |alfreds futterkiste               |19      |Alfreds Fu|
|Ana Trujillo Emparedados y helados|ANA TRUJILLO EMPAREDADOS Y HELADOS|ana trujillo emparedados y helados|34      |Ana Trujil|
|Antonio Moreno Taquería           |ANTONIO MORENO TAQUERÍA           |antonio moreno taquería           |23      |Antonio Mo|
|Around the Horn                   |AROUND THE HORN                   |around the horn                   |15      |Around the|
|Berglunds snabbköp                |BERGLUNDS SNABBKÖP                |berglunds snabbköp                |18   

In [14]:
# Concatenation et remplacement
spark.sql("""
    SELECT 
        first_name,
        last_name,
        CONCAT(first_name, ' ', last_name) as full_name,
        CONCAT_WS(', ', last_name, first_name) as nom_formel
    FROM employees
""").show()

+----------+---------+----------------+-----------------+
|first_name|last_name|       full_name|       nom_formel|
+----------+---------+----------------+-----------------+
|     Nancy|  Davolio|   Nancy Davolio|   Davolio, Nancy|
|    Andrew|   Fuller|   Andrew Fuller|   Fuller, Andrew|
|     Janet|Leverling| Janet Leverling| Leverling, Janet|
|  Margaret|  Peacock|Margaret Peacock|Peacock, Margaret|
|    Steven| Buchanan| Steven Buchanan| Buchanan, Steven|
|   Michael|   Suyama|  Michael Suyama|  Suyama, Michael|
|    Robert|     King|     Robert King|     King, Robert|
|     Laura| Callahan|  Laura Callahan|  Callahan, Laura|
|      Anne|Dodsworth|  Anne Dodsworth|  Dodsworth, Anne|
+----------+---------+----------------+-----------------+



In [15]:
# Recherche et pattern matching
spark.sql("""
    SELECT product_name
    FROM products
    WHERE product_name LIKE '%Sauce%'
       OR product_name LIKE '%Syrup%'
""").show(truncate=False)

+--------------------------------+
|product_name                    |
+--------------------------------+
|Aniseed Syrup                   |
|Northwoods Cranberry Sauce      |
|Louisiana Fiery Hot Pepper Sauce|
+--------------------------------+



In [16]:
# Expression reguliere
spark.sql("""
    SELECT product_name
    FROM products
    WHERE product_name RLIKE '^[A-C]'
""").show(truncate=False)

+----------------------------+
|product_name                |
+----------------------------+
|Chai                        |
|Chang                       |
|Aniseed Syrup               |
|Chef Anton's Cajun Seasoning|
|Chef Anton's Gumbo Mix      |
|Alice Mutton                |
|Carnarvon Tigers            |
|Côte de Blaye               |
|Chartreuse verte            |
|Boston Crab Meat            |
|Chocolade                   |
|Camembert Pierrot           |
+----------------------------+



## 7. Fonctions d'agregation

In [17]:
# Agregations de base
spark.sql("""
    SELECT 
        COUNT(*) as nb_produits,
        ROUND(AVG(unit_price), 2) as prix_moyen,
        MIN(unit_price) as prix_min,
        MAX(unit_price) as prix_max,
        ROUND(SUM(units_in_stock), 0) as stock_total
    FROM products
""").show()

+-----------+----------+--------+--------+-----------+
|nb_produits|prix_moyen|prix_min|prix_max|stock_total|
+-----------+----------+--------+--------+-----------+
|         77|     28.83|     2.5|   263.5|       3119|
+-----------+----------+--------+--------+-----------+



In [18]:
# Agregation avec GROUP BY
spark.sql("""
    SELECT 
        cat.category_name,
        COUNT(*) as nb_produits,
        ROUND(AVG(p.unit_price), 2) as prix_moyen,
        SUM(p.units_in_stock) as stock_total
    FROM products p
    JOIN categories cat ON p.category_id = cat.category_id
    GROUP BY cat.category_name
    ORDER BY nb_produits DESC
""").show()

+--------------+-----------+----------+-----------+
| category_name|nb_produits|prix_moyen|stock_total|
+--------------+-----------+----------+-----------+
|   Confections|         13|     25.16|        386|
|    Condiments|         12|     22.85|        507|
|     Beverages|         12|     37.98|        559|
|       Seafood|         12|     20.68|        701|
|Dairy Products|         10|     28.73|        393|
|Grains/Cereals|          7|     20.25|        308|
|  Meat/Poultry|          6|     54.01|        165|
|       Produce|          5|     32.37|        100|
+--------------+-----------+----------+-----------+



In [19]:
# Agregation conditionnelle
spark.sql("""
    SELECT 
        category_name,
        COUNT(*) as total_produits,
        SUM(CASE WHEN unit_price < 20 THEN 1 ELSE 0 END) as produits_pas_chers,
        SUM(CASE WHEN unit_price >= 20 AND unit_price < 50 THEN 1 ELSE 0 END) as produits_moyens,
        SUM(CASE WHEN unit_price >= 50 THEN 1 ELSE 0 END) as produits_chers
    FROM products p
    JOIN categories c ON p.category_id = c.category_id
    GROUP BY category_name
""").show()

+--------------+--------------+------------------+---------------+--------------+
| category_name|total_produits|produits_pas_chers|produits_moyens|produits_chers|
+--------------+--------------+------------------+---------------+--------------+
|Dairy Products|            10|                 2|              7|             1|
|  Meat/Poultry|             6|                 1|              3|             2|
|    Condiments|            12|                 5|              7|             0|
|     Beverages|            12|                10|              1|             1|
|Grains/Cereals|             7|                 4|              3|             0|
|       Seafood|            12|                 8|              3|             1|
|   Confections|            13|                 8|              4|             1|
|       Produce|             5|                 1|              3|             1|
+--------------+--------------+------------------+---------------+--------------+



## 8. Sous-requetes

In [20]:
# Sous-requete dans WHERE
spark.sql("""
    SELECT product_name, unit_price
    FROM products
    WHERE unit_price > (
        SELECT AVG(unit_price) FROM products
    )
    ORDER BY unit_price DESC
""").show()

+--------------------+----------+
|        product_name|unit_price|
+--------------------+----------+
|       Côte de Blaye|     263.5|
|Thüringer Rostbra...|    123.79|
|     Mishi Kobe Niku|      97.0|
|Sir Rodney's Marm...|      81.0|
|    Carnarvon Tigers|      62.5|
|Raclette Courdavault|      55.0|
|Manjimup Dried Ap...|      53.0|
|      Tarte au sucre|      49.3|
|         Ipoh Coffee|      46.0|
|   Rössle Sauerkraut|      45.6|
|  Schoggi Schokolade|      43.9|
|        Vegie-spread|      43.9|
|Northwoods Cranbe...|      40.0|
|        Alice Mutton|      39.0|
|Queso Manchego La...|      38.0|
|Gnocchi di nonna ...|      38.0|
|    Gudbrandsdalsost|      36.0|
|Mozzarella di Gio...|      34.8|
|   Camembert Pierrot|      34.0|
|Wimmers gute Semm...|     33.25|
+--------------------+----------+
only showing top 20 rows


In [21]:
# Sous-requete avec IN
spark.sql("""
    SELECT company_name, country
    FROM customers
    WHERE customer_id IN (
        SELECT DISTINCT customer_id
        FROM orders
        WHERE order_date >= '1997-01-01'
    )
""").show()

+--------------------+---------+
|        company_name|  country|
+--------------------+---------+
|      Wolski  Zajazd|   Poland|
|        Maison Dewey|  Belgium|
|Blauer See Delika...|  Germany|
|Magazzini Aliment...|    Italy|
|      Folk och fä HB|   Sweden|
|Ana Trujillo Empa...|   Mexico|
|      Island Trading|       UK|
|        Vaffeljernet|  Denmark|
|Blondesddsl père ...|   France|
|Split Rail Beer &...|      USA|
|Trail's Head Gour...|      USA|
|   LILA-Supermercado|Venezuela|
|      Wartian Herkku|  Finland|
| France restauration|   France|
|  Seven Seas Imports|       UK|
|  Eastern Connection|       UK|
|    HILARION-Abastos|Venezuela|
|       Hanari Carnes|   Brazil|
|Drachenblut Delik...|  Germany|
|Vins et alcools C...|   France|
+--------------------+---------+
only showing top 20 rows


In [22]:
# Sous-requete avec EXISTS
spark.sql("""
    SELECT c.company_name
    FROM customers c
    WHERE EXISTS (
        SELECT 1
        FROM orders o
        JOIN order_details od ON o.order_id = od.order_id
        WHERE o.customer_id = c.customer_id
        AND od.quantity >= 100
    )
""").show(truncate=False)

+------------------+
|company_name      |
+------------------+
|Save-a-lot Markets|
|Ernst Handel      |
|QUICK-Stop        |
+------------------+



## 9. Common Table Expressions (CTE)

In [23]:
# CTE simple
spark.sql("""
    WITH ventes_client AS (
        SELECT 
            c.customer_id,
            c.company_name,
            SUM(od.unit_price * od.quantity) as ca_total
        FROM customers c
        JOIN orders o ON c.customer_id = o.customer_id
        JOIN order_details od ON o.order_id = od.order_id
        GROUP BY c.customer_id, c.company_name
    )
    SELECT *
    FROM ventes_client
    WHERE ca_total > 30000
    ORDER BY ca_total DESC
""").show(truncate=False)

+-----------+----------------------------+------------------+
|customer_id|company_name                |ca_total          |
+-----------+----------------------------+------------------+
|QUICK      |QUICK-Stop                  |117483.390147686  |
|SAVEA      |Save-a-lot Markets          |115673.38964271545|
|ERNSH      |Ernst Handel                |113236.67978191376|
|HUNGO      |Hungry Owl All-Night Grocers|57317.39016246796 |
|RATTC      |Rattlesnake Canyon Grocery  |52245.90034675598 |
|HANAR      |Hanari Carnes               |34101.149973869324|
|FOLKO      |Folk och fä HB              |32555.55001926422 |
|MEREP      |Mère Paillarde              |32203.900234222412|
|KOENE      |Königlich Essen             |31745.749893188477|
|QUEEN      |Queen Cozinha               |30226.10017967224 |
+-----------+----------------------------+------------------+



In [24]:
# CTE multiples
spark.sql("""
    WITH 
    ventes_mensuelles AS (
        SELECT 
            YEAR(order_date) as annee,
            MONTH(order_date) as mois,
            SUM(od.unit_price * od.quantity) as ca
        FROM orders o
        JOIN order_details od ON o.order_id = od.order_id
        GROUP BY YEAR(order_date), MONTH(order_date)
    ),
    moyenne AS (
        SELECT AVG(ca) as ca_moyen
        FROM ventes_mensuelles
    )
    SELECT 
        v.annee,
        v.mois,
        ROUND(v.ca, 2) as ca,
        ROUND(m.ca_moyen, 2) as ca_moyen,
        CASE 
            WHEN v.ca > m.ca_moyen THEN 'Au dessus'
            ELSE 'En dessous'
        END as performance
    FROM ventes_mensuelles v
    CROSS JOIN moyenne m
    ORDER BY v.annee, v.mois
""").show()

+-----+----+---------+--------+-----------+
|annee|mois|       ca|ca_moyen|performance|
+-----+----+---------+--------+-----------+
| 1996|   7|  30192.1| 58889.5| En dessous|
| 1996|   8|  26609.4| 58889.5| En dessous|
| 1996|   9|  27636.0| 58889.5| En dessous|
| 1996|  10|  41203.6| 58889.5| En dessous|
| 1996|  11|  49704.0| 58889.5| En dessous|
| 1996|  12|  50953.4| 58889.5| En dessous|
| 1997|   1|  66692.8| 58889.5|  Au dessus|
| 1997|   2|  41207.2| 58889.5| En dessous|
| 1997|   3|  39979.9| 58889.5| En dessous|
| 1997|   4| 55699.39| 58889.5| En dessous|
| 1997|   5|  56823.7| 58889.5| En dessous|
| 1997|   6|  39088.0| 58889.5| En dessous|
| 1997|   7| 55464.93| 58889.5| En dessous|
| 1997|   8| 49981.69| 58889.5| En dessous|
| 1997|   9| 59733.02| 58889.5|  Au dessus|
| 1997|  10|  70328.5| 58889.5|  Au dessus|
| 1997|  11| 45913.36| 58889.5| En dessous|
| 1997|  12| 77476.26| 58889.5|  Au dessus|
| 1998|   1|100854.72| 58889.5|  Au dessus|
| 1998|   2|104561.95| 58889.5| 

## 10. Fonctions de fenetre (Window Functions)

In [25]:
# ROW_NUMBER, RANK, DENSE_RANK
spark.sql("""
    SELECT 
        category_name,
        product_name,
        unit_price,
        ROW_NUMBER() OVER (PARTITION BY category_name ORDER BY unit_price DESC) as row_num,
        RANK() OVER (PARTITION BY category_name ORDER BY unit_price DESC) as rang,
        DENSE_RANK() OVER (PARTITION BY category_name ORDER BY unit_price DESC) as dense_rang
    FROM products p
    JOIN categories c ON p.category_id = c.category_id
""").show(20)

+-------------+--------------------+----------+-------+----+----------+
|category_name|        product_name|unit_price|row_num|rang|dense_rang|
+-------------+--------------------+----------+-------+----+----------+
|    Beverages|       Côte de Blaye|     263.5|      1|   1|         1|
|    Beverages|         Ipoh Coffee|      46.0|      2|   2|         2|
|    Beverages|               Chang|      19.0|      3|   3|         3|
|    Beverages|                Chai|      18.0|      4|   4|         4|
|    Beverages|      Steeleye Stout|      18.0|      5|   4|         4|
|    Beverages|    Chartreuse verte|      18.0|      6|   4|         4|
|    Beverages|        Lakkalikööri|      18.0|      7|   4|         4|
|    Beverages|       Outback Lager|      15.0|      8|   8|         5|
|    Beverages|       Sasquatch Ale|      14.0|      9|   9|         6|
|    Beverages|Laughing Lumberja...|      14.0|     10|   9|         6|
|    Beverages|Rhönbräu Klosterbier|      7.75|     11|  11|    

In [26]:
# LAG et LEAD
spark.sql("""
    WITH ventes_mois AS (
        SELECT 
            DATE_TRUNC('month', order_date) as mois,
            SUM(od.unit_price * od.quantity) as ca
        FROM orders o
        JOIN order_details od ON o.order_id = od.order_id
        GROUP BY DATE_TRUNC('month', order_date)
    )
    SELECT 
        mois,
        ROUND(ca, 2) as ca_actuel,
        ROUND(LAG(ca, 1) OVER (ORDER BY mois), 2) as ca_precedent,
        ROUND(LEAD(ca, 1) OVER (ORDER BY mois), 2) as ca_suivant,
        ROUND((ca - LAG(ca, 1) OVER (ORDER BY mois)) / LAG(ca, 1) OVER (ORDER BY mois) * 100, 2) as evolution_pct
    FROM ventes_mois
    ORDER BY mois
""").show()

+-------------------+---------+------------+----------+-------------+
|               mois|ca_actuel|ca_precedent|ca_suivant|evolution_pct|
+-------------------+---------+------------+----------+-------------+
|1996-07-01 00:00:00|  30192.1|        NULL|   26609.4|         NULL|
|1996-08-01 00:00:00|  26609.4|     30192.1|   27636.0|       -11.87|
|1996-09-01 00:00:00|  27636.0|     26609.4|   41203.6|         3.86|
|1996-10-01 00:00:00|  41203.6|     27636.0|   49704.0|        49.09|
|1996-11-01 00:00:00|  49704.0|     41203.6|   50953.4|        20.63|
|1996-12-01 00:00:00|  50953.4|     49704.0|   66692.8|         2.51|
|1997-01-01 00:00:00|  66692.8|     50953.4|   41207.2|        30.89|
|1997-02-01 00:00:00|  41207.2|     66692.8|   39979.9|       -38.21|
|1997-03-01 00:00:00|  39979.9|     41207.2|  55699.39|        -2.98|
|1997-04-01 00:00:00| 55699.39|     39979.9|   56823.7|        39.32|
|1997-05-01 00:00:00|  56823.7|    55699.39|   39088.0|         2.02|
|1997-06-01 00:00:00

In [27]:
# Cumul et moyenne mobile
spark.sql("""
    WITH ventes_jour AS (
        SELECT 
            order_date,
            SUM(od.unit_price * od.quantity) as ca
        FROM orders o
        JOIN order_details od ON o.order_id = od.order_id
        GROUP BY order_date
    )
    SELECT 
        order_date,
        ROUND(ca, 2) as ca,
        ROUND(SUM(ca) OVER (ORDER BY order_date), 2) as ca_cumule,
        ROUND(AVG(ca) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW), 2) as moyenne_7j
    FROM ventes_jour
    ORDER BY order_date
    LIMIT 20
""").show()

+----------+------+---------+----------+
|order_date|    ca|ca_cumule|moyenne_7j|
+----------+------+---------+----------+
|1996-07-04| 440.0|    440.0|     440.0|
|1996-07-05|1863.4|   2303.4|    1151.7|
|1996-07-08|2483.8|   4787.2|   1595.73|
|1996-07-09|3730.0|   8517.2|    2129.3|
|1996-07-10|1444.8|   9962.0|    1992.4|
|1996-07-11| 625.2|  10587.2|   1764.53|
|1996-07-12|2490.5|  13077.7|   1868.24|
|1996-07-15| 517.8|  13595.5|   1879.36|
|1996-07-16|1119.9|  14715.4|   1773.14|
|1996-07-17|2018.6|  16734.0|   1706.69|
|1996-07-18| 100.8|  16834.8|   1188.23|
|1996-07-19|2194.2|  19029.0|   1295.29|
|1996-07-22| 624.8|  19653.8|   1295.23|
|1996-07-23|2464.8|  22118.6|   1291.56|
|1996-07-24| 724.5|  22843.1|   1321.09|
|1996-07-25|1176.0|  24019.1|    1329.1|
|1996-07-26| 364.8|  24383.9|   1092.84|
|1996-07-29|4031.0|  28414.9|    1654.3|
|1996-07-30|1101.2|  29516.1|   1498.16|
|1996-07-31| 676.0|  30192.1|   1505.47|
+----------+------+---------+----------+



---

## Exercice

**Objectif** : Ecrire des requetes SQL complexes

**Consigne** :
1. Trouvez le top 3 des produits par categorie (en termes de CA)
2. Calculez le CA mensuel avec le cumul annuel
3. Trouvez les clients qui ont commande plus que la moyenne

A vous de jouer :

In [28]:
# TODO: Top 3 produits par categorie
spark.sql("""
    WITH ventes_produit AS (
        SELECT 
            c.category_name,
            p.product_name,
            SUM(od.unit_price * od.quantity) as ca_total
        FROM order_details od
        JOIN products p ON od.product_id = p.product_id
        JOIN categories c ON p.category_id = c.category_id
        GROUP BY c.category_name, p.product_name
    ),
    classement AS (
        SELECT 
            *,
            RANK() OVER (PARTITION BY category_name ORDER BY ca_total DESC) as rang
        FROM ventes_produit
    )
    SELECT * FROM classement 
    WHERE rang <= 3
    ORDER BY category_name, rang
""").show(truncate=False)

+--------------+--------------------------------+------------------+----+
|category_name |product_name                    |ca_total          |rang|
+--------------+--------------------------------+------------------+----+
|Beverages     |Côte de Blaye                   |149984.20082092285|1   |
|Beverages     |Ipoh Coffee                     |25079.199867248535|2   |
|Beverages     |Chang                           |18559.19992351532 |3   |
|Condiments    |Vegie-spread                    |17696.30004119873 |1   |
|Condiments    |Sirop d'érable                  |16438.79990005493 |2   |
|Condiments    |Louisiana Fiery Hot Pepper Sauce|14606.999431610107|3   |
|Confections   |Tarte au sucre                  |49827.89999771118 |1   |
|Confections   |Sir Rodney's Marmalade          |23635.800323486328|2   |
|Confections   |Gumbär Gummibärchen             |21534.89967918396 |3   |
|Dairy Products|Raclette Courdavault            |76296.0           |1   |
|Dairy Products|Camembert Pierrot     

In [29]:
# TODO: CA mensuel avec cumul annuel
spark.sql("""
    WITH ventes_mois AS (
        SELECT 
            YEAR(o.order_date) as annee,
            MONTH(o.order_date) as mois,
            SUM(od.unit_price * od.quantity) as ca_mois
        FROM orders o
        JOIN order_details od ON o.order_id = od.order_id
        GROUP BY YEAR(o.order_date), MONTH(o.order_date)
    )
    SELECT 
        annee,
        mois,
        ROUND(ca_mois, 2) as ca_mensuel,
        ROUND(SUM(ca_mois) OVER (PARTITION BY annee ORDER BY mois), 2) as cumul_annuel
    FROM ventes_mois
    ORDER BY annee, mois
""").show()

+-----+----+----------+------------+
|annee|mois|ca_mensuel|cumul_annuel|
+-----+----+----------+------------+
| 1996|   7|   30192.1|     30192.1|
| 1996|   8|   26609.4|     56801.5|
| 1996|   9|   27636.0|     84437.5|
| 1996|  10|   41203.6|    125641.1|
| 1996|  11|   49704.0|    175345.1|
| 1996|  12|   50953.4|    226298.5|
| 1997|   1|   66692.8|     66692.8|
| 1997|   2|   41207.2|    107900.0|
| 1997|   3|   39979.9|    147879.9|
| 1997|   4|  55699.39|   203579.29|
| 1997|   5|   56823.7|   260402.99|
| 1997|   6|   39088.0|   299490.99|
| 1997|   7|  55464.93|   354955.92|
| 1997|   8|  49981.69|   404937.61|
| 1997|   9|  59733.02|   464670.63|
| 1997|  10|   70328.5|   534999.13|
| 1997|  11|  45913.36|   580912.49|
| 1997|  12|  77476.26|   658388.75|
| 1998|   1| 100854.72|   100854.72|
| 1998|   2| 104561.95|   205416.67|
+-----+----+----------+------------+
only showing top 20 rows


In [30]:
# TODO: Clients au dessus de la moyenne
spark.sql("""
    WITH ventes_clients AS (
        SELECT 
            c.company_name,
            SUM(od.unit_price * od.quantity) as total_achats
        FROM customers c
        JOIN orders o ON c.customer_id = o.customer_id
        JOIN order_details od ON o.order_id = od.order_id
        GROUP BY c.company_name
    ),
    moyenne_globale AS (
        SELECT AVG(total_achats) as moyenne FROM ventes_clients
    )
    SELECT 
        v.company_name,
        ROUND(v.total_achats, 2) as ca_client,
        ROUND(m.moyenne, 2) as moyenne_generale
    FROM ventes_clients v
    CROSS JOIN moyenne_globale m
    WHERE v.total_achats > m.moyenne
    ORDER BY v.total_achats DESC
""").show(truncate=False)

+----------------------------+---------+----------------+
|company_name                |ca_client|moyenne_generale|
+----------------------------+---------+----------------+
|QUICK-Stop                  |117483.39|15218.64        |
|Save-a-lot Markets          |115673.39|15218.64        |
|Ernst Handel                |113236.68|15218.64        |
|Hungry Owl All-Night Grocers|57317.39 |15218.64        |
|Rattlesnake Canyon Grocery  |52245.9  |15218.64        |
|Hanari Carnes               |34101.15 |15218.64        |
|Folk och fä HB              |32555.55 |15218.64        |
|Mère Paillarde              |32203.9  |15218.64        |
|Königlich Essen             |31745.75 |15218.64        |
|Queen Cozinha               |30226.1  |15218.64        |
|White Clover Markets        |29073.45 |15218.64        |
|Frankenversand              |28722.71 |15218.64        |
|Berglunds snabbköp          |26968.15 |15218.64        |
|Piccolo und mehr            |26259.95 |15218.64        |
|Suprêmes déli

---

## Resume

Dans ce notebook, vous avez appris :
- Comment utiliser **Spark SQL** avec des vues temporaires
- Les **fonctions de date** (YEAR, MONTH, DATEDIFF, DATE_FORMAT)
- Les **fonctions de chaines** (CONCAT, UPPER, LIKE, RLIKE)
- Les **fonctions d'agregation** (SUM, AVG, COUNT, CASE)
- Les **sous-requetes** et **CTE**
- Les **fonctions de fenetre** (ROW_NUMBER, LAG, LEAD, cumul)

### Prochaine etape
Dans le prochain notebook, nous apprendrons les bases de Kafka.