# Exercice 06 - Connexion a PostgreSQL

## Objectifs
- Se connecter a PostgreSQL depuis Spark
- Lire des tables SQL avec Spark
- Decouvrir la base Northwind
- Executer des requetes SQL sur les donnees

---

## 1. La base de donnees Northwind

**Northwind** est une base de donnees exemple classique qui simule une entreprise de vente.

```
+------------------------------------------------------------------+
|                    BASE NORTHWIND                                |
+------------------------------------------------------------------+
|                                                                  |
|  +------------+     +------------+     +------------+            |
|  | categories |     |  products  |     | suppliers  |            |
|  +------------+     +------------+     +------------+            |
|        |                  |                  |                   |
|        +--------+---------+------------------+                   |
|                 |                                                |
|                 v                                                |
|  +------------+     +------------+     +------------+            |
|  |   orders   |---->|order_details|<---|  products  |            |
|  +------------+     +------------+     +------------+            |
|        |                                                         |
|        v                                                         |
|  +------------+     +------------+                               |
|  | customers  |     | employees  |                               |
|  +------------+     +------------+                               |
|                                                                  |
+------------------------------------------------------------------+

Tables principales :
- customers    : clients de l'entreprise
- products     : catalogue de produits
- orders       : commandes passees
- order_details: details de chaque commande
- employees    : employes de l'entreprise
- categories   : categories de produits
- suppliers    : fournisseurs
```

## 2. Configuration de Spark pour PostgreSQL

In [1]:
from pyspark.sql import SparkSession

# Creer la SparkSession
spark = SparkSession.builder \
    .appName("Spark PostgreSQL") \
    .config("spark.jars.packages", "org.postgresql:postgresql:42.6.0") \
    .getOrCreate()

print("Spark pret !")

Spark pret !


In [2]:
# Configuration de la connexion PostgreSQL
jdbc_url = "jdbc:postgresql://postgres:5432/app"
connection_properties = {
    "user": "postgres",
    "password": "postgres",
    "driver": "org.postgresql.Driver"
}

print("Configuration PostgreSQL :")
print(f"  URL: {jdbc_url}")
print(f"  User: {connection_properties['user']}")

Configuration PostgreSQL :
  URL: jdbc:postgresql://postgres:5432/app
  User: postgres


## 3. Lire une table SQL

In [3]:
# Lire la table customers
df_customers = spark.read.jdbc(
    url=jdbc_url,
    table="customers",
    properties=connection_properties
)

print("Table customers chargee !")
print(f"Nombre de clients : {df_customers.count()}")

Table customers chargee !
Nombre de clients : 91


In [4]:
# Afficher les premieres lignes
df_customers.show(5)

+-----------+--------------------+------------------+--------------------+--------------------+-----------+------+-----------+-------+--------------+--------------+
|customer_id|        company_name|      contact_name|       contact_title|             address|       city|region|postal_code|country|         phone|           fax|
+-----------+--------------------+------------------+--------------------+--------------------+-----------+------+-----------+-------+--------------+--------------+
|      ALFKI| Alfreds Futterkiste|      Maria Anders|Sales Representative|       Obere Str. 57|     Berlin|  NULL|      12209|Germany|   030-0074321|   030-0076545|
|      ANATR|Ana Trujillo Empa...|      Ana Trujillo|               Owner|Avda. de la Const...|México D.F.|  NULL|      05021| Mexico|  (5) 555-4729|  (5) 555-3745|
|      ANTON|Antonio Moreno Ta...|    Antonio Moreno|               Owner|     Mataderos  2312|México D.F.|  NULL|      05023| Mexico|  (5) 555-3932|          NULL|
|      ARO

In [5]:
# Afficher le schema
df_customers.printSchema()

root
 |-- customer_id: string (nullable = true)
 |-- company_name: string (nullable = true)
 |-- contact_name: string (nullable = true)
 |-- contact_title: string (nullable = true)
 |-- address: string (nullable = true)
 |-- city: string (nullable = true)
 |-- region: string (nullable = true)
 |-- postal_code: string (nullable = true)
 |-- country: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- fax: string (nullable = true)



## 4. Lire d'autres tables

In [6]:
# Lire la table products
df_products = spark.read.jdbc(
    url=jdbc_url,
    table="products",
    properties=connection_properties
)

print(f"Nombre de produits : {df_products.count()}")
df_products.show(5)

Nombre de produits : 77
+----------+--------------------+-----------+-----------+-------------------+----------+--------------+--------------+-------------+------------+
|product_id|        product_name|supplier_id|category_id|  quantity_per_unit|unit_price|units_in_stock|units_on_order|reorder_level|discontinued|
+----------+--------------------+-----------+-----------+-------------------+----------+--------------+--------------+-------------+------------+
|         1|                Chai|          8|          1| 10 boxes x 30 bags|      18.0|            39|             0|           10|           1|
|         2|               Chang|          1|          1| 24 - 12 oz bottles|      19.0|            17|            40|           25|           1|
|         3|       Aniseed Syrup|          1|          2|12 - 550 ml bottles|      10.0|            13|            70|           25|           0|
|         4|Chef Anton's Caju...|          2|          2|     48 - 6 oz jars|      22.0|            

In [7]:
# Lire la table orders
df_orders = spark.read.jdbc(
    url=jdbc_url,
    table="orders",
    properties=connection_properties
)

print(f"Nombre de commandes : {df_orders.count()}")
df_orders.show(5)

Nombre de commandes : 830
+--------+-----------+-----------+----------+-------------+------------+--------+-------+--------------------+--------------------+--------------+-----------+----------------+------------+
|order_id|customer_id|employee_id|order_date|required_date|shipped_date|ship_via|freight|           ship_name|        ship_address|     ship_city|ship_region|ship_postal_code|ship_country|
+--------+-----------+-----------+----------+-------------+------------+--------+-------+--------------------+--------------------+--------------+-----------+----------------+------------+
|   10248|      VINET|          5|1996-07-04|   1996-08-01|  1996-07-16|       3|  32.38|Vins et alcools C...|  59 rue de l'Abbaye|         Reims|       NULL|           51100|      France|
|   10249|      TOMSP|          6|1996-07-05|   1996-08-16|  1996-07-10|       1|  11.61|  Toms Spezialitäten|       Luisenstr. 48|       Münster|       NULL|           44087|     Germany|
|   10250|      HANAR|       

In [8]:
# Lire la table order_details
df_order_details = spark.read.jdbc(
    url=jdbc_url,
    table="order_details",
    properties=connection_properties
)

print(f"Nombre de lignes de commande : {df_order_details.count()}")
df_order_details.show(5)

Nombre de lignes de commande : 2155
+--------+----------+----------+--------+--------+
|order_id|product_id|unit_price|quantity|discount|
+--------+----------+----------+--------+--------+
|   10248|        11|      14.0|      12|     0.0|
|   10248|        42|       9.8|      10|     0.0|
|   10248|        72|      34.8|       5|     0.0|
|   10249|        14|      18.6|       9|     0.0|
|   10249|        51|      42.4|      40|     0.0|
+--------+----------+----------+--------+--------+
only showing top 5 rows


## 5. Utiliser SQL avec Spark

Spark permet d'utiliser du SQL sur les DataFrames.

In [9]:
# Enregistrer les DataFrames comme vues temporaires
df_customers.createOrReplaceTempView("customers")
df_products.createOrReplaceTempView("products")
df_orders.createOrReplaceTempView("orders")
df_order_details.createOrReplaceTempView("order_details")

print("Vues SQL creees !")

Vues SQL creees !


In [10]:
# Requete SQL : clients par pays
result = spark.sql("""
    SELECT country, COUNT(*) as nb_clients
    FROM customers
    GROUP BY country
    ORDER BY nb_clients DESC
""")

result.show()

+-----------+----------+
|    country|nb_clients|
+-----------+----------+
|        USA|        13|
|    Germany|        11|
|     France|        11|
|     Brazil|         9|
|         UK|         7|
|      Spain|         5|
|     Mexico|         5|
|  Venezuela|         4|
|  Argentina|         3|
|      Italy|         3|
|     Canada|         3|
|     Sweden|         2|
|    Belgium|         2|
|    Finland|         2|
|    Denmark|         2|
|Switzerland|         2|
|   Portugal|         2|
|    Austria|         2|
|     Norway|         1|
|    Ireland|         1|
+-----------+----------+
only showing top 20 rows


In [11]:
# Requete SQL : produits les plus chers
result = spark.sql("""
    SELECT product_name, unit_price
    FROM products
    ORDER BY unit_price DESC
    LIMIT 10
""")

result.show()

+--------------------+----------+
|        product_name|unit_price|
+--------------------+----------+
|       Côte de Blaye|     263.5|
|Thüringer Rostbra...|    123.79|
|     Mishi Kobe Niku|      97.0|
|Sir Rodney's Marm...|      81.0|
|    Carnarvon Tigers|      62.5|
|Raclette Courdavault|      55.0|
|Manjimup Dried Ap...|      53.0|
|      Tarte au sucre|      49.3|
|         Ipoh Coffee|      46.0|
|   Rössle Sauerkraut|      45.6|
+--------------------+----------+



In [12]:
# Requete SQL : montant total des commandes par client
result = spark.sql("""
    SELECT 
        c.company_name,
        COUNT(DISTINCT o.order_id) as nb_commandes,
        ROUND(SUM(od.unit_price * od.quantity), 2) as montant_total
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
    JOIN order_details od ON o.order_id = od.order_id
    GROUP BY c.company_name
    ORDER BY montant_total DESC
    LIMIT 10
""")

print("Top 10 clients par chiffre d'affaires :")
result.show(truncate=False)

Top 10 clients par chiffre d'affaires :
+----------------------------+------------+-------------+
|company_name                |nb_commandes|montant_total|
+----------------------------+------------+-------------+
|QUICK-Stop                  |28          |117483.39    |
|Save-a-lot Markets          |31          |115673.39    |
|Ernst Handel                |30          |113236.68    |
|Hungry Owl All-Night Grocers|19          |57317.39     |
|Rattlesnake Canyon Grocery  |18          |52245.9      |
|Hanari Carnes               |14          |34101.15     |
|Folk och fä HB              |19          |32555.55     |
|Mère Paillarde              |13          |32203.9      |
|Königlich Essen             |14          |31745.75     |
|Queen Cozinha               |13          |30226.1      |
+----------------------------+------------+-------------+



## 6. Lire avec une requete SQL personnalisee

In [13]:
# Au lieu de lire une table entiere, on peut executer une requete
query = "(SELECT customer_id, company_name, country FROM customers WHERE country = 'France') as french_customers"

df_french = spark.read.jdbc(
    url=jdbc_url,
    table=query,
    properties=connection_properties
)

print("Clients francais :")
df_french.show()

Clients francais :
+-----------+--------------------+-------+
|customer_id|        company_name|country|
+-----------+--------------------+-------+
|      BLONP|Blondesddsl père ...| France|
|      BONAP|            Bon app'| France|
|      DUMON|     Du monde entier| France|
|      FOLIG|   Folies gourmandes| France|
|      FRANR| France restauration| France|
|      LACOR|La corne d'abondance| France|
|      LAMAI|    La maison d'Asie| France|
|      PARIS|   Paris spécialités| France|
|      SPECD|Spécialités du monde| France|
|      VICTE|Victuailles en stock| France|
|      VINET|Vins et alcools C...| France|
+-----------+--------------------+-------+



## 7. Fonction utilitaire pour lire les tables

In [14]:
def lire_table(nom_table):
    """Fonction pour lire une table PostgreSQL"""
    return spark.read.jdbc(
        url="jdbc:postgresql://postgres:5432/app",
        table=nom_table,
        properties={
            "user": "postgres",
            "password": "postgres",
            "driver": "org.postgresql.Driver"
        }
    )

# Exemple d'utilisation
df_categories = lire_table("categories")
df_categories.show()

+-----------+--------------+--------------------+-------+
|category_id| category_name|         description|picture|
+-----------+--------------+--------------------+-------+
|          1|     Beverages|Soft drinks, coff...|     []|
|          2|    Condiments|Sweet and savory ...|     []|
|          3|   Confections|Desserts, candies...|     []|
|          4|Dairy Products|             Cheeses|     []|
|          5|Grains/Cereals|Breads, crackers,...|     []|
|          6|  Meat/Poultry|      Prepared meats|     []|
|          7|       Produce|Dried fruit and b...|     []|
|          8|       Seafood|    Seaweed and fish|     []|
+-----------+--------------+--------------------+-------+



---

## Exercice

**Objectif** : Explorer la base Northwind avec Spark SQL

**Consigne** :
1. Lisez la table `employees`
2. Affichez le schema et les premieres lignes
3. Creez une vue temporaire
4. Ecrivez une requete SQL pour trouver :
   - Le nombre d'employes par ville
   - Les employes embauches apres 1993

A vous de jouer :

In [15]:
# TODO: Lire la table employees
df_employees = lire_table("employees")

# TODO: Afficher le schema et les premieres lignes
print("--- Schema ---")
df_employees.printSchema()

print("--- Apercu des donnees ---")
df_employees.show(5)

--- Schema ---
root
 |-- employee_id: short (nullable = true)
 |-- last_name: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- title: string (nullable = true)
 |-- title_of_courtesy: string (nullable = true)
 |-- birth_date: date (nullable = true)
 |-- hire_date: date (nullable = true)
 |-- address: string (nullable = true)
 |-- city: string (nullable = true)
 |-- region: string (nullable = true)
 |-- postal_code: string (nullable = true)
 |-- country: string (nullable = true)
 |-- home_phone: string (nullable = true)
 |-- extension: string (nullable = true)
 |-- photo: binary (nullable = true)
 |-- notes: string (nullable = true)
 |-- reports_to: short (nullable = true)
 |-- photo_path: string (nullable = true)

--- Apercu des donnees ---
+-----------+---------+----------+--------------------+-----------------+----------+----------+--------------------+--------+------+-----------+-------+--------------+---------+-----+--------------------+----------+------------

In [16]:
# TODO: Creer une vue temporaire
df_employees.createOrReplaceTempView("employees")
# TODO: Nombre d'employes par ville
spark.sql("""
    SELECT city, COUNT(*) as nb_employees
    FROM employees
    GROUP BY city
    ORDER BY nb_employees DESC
""").show()

+--------+------------+
|    city|nb_employees|
+--------+------------+
|  London|           4|
| Seattle|           2|
|Kirkland|           1|
| Redmond|           1|
|  Tacoma|           1|
+--------+------------+



In [17]:
# TODO: Employes embauches apres 1993
# Indice : utilisez WHERE hire_date > '1993-12-31'
spark.sql("""
    SELECT first_name, last_name, hire_date
    FROM employees
    WHERE hire_date > '1993-12-31'
    ORDER BY hire_date ASC
""").show()

+----------+---------+----------+
|first_name|last_name| hire_date|
+----------+---------+----------+
|    Robert|     King|1994-01-02|
|     Laura| Callahan|1994-03-05|
|      Anne|Dodsworth|1994-11-15|
+----------+---------+----------+



---

## Resume

Dans ce notebook, vous avez appris :
- Comment **configurer Spark pour PostgreSQL**
- Comment **lire des tables** avec `spark.read.jdbc()`
- La structure de la base **Northwind**
- Comment utiliser **Spark SQL** pour analyser les donnees
- Comment lire avec une **requete personnalisee**

### Prochaine etape
Dans le prochain notebook, nous apprendrons a ingerer les donnees PostgreSQL vers notre Data Lake MinIO.