# TP Final — Phase 1 : Ingestion Bronze (PostgreSQL → MinIO)

**But** : ingerer des donnees brutes depuis PostgreSQL (Northwind) vers la zone **Bronze** (MinIO/S3) selon l’architecture Medallion.

## Attendus couverts
- Lecture generique d’une table PostgreSQL
- Ecriture Parquet dans `s3a://bronze/<table_name>/`
- Partitionnement par **date d’ingestion** au format `YYYY-MM-DD`
- Ajout des metadonnees techniques obligatoires :
  - `_ingestion_timestamp`
  - `_source_system` = `postgresql`
  - `_table_name`

> Remarque : on ajoute aussi `_ingestion_date` pour porter la partition (colonne technique).

## 1) Contexte d’execution
Ce notebook est fait pour etre execute **dans le JupyterLab de l’environnement Docker** (service `jupyter-spark`).

- PostgreSQL dans Docker : `jdbc:postgresql://postgres:5432/app`
- MinIO dans Docker : `http://minio:9000`
- Buckets attendus : `bronze`, `silver`, `gold`

In [8]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from datetime import datetime

# SparkSession avec config MinIO + drivers (au besoin)
spark = (SparkSession.builder
    .appName("TP Final - Phase 1 Bronze")
    .config("spark.jars.packages", "org.postgresql:postgresql:42.6.0,org.apache.hadoop:hadoop-aws:3.4.1,com.amazonaws:aws-java-sdk-bundle:1.12.262")
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000")
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin")
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123")
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .getOrCreate())

# Pour permettre overwrite uniquement de la partition du jour
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

print("Spark OK")
print("S3A endpoint =", spark.sparkContext._jsc.hadoopConfiguration().get("fs.s3a.endpoint"))

Spark OK
S3A endpoint = http://minio:9000


In [9]:
# Configuration JDBC PostgreSQL
jdbc_url = "jdbc:postgresql://postgres:5432/app"
jdbc_properties = {
    "user": "postgres",
    "password": "postgres",
    "driver": "org.postgresql.Driver",
}

# Date d’ingestion (partition) au format YYYY-MM-DD
ingestion_date = datetime.now().strftime("%Y-%m-%d")
print("Ingestion date =", ingestion_date)

Ingestion date = 2026-01-16


## 2) Fonction d’ingestion generique (PostgreSQL → Bronze)
Cette fonction :
- lit la table via JDBC
- ajoute les colonnes techniques obligatoires
- ecrit en Parquet dans `s3a://bronze/<table_name>/<YYYY-MM-DD>/` (partitionnement par dossier)

In [10]:
def ingest_table_to_bronze(table_name: str, *, ingestion_date: str, base_path: str = "s3a://bronze") -> dict:
    """
    Ingerer une table PostgreSQL vers la zone Bronze sur MinIO.

    Ecrit en Parquet et partitionne par date d’ingestion (YYYY-MM-DD) via la structure de dossiers.
    Ajoute les metadonnees techniques :
      - _ingestion_timestamp
      - _source_system = 'postgresql'
      - _table_name
      - _ingestion_date

    Retourne un petit dictionnaire de stats.
    """
    ingestion_ts_col = F.current_timestamp()

    df = spark.read.jdbc(url=jdbc_url, table=table_name, properties=jdbc_properties)
    row_count = df.count()

    df_out = (df
        .withColumn("_ingestion_timestamp", ingestion_ts_col)
        .withColumn("_source_system", F.lit("postgresql"))
        .withColumn("_table_name", F.lit(table_name))
        .withColumn("_ingestion_date", F.lit(ingestion_date))
    )

    target_path = f"{base_path}/{table_name}/{ingestion_date}"

    df_out.write.mode("overwrite").parquet(target_path)

    return {
        "table": table_name,
        "rows": row_count,
        "path": target_path,
        "partition": ingestion_date,
    }

## 3) Execution Phase 1 : tables obligatoires (+ bonus)
Tables obligatoires : `customers`, `orders`, `order_details`, `products`

Tables bonus (si dispo) : `employees`, `suppliers`, `categories`

In [11]:
required_tables = ["customers", "orders", "order_details", "products"]
bonus_tables = ["employees", "suppliers", "categories"]
tables_to_ingest = required_tables + bonus_tables

results = []
errors = []

print("=== PHASE 1 - INGESTION BRONZE ===")
for table in tables_to_ingest:
    try:
        stats = ingest_table_to_bronze(table, ingestion_date=ingestion_date)
        results.append(stats)
        print(f"[OK] {stats['table']} -> {stats['rows']} lignes -> {stats['path']} (_ingestion_date={stats['partition']})")
    except Exception as e:
        errors.append({"table": table, "error": str(e)})
        print(f"[ERREUR] {table} : {e}")

print("\nResume OK :")
for r in results:
    print(f"  - {r['table']}: {r['rows']} lignes")

if errors:
    print("\nResume erreurs :")
    for err in errors:
        print(f"  - {err['table']}: {err['error']}")

=== PHASE 1 - INGESTION BRONZE ===
[OK] customers -> 91 lignes -> s3a://bronze/customers/2026-01-16 (_ingestion_date=2026-01-16)
[OK] orders -> 830 lignes -> s3a://bronze/orders/2026-01-16 (_ingestion_date=2026-01-16)
[OK] order_details -> 2155 lignes -> s3a://bronze/order_details/2026-01-16 (_ingestion_date=2026-01-16)
[OK] products -> 77 lignes -> s3a://bronze/products/2026-01-16 (_ingestion_date=2026-01-16)
[OK] employees -> 9 lignes -> s3a://bronze/employees/2026-01-16 (_ingestion_date=2026-01-16)
[OK] suppliers -> 29 lignes -> s3a://bronze/suppliers/2026-01-16 (_ingestion_date=2026-01-16)
[OK] categories -> 8 lignes -> s3a://bronze/categories/2026-01-16 (_ingestion_date=2026-01-16)

Resume OK :
  - customers: 91 lignes
  - orders: 830 lignes
  - order_details: 2155 lignes
  - products: 77 lignes
  - employees: 9 lignes
  - suppliers: 29 lignes
  - categories: 8 lignes


## 4) Verifications rapides
On relit une table depuis Bronze et on verifie la presence des colonnes techniques + la partition.

In [12]:
# Exemple : verifier customers (lecture de la partition du jour)
df_bronze_customers = spark.read.parquet(f"s3a://bronze/customers/{ingestion_date}")

expected_cols = {"_ingestion_timestamp", "_source_system", "_table_name", "_ingestion_date"}
missing = expected_cols - set(df_bronze_customers.columns)
print("Colonnes manquantes :", missing)

df_bronze_customers.select(
    "_ingestion_timestamp",
    "_source_system",
    "_table_name",
    "_ingestion_date",
).show(5, truncate=False)

df_bronze_customers.groupBy("_ingestion_date").count().orderBy("_ingestion_date").show(truncate=False)

Colonnes manquantes : set()
+--------------------------+--------------+-----------+---------------+
|_ingestion_timestamp      |_source_system|_table_name|_ingestion_date|
+--------------------------+--------------+-----------+---------------+
|2026-01-16 09:42:31.851842|postgresql    |customers  |2026-01-16     |
|2026-01-16 09:42:31.851842|postgresql    |customers  |2026-01-16     |
|2026-01-16 09:42:31.851842|postgresql    |customers  |2026-01-16     |
|2026-01-16 09:42:31.851842|postgresql    |customers  |2026-01-16     |
|2026-01-16 09:42:31.851842|postgresql    |customers  |2026-01-16     |
+--------------------------+--------------+-----------+---------------+
only showing top 5 rows
+---------------+-----+
|_ingestion_date|count|
+---------------+-----+
|2026-01-16     |91   |
+---------------+-----+

