# BRONZE LAYER - Extract depuis sources DIRECTES

**Flux correct ETL** : Extract → Transform → Load

**Auteurs** : Nejma MOUALHI | Brieuc OLIVIERI | Nicolas TAING

---

## Sources de données

### 1. PostgreSQL (tables originales) :
- Patient, Consultation, Diagnostic, Professionnel_de_sante, etc.
- AAAA + date : Données d'hospitalisation (82K lignes)

### 2. CSV BRUTS (lus directement depuis /DATA_2024/) :
- Etablissements de santé (416K lignes)
- Satisfaction 2019 (1K lignes)
- Décès 2019 UNIQUEMENT (600K lignes - FILTRÉ depuis 25M)

Le fichier `deces.csv` est filtré pour ne garder que 2019.

In [1]:
# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, lit, year, to_date, col
from datetime import datetime
import time

print("Imports loaded successfully")

Imports loaded successfully


In [2]:
# Configuration Spark avec plus de mémoire pour les gros fichiers
spark = SparkSession.builder \
    .appName("CHU_Bronze_Extract_Sources_Directes") \
    .config("spark.driver.memory", "8g") \
    .config("spark.executor.memory", "8g") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

print(f"Spark {spark.version} started successfully")
print(f"Master: {spark.sparkContext.master}")

Spark 3.5.0 started successfully
Master: local[*]


In [3]:
# Configuration PostgreSQL pour les tables originales
JDBC_URL = "jdbc:postgresql://chu_postgres:5432/healthcare_data"
JDBC_PROPS = {
    "user": "admin",
    "password": "admin123",
    "driver": "org.postgresql.Driver"
}

# Tables PostgreSQL originales (13 tables)
POSTGRES_TABLES = [
    "Patient",
    "Consultation",
    "Diagnostic",
    "Professionnel_de_sante",
    "Mutuelle",
    "Adher",
    "Prescription",
    "Medicaments",
    "Laboratoire",
    "Salle",
    "Specialites",
    "date",
    "AAAA"
]

# Configuration chemins
DATA_DIR = "/home/jovyan/DATA_2024"
OUTPUT_BASE = "/home/jovyan/data/bronze"

print(f"{len(POSTGRES_TABLES)} PostgreSQL tables configured for extraction")
print(f"4 CSV files configured for extraction")
print(f"Destination: {OUTPUT_BASE}")

13 PostgreSQL tables configured for extraction
4 CSV files configured for extraction
Destination: /home/jovyan/data/bronze


## PARTIE 1 : Extract PostgreSQL (tables originales)

In [4]:
def ingest_postgres_table(table_name):
    """Extrait une table PostgreSQL vers Bronze layer"""
    print(f"\n{'='*80}")
    print(f"Extracting PostgreSQL table: {table_name}")
    print(f"{'='*80}")
    
    start_time = time.time()
    
    try:
        df = spark.read.jdbc(
            url=JDBC_URL,
            table=f'"{table_name}"',
            properties=JDBC_PROPS
        )
        
        row_count = df.count()
        col_count = len(df.columns)
        
        print(f"Read: {row_count:,} rows, {col_count} columns")
        
        # Ajout métadonnées
        df_with_meta = df \
            .withColumn("ingestion_timestamp", current_timestamp()) \
            .withColumn("ingestion_date", lit(datetime.now().strftime("%Y-%m-%d")))
        
        # Sauvegarde en Bronze
        output_path = f"{OUTPUT_BASE}/postgres/{table_name}"
        df_with_meta.write \
            .mode("overwrite") \
            .partitionBy("ingestion_date") \
            .parquet(output_path)
        
        elapsed = time.time() - start_time
        
        print(f"Saved to: {output_path}")
        print(f"Time elapsed: {elapsed:.2f}s")
        print(f"{table_name} extraction completed successfully")
        
        return {
            "source": "PostgreSQL",
            "table": table_name,
            "rows": row_count,
            "cols": col_count,
            "time_sec": round(elapsed, 2),
            "status": "SUCCESS"
        }
        
    except Exception as e:
        print(f"ERROR: {str(e)}")
        return {
            "source": "PostgreSQL",
            "table": table_name,
            "rows": 0,
            "cols": 0,
            "time_sec": 0,
            "status": f"ERROR: {str(e)}"
        }

print("PostgreSQL ingestion function defined")

PostgreSQL ingestion function defined


In [5]:
# INGESTION POSTGRESQL
print("\n" + "="*80)
print("POSTGRESQL EXTRACTION - ORIGINAL TABLES")
print("="*80)

results = []

for table in POSTGRES_TABLES:
    result = ingest_postgres_table(table)
    results.append(result)

print("\n" + "="*80)
print("POSTGRESQL EXTRACTION COMPLETED")
print("="*80)


POSTGRESQL EXTRACTION - ORIGINAL TABLES

Extracting PostgreSQL table: Patient
Read: 100,000 rows, 16 columns
Saved to: /home/jovyan/data/bronze/postgres/Patient
Time elapsed: 6.63s
Patient extraction completed successfully

Extracting PostgreSQL table: Consultation
Read: 1,027,157 rows, 9 columns
Saved to: /home/jovyan/data/bronze/postgres/Consultation
Time elapsed: 10.50s
Consultation extraction completed successfully

Extracting PostgreSQL table: Diagnostic
Read: 15,490 rows, 2 columns
Saved to: /home/jovyan/data/bronze/postgres/Diagnostic
Time elapsed: 1.09s
Diagnostic extraction completed successfully

Extracting PostgreSQL table: Professionnel_de_sante
Read: 1,048,575 rows, 8 columns
Saved to: /home/jovyan/data/bronze/postgres/Professionnel_de_sante
Time elapsed: 5.15s
Professionnel_de_sante extraction completed successfully

Extracting PostgreSQL table: Mutuelle
Read: 254 rows, 3 columns
Saved to: /home/jovyan/data/bronze/postgres/Mutuelle
Time elapsed: 0.96s
Mutuelle extraction

## PARTIE 2 : Extract CSV BRUTS (directement depuis /DATA_2024/)

In [6]:
def ingest_csv_file(name, file_path, separator=";", encoding="UTF-8"):
    """Extrait un fichier CSV vers Bronze layer"""
    print(f"\n{'='*80}")
    print(f"Extracting CSV: {name}")
    print(f"File: {file_path}")
    print(f"{'='*80}")
    
    start_time = time.time()
    
    try:
        # Lecture CSV BRUT
        df = spark.read \
            .option("header", "true") \
            .option("inferSchema", "true") \
            .option("sep", separator) \
            .option("encoding", encoding) \
            .csv(file_path)
        
        row_count = df.count()
        col_count = len(df.columns)
        
        print(f"Read: {row_count:,} rows, {col_count} columns")
        
        # Ajout métadonnées
        df_with_meta = df \
            .withColumn("ingestion_timestamp", current_timestamp()) \
            .withColumn("ingestion_date", lit(datetime.now().strftime("%Y-%m-%d")))
        
        # Sauvegarde en Bronze
        output_path = f"{OUTPUT_BASE}/csv/{name}"
        df_with_meta.write \
            .mode("overwrite") \
            .parquet(output_path)
        
        elapsed = time.time() - start_time
        
        print(f"Saved to: {output_path}")
        print(f"Time elapsed: {elapsed:.2f}s")
        print(f"{name} extraction completed successfully")
        
        return {
            "source": "CSV",
            "table": name,
            "rows": row_count,
            "cols": col_count,
            "time_sec": round(elapsed, 2),
            "status": "SUCCESS"
        }
        
    except Exception as e:
        print(f"ERROR: {str(e)}")
        return {
            "source": "CSV",
            "table": name,
            "rows": 0,
            "cols": 0,
            "time_sec": 0,
            "status": f"ERROR: {str(e)}"
        }

print("CSV ingestion function defined")

CSV ingestion function defined


In [7]:
# 1. ÉTABLISSEMENTS DE SANTÉ
result = ingest_csv_file(
    name="etablissement_sante",
    file_path=f"{DATA_DIR}/Etablissement de SANTE/etablissement_sante.csv",
    separator=";"
)
results.append(result)


Extracting CSV: etablissement_sante
File: /home/jovyan/DATA_2024/Etablissement de SANTE/etablissement_sante.csv
Read: 416,665 rows, 24 columns
Saved to: /home/jovyan/data/bronze/csv/etablissement_sante
Time elapsed: 6.53s
etablissement_sante extraction completed successfully


In [8]:
# 2. SATISFACTION 2019
result = ingest_csv_file(
    name="satisfaction_esatis48h_2019",
    file_path=f"{DATA_DIR}/Satisfaction/2019/resultats-esatis48h-mco-open-data-2019.csv",
    separator=";"
)
results.append(result)


Extracting CSV: satisfaction_esatis48h_2019
File: /home/jovyan/DATA_2024/Satisfaction/2019/resultats-esatis48h-mco-open-data-2019.csv
Read: 1,152 rows, 25 columns
Saved to: /home/jovyan/data/bronze/csv/satisfaction_esatis48h_2019
Time elapsed: 1.06s
satisfaction_esatis48h_2019 extraction completed successfully


In [9]:
# 3. DECES 2019 UNIQUEMENT (FILTRE)
print(f"\n{'='*80}")
print(f"Extracting CSV: deces (2019 DATA ONLY - FILTERED)")
print(f"File: {DATA_DIR}/DECES EN FRANCE/deces.csv")
print(f"{'='*80}")

start_time = time.time()

try:
    # Lecture CSV brut
    print("Reading complete CSV file...")
    df_deces_raw = spark.read \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .option("mode", "PERMISSIVE") \
        .option("multiLine", "false") \
        .csv(f"{DATA_DIR}/DECES EN FRANCE/deces.csv")
    
    # FILTRAGE : Ne garder que 2019
    print("Filtering 2019 deaths only...")
    df_deces_full = df_deces_raw.filter(col("date_deces").startswith("2019"))
    
    # Repartitionner pour optimiser l'écriture
    df_deces_full = df_deces_full.repartition(10)
    
    row_count = df_deces_full.count()
    col_count = len(df_deces_full.columns)
    print(f"Total 2019 records: {row_count:,} rows, {col_count} columns")
    print(f"FILTERED: Only 2019 data (98% reduction)")
    
    # Ajout métadonnées
    df_with_meta = df_deces_full \
        .withColumn("ingestion_timestamp", current_timestamp()) \
        .withColumn("ingestion_date", lit(datetime.now().strftime("%Y-%m-%d")))
    
    # Sauvegarde en Bronze (DONNEES 2019 uniquement)
    output_path = f"{OUTPUT_BASE}/csv/deces_2019"
    df_with_meta.write \
        .mode("overwrite") \
        .option("compression", "snappy") \
        .parquet(output_path)
    
    elapsed = time.time() - start_time
    
    print(f"Saved to: {output_path}")
    print(f"Time elapsed: {elapsed:.2f}s")
    print(f"deces 2019 extraction completed ({row_count:,} rows)")
    
    results.append({
        "source": "CSV",
        "table": "deces_2019",
        "rows": row_count,
        "cols": col_count,
        "time_sec": round(elapsed, 2),
        "status": "SUCCESS"
    })
    
except Exception as e:
    print(f"ERROR: {str(e)}")
    import traceback
    traceback.print_exc()
    results.append({
        "source": "CSV",
        "table": "deces_2019",
        "rows": 0,
        "cols": 0,
        "time_sec": 0,
        "status": f"ERROR: {str(e)}"
    })


Extracting CSV: deces (2019 DATA ONLY - FILTERED)
File: /home/jovyan/DATA_2024/DECES EN FRANCE/deces.csv
Reading complete CSV file...
Filtering 2019 deaths only...
Total 2019 records: 620,626 rows, 10 columns
FILTERED: Only 2019 data (98% reduction)
Saved to: /home/jovyan/data/bronze/csv/deces_2019
Time elapsed: 36.51s
deces 2019 extraction completed (620,626 rows)


In [10]:
# 4. DEPARTEMENTS FRANCAIS (référentiel géographique)
result = ingest_csv_file(
    name="departements",
    file_path=f"{DATA_DIR}/departements-francais.csv",
    separator=";"
)
results.append(result)


Extracting CSV: departements
File: /home/jovyan/DATA_2024/departements-francais.csv
Read: 101 rows, 4 columns
Saved to: /home/jovyan/data/bronze/csv/departements
Time elapsed: 0.82s
departements extraction completed successfully


## RESUME GLOBAL

In [11]:
# RESUME - Statistiques globales
import pandas as pd

df_results = pd.DataFrame(results)

# Filtrer les succès
success = df_results[df_results['status'] == 'SUCCESS']

print("\nGLOBAL STATISTICS")
print("="*60)
print(f"Tables extracted: {len(success)}/{len(results)}")
print(f"Total rows: {success['rows'].sum():,}")
print(f"Total time: {success['time_sec'].sum():.2f}s")

print("\nBy source:")
for source in success['source'].unique():
    source_data = success[success['source'] == source]
    print(f"\n  {source}:")
    print(f"    - Tables: {len(source_data)}")
    print(f"    - Rows: {source_data['rows'].sum():,}")
    print(f"    - Time: {source_data['time_sec'].sum():.2f}s")

print("\n" + "="*60)
print(f"\nData saved in: {OUTPUT_BASE}/")
print("  bronze/postgres/ - 13 original tables")
print("  bronze/csv/ - 4 CSV files")
print("="*60)
# STATISTIQUES PAR SOURCE
success = df_results[df_results['status'] == 'SUCCESS']

print("\n📊 STATISTIQUES GLOBALES")
print("="*60)
print(f"✅ Tables extraites: {len(success)}/{len(results)}")
print(f"📊 Total lignes: {success['rows'].sum():,}")
print(f"⏱️  Temps total: {success['time_sec'].sum():.2f}s")

print("\n📦 Détail par source:")
for source in success['source'].unique():
    source_data = success[success['source'] == source]
    print(f"\n  {source}:")
    print(f"    - Tables: {len(source_data)}")
    print(f"    - Lignes: {source_data['rows'].sum():,}")
    print(f"    - Temps: {source_data['time_sec'].sum():.2f}s")

print("\n" + "="*60)
print("\n💾 Données sauvegardées dans: {OUTPUT_BASE}/")
print("  📂 bronze/postgres/ - 13 tables originales")
print("  📂 bronze/csv/ - 3 fichiers CSV")
print("="*60)


GLOBAL STATISTICS
Tables extracted: 17/17
Total rows: 4,712,928
Total time: 78.70s

By source:

  PostgreSQL:
    - Tables: 13
    - Rows: 3,674,384
    - Time: 33.78s

  CSV:
    - Tables: 4
    - Rows: 1,038,544
    - Time: 44.92s


Data saved in: /home/jovyan/data/bronze/
  bronze/postgres/ - 13 original tables
  bronze/csv/ - 4 CSV files

📊 STATISTIQUES GLOBALES
✅ Tables extraites: 17/17
📊 Total lignes: 4,712,928
⏱️  Temps total: 78.70s

📦 Détail par source:

  PostgreSQL:
    - Tables: 13
    - Lignes: 3,674,384
    - Temps: 33.78s

  CSV:
    - Tables: 4
    - Lignes: 1,038,544
    - Temps: 44.92s


💾 Données sauvegardées dans: {OUTPUT_BASE}/
  📂 bronze/postgres/ - 13 tables originales
  📂 bronze/csv/ - 3 fichiers CSV


---

## BRONZE LAYER - EXTRACTION COMPLETE

### Data extracted:

#### PostgreSQL (13 tables):
- 100K patients
- 1M+ consultations
- 1M+ healthcare professionals
- 82K hospitalizations (AAAA + date tables)
- Diagnostic, Prescription, Medicaments, etc.

#### CSV (4 files read DIRECTLY):
- 416K healthcare establishments
- 1K satisfaction evaluations 2019
- 600K deaths 2019 (FILTERED from 25M - 98% performance gain)
- 101 French departments

### Total: ~4 million rows (optimized)

### Next step:

**Notebook 02**: Transform Silver (Cleaning, Anonymization, Formatting)

**Important**: Deaths are now filtered to 2019 only for performance and consistency with 2019 satisfaction data. Hospitalizations come from AAAA + date tables.

---

## ✅ BRONZE LAYER - EXTRACTION COMPLÈTE

### 📦 Données extraites :

#### PostgreSQL (13 tables) :
- ✅ 100K patients
- ✅ 1M+ consultations
- ✅ 1M+ professionnels de santé
- ✅ Diagnostic, Prescription, Médicaments, etc.

#### CSV (3 fichiers lus DIRECTEMENT) :
- ✅ 416K établissements de santé
- ✅ 1K évaluations satisfaction 2019
- ✅ **25M décès COMPLET** (toutes années - SANS FILTRAGE)

### Total : ~29 millions de lignes

### 🎯 Prochaine étape :

👉 **Notebook 02** : Transform Silver (Nettoyage, Anonymisation, Formats)

**Important** : Les CSV sont maintenant en Bronze sous forme **BRUTE INTÉGRALE**. Le filtrage temporel (si nécessaire) sera appliqué dans Silver ou Gold selon les besoins métier.

In [12]:
# Dans une cellule Jupyter
df = spark.read.parquet("/home/jovyan/data/bronze/postgres/Patient")
print(f"Patients en Bronze : {df.count():,}")
df.show(5)

Patients en Bronze : 100,000
+----------+----------+--------+------+--------------------+-----------------+-----------+----+--------------------+--------------+----------+---+----------------+--------------+-----+------+--------------------+--------------+
|Id_patient|       Nom|  Prenom|  Sexe|             Adresse|            Ville|Code_postal|Pays|               EMail|           Tel|      Date|Age|        Num_Secu|Groupe_sanguin| Poid|Taille| ingestion_timestamp|ingestion_date|
+----------+----------+--------+------+--------------------+-----------------+-----------+----+--------------------+--------------+----------+---+----------------+--------------+-----+------+--------------------+--------------+
|         1|Christabel|  Tougas|female|12 rue du Faubour...|       THIONVILLE|      57100|  FR|ChristabelTougas@...|03.85.46.00.55|  4/6/1980| 41|5571905089387417|            O+| 54.3|   162|2025-10-24 19:11:...|    2025-10-24|
|         2|  Lorraine|   Lebel|female|   21 rue Jean Vilar