# üì¶ BRONZE LAYER - Extract depuis sources DIRECTES

**Flux correct ETL** : Extract ‚Üí Transform ‚Üí Load

**Auteurs** : Nejma MOUALHI | Brieuc OLIVIERI | Nicolas TAING

---

## üéØ Sources de donn√©es

### 1. PostgreSQL (tables originales) :
- Patient, Consultation, Diagnostic, Professionnel_de_sante, etc.
- **AAAA + date** : Donn√©es d'hospitalisation (82K lignes)

### 2. CSV BRUTS (lus directement depuis /DATA_2024/) :
- ‚úÖ √âtablissements de sant√© (416K lignes)
- ‚úÖ Satisfaction 2019 (1K lignes)
- ‚úÖ **D√©c√®s 2019 UNIQUEMENT** (600K lignes - FILTR√â depuis 25M)

**Important** : Les CSV sont lus **DIRECTEMENT** depuis le syst√®me de fichiers, **PAS** depuis PostgreSQL !

**Nouveau** : Le fichier `deces.csv` est maintenant **FILTR√â** pour ne garder que **2019** (gain de 98% de performance).

In [1]:
# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, lit, year, to_date, col
from datetime import datetime
import time

print("‚úÖ Imports OK")

‚úÖ Imports OK


In [2]:
# Configuration Spark avec plus de m√©moire pour les gros fichiers
spark = SparkSession.builder \
    .appName("CHU_Bronze_Extract_Sources_Directes") \
    .config("spark.driver.memory", "8g") \
    .config("spark.executor.memory", "8g") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

print(f"‚úÖ Spark {spark.version} d√©marr√©")
print(f"üìä Master: {spark.sparkContext.master}")

‚úÖ Spark 3.5.0 d√©marr√©
üìä Master: local[*]


In [3]:
# Configuration PostgreSQL pour les tables originales
JDBC_URL = "jdbc:postgresql://chu_postgres:5432/healthcare_data"
JDBC_PROPS = {
    "user": "admin",
    "password": "admin123",
    "driver": "org.postgresql.Driver"
}

# Tables PostgreSQL originales (13 tables)
POSTGRES_TABLES = [
    "Patient",
    "Consultation",
    "Diagnostic",
    "Professionnel_de_sante",
    "Mutuelle",
    "Adher",
    "Prescription",
    "Medicaments",
    "Laboratoire",
    "Salle",
    "Specialites",
    "date",
    "AAAA"
]

# Configuration chemins
DATA_DIR = "/home/jovyan/DATA_2024"
OUTPUT_BASE = "/home/jovyan/data/bronze"

print(f"‚úÖ {len(POSTGRES_TABLES)} tables PostgreSQL √† extraire")
print(f"‚úÖ 4 fichiers CSV √† extraire")  # Modifi√© : 4 au lieu de 3
print(f"üíæ Destination: {OUTPUT_BASE}")

‚úÖ 13 tables PostgreSQL √† extraire
‚úÖ 4 fichiers CSV √† extraire
üíæ Destination: /home/jovyan/data/bronze


## üìä PARTIE 1 : Extract PostgreSQL (tables originales)

In [4]:
def ingest_postgres_table(table_name):
    """Extrait une table PostgreSQL vers Bronze layer"""
    print(f"\n{'='*80}")
    print(f"üîÑ Extract PostgreSQL: {table_name}")
    print(f"{'='*80}")
    
    start_time = time.time()
    
    try:
        df = spark.read.jdbc(
            url=JDBC_URL,
            table=f'"{table_name}"',
            properties=JDBC_PROPS
        )
        
        row_count = df.count()
        col_count = len(df.columns)
        
        print(f"üìñ Lu: {row_count:,} lignes, {col_count} colonnes")
        
        # Ajout m√©tadonn√©es
        df_with_meta = df \
            .withColumn("ingestion_timestamp", current_timestamp()) \
            .withColumn("ingestion_date", lit(datetime.now().strftime("%Y-%m-%d")))
        
        # Sauvegarde en Bronze
        output_path = f"{OUTPUT_BASE}/postgres/{table_name}"
        df_with_meta.write \
            .mode("overwrite") \
            .partitionBy("ingestion_date") \
            .parquet(output_path)
        
        elapsed = time.time() - start_time
        
        print(f"üíæ Sauvegard√©: {output_path}")
        print(f"‚è±Ô∏è  Temps: {elapsed:.2f}s")
        print(f"‚úÖ {table_name} OK")
        
        return {
            "source": "PostgreSQL",
            "table": table_name,
            "rows": row_count,
            "cols": col_count,
            "time_sec": round(elapsed, 2),
            "status": "SUCCESS"
        }
        
    except Exception as e:
        print(f"‚ùå ERREUR: {str(e)}")
        return {
            "source": "PostgreSQL",
            "table": table_name,
            "rows": 0,
            "cols": 0,
            "time_sec": 0,
            "status": f"ERROR: {str(e)}"
        }

print("‚úÖ Fonction d'ingestion PostgreSQL d√©finie")

‚úÖ Fonction d'ingestion PostgreSQL d√©finie


In [5]:
# INGESTION POSTGRESQL
print("\n" + "="*80)
print("üöÄ EXTRACTION POSTGRESQL - TABLES ORIGINALES")
print("="*80)

results = []

for table in POSTGRES_TABLES:
    result = ingest_postgres_table(table)
    results.append(result)

print("\n" + "="*80)
print("‚úÖ EXTRACTION POSTGRESQL TERMIN√âE")
print("="*80)


üöÄ EXTRACTION POSTGRESQL - TABLES ORIGINALES

üîÑ Extract PostgreSQL: Patient
üìñ Lu: 100,000 lignes, 16 colonnes
üíæ Sauvegard√©: /home/jovyan/data/bronze/postgres/Patient
‚è±Ô∏è  Temps: 9.58s
‚úÖ Patient OK

üîÑ Extract PostgreSQL: Consultation
üìñ Lu: 1,027,157 lignes, 9 colonnes
üíæ Sauvegard√©: /home/jovyan/data/bronze/postgres/Consultation
‚è±Ô∏è  Temps: 16.74s
‚úÖ Consultation OK

üîÑ Extract PostgreSQL: Diagnostic
üìñ Lu: 15,490 lignes, 2 colonnes
üíæ Sauvegard√©: /home/jovyan/data/bronze/postgres/Diagnostic
‚è±Ô∏è  Temps: 1.17s
‚úÖ Diagnostic OK

üîÑ Extract PostgreSQL: Professionnel_de_sante
üìñ Lu: 1,048,575 lignes, 8 colonnes
üíæ Sauvegard√©: /home/jovyan/data/bronze/postgres/Professionnel_de_sante
‚è±Ô∏è  Temps: 6.23s
‚úÖ Professionnel_de_sante OK

üîÑ Extract PostgreSQL: Mutuelle
üìñ Lu: 254 lignes, 3 colonnes
üíæ Sauvegard√©: /home/jovyan/data/bronze/postgres/Mutuelle
‚è±Ô∏è  Temps: 1.32s
‚úÖ Mutuelle OK

üîÑ Extract PostgreSQL: Adher
üìñ Lu: 96,671 l

## üìÑ PARTIE 2 : Extract CSV BRUTS (directement depuis /DATA_2024/)

In [6]:
def ingest_csv_file(name, file_path, separator=";", encoding="UTF-8"):
    """Extrait un fichier CSV vers Bronze layer"""
    print(f"\n{'='*80}")
    print(f"üîÑ Extract CSV: {name}")
    print(f"üìÅ Fichier: {file_path}")
    print(f"{'='*80}")
    
    start_time = time.time()
    
    try:
        # Lecture CSV BRUT
        df = spark.read \
            .option("header", "true") \
            .option("inferSchema", "true") \
            .option("sep", separator) \
            .option("encoding", encoding) \
            .csv(file_path)
        
        row_count = df.count()
        col_count = len(df.columns)
        
        print(f"üìñ Lu: {row_count:,} lignes, {col_count} colonnes")
        
        # Ajout m√©tadonn√©es
        df_with_meta = df \
            .withColumn("ingestion_timestamp", current_timestamp()) \
            .withColumn("ingestion_date", lit(datetime.now().strftime("%Y-%m-%d")))
        
        # Sauvegarde en Bronze
        output_path = f"{OUTPUT_BASE}/csv/{name}"
        df_with_meta.write \
            .mode("overwrite") \
            .parquet(output_path)
        
        elapsed = time.time() - start_time
        
        print(f"üíæ Sauvegard√©: {output_path}")
        print(f"‚è±Ô∏è  Temps: {elapsed:.2f}s")
        print(f"‚úÖ {name} OK")
        
        return {
            "source": "CSV",
            "table": name,
            "rows": row_count,
            "cols": col_count,
            "time_sec": round(elapsed, 2),
            "status": "SUCCESS"
        }
        
    except Exception as e:
        print(f"‚ùå ERREUR: {str(e)}")
        return {
            "source": "CSV",
            "table": name,
            "rows": 0,
            "cols": 0,
            "time_sec": 0,
            "status": f"ERROR: {str(e)}"
        }

print("‚úÖ Fonction d'ingestion CSV d√©finie")

‚úÖ Fonction d'ingestion CSV d√©finie


In [7]:
# 1. √âTABLISSEMENTS DE SANT√â
result = ingest_csv_file(
    name="etablissement_sante",
    file_path=f"{DATA_DIR}/Etablissement de SANTE/etablissement_sante.csv",
    separator=";"
)
results.append(result)


üîÑ Extract CSV: etablissement_sante
üìÅ Fichier: /home/jovyan/DATA_2024/Etablissement de SANTE/etablissement_sante.csv
üìñ Lu: 416,665 lignes, 24 colonnes
üíæ Sauvegard√©: /home/jovyan/data/bronze/csv/etablissement_sante
‚è±Ô∏è  Temps: 7.33s
‚úÖ etablissement_sante OK


In [8]:
# 2. SATISFACTION 2019
result = ingest_csv_file(
    name="satisfaction_esatis48h_2019",
    file_path=f"{DATA_DIR}/Satisfaction/2019/resultats-esatis48h-mco-open-data-2019.csv",
    separator=";"
)
results.append(result)


üîÑ Extract CSV: satisfaction_esatis48h_2019
üìÅ Fichier: /home/jovyan/DATA_2024/Satisfaction/2019/resultats-esatis48h-mco-open-data-2019.csv
üìñ Lu: 1,152 lignes, 25 colonnes
üíæ Sauvegard√©: /home/jovyan/data/bronze/csv/satisfaction_esatis48h_2019
‚è±Ô∏è  Temps: 1.71s
‚úÖ satisfaction_esatis48h_2019 OK


In [9]:
# 3. D√âC√àS 2019 UNIQUEMENT (FILTR√â)
print(f"\n{'='*80}")
print(f"üîÑ Extract CSV: deces (FILTR√â 2019 UNIQUEMENT)")
print(f"üìÅ Fichier: {DATA_DIR}/DECES EN FRANCE/deces.csv")
print(f"{'='*80}")

start_time = time.time()

try:
    # Lecture CSV brut
    print("üìñ Lecture du fichier CSV complet...")
    df_deces_raw = spark.read \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .option("mode", "PERMISSIVE") \
        .option("multiLine", "false") \
        .csv(f"{DATA_DIR}/DECES EN FRANCE/deces.csv")
    
    # FILTRAGE : Ne garder que 2019
    # NOTE: La colonne s'appelle "date_deces" (pas "datdec")
    print("üîç Filtrage des d√©c√®s 2019 uniquement...")
    df_deces_full = df_deces_raw.filter(col("date_deces").startswith("2019"))
    
    # Repartitionner pour optimiser l'√©criture
    df_deces_full = df_deces_full.repartition(10)
    
    row_count = df_deces_full.count()
    col_count = len(df_deces_full.columns)
    print(f"üìä Total 2019: {row_count:,} lignes, {col_count} colonnes")
    print(f"‚úÖ FILTR√â : Seulement donn√©es 2019 (r√©duction de 98%)")
    
    # Ajout m√©tadonn√©es
    df_with_meta = df_deces_full \
        .withColumn("ingestion_timestamp", current_timestamp()) \
        .withColumn("ingestion_date", lit(datetime.now().strftime("%Y-%m-%d")))
    
    # Sauvegarde en Bronze (DONN√âES 2019 uniquement)
    output_path = f"{OUTPUT_BASE}/csv/deces_2019"
    df_with_meta.write \
        .mode("overwrite") \
        .option("compression", "snappy") \
        .parquet(output_path)
    
    elapsed = time.time() - start_time
    
    print(f"üíæ Sauvegard√©: {output_path}")
    print(f"‚è±Ô∏è  Temps: {elapsed:.2f}s")
    print(f"‚úÖ deces 2019 OK ({row_count:,} lignes)")
    
    results.append({
        "source": "CSV",
        "table": "deces_2019",
        "rows": row_count,
        "cols": col_count,
        "time_sec": round(elapsed, 2),
        "status": "SUCCESS"
    })
    
except Exception as e:
    print(f"‚ùå ERREUR: {str(e)}")
    import traceback
    traceback.print_exc()
    results.append({
        "source": "CSV",
        "table": "deces_2019",
        "rows": 0,
        "cols": 0,
        "time_sec": 0,
        "status": f"ERROR: {str(e)}"
    })


üîÑ Extract CSV: deces (FILTR√â 2019 UNIQUEMENT)
üìÅ Fichier: /home/jovyan/DATA_2024/DECES EN FRANCE/deces.csv
üìñ Lecture du fichier CSV complet...
üîç Filtrage des d√©c√®s 2019 uniquement...
üìä Total 2019: 620,626 lignes, 10 colonnes
‚úÖ FILTR√â : Seulement donn√©es 2019 (r√©duction de 98%)
üíæ Sauvegard√©: /home/jovyan/data/bronze/csv/deces_2019
‚è±Ô∏è  Temps: 37.80s
‚úÖ deces 2019 OK (620,626 lignes)


In [10]:
# 4. D√âPARTEMENTS FRAN√áAIS (r√©f√©rentiel g√©ographique)
result = ingest_csv_file(
    name="departements",
    file_path=f"{DATA_DIR}/departements-francais.csv",
    separator=";"
)
results.append(result)


üîÑ Extract CSV: departements
üìÅ Fichier: /home/jovyan/DATA_2024/departements-francais.csv
üìñ Lu: 101 lignes, 4 colonnes
üíæ Sauvegard√©: /home/jovyan/data/bronze/csv/departements
‚è±Ô∏è  Temps: 1.22s
‚úÖ departements OK


## üìä R√âSUM√â GLOBAL

In [11]:
# R√âSUM√â
import pandas as pd

df_results = pd.DataFrame(results)
df_results

Unnamed: 0,source,table,rows,cols,time_sec,status
0,PostgreSQL,Patient,100000,16,9.58,SUCCESS
1,PostgreSQL,Consultation,1027157,9,16.74,SUCCESS
2,PostgreSQL,Diagnostic,15490,2,1.17,SUCCESS
3,PostgreSQL,Professionnel_de_sante,1048575,8,6.23,SUCCESS
4,PostgreSQL,Mutuelle,254,3,1.32,SUCCESS
5,PostgreSQL,Adher,96671,2,1.42,SUCCESS
6,PostgreSQL,Prescription,1003845,2,2.97,SUCCESS
7,PostgreSQL,Medicaments,15455,12,1.49,SUCCESS
8,PostgreSQL,Laboratoire,677,3,1.25,SUCCESS
9,PostgreSQL,Salle,201735,5,1.97,SUCCESS


In [12]:
# STATISTIQUES PAR SOURCE
success = df_results[df_results['status'] == 'SUCCESS']

print("\nüìä STATISTIQUES GLOBALES")
print("="*60)
print(f"‚úÖ Tables extraites: {len(success)}/{len(results)}")
print(f"üìä Total lignes: {success['rows'].sum():,}")
print(f"‚è±Ô∏è  Temps total: {success['time_sec'].sum():.2f}s")

print("\nüì¶ D√©tail par source:")
for source in success['source'].unique():
    source_data = success[success['source'] == source]
    print(f"\n  {source}:")
    print(f"    - Tables: {len(source_data)}")
    print(f"    - Lignes: {source_data['rows'].sum():,}")
    print(f"    - Temps: {source_data['time_sec'].sum():.2f}s")

print("\n" + "="*60)
print("\nüíæ Donn√©es sauvegard√©es dans: {OUTPUT_BASE}/")
print("  üìÇ bronze/postgres/ - 13 tables originales")
print("  üìÇ bronze/csv/ - 3 fichiers CSV")
print("="*60)


üìä STATISTIQUES GLOBALES
‚úÖ Tables extraites: 17/17
üìä Total lignes: 4,712,928
‚è±Ô∏è  Temps total: 96.26s

üì¶ D√©tail par source:

  PostgreSQL:
    - Tables: 13
    - Lignes: 3,674,384
    - Temps: 48.20s

  CSV:
    - Tables: 4
    - Lignes: 1,038,544
    - Temps: 48.06s


üíæ Donn√©es sauvegard√©es dans: {OUTPUT_BASE}/
  üìÇ bronze/postgres/ - 13 tables originales
  üìÇ bronze/csv/ - 3 fichiers CSV


In [13]:
---

## ‚úÖ BRONZE LAYER - EXTRACTION COMPL√àTE

### üì¶ Donn√©es extraites :

#### PostgreSQL (13 tables) :
- ‚úÖ 100K patients
- ‚úÖ 1M+ consultations
- ‚úÖ 1M+ professionnels de sant√©
- ‚úÖ **82K hospitalisations** (tables AAAA + date)
- ‚úÖ Diagnostic, Prescription, M√©dicaments, etc.

#### CSV (4 fichiers lus DIRECTEMENT) :
- ‚úÖ 416K √©tablissements de sant√©
- ‚úÖ 1K √©valuations satisfaction 2019
- ‚úÖ **600K d√©c√®s 2019** (FILTR√â depuis 25M - gain de 98%)
- ‚úÖ 101 d√©partements fran√ßais

### Total : ~4 millions de lignes (optimis√©)

### üéØ Prochaine √©tape :

üëâ **Notebook 02** : Transform Silver (Nettoyage, Anonymisation, Formats)

**Important** : Les d√©c√®s sont maintenant **filtr√©s √† 2019** pour performance et coh√©rence avec satisfaction 2019. Les hospitalisations proviennent des tables **AAAA + date**.

SyntaxError: invalid character '‚úÖ' (U+2705) (162563051.py, line 8)

---

## ‚úÖ BRONZE LAYER - EXTRACTION COMPL√àTE

### üì¶ Donn√©es extraites :

#### PostgreSQL (13 tables) :
- ‚úÖ 100K patients
- ‚úÖ 1M+ consultations
- ‚úÖ 1M+ professionnels de sant√©
- ‚úÖ Diagnostic, Prescription, M√©dicaments, etc.

#### CSV (3 fichiers lus DIRECTEMENT) :
- ‚úÖ 416K √©tablissements de sant√©
- ‚úÖ 1K √©valuations satisfaction 2019
- ‚úÖ **25M d√©c√®s COMPLET** (toutes ann√©es - SANS FILTRAGE)

### Total : ~29 millions de lignes

### üéØ Prochaine √©tape :

üëâ **Notebook 02** : Transform Silver (Nettoyage, Anonymisation, Formats)

**Important** : Les CSV sont maintenant en Bronze sous forme **BRUTE INT√âGRALE**. Le filtrage temporel (si n√©cessaire) sera appliqu√© dans Silver ou Gold selon les besoins m√©tier.

In [14]:
# Dans une cellule Jupyter
df = spark.read.parquet("/home/jovyan/data/bronze/postgres/Patient")
print(f"Patients en Bronze : {df.count():,}")
df.show(5)

Patients en Bronze : 100,000
+----------+----------+--------+------+--------------------+-----------------+-----------+----+--------------------+--------------+----------+---+----------------+--------------+-----+------+--------------------+--------------+
|Id_patient|       Nom|  Prenom|  Sexe|             Adresse|            Ville|Code_postal|Pays|               EMail|           Tel|      Date|Age|        Num_Secu|Groupe_sanguin| Poid|Taille| ingestion_timestamp|ingestion_date|
+----------+----------+--------+------+--------------------+-----------------+-----------+----+--------------------+--------------+----------+---+----------------+--------------+-----+------+--------------------+--------------+
|         1|Christabel|  Tougas|female|12 rue du Faubour...|       THIONVILLE|      57100|  FR|ChristabelTougas@...|03.85.46.00.55|  4/6/1980| 41|5571905089387417|            O+| 54.3|   162|2025-10-23 18:39:...|    2025-10-23|
|         2|  Lorraine|   Lebel|female|   21 rue Jean Vilar