# SILVER LAYER - Transformation et Nettoyage

**Flux ETL** : Bronze (brut) → **Silver (nettoyé)** → Gold (métier)

**Auteurs** : Nejma MOUALHI | Brieuc OLIVIERI | Nicolas TAING

---

## Transformations appliquées

### 1. ANONYMISATION (RGPD) :
- Hash des noms/prénoms (SHA-256)
- Suppression des données sensibles

### 2. NETTOYAGE :
- Formats de dates cohérents (YYYY-MM-DD)
- Typage correct des colonnes
- Dédoublonnage
- Valeurs NULL gérées

### 3. VALIDATION :
- Contraintes métier vérifiées
- Données aberrantes filtrées

In [1]:
# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, sha2, when, trim, upper, 
    to_date, year, month, dayofmonth,
    regexp_replace, coalesce, lit,
    current_timestamp
)
from datetime import datetime
import time

print("Imports loaded successfully")

Imports loaded successfully


In [2]:
# Configuration Spark
spark = SparkSession.builder \
    .appName("CHU_Silver_Transform_Nettoyage") \
    .config("spark.driver.memory", "8g") \
    .config("spark.executor.memory", "8g") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

print(f"Spark {spark.version} started successfully")

# Chemins
BRONZE_BASE = "/home/jovyan/data/bronze"
SILVER_BASE = "/home/jovyan/data/silver"

print(f"Source: {BRONZE_BASE}")
print(f"Destination: {SILVER_BASE}")

Spark 3.5.0 started successfully
Source: /home/jovyan/data/bronze
Destination: /home/jovyan/data/silver


## PARTIE 1 : Anonymisation Patient

In [3]:
print("="*80)
print("TRANSFORMATION: Patient (ANONYMIZATION)")
print("="*80)

start_time = time.time()

# Lecture Bronze
df_patient_bronze = spark.read.parquet(f"{BRONZE_BASE}/postgres/Patient")
print(f"Read: {df_patient_bronze.count():,} rows")

# Aperçu AVANT anonymisation
print("\nBEFORE anonymization:")
df_patient_bronze.select("Id_patient", "Nom", "Prenom", "Sexe", "Age", "Date").show(3, truncate=False)

# ANONYMISATION + NETTOYAGE
df_patient_silver = df_patient_bronze.select(
    col("Id_patient").alias("id_patient"),
    
    # ANONYMISATION: Hash SHA-256 des données sensibles
    sha2(col("Nom"), 256).alias("nom_hash"),
    sha2(col("Prenom"), 256).alias("prenom_hash"),
    
    # Données démographiques (conservées)
    col("Sexe").alias("sexe"),
    col("Age").cast("integer").alias("age"),
    
    # FORMAT DATE UNIFORME: M/d/yyyy → yyyy-MM-dd
    to_date(col("Date"), "M/d/yyyy").alias("date_naissance"),
    
    # Localisation (géographique large = OK pour RGPD)
    trim(upper(col("Ville"))).alias("ville"),
    col("Code_postal").alias("code_postal"),
    trim(upper(col("Pays"))).alias("pays"),
    
    # Informations médicales
    col("Poid").cast("double").alias("poids_kg"),
    col("Taille").cast("double").alias("taille_cm"),
    trim(upper(col("Groupe_sanguin"))).alias("groupe_sanguin"),
    
    # Contact (hashé)
    sha2(col("Tel"), 256).alias("telephone_hash"),
    sha2(col("EMail"), 256).alias("email_hash"),
    
    # Sécurité sociale (hashée)
    sha2(col("Num_Secu"), 256).alias("num_secu_hash"),
    
    # Métadonnées
    col("ingestion_date"),
    current_timestamp().alias("transformation_timestamp")
).dropDuplicates(["id_patient"])

# Aperçu APRES anonymisation
print("\nAFTER anonymization:")
df_patient_silver.select(
    "id_patient", "nom_hash", "prenom_hash", "sexe", "age", "date_naissance", "ville"
).show(3, truncate=False)

# Sauvegarde Silver
df_patient_silver.write \
    .mode("overwrite") \
    .parquet(f"{SILVER_BASE}/patient")

elapsed = time.time() - start_time
print(f"\nSaved to: {SILVER_BASE}/patient")
print(f"Time elapsed: {elapsed:.2f}s")
print(f"{df_patient_silver.count():,} patients anonymized")

TRANSFORMATION: Patient (ANONYMIZATION)
Read: 100,000 rows

BEFORE anonymization:
+----------+----------+------+------+---+---------+
|Id_patient|Nom       |Prenom|Sexe  |Age|Date     |
+----------+----------+------+------+---+---------+
|1         |Christabel|Tougas|female|41 |4/6/1980 |
|2         |Lorraine  |Lebel |female|7  |7/25/2013|
|3         |Jolie     |Majory|female|11 |8/8/2009 |
+----------+----------+------+------+---+---------+
only showing top 3 rows


AFTER anonymization:
+----------+----------------------------------------------------------------+----------------------------------------------------------------+------+---+--------------+-----------------+
|id_patient|nom_hash                                                        |prenom_hash                                                     |sexe  |age|date_naissance|ville            |
+----------+----------------------------------------------------------------+--------------------------------------------------------

## PARTIE 2 : Nettoyage Consultation (Dates + Typage)

In [4]:
print("="*80)
print("TRANSFORMATION: Consultation (DATES + TYPING)")
print("="*80)

start_time = time.time()

# Lecture Bronze
df_consult_bronze = spark.read.parquet(f"{BRONZE_BASE}/postgres/Consultation")
print(f"Read: {df_consult_bronze.count():,} rows")

# NETTOYAGE + FORMATS
df_consult_silver = df_consult_bronze.select(
    col("Num_consultation").alias("id_consultation"),
    col("Id_patient").alias("id_patient"),
    col("Code_diag").alias("id_diagnostic"),
    col("Id_prof_sante").alias("id_professionnel"),
    col("Id_mut").alias("id_mutuelle"),
    
    # FORMAT DATE: M/d/yyyy → yyyy-MM-dd
    to_date(col("Date"), "M/d/yyyy").alias("date_consultation"),
    
    # Extraction composantes temporelles
    year(to_date(col("Date"), "M/d/yyyy")).alias("annee"),
    month(to_date(col("Date"), "M/d/yyyy")).alias("mois"),
    dayofmonth(to_date(col("Date"), "M/d/yyyy")).alias("jour"),
    
    # Heures
    col("Heure_debut").alias("heure_debut"),
    col("Heure_fin").alias("heure_fin"),
    
    # Motif
    trim(col("Motif")).alias("motif"),
    
    # Métadonnées
    col("ingestion_date"),
    current_timestamp().alias("transformation_timestamp")
).filter(
    # VALIDATION: dates cohérentes
    (col("annee") >= 2013) & (col("annee") <= 2025)
).dropDuplicates(["id_consultation"])

print("\nCleaned data:")
df_consult_silver.select(
    "id_consultation", "date_consultation", "annee", "mois", "heure_debut", "motif"
).show(5)

# Sauvegarde Silver
df_consult_silver.write \
    .mode("overwrite") \
    .parquet(f"{SILVER_BASE}/consultation")

elapsed = time.time() - start_time
print(f"\nSaved to: {SILVER_BASE}/consultation")
print(f"Time elapsed: {elapsed:.2f}s")
print(f"{df_consult_silver.count():,} consultations cleaned")

TRANSFORMATION: Consultation (DATES + TYPING)
Read: 1,027,157 rows

Cleaned data:
+---------------+-----------------+-----+----+-------------------+---------------+
|id_consultation|date_consultation|annee|mois|        heure_debut|          motif|
+---------------+-----------------+-----+----+-------------------+---------------+
|     1059023437|       2015-06-20| 2015|   6|1970-01-01 13:00:00|   Consultation|
|     1059023446|       2015-06-20| 2015|   6|1970-01-01 08:00:00|   Consultation|
|     1059023466|       2015-06-20| 2015|   6|1970-01-01 08:00:00|Soins dentaires|
|     1059023468|       2015-06-20| 2015|   6|1970-01-01 13:00:00|   Consultation|
|     1059023498|       2015-06-20| 2015|   6|1970-01-01 13:00:00|   Consultation|
+---------------+-----------------+-----+----+-------------------+---------------+
only showing top 5 rows


Saved to: /home/jovyan/data/silver/consultation
Time elapsed: 8.05s
1,027,157 consultations cleaned


## RESUME SILVER LAYER

In [5]:
# Vérification finale
print("\n" + "="*80)
print("SILVER LAYER SUMMARY")
print("="*80)

silver_tables = [
    "patient",
    "consultation",
    "etablissement_sante",
    "satisfaction_2019",
    "deces_2019",
    "diagnostic",
    "professionnel_de_sante",
    "mutuelle",
    "medicaments",
    "laboratoire",
    "salle",
    "specialites"
]

total_rows = 0
for table in silver_tables:
    try:
        df = spark.read.parquet(f"{SILVER_BASE}/{table}")
        count = df.count()
        total_rows += count
        print(f"  {table:30} {count:>10,} rows")
    except Exception as e:
        print(f"  {table:30} ERROR")

print("="*80)
print(f"\nTOTAL SILVER: {total_rows:,} rows")
print(f"Storage: {SILVER_BASE}/")
print("="*80)


SILVER LAYER SUMMARY
  patient                           100,000 rows
  consultation                    1,027,157 rows
  etablissement_sante                72,017 rows
  satisfaction_2019                       8 rows
  deces_2019                        620,625 rows
  diagnostic                         15,490 rows
  professionnel_de_sante          1,048,575 rows
  mutuelle                              254 rows
  medicaments                        15,455 rows
  laboratoire                           677 rows
  salle                             201,735 rows
  specialites                            93 rows

TOTAL SILVER: 3,102,086 rows
Storage: /home/jovyan/data/silver/
