# Bronze Data Exploration

This notebook explores the **bronze layer** data after initial cleaning.

**Bronze Layer Characteristics:**
- Cleaned raw data
- **Proper column names assigned** (official FINESS structure)
- Standardized column names (lowercase, no special chars)
- UTF-8 encoding
- Duplicates removed
- All CSV format
- Ready for transformation

In [10]:
import pandas as pd
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)
pd.set_option('display.width', None)

# Define paths
BRONZE_PATH = Path('../data/bronze/2024')
print(f"Bronze data path: {BRONZE_PATH}")
print(f"Files: {[f.name for f in BRONZE_PATH.glob('*.csv')]}")

Bronze data path: ../data/bronze/2024
Files: ['health_metrics_clean.csv', 'has_etab_geo_clean.csv', 'has_demarche_clean.csv', 'finess_clean.csv']


## 1. FINESS Clean (Establishments)

Now with proper official column names from data.gouv.fr!

In [11]:
# Load cleaned FINESS
df_finess_bronze = pd.read_csv(BRONZE_PATH / 'finess_clean.csv', low_memory=False)

print(f"Shape: {df_finess_bronze.shape}")
print(f"\nColumn Names (Official FINESS Structure):")
print("="*70)
column_descriptions = {
    'structureet': 'Structure type',
    'finess_et': 'FINESS Establishment ID',
    'finess_ej': 'FINESS Legal Entity ID',
    'rs': 'Short name (Raison Sociale)',
    'rslongue': 'Long name (Raison Sociale Longue)',
    'numvoie': 'Street number',
    'typvoie': 'Street type (rue, avenue, etc.)',
    'voie': 'Street name',
    'lieuditbp': 'Place/BP',
    'ligneacheminement': 'Postal routing (CP + ville)',
    'telephone': 'Phone number',
    'siret': 'SIRET number',
    'libsph': 'SPH label (category detail)',
    'dateouv': 'Opening date',
    'datemaj': 'Last update date'
}

for i, col in enumerate(df_finess_bronze.columns):
    desc = column_descriptions.get(col, '')
    if desc:
        print(f"{i:2d}. {col:20s} - {desc}")
    else:
        print(f"{i:2d}. {col}")

Shape: (204915, 32)

Column Names (Official FINESS Structure):
 0. structureet          - Structure type
 1. finess_et            - FINESS Establishment ID
 2. finess_ej            - FINESS Legal Entity ID
 3. rs                   - Short name (Raison Sociale)
 4. rslongue             - Long name (Raison Sociale Longue)
 5. complrs
 6. compldistrib
 7. numvoie              - Street number
 8. typvoie              - Street type (rue, avenue, etc.)
 9. voie                 - Street name
10. compvoie
11. lieuditbp            - Place/BP
12. commune
13. departement
14. libdepartement
15. ligneacheminement    - Postal routing (CP + ville)
16. telephone            - Phone number
17. telecopie
18. categetab
19. libcategetab
20. categagretab
21. libcategagretab
22. siret                - SIRET number
23. codeape
24. codemft
25. libmft
26. codesph
27. libsph               - SPH label (category detail)
28. dateouv              - Opening date
29. dateautor
30. datemaj              - Last update da

In [12]:
# Show key columns
key_cols = ['finess_et', 'rs', 'rslongue', 'numvoie', 'typvoie', 'voie', 'ligneacheminement', 'siret', 'libsph']
print("Sample of key columns:")
df_finess_bronze[key_cols].head(10)

Sample of key columns:


Unnamed: 0,finess_et,rs,rslongue,numvoie,typvoie,voie,ligneacheminement,siret,libsph
0,10000032,CH BUGEY SUD,CENTRE HOSPITALIER BUGEY SUD,700.0,AV,DE NARVIK,01300 BELLEY,26010000000000.0,Etablissement public de santé
1,10000065,CH DE TREVOUX - MONTPENSIER,CENTRE HOSPITALIER DE TREVOUX - MONTPENSIER,14.0,R,DE L'HOPITAL,01606 TREVOUX CEDEX,26010030000000.0,Etablissement public de santé
2,10000081,CH DU PAYS DE GEX,CENTRE HOSPITALIER DU PAYS DE GEX,160.0,R,MARC PANISSOD,01174 GEX CEDEX,26010010000000.0,Etablissement public de santé
3,10000099,CH DE MEXIMIEUX,CENTRE HOSPITALIER DE MEXIMIEUX,13.0,AV,DU DOCTEUR BOYER,01800 MEXIMIEUX,26010010000000.0,Etablissement public de santé
4,10000107,CH DE PONT DE VAUX,CENTRE HOSPITALIER DE PONT DE VAUX,,CHE,DES NIVRES,01190 PONT DE VAUX,26010020000000.0,Etablissement public de santé
5,10000115,CHI AIN VAL DE SAONE - PONT VEYLE,CH INTERCOMMUNAL AIN VAL DE SAONE - PONT DE VEYLE,,IMP,DE LA BISCUITERIE,01290 PONT DE VEYLE,,Etablissement public de santé
6,10000131,CHI AIN VAL DE SAONE - THOISSEY,CH INTERCOMMUNAL AIN VAL DE SAONE - THOISSEY,11.0,R,DE L HOPITAL,01140 THOISSEY,20003000000000.0,Etablissement public de santé
7,10000180,CH PUBLIC HAUTEVILLE - UNITE ESPERANCE,CENTRE HOSPITALIER PUBLIC DE HAUTEVILLE - UNIT...,,AV,FELIX MANGINI,01110 PLATEAU D HAUTEVILLE,26011020000000.0,Etablissement public de santé
8,10000198,CH PUBLIC HAUTEVILLE - UNITE INTERDEPT,CENTRE HOSPITALIER PUBLIC DE HAUTEVILLE - UNIT...,,AV,FELIX MANGINI,01110 PLATEAU D HAUTEVILLE,26011020000000.0,Etablissement public de santé
9,10000214,CH PUBLIC HAUTEVILLE - UNITE ALBARINE,CENTRE HOSPITALIER PUBLIC DE HAUTEVILLE - UNIT...,,AV,FELIX MANGINI,01110 PLATEAU D HAUTEVILLE,26011020000000.0,Etablissement public de santé


In [13]:
# Data quality check
print("Data Quality:")
print(f"  - Total records: {len(df_finess_bronze):,}")
print(f"  - Duplicates: {df_finess_bronze.duplicated().sum()}")
print(f"  - Memory usage: {df_finess_bronze.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"\nMissing values in key columns:")
for col in ['finess_et', 'rs', 'rslongue', 'siret', 'ligneacheminement', 'libsph']:
    missing = df_finess_bronze[col].isna().sum()
    pct = (missing / len(df_finess_bronze)) * 100
    print(f"  {col:20s}: {missing:6,} ({pct:5.2f}%)")

Data Quality:
  - Total records: 204,915
  - Duplicates: 0
  - Memory usage: 238.19 MB

Missing values in key columns:
  finess_et           :      0 ( 0.00%)
  rs                  :      0 ( 0.00%)
  rslongue            : 20,674 (10.09%)
  siret               : 114,386 (55.82%)
  ligneacheminement   : 102,458 (50.00%)
  libsph              : 193,062 (94.22%)


## 2. HAS Demarche Clean (Certification Process)

In [14]:
# Load cleaned HAS demarche
df_has_demarche_bronze = pd.read_csv(BRONZE_PATH / 'has_demarche_clean.csv')

print(f"Shape: {df_has_demarche_bronze.shape}")
print(f"\nColumns: {list(df_has_demarche_bronze.columns)}")
print("\nNote: Column names are now standardized (lowercase, no BOM)")
print("\nSample:")
df_has_demarche_bronze.head()

Shape: (2348, 6)

Columns: ['code_demarche', 'annee_visite', 'mois_visite', 'date_deb_visite', 'date_de_decision', 'decision_de_la_cces']

Note: Column names are now standardized (lowercase, no BOM)

Sample:


Unnamed: 0,code_demarche,annee_visite,mois_visite,date_deb_visite,date_de_decision,decision_de_la_cces
0,30001,2021,09-Septembre,21/09/2021,10/02/2022,Certifié
1,30002,2023,01-Janvier,24/01/2023,08/03/2023,Certifié avec mention
2,30003,2022,01-Janvier,19/01/2022,31/03/2022,Certifié
3,30004,2021,09-Septembre,28/09/2021,14/12/2021,Certifié
4,30005,2021,11-Novembre,29/11/2021,31/03/2022,Certifié


## 3. HAS Establishment Geography Clean

In [15]:
# Load cleaned HAS etab geo
df_has_geo_bronze = pd.read_csv(BRONZE_PATH / 'has_etab_geo_clean.csv')

print(f"Shape: {df_has_geo_bronze.shape}")
print(f"\nColumns: {list(df_has_geo_bronze.columns)}")
print("\nSample:")
df_has_geo_bronze.head()

Shape: (8211, 5)

Columns: ['code_demarche', 'finess_ej', 'finess_eg', 'rs_eg', 'site_principal']

Sample:


Unnamed: 0,code_demarche,finess_ej,finess_eg,rs_eg,site_principal
0,30001,350000402,350002176,CLINIQUE DE L'ESPERANCE,True
1,30002,340000272,340024314,CLINIQUE SAINT JEAN SUD DE FRANCE,True
2,30003,350002291,350000410,CENTRE HOSPITALIER DE JANZE,True
3,30004,530000249,530000124,CLINIQUE NOTRE DAME DE PRITZ,True
4,30004,530000249,530010438,CENTRE MEDIPSY - CLINIQUE NOTRE DAME DE PRITZ,False


## 4. Health Metrics Clean (IQSS)

In [16]:
# Load cleaned health metrics (now CSV instead of Excel)
df_metrics_bronze = pd.read_csv(BRONZE_PATH / 'health_metrics_clean.csv')

print(f"Shape: {df_metrics_bronze.shape}")
print(f"\nColumns ({len(df_metrics_bronze.columns)}):")
for col in df_metrics_bronze.columns:
    print(f"  - {col}")

print("\nNote: Columns are normalized (lowercase, underscores)")
print("\nSample:")
df_metrics_bronze.head()

Shape: (1248, 27)

Columns (27):
  - finess
  - rs_finess
  - finess_geo
  - rs_finess_geo
  - region
  - type
  - participation
  - depot
  - nb_rep_score_all_ssr_ajust
  - score_all_ssr_ajust
  - classement
  - evolution
  - score_accueil_ssr_ajust
  - nb_rep_score_accueil_ssr_ajust
  - score_pec_ssr_ajust
  - nb_rep_score_pec_ssr_ajust
  - score_lieu_ssr_ajust
  - nb_rep_score_lieu_ssr_ajust
  - score_repas_ssr_ajust
  - nb_rep_score_repas_ssr_ajust
  - score_sortie_ssr_ajust
  - nb_rep_score_sortie_ssr_ajust
  - score_all_ssr_ajust_dp
  - taux_reco_brut
  - nb_reco_brut
  - score_ajust_esatis_region
  - score_ajust_esatis_type

Note: Columns are normalized (lowercase, underscores)

Sample:


Unnamed: 0,finess,rs_finess,finess_geo,rs_finess_geo,region,type,participation,depot,nb_rep_score_all_ssr_ajust,score_all_ssr_ajust,classement,evolution,score_accueil_ssr_ajust,nb_rep_score_accueil_ssr_ajust,score_pec_ssr_ajust,nb_rep_score_pec_ssr_ajust,score_lieu_ssr_ajust,nb_rep_score_lieu_ssr_ajust,score_repas_ssr_ajust,nb_rep_score_repas_ssr_ajust,score_sortie_ssr_ajust,nb_rep_score_sortie_ssr_ajust,score_all_ssr_ajust_dp,taux_reco_brut,nb_reco_brut,score_ajust_esatis_region,score_ajust_esatis_type
0,10780062,CH DOCTEUR RECAMIER,10000032,CH BUGEY SUD,Auvergne-Rhône-Alpes,Centre Hospitaliers,Facultatif,Oui,,,DI,,,,,,,,,,,,,,,,
1,10780112,CH DU PAYS DE GEX,10000081,CH DU PAYS DE GEX,Auvergne-Rhône-Alpes,Centre Hospitaliers,Facultatif,Oui,,,DI,,,,,,,,,,,,,,,,
2,10007987,CH PUBLIC HAUTEVILLE,10000180,CH PUBLIC HAUTEVILLE - UNITE ESPERANCE,Auvergne-Rhône-Alpes,Centre Hospitaliers,Facultatif,Oui,65.0,81.46,A,Stable,91.18,65.0,84.76,65.0,75.24,65.0,79.62,65.0,76.0,65.0,81.0,81.5,65.0,76.82,78.29
3,10007987,CH PUBLIC HAUTEVILLE,10000198,CH PUBLIC HAUTEVILLE - UNITE INTERDEPT,Auvergne-Rhône-Alpes,Centre Hospitaliers,Obligatoire,Oui,203.0,80.05,A,Stable,88.52,203.0,83.54,203.0,79.02,203.0,68.35,203.0,74.56,203.0,80.0,81.7,202.0,76.82,78.29
4,10007987,CH PUBLIC HAUTEVILLE,10000214,CH PUBLIC HAUTEVILLE - UNITE ALBARINE,Auvergne-Rhône-Alpes,Centre Hospitaliers,Obligatoire,Oui,68.0,82.46,A,Stable,89.09,68.0,84.39,68.0,85.1,68.0,77.22,68.0,73.54,68.0,82.0,77.6,67.0,76.82,78.29


In [17]:
# Check data types
print("Data types:")
print(df_metrics_bronze.dtypes)

Data types:
finess                        object
rs_finess                     object
finess_geo                    object
rs_finess_geo                 object
region                        object
                              ...   
score_all_ssr_ajust_dp       float64
taux_reco_brut               float64
nb_reco_brut                 float64
score_ajust_esatis_region    float64
score_ajust_esatis_type      float64
Length: 27, dtype: object


## Comparison: Raw vs Bronze

**Major Improvements in Bronze Layer:**

### FINESS Data
- ✓ **Proper column names assigned** (32 official columns from data.gouv.fr)
  - Before: Unnamed columns like `Unnamed: 5`, `Unnamed: 6`
  - After: `complrs`, `compldistrib`, etc.
- ✓ Named-based access instead of positional indexing
  - Before: `df.iloc[:, 1]` 
  - After: `df['finess_et']`

### All Data Sources
- ✓ All files in CSV format (easier to work with)
- ✓ Column names standardized (lowercase, no special characters)
- ✓ UTF-8 encoding consistent across all files
- ✓ Duplicates removed
- ✓ Categorical values cleaned (numeric prefixes removed)
- ✓ Numeric columns properly typed

**Next Step:** Run processing pipeline to create silver layer
```bash
python scripts/run_processing.py --year 2024
```