# Raw Data Exploration

This notebook explores the **raw layer** data directly as downloaded from external sources.

**Raw Layer Characteristics:**
- Data as downloaded from APIs/sources
- May have encoding issues
- **FINESS has no proper column headers** (needs to be fixed)
- Inconsistent column names
- Potential duplicates
- Mixed formats (CSV, Excel)

In [17]:
import pandas as pd
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)
pd.set_option('display.width', None)

# Define paths
RAW_PATH = Path('../data/raw/2024')
print(f"Raw data path: {RAW_PATH}")

Raw data path: ../data/raw/2024


## 1. FINESS Data (Establishments)

### ⚠️ Important: FINESS Column Header Issue

The raw FINESS file **does not have proper column headers**:
- Row 1: Metadata (finess, etalab, 93, date)
- Row 2+: Actual data
- No header row defining column names

**Solution**: In the cleaning stage (bronze layer), we assign official column names based on the data.gouv.fr FINESS structure.

In [None]:
# Load FINESS raw data - note how pandas assigns generic column names
df_finess_raw = pd.read_csv(
    RAW_PATH / 'finess.csv',
    sep=';',
    encoding='utf-8',
    header=1,  # Skip metadata row
    low_memory=False,
    on_bad_lines='warn',
    nrows=5  # Load only 5 rows for exploration
)

print(f"Shape: {df_finess_raw.shape}")
print(f"\nColumn Names (Issues to Fix):")
print("="*70)

# Official FINESS column names that should be used
finess_columns = [
    'structureet',           # 0: Structure type
    'finess_et',             # 1: FINESS Establishment ID
    'finess_ej',             # 2: FINESS Legal Entity ID
    'rs',                    # 3: Short name (Raison Sociale)
    'rslongue',              # 4: Long name (Raison Sociale Longue)
    'complrs',               # 5: Complement RS
    'compldistrib',          # 6: Complement distribution
    'numvoie',               # 7: Street number
    'typvoie',               # 8: Street type (rue, avenue, etc.)
    'voie',                  # 9: Street name
    'compvoie',              # 10: Street complement
    'lieuditbp',             # 11: Place/BP
    'commune',               # 12: Municipality code
    'departement',           # 13: Department code
    'libdepartement',        # 14: Department name
    'ligneacheminement',     # 15: Postal routing line (CP + ville)
    'telephone',             # 16: Phone number
    'telecopie',             # 17: Fax number
    'categetab',             # 18: Establishment category code
    'libcategetab',          # 19: Establishment category name
    'categagretab',          # 20: Aggregated category code
    'libcategagretab',       # 21: Aggregated category name
    'siret',                 # 22: SIRET number
    'codeape',               # 23: APE code
    'codemft',               # 24: MFT code
    'libmft',                # 25: MFT label
    'codesph',               #26: SPH code
    'libsph',                # 27: SPH label (category detail)
    'dateouv',               # 28: Opening date
    'dateautor',             # 29: Authorization date
    'datemaj',               # 30: Last update date
    'numuai'                 # 31: UAI number
]

# Create column descriptions dict
column_descriptions = {
    'structureet': 'Structure type',
    'finess_et': 'FINESS Establishment ID',
    'finess_ej': 'FINESS Legal Entity ID',
    'rs': 'Short name (Raison Sociale)',
    'rslongue': 'Long name (Raison Sociale Longue)',
    'numvoie': 'Street number',
    'typvoie': 'Street type (rue, avenue, etc.)',
    'voie': 'Street name',
    'lieuditbp': 'Place/BP',
    'ligneacheminement': 'Postal routing (CP + ville)',
    'telephone': 'Phone number',
    'siret': 'SIRET number',
    'libsph': 'SPH label (category detail)',
    'dateouv': 'Opening date',
    'datemaj': 'Last update date'
}

# Show current vs should be
for i, (current_col, should_be) in enumerate(zip(df_finess_raw.columns, finess_columns)):
    desc = column_descriptions.get(should_be, '')
    if desc:
        print(f"{i:2d}. '{current_col:30s}' → '{should_be:20s}' - {desc}")
    else:
        print(f"{i:2d}. '{current_col:30s}' → '{should_be}'")

### Official FINESS Column Structure

Based on data.gouv.fr documentation, here's what these columns should be:

| Index | Should Be | Description |
|-------|-----------|-------------|
| 0 | structureet | Structure type |
| 1 | finess_et | **FINESS Establishment ID** |
| 2 | finess_ej | FINESS Legal Entity ID |
| 3 | rs | Short name |
| 4 | rslongue | **Long name** (preferred) |
| 7 | numvoie | Street number |
| 8 | typvoie | Street type (rue, av, etc.) |
| 9 | voie | Street name |
| 11 | lieuditbp | Place/BP |
| 15 | ligneacheminement | **Postal code + city** |
| 22 | siret | **SIRET number** |
| 27 | libsph | **Category description** |

✅ **This is fixed in the bronze layer!**

In [19]:
# Show how data looks with incorrect headers
print("Sample of raw data (with generic headers):")
df_finess_raw.head(2)

Sample of raw data (with generic headers):


Unnamed: 0,structureet,010000024,010780054,CH DE FLEYRIAT,CENTRE HOSPITALIER DE BOURG-EN-BRESSE FLEYRIAT,Unnamed: 5,Unnamed: 6,900,RTE,DE PARIS,Unnamed: 10,Unnamed: 11,451,01,AIN,01440 VIRIAT,0474454647,0474454114,355,Centre Hospitalier (C.H.),1102,Centres Hospitaliers,26010004500012,8610Z,03,ARS établissements Publics de santé dotation globale,1,Etablissement public de santé,1979-02-13,1979-02-13.1,2020-02-04,Unnamed: 31
0,structureet,10000032,10780062,CH BUGEY SUD,CENTRE HOSPITALIER BUGEY SUD,,,700.0,AV,DE NARVIK,,BP 139,34,1,AIN,01300 BELLEY,479425959,479425996,355,Centre Hospitalier (C.H.),1102,Centres Hospitaliers,26010003700068,8610Z,3,ARS établissements Publics de santé dotation g...,1,Etablissement public de santé,1901-01-01,1901-01-01,2021-07-07,
1,structureet,10000065,10780096,CH DE TREVOUX - MONTPENSIER,CENTRE HOSPITALIER DE TREVOUX - MONTPENSIER,,,14.0,R,DE L'HOPITAL,,BP 615,427,1,AIN,01606 TREVOUX CEDEX,474105000,474105019,355,Centre Hospitalier (C.H.),1102,Centres Hospitaliers,26010028400017,8610Z,3,ARS établissements Publics de santé dotation g...,1,Etablissement public de santé,1901-01-01,1901-01-01,2018-01-12,


In [20]:
# Demonstrate the problem with positional access
print("Accessing data by position (fragile):")
print(f"  FINESS ET (col 1): {df_finess_raw.iloc[0, 1]}")
print(f"  Long name (col 4): {df_finess_raw.iloc[0, 4]}")
print(f"  SIRET (col 22): {df_finess_raw.iloc[0, 22]}")
print("\n⚠️ Problem: Using .iloc[:,1] is not readable and prone to errors!")

Accessing data by position (fragile):
  FINESS ET (col 1): 10000032
  Long name (col 4): CENTRE HOSPITALIER BUGEY SUD
  SIRET (col 22): 26010003700068

⚠️ Problem: Using .iloc[:,1] is not readable and prone to errors!


## 2. HAS Demarche Data (Certification Process)

In [21]:
# Load HAS demarche - this one has proper headers
df_has_demarche_raw = pd.read_csv(RAW_PATH / 'has_demarche.csv')

print(f"Shape: {df_has_demarche_raw.shape}")
print(f"\n✓ HAS files have proper column names")
print(f"Columns: {list(df_has_demarche_raw.columns)}")
print("\nNote: May have BOM characters or special chars to clean")
print("\nSample:")
df_has_demarche_raw.head()

Shape: (2348, 6)

✓ HAS files have proper column names
Columns: ['code_demarche', 'annee_visite', 'mois_visite', 'date_deb_visite', 'date_de_decision', 'Decision_de_la_CCES']

Note: May have BOM characters or special chars to clean

Sample:


Unnamed: 0,code_demarche,annee_visite,mois_visite,date_deb_visite,date_de_decision,Decision_de_la_CCES
0,30001,2021,09-Septembre,21/09/2021,10/02/2022,Certifié
1,30002,2023,01-Janvier,24/01/2023,08/03/2023,Certifié avec mention
2,30003,2022,01-Janvier,19/01/2022,31/03/2022,Certifié
3,30004,2021,09-Septembre,28/09/2021,14/12/2021,Certifié
4,30005,2021,11-Novembre,29/11/2021,31/03/2022,Certifié


## 3. HAS Establishment Geography Data

In [22]:
# Load HAS etab geo
df_has_geo_raw = pd.read_csv(RAW_PATH / 'has_etab_geo.csv')

print(f"Shape: {df_has_geo_raw.shape}")
print(f"\nColumns: {list(df_has_geo_raw.columns)}")
print("\nSample:")
df_has_geo_raw.head()

Shape: (8211, 5)

Columns: ['code_demarche', 'FINESS_EJ', 'FINESS_EG', 'RS_eg', 'Site_Principal']

Sample:


Unnamed: 0,code_demarche,FINESS_EJ,FINESS_EG,RS_eg,Site_Principal
0,30001,350000402,350002176,CLINIQUE DE L'ESPERANCE,True
1,30002,340000272,340024314,CLINIQUE SAINT JEAN SUD DE FRANCE,True
2,30003,350002291,350000410,CENTRE HOSPITALIER DE JANZE,True
3,30004,530000249,530000124,CLINIQUE NOTRE DAME DE PRITZ,True
4,30004,530000249,530010438,CENTRE MEDIPSY - CLINIQUE NOTRE DAME DE PRITZ,False


## 4. Health Metrics (IQSS)

In [23]:
# Load health metrics (Excel)
df_metrics_raw = pd.read_excel(RAW_PATH / 'health_metrics.xlsx')

print(f"Shape: {df_metrics_raw.shape}")
print(f"\nColumns ({len(df_metrics_raw.columns)}):")
for col in df_metrics_raw.columns:
    print(f"  - {col}")
print("\n⚠️ Issues to fix:")
print("  - Mixed case column names")
print("  - Spaces in column names")
print("  - Special characters/accents")
print("\nSample:")
df_metrics_raw.head()

Shape: (1248, 27)

Columns (27):
  - finess
  - rs_finess
  - finess_geo
  - rs_finess_geo
  - region
  - type
  - participation
  - Depot
  - nb_rep_score_ALL_ssr_ajust
  - score_ALL_ssr_ajust
  - classement
  - evolution
  - score_ACCUEIL_ssr_ajust
  - nb_rep_score_ACCUEIL_ssr_ajust
  - score_PEC_ssr_ajust
  - nb_rep_score_PEC_ssr_ajust
  - score_LIEU_ssr_ajust
  - nb_rep_score_LIEU_ssr_ajust
  - score_REPAS_ssr_ajust
  - nb_rep_score_REPAS_ssr_ajust
  - score_SORTIE_ssr_ajust
  - nb_rep_score_SORTIE_ssr_ajust
  - score_ALL_ssr_ajust_dp
  - taux_reco_brut
  - nb_reco_brut
  - SCORE_AJUST_ESATIS_REGION
  - SCORE_AJUST_ESATIS_TYPE

⚠️ Issues to fix:
  - Mixed case column names
  - Spaces in column names
  - Special characters/accents

Sample:


Unnamed: 0,finess,rs_finess,finess_geo,rs_finess_geo,region,type,participation,Depot,nb_rep_score_ALL_ssr_ajust,score_ALL_ssr_ajust,classement,evolution,score_ACCUEIL_ssr_ajust,nb_rep_score_ACCUEIL_ssr_ajust,score_PEC_ssr_ajust,nb_rep_score_PEC_ssr_ajust,score_LIEU_ssr_ajust,nb_rep_score_LIEU_ssr_ajust,score_REPAS_ssr_ajust,nb_rep_score_REPAS_ssr_ajust,score_SORTIE_ssr_ajust,nb_rep_score_SORTIE_ssr_ajust,score_ALL_ssr_ajust_dp,taux_reco_brut,nb_reco_brut,SCORE_AJUST_ESATIS_REGION,SCORE_AJUST_ESATIS_TYPE
0,10780062,CH DOCTEUR RECAMIER,10000032,CH BUGEY SUD,Auvergne-Rhône-Alpes,Centre Hospitaliers,2- Facultatif,1- Oui,,,DI,,,,,,,,,,,,,,,,
1,10780112,CH DU PAYS DE GEX,10000081,CH DU PAYS DE GEX,Auvergne-Rhône-Alpes,Centre Hospitaliers,2- Facultatif,1- Oui,,,DI,,,,,,,,,,,,,,,,
2,10007987,CH PUBLIC HAUTEVILLE,10000180,CH PUBLIC HAUTEVILLE - UNITE ESPERANCE,Auvergne-Rhône-Alpes,Centre Hospitaliers,2- Facultatif,1- Oui,65.0,81.46,A,2-Stable,91.18,65.0,84.76,65.0,75.24,65.0,79.62,65.0,76.0,65.0,81.0,81.5,65.0,76.82,78.29
3,10007987,CH PUBLIC HAUTEVILLE,10000198,CH PUBLIC HAUTEVILLE - UNITE INTERDEPT,Auvergne-Rhône-Alpes,Centre Hospitaliers,1- Obligatoire,1- Oui,203.0,80.05,A,2-Stable,88.52,203.0,83.54,203.0,79.02,203.0,68.35,203.0,74.56,203.0,80.0,81.7,202.0,76.82,78.29
4,10007987,CH PUBLIC HAUTEVILLE,10000214,CH PUBLIC HAUTEVILLE - UNITE ALBARINE,Auvergne-Rhône-Alpes,Centre Hospitaliers,1- Obligatoire,1- Oui,68.0,82.46,A,2-Stable,89.09,68.0,84.39,68.0,85.1,68.0,77.22,68.0,73.54,68.0,82.0,77.6,67.0,76.82,78.29


## Summary

**Raw Data Issues Identified:**

### FINESS (Critical)
- ❌ **No proper column headers** - uses generic names
- ❌ Requires positional indexing (`.iloc[:, 1]`) - fragile
- ✅ **Fixed in bronze layer** with 32 official column names

### HAS Data
- ⚠️ Has headers but may contain BOM characters
- ⚠️ Some special characters to clean

### Health Metrics
- ⚠️ Excel format (slower to read)
- ⚠️ Column names need normalization
- ⚠️ Categorical values have numeric prefixes

### All Sources
- ⚠️ Potential duplicates
- ⚠️ Missing values
- ⚠️ Mixed formats

**Next Step:** Run cleaning pipeline to create bronze layer with proper column names
```bash
python scripts/run_cleaning.py --year 2024
```

**Result**: Bronze layer will have:
- ✅ FINESS with 32 properly named columns (`finess_et`, `rslongue`, `siret`, etc.)
- ✅ All data in standardized CSV format
- ✅ Clean column names (lowercase, no special chars)
- ✅ No duplicates
- ✅ Ready for transformation