# Veltis Data Exploration (2023)

**Note on Column Headers**: 
- **FINESS** raw files do NOT contain column headers. We manually assign them for readability.
- **HAS** files use comma separators and contain headers.
- **Health Metrics** files usually contain headers.

In [1]:
# Auto-install dependencies if missing
try:
    import pandas as pd
    import openpyxl
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "pandas", "openpyxl"])
    import pandas as pd

from pathlib import Path
import os

# Set Path (Robust)
base_path = Path('/workspaces/MVP-web-scrapping-project/data/raw/2023')
if not base_path.exists():
    base_path = Path('data/raw/2023')

print(f"Data Directory: {base_path}")

Data Directory: /workspaces/MVP-web-scrapping-project/data/raw/2023


## 1. FINESS Registry (`finess.csv`)
*Raw file has no headers. We skip row 1 (metadata) and assign names manually.*

In [2]:
finess_path = base_path / 'finess.csv'

if finess_path.exists():
    # FINESS RAW: No header row, semi-colon sep, latin-1
    df_finess = pd.read_csv(
        finess_path, 
        sep=';', 
        encoding='latin-1', 
        header=None, 
        skiprows=1, 
        on_bad_lines='warn'
    )
    
    # Manually assign key column names
    column_map = {
        1: 'FINESS_ET',
        4: 'RAISON_SOCIALE',
        22: 'SIRET',
        15: 'VILLE_CP',
        27: 'CATEGORIE'
    }
    df_finess.rename(columns=column_map, inplace=True)
    
    print(f"Loaded {len(df_finess)} rows.")
    display(df_finess.head(3))
else:
    print("finess.csv not found")

Loaded 204916 rows.


  df_finess = pd.read_csv(


Unnamed: 0,0,FINESS_ET,2,3,RAISON_SOCIALE,5,6,7,8,9,...,SIRET,23,24,25,26,CATEGORIE,28,29,30,31
0,structureet,10000024,10780054,CH DE FLEYRIAT,CENTRE HOSPITALIER DE BOURG-EN-BRESSE FLEYRIAT,,,900.0,RTE,DE PARIS,...,26010000000000.0,8610Z,3.0,ARS Ã©tablissements Publics de santÃ© dotation...,1.0,Etablissement public de santÃ©,1979-02-13,1979-02-13,2020-02-04,
1,structureet,10000032,10780062,CH BUGEY SUD,CENTRE HOSPITALIER BUGEY SUD,,,700.0,AV,DE NARVIK,...,26010000000000.0,8610Z,3.0,ARS Ã©tablissements Publics de santÃ© dotation...,1.0,Etablissement public de santÃ©,1901-01-01,1901-01-01,2021-07-07,
2,structureet,10000065,10780096,CH DE TREVOUX - MONTPENSIER,CENTRE HOSPITALIER DE TREVOUX - MONTPENSIER,,,14.0,R,DE L'HOPITAL,...,26010030000000.0,8610Z,3.0,ARS Ã©tablissements Publics de santÃ© dotation...,1.0,Etablissement public de santÃ©,1901-01-01,1901-01-01,2018-01-12,


## 2. HAS Certification (`has_demarche.csv`)
*Uses comma separator.*

In [3]:
has_path = base_path / 'has_demarche.csv'
if has_path.exists():
    # HAS: Comma separator, UTF-8
    df_has = pd.read_csv(has_path, sep=',', encoding='utf-8')
    print(f"Loaded {len(df_has)} rows.")
    display(df_has.head(3))
else:
    print("has_demarche.csv not found")

Loaded 2348 rows.


Unnamed: 0,code_demarche,annee_visite,mois_visite,date_deb_visite,date_de_decision,Decision_de_la_CCES
0,30001,2021,09-Septembre,21/09/2021,10/02/2022,Certifié
1,30002,2023,01-Janvier,24/01/2023,08/03/2023,Certifié avec mention
2,30003,2022,01-Janvier,19/01/2022,31/03/2022,Certifié


## 3. HAS Geolocation (`has_etab_geo.csv`)
*Uses comma separator.*

In [4]:
geo_path = base_path / 'has_etab_geo.csv'
if geo_path.exists():
    # GEOLOC: Comma separator, UTF-8
    df_geo = pd.read_csv(geo_path, sep=',', encoding='utf-8')
    print(f"Loaded {len(df_geo)} rows.")
    display(df_geo.head(3))
else:
    print("has_etab_geo.csv not found")

Loaded 8211 rows.


Unnamed: 0,code_demarche,FINESS_EJ,FINESS_EG,RS_eg,Site_Principal
0,30001,350000402,350002176,CLINIQUE DE L'ESPERANCE,True
1,30002,340000272,340024314,CLINIQUE SAINT JEAN SUD DE FRANCE,True
2,30003,350002291,350000410,CENTRE HOSPITALIER DE JANZE,True


## 4. Health Metrics (`health_metrics.xlsx`)

In [5]:
metrics_path = base_path / 'health_metrics.xlsx'
if metrics_path.exists():
    df_metrics = pd.read_excel(metrics_path)
    print(f"Loaded {len(df_metrics)} rows.")
    display(df_metrics.head(3))
else:
    print("health_metrics.xlsx not found")

Loaded 1268 rows.


Unnamed: 0,finess,rs_finess,finess_geo,rs_finess_geo,region,type,participation,Depot,nb_rep_score_ALL_ssr_ajust,score_ALL_ssr_ajust,...,nb_rep_score_LIEU_ssr_ajust,score_REPAS_ssr_ajust,nb_rep_score_REPAS_ssr_ajust,score_SORTIE_ssr_ajust,nb_rep_score_SORTIE_ssr_ajust,score_ALL_ssr_ajust_dp,taux_reco_brut,nb_reco_brut,SCORE_AJUST_ESATIS_REGION,SCORE_AJUST_ESATIS_TYPE
0,10780062,CH DOCTEUR RECAMIER,10000032,CH BUGEY SUD,Auvergne-Rhône-Alpes,Centre Hospitaliers,2- Facultatif,1- Oui,,,...,,,,,,,,,,
1,10007987,CH PUBLIC HAUTEVILLE,10000180,CH PUBLIC HAUTEVILLE - UNITE ESPERANCE,Auvergne-Rhône-Alpes,Centre Hospitaliers,2- Facultatif,1- Oui,80.0,80.76,...,80.0,79.18,80.0,75.42,80.0,81.0,80.0,80.0,76.02,77.89
2,10007987,CH PUBLIC HAUTEVILLE,10000198,CH PUBLIC HAUTEVILLE - UNITE INTERDEPT,Auvergne-Rhône-Alpes,Centre Hospitaliers,1- Obligatoire,1- Oui,188.0,78.94,...,188.0,69.73,188.0,73.1,188.0,79.0,80.9,188.0,76.02,77.89
