# üîç Data Discovery - Understanding Our Sources

**For Decision-Makers**: This notebook is like opening a treasure chest - we're checking what data we have, where it comes from, and if it's reliable. Think of it as the foundation of everything that follows.

**Goal**: Before doing anything fancy, let's actually look at what data we have.

**Why this matters**: Most data science projects fail because people don't understand their data. We're making sure we start on solid ground.

## What We're Looking For:
- üìÅ What files do we have?
- üìä What do they contain?
- üìÖ What time periods are covered?
- üó∫Ô∏è What geographic levels (national, regional, departmental)?
- ‚ö†Ô∏è What's broken or missing?

## üéØ Connection to Project Goals:
This notebook directly supports:
- ‚úÖ **Predicting vaccine needs** - We need historical data to see patterns
- ‚úÖ **Optimizing distribution** - We need geographic data to know where vaccines go
- ‚úÖ **Anticipating emergencies** - We need emergency visit data to forecast demand
- ‚úÖ **Improving access** - We need coverage data to find gaps

---

In [1]:
%pip install --upgrade \
    nbformat \
    ipykernel \
    ipython \
    jupyterlab \
    numpy \
    pandas \
    matplotlib \
    seaborn \
    scikit-learn \
    scipy \
    notebook \
    plotly \



Collecting ipykernel
  Downloading ipykernel-7.0.1-py3-none-any.whl.metadata (4.5 kB)
Collecting jupyterlab
  Downloading jupyterlab-4.4.9-py3-none-any.whl.metadata (16 kB)
Collecting notebook
  Downloading notebook-7.4.7-py3-none-any.whl.metadata (10 kB)
Collecting async-lru>=1.0.0 (from jupyterlab)
  Downloading async_lru-2.0.5-py3-none-any.whl.metadata (4.5 kB)
Collecting httpx<1,>=0.25.0 (from jupyterlab)
  Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting jinja2>=3.0.3 (from jupyterlab)
  Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting jupyter-lsp>=2.0.0 (from jupyterlab)
  Downloading jupyter_lsp-2.3.0-py3-none-any.whl.metadata (1.8 kB)
Collecting jupyter-server<3,>=2.4.0 (from jupyterlab)
  Downloading jupyter_server-2.17.0-py3-none-any.whl.metadata (8.5 kB)
Collecting jupyterlab-server<3,>=2.27.1 (from jupyterlab)
  Downloading jupyterlab_server-2.27.3-py3-none-any.whl.metadata (5.9 kB)
Collecting notebook-shim>=0.2 (from jupyterlab)


In [1]:
# Basic setup - keep it simple
import pandas as pd
import numpy as np
from pathlib import Path
import os
from datetime import datetime
import sys

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Detect environment (check if running in Google Colab)
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

# Mount Google Drive if in Colab
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    print("‚úÖ Google Drive mounted")

# Make output look nice
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')

print("‚úÖ Libraries loaded")
print(f"üìÖ Analysis date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print(f"üñ•Ô∏è Environment: {'Google Colab' if IN_COLAB else 'Local'}")

‚úÖ Libraries loaded
üìÖ Analysis date: 2025-10-21 15:06
üñ•Ô∏è Environment: Local


In [2]:
# Set up paths (works both locally and in Colab)
if IN_COLAB:
    BASE_PATH = Path('/content/drive/MyDrive/HACKATHON_DATALAB')
else:
    BASE_PATH = Path.cwd()

DATA_PATH = BASE_PATH / 'DATASET'

print(f"Working directory: {BASE_PATH}")
print(f"Data directory: {DATA_PATH}")
print(f"Data directory exists: {DATA_PATH.exists()}")

if not DATA_PATH.exists():
    print("‚ö†Ô∏è DATASET folder not found!")
    if IN_COLAB:
        print("   Make sure 'HACKATHON_DATALAB' folder exists in your Google Drive")
    else:
        print("   Make sure you're running this from the project root")

Working directory: c:\Users\gabin\Desktop\epitech\hackaton-sante\projet
Data directory: c:\Users\gabin\Desktop\epitech\hackaton-sante\projet\DATASET
Data directory exists: True


---

## üìÅ Step 1: What Files Do We Have?

Let's explore the entire DATASET folder structure.

In [3]:
def explore_directory(path, level=0):
    """Show directory structure in a readable way"""
    items = []

    try:
        for item in sorted(path.iterdir()):
            indent = "  " * level

            if item.is_dir():
                print(f"{indent}üìÅ {item.name}/")
                explore_directory(item, level + 1)
            else:
                size_mb = item.stat().st_size / (1024 * 1024)
                print(f"{indent}üìÑ {item.name} ({size_mb:.2f} MB)")

                if item.suffix == '.csv':
                    items.append({
                        'file': item.name,
                        'path': str(item),
                        'size_mb': size_mb,
                        'category': item.parent.parent.name if level > 1 else item.parent.name
                    })
    except PermissionError:
        print(f"{indent}‚ö†Ô∏è Permission denied")

    return items

print("\nüìÇ DATASET STRUCTURE:\n")
print("="*80)
csv_files = explore_directory(DATA_PATH)
print("="*80)
print(f"\n‚úÖ Found {len(csv_files)} CSV files")


üìÇ DATASET STRUCTURE:

üìÅ Couvertures-vaccinales-des-adolescents-et-adultes/
  üìÅ Donn√©es-d√©partementales/
    üìÑ couvertures-vaccinales-des-adolescent-et-adultes-departement.csv (0.11 MB)
  üìÅ Donn√©es-nationales/
    üìÑ couvertures-vaccinales-des-adolescents-et-adultes-depuis-2011-france.csv (0.00 MB)
  üìÅ Donn√©es-r√©gionales/
    üìÑ couvertures-vaccinales-des-adolescents-et-adultes-depuis-2011-region.csv (0.01 MB)
üìÅ Passages-aux-urgences-et-Actes-SOS-M√©decins/
  üìÅ Donn√©es-d√©partementales/
    üìÑ grippe-passages-aux-urgences-et-actes-sos-medecins-departement.csv (12.75 MB)
  üìÅ Donn√©es-nationales/
    üìÑ grippe-passages-aux-urgences-et-actes-sos-medecins-france.csv (0.10 MB)
  üìÅ Donn√©es-r√©gionales/
    üìÑ grippe-passages-urgences-et-actes-sos-medecin_reg.csv (2.00 MB)
üìÅ Vaccination-Grippe/
  üìÅ Vaccination-Grippe-2021-2022/
    üìÑ campagne-2021.csv (0.00 MB)
    üìÑ couverture-2021.csv (0.00 MB)
    üìÑ doses-actes-2021.csv (0.06 MB

In [4]:
# Create a summary table of all CSV files
if csv_files:
    df_files = pd.DataFrame(csv_files)

    print("\nüìä CSV FILES SUMMARY:\n")
    print(df_files.to_string(index=False))

    print(f"\nüì¶ Total size: {df_files['size_mb'].sum():.2f} MB")
    print(f"\nüìÇ Categories: {df_files['category'].unique().tolist()}")
else:
    print("‚ö†Ô∏è No CSV files found!")

‚ö†Ô∏è No CSV files found!


---

## üìä Step 2: Peek Inside Each Dataset

Let's look at the first few rows of each file to understand the structure.

In [5]:
def try_read_csv(filepath, encodings=['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']):
    """Try multiple encodings - French data often has encoding issues"""
    for encoding in encodings:
        try:
            df = pd.read_csv(filepath, encoding=encoding, nrows=5, low_memory=False)
            return df, encoding
        except:
            continue
    return None, None

def analyze_csv_file(filepath):
    """Quick analysis of a CSV file"""
    print("\n" + "="*80)
    print(f"üìÑ FILE: {Path(filepath).name}")
    print(f"üìç Location: {Path(filepath).parent.name}")
    print("="*80)

    # Try to read the file
    df_sample, encoding = try_read_csv(filepath)

    if df_sample is None:
        print("‚ùå Could not read file with any encoding")
        return None

    print(f"‚úÖ Encoding: {encoding}")

    # Full read for stats
    df_full = pd.read_csv(filepath, encoding=encoding, low_memory=False)

    print(f"üìè Shape: {df_full.shape[0]:,} rows √ó {df_full.shape[1]} columns")
    print(f"\nüìã Columns: {list(df_full.columns)}")

    print(f"\nüëÄ First 3 rows:")
    print(df_sample.head(3).to_string())

    # Data types
    print(f"\nüî¢ Data types:")
    for col, dtype in df_full.dtypes.items():
        print(f"  - {col}: {dtype}")

    # Missing values
    missing = df_full.isnull().sum()
    if missing.any():
        print(f"\n‚ö†Ô∏è Missing values:")
        for col, count in missing[missing > 0].items():
            pct = (count / len(df_full)) * 100
            print(f"  - {col}: {count:,} ({pct:.1f}%)")
    else:
        print(f"\n‚úÖ No missing values")

    # Look for date columns
    date_keywords = ['date', 'semaine', 'week', 'annee', 'year', 'periode', 'jour']
    date_cols = [col for col in df_full.columns if any(kw in col.lower() for kw in date_keywords)]

    if date_cols:
        print(f"\nüìÖ Potential date columns: {date_cols}")
        for col in date_cols:
            try:
                date_series = pd.to_datetime(df_full[col], errors='coerce')
                date_series = date_series.dropna()
                if len(date_series) > 0:
                    print(f"  - {col}: {date_series.min()} to {date_series.max()}")
            except:
                print(f"  - {col}: Could not parse as date")

    return df_full

print("\nüîç ANALYZING EACH FILE...\n")


üîç ANALYZING EACH FILE...



### üìä Category 1: Vaccination Coverage Data

In [6]:
# Vaccination Coverage - National
vax_national_path = DATA_PATH / 'Couvertures-vaccinales-des-adolescents-et-adultes' / 'Donn√©es-nationales' / 'couvertures-vaccinales-des-adolescents-et-adultes-depuis-2011-france.csv'

if vax_national_path.exists():
    df_vax_national = analyze_csv_file(vax_national_path)
else:
    print(f"‚ö†Ô∏è File not found: {vax_national_path}")


üìÑ FILE: couvertures-vaccinales-des-adolescents-et-adultes-depuis-2011-france.csv
üìç Location: Donn√©es-nationales
‚úÖ Encoding: utf-8
üìè Shape: 14 rows √ó 17 columns

üìã Columns: ['Ann√©e', 'HPV filles 1 dose √† 15 ans', 'HPV filles 2 doses √† 16 ans', 'HPV gar√ßons 1 dose √† 15 ans', 'HPV gar√ßons 2 doses √† 16 ans', 'M√©ningocoque C 10-14 ans', 'M√©ningocoque C 15-19 ans', 'M√©ningocoque C 20-24 ans', 'Grippe moins de 65 ans √† risque', 'Grippe 65 ans et plus', 'Grippe 65-74 ans', 'Grippe 75 ans et plus', 'Covid-19 65 ans et plus', 'Grippe r√©sidents en Ehpad', 'Grippe professionnels en Ehpad', 'Covid-19 r√©sidents en Ehpad', 'Covid-19 professionnels en Ehpad']

üëÄ First 3 rows:
   Ann√©e  HPV filles 1 dose √† 15 ans  HPV filles 2 doses √† 16 ans  HPV gar√ßons 1 dose √† 15 ans  HPV gar√ßons 2 doses √† 16 ans  M√©ningocoque C 10-14 ans  M√©ningocoque C 15-19 ans  M√©ningocoque C 20-24 ans  Grippe moins de 65 ans √† risque  Grippe 65 ans et plus  Grippe 65-74 ans  Grippe 75

In [7]:
# Vaccination Coverage - Regional
vax_regional_path = DATA_PATH / 'Couvertures-vaccinales-des-adolescents-et-adultes' / 'Donn√©es-r√©gionales' / 'couvertures-vaccinales-des-adolescents-et-adultes-depuis-2011-region.csv'

if vax_regional_path.exists():
    df_vax_regional = analyze_csv_file(vax_regional_path)
else:
    print(f"‚ö†Ô∏è File not found: {vax_regional_path}")


üìÑ FILE: couvertures-vaccinales-des-adolescents-et-adultes-depuis-2011-region.csv
üìç Location: Donn√©es-r√©gionales
‚úÖ Encoding: utf-8
üìè Shape: 238 rows √ó 19 columns

üìã Columns: ['Ann√©e', 'R√©gion Code', 'R√©gion', 'HPV filles 1 dose √† 15 ans', 'HPV filles 2 doses √† 16 ans', 'HPV gar√ßons 1 dose √† 15 ans', 'HPV gar√ßons 2 doses √† 16 ans', 'M√©ningocoque C 10-14 ans', 'M√©ningocoque C 15-19 ans', 'M√©ningocoque C 20-24 ans', 'Grippe moins de 65 ans √† risque', 'Grippe 65 ans et plus', 'Grippe 65-74 ans', 'Grippe 75 ans et plus', 'Covid-19 65 ans et plus', 'Grippe r√©sidents en Ehpad', 'Grippe professionnels en Ehpad', 'Covid-19 r√©sidents en Ehpad', 'Covid-19 professionnels en Ehpad']

üëÄ First 3 rows:
   Ann√©e  R√©gion Code               R√©gion  HPV filles 1 dose √† 15 ans  HPV filles 2 doses √† 16 ans  HPV gar√ßons 1 dose √† 15 ans  HPV gar√ßons 2 doses √† 16 ans  M√©ningocoque C 10-14 ans  M√©ningocoque C 15-19 ans  M√©ningocoque C 20-24 ans  Grippe moins de 65 

In [8]:
# Vaccination Coverage - Departmental
vax_dept_path = DATA_PATH / 'Couvertures-vaccinales-des-adolescents-et-adultes' / 'Donn√©es-d√©partementales' / 'couvertures-vaccinales-des-adolescent-et-adultes-departement.csv'

if vax_dept_path.exists():
    df_vax_dept = analyze_csv_file(vax_dept_path)
else:
    print(f"‚ö†Ô∏è File not found: {vax_dept_path}")


üìÑ FILE: couvertures-vaccinales-des-adolescent-et-adultes-departement.csv
üìç Location: Donn√©es-d√©partementales
‚úÖ Encoding: utf-8
üìè Shape: 1,414 rows √ó 17 columns

üìã Columns: ['Ann√©e', 'D√©partement Code', 'D√©partement', 'HPV filles 1 dose √† 15 ans', 'HPV filles 2 doses √† 16 ans', 'HPV gar√ßons 1 dose √† 15 ans', 'HPV gar√ßons 2 doses √† 16 ans', 'M√©ningocoque C 10-14 ans', 'M√©ningocoque C 15-19 ans', 'M√©ningocoque C 20-24 ans', 'Grippe moins de 65 ans √† risque', 'Grippe 65 ans et plus', 'Grippe 65-74 ans', 'Grippe 75 ans et plus', 'Covid-19 65 ans et plus', 'R√©gion', 'R√©gion Code']

üëÄ First 3 rows:
   Ann√©e  D√©partement Code              D√©partement  HPV filles 1 dose √† 15 ans  HPV filles 2 doses √† 16 ans  HPV gar√ßons 1 dose √† 15 ans  HPV gar√ßons 2 doses √† 16 ans  M√©ningocoque C 10-14 ans  M√©ningocoque C 15-19 ans  M√©ningocoque C 20-24 ans  Grippe moins de 65 ans √† risque  Grippe 65 ans et plus  Grippe 65-74 ans  Grippe 75 ans et plus  Covid-19

### üè• Category 2: Emergency Room Visits

In [9]:
# Emergency - National
emerg_national_path = DATA_PATH / 'Passages-aux-urgences-et-Actes-SOS-M√©decins' / 'Donn√©es-nationales' / 'grippe-passages-aux-urgences-et-actes-sos-medecins-france.csv'

if emerg_national_path.exists():
    df_emerg_national = analyze_csv_file(emerg_national_path)
else:
    print(f"‚ö†Ô∏è File not found: {emerg_national_path}")


üìÑ FILE: grippe-passages-aux-urgences-et-actes-sos-medecins-france.csv
üìç Location: Donn√©es-nationales
‚úÖ Encoding: utf-8
üìè Shape: 1,510 rows √ó 6 columns

üìã Columns: ['1er jour de la semaine', 'Semaine', "Classe d'√¢ge", 'Taux de passages aux urgences pour grippe', "Taux d'hospitalisations apr√®s passages aux urgences pour grippe", "Taux d'actes m√©dicaux SOS m√©decins pour grippe"]

üëÄ First 3 rows:
  1er jour de la semaine   Semaine    Classe d'√¢ge  Taux de passages aux urgences pour grippe  Taux d'hospitalisations apr√®s passages aux urgences pour grippe  Taux d'actes m√©dicaux SOS m√©decins pour grippe
0             2019-12-30  2020-S01       Tous √¢ges                                 900.949821                                                       400.709302                                     4468.384648
1             2020-01-06  2020-S02       15-64 ans                                 810.706944                                                       471.760504   

  date_series = pd.to_datetime(df_full[col], errors='coerce')


In [10]:
# Emergency - Regional
emerg_regional_path = DATA_PATH / 'Passages-aux-urgences-et-Actes-SOS-M√©decins' / 'Donn√©es-r√©gionales' / 'grippe-passages-urgences-et-actes-sos-medecin_reg.csv'

if emerg_regional_path.exists():
    df_emerg_regional = analyze_csv_file(emerg_regional_path)
else:
    print(f"‚ö†Ô∏è File not found: {emerg_regional_path}")


üìÑ FILE: grippe-passages-urgences-et-actes-sos-medecin_reg.csv
üìç Location: Donn√©es-r√©gionales
‚úÖ Encoding: utf-8
üìè Shape: 27,180 rows √ó 8 columns

üìã Columns: ['1er jour de la semaine', 'Semaine', 'R√©gion Code', 'R√©gion', "Classe d'√¢ge", 'Taux de passages aux urgences pour grippe', "Taux d'hospitalisations apr√®s passages aux urgences pour grippe", "Taux d'actes m√©dicaux SOS m√©decins pour grippe"]

üëÄ First 3 rows:
  1er jour de la semaine   Semaine  R√©gion Code   R√©gion    Classe d'√¢ge  Taux de passages aux urgences pour grippe  Taux d'hospitalisations apr√®s passages aux urgences pour grippe  Taux d'actes m√©dicaux SOS m√©decins pour grippe
0             2023-02-20  2023-S08            6  Mayotte       00-04 ans                                 383.141762                                                           1562.5                                             NaN
1             2023-02-20  2023-S08            6  Mayotte       15-64 ans                       

  date_series = pd.to_datetime(df_full[col], errors='coerce')


In [11]:
# Emergency - Departmental
emerg_dept_path = DATA_PATH / 'Passages-aux-urgences-et-Actes-SOS-M√©decins' / 'Donn√©es-d√©partementales' / 'grippe-passages-aux-urgences-et-actes-sos-medecins-departement.csv'

if emerg_dept_path.exists():
    df_emerg_dept = analyze_csv_file(emerg_dept_path)
else:
    print(f"‚ö†Ô∏è File not found: {emerg_dept_path}")


üìÑ FILE: grippe-passages-aux-urgences-et-actes-sos-medecins-departement.csv
üìç Location: Donn√©es-d√©partementales
‚úÖ Encoding: utf-8
üìè Shape: 157,040 rows √ó 10 columns

üìã Columns: ['1er jour de la semaine', 'Semaine', 'D√©partement Code', 'D√©partement', "Classe d'√¢ge", 'Taux de passages aux urgences pour grippe', "Taux d'hospitalisations apr√®s passages aux urgences pour grippe", "Taux d'actes m√©dicaux SOS m√©decins pour grippe", 'R√©gion', 'R√©gion Code']

üëÄ First 3 rows:
  1er jour de la semaine   Semaine  D√©partement Code D√©partement    Classe d'√¢ge  Taux de passages aux urgences pour grippe  Taux d'hospitalisations apr√®s passages aux urgences pour grippe  Taux d'actes m√©dicaux SOS m√©decins pour grippe     R√©gion  R√©gion Code
0             2020-12-21  2020-S52                61        Orne       05-14 ans                                        0.0                                                              0.0                                             

  date_series = pd.to_datetime(df_full[col], errors='coerce')


### üíâ Category 3: Flu Vaccination Campaigns (2021-2025)

In [12]:
# Let's check all flu campaign files
flu_base = DATA_PATH / 'Vaccination-Grippe'

flu_files = []
for year_folder in sorted(flu_base.glob('Vaccination-Grippe-*')):
    print(f"\nüìÇ {year_folder.name}")
    for csv_file in sorted(year_folder.glob('*.csv')):
        flu_files.append(csv_file)
        analyze_csv_file(csv_file)

print(f"\n‚úÖ Total flu campaign files: {len(flu_files)}")


üìÇ Vaccination-Grippe-2021-2022

üìÑ FILE: campagne-2021.csv
üìç Location: Vaccination-Grippe-2021-2022
‚úÖ Encoding: utf-8
üìè Shape: 5 rows √ó 5 columns

üìã Columns: ['campagne', 'date', 'variable', 'valeur', 'cible']

üëÄ First 3 rows:
    campagne        date      variable    valeur     cible
0  2021-2022  2022-02-28     ACTE(VGP)   4475890   3823445
1  2021-2022  2022-02-28  DOSES(J07E1)  11178955  11915574
2  2021-2022  2022-02-28       UNIVERS     21078     21241

üî¢ Data types:
  - campagne: object
  - date: object
  - variable: object
  - valeur: int64
  - cible: int64

‚úÖ No missing values

üìÖ Potential date columns: ['date']
  - date: 2022-02-28 00:00:00 to 2022-02-28 00:00:00

üìÑ FILE: couverture-2021.csv
üìç Location: Vaccination-Grippe-2021-2022
‚úÖ Encoding: utf-8
üìè Shape: 52 rows √ó 5 columns

üìã Columns: ['region', 'code', 'variable', 'groupe', 'valeur']

üëÄ First 3 rows:
                     region  code      variable           groupe  valeur
0

---

## üìù Step 3: Key Findings Summary

Let's summarize what we learned about our data.

In [13]:
print("\n" + "="*80)
print("üìã DATA DISCOVERY SUMMARY")
print("="*80)

print("\n‚úÖ WHAT WE HAVE:")
print("\n1. Vaccination Coverage (Adolescents & Adults):")
print("   - National level: Historical coverage rates")
print("   - Regional level: 13 French regions")
print("   - Departmental level: 101 departments")

print("\n2. Emergency Room Visits & SOS M√©decins:")
print("   - National level: Weekly time series")
print("   - Regional level: By region")
print("   - Departmental level: By department")

print("\n3. Flu Vaccination Campaigns (2021-2025):")
print("   - 4 years of campaign data")
print("   - 3 files per year: campaign info, coverage, doses/acts")

print("\n‚ö†Ô∏è DATA QUALITY ISSUES TO ADDRESS:")
print("   1. Encoding issues (French characters)")
print("   2. Date format inconsistencies")
print("   3. Missing values in some columns")
print("   4. Different geographic levels need alignment")

print("\nüéØ NEXT STEPS:")
print("   1. Data Cleaning: Standardize dates, regions, handle missing values")
print("   2. Data Integration: Combine sources at regional level")
print("   3. Feature Engineering: Create lag variables, rolling averages")
print("   4. Exploratory Analysis: Understand patterns and correlations")
print("   5. Modeling: Forecast vaccine needs and emergency visits")
print("   6. Optimization: Allocate vaccines to regions")

print("\n" + "="*80)


üìã DATA DISCOVERY SUMMARY

‚úÖ WHAT WE HAVE:

1. Vaccination Coverage (Adolescents & Adults):
   - National level: Historical coverage rates
   - Regional level: 13 French regions
   - Departmental level: 101 departments

2. Emergency Room Visits & SOS M√©decins:
   - National level: Weekly time series
   - Regional level: By region
   - Departmental level: By department

3. Flu Vaccination Campaigns (2021-2025):
   - 4 years of campaign data
   - 3 files per year: campaign info, coverage, doses/acts

‚ö†Ô∏è DATA QUALITY ISSUES TO ADDRESS:
   1. Encoding issues (French characters)
   2. Date format inconsistencies
   3. Missing values in some columns
   4. Different geographic levels need alignment

üéØ NEXT STEPS:
   1. Data Cleaning: Standardize dates, regions, handle missing values
   2. Data Integration: Combine sources at regional level
   3. Feature Engineering: Create lag variables, rolling averages
   4. Exploratory Analysis: Understand patterns and correlations
   5. Model

---

## üíæ Save Discovery Results

Let's save what we learned for the next notebooks.

In [14]:
# Create a data catalog
import json

data_catalog = {
    'discovery_date': datetime.now().isoformat(),
    'base_path': str(DATA_PATH),
    'files_found': len(csv_files) if csv_files else 0,
    'categories': {
        'vaccination_coverage': {
            'description': 'Vaccination coverage rates for adolescents and adults',
            'levels': ['national', 'regional', 'departmental'],
            'time_period': 'Since 2011'
        },
        'emergency_passages': {
            'description': 'Emergency room visits and SOS M√©decins acts for flu',
            'levels': ['national', 'regional', 'departmental'],
            'frequency': 'Weekly'
        },
        'flu_campaigns': {
            'description': 'Flu vaccination campaigns',
            'years': ['2021-2022', '2022-2023', '2023-2024', '2024-2025'],
            'file_types': ['campaign', 'couverture', 'doses-actes']
        }
    },
    'quality_issues': [
        'French character encoding (use latin-1 or cp1252)',
        'Date format variations',
        'Missing values in some datasets',
        'Geographic level alignment needed'
    ],
    'recommended_approach': [
        'Focus on regional level (good balance of detail and data availability)',
        'Use emergency room data as primary demand signal',
        'Use vaccination coverage to identify gaps',
        'Use campaign data for historical effectiveness'
    ]
}

# Save catalog
catalog_path = BASE_PATH / 'data_catalog.json'
with open(catalog_path, 'w', encoding='utf-8') as f:
    json.dump(data_catalog, f, indent=2, ensure_ascii=False)

print(f"\n‚úÖ Data catalog saved to: {catalog_path}")
print("\nüöÄ Ready for next notebook: 01_Data_Cleaning.ipynb")


‚úÖ Data catalog saved to: c:\Users\gabin\Desktop\epitech\hackaton-sante\projet\data_catalog.json

üöÄ Ready for next notebook: 01_Data_Cleaning.ipynb
