# 🔍 Data Discovery - Understanding Our Sources

**For Decision-Makers**: This notebook is like opening a treasure chest - we're checking what data we have, where it comes from, and if it's reliable. Think of it as the foundation of everything that follows.

**Goal**: Before doing anything fancy, let's actually look at what data we have.

**Why this matters**: Most data science projects fail because people don't understand their data. We're making sure we start on solid ground.

## What We're Looking For:
- 📁 What files do we have?
- 📊 What do they contain?
- 📅 What time periods are covered?
- 🗺️ What geographic levels (national, regional, departmental)?
- ⚠️ What's broken or missing?

## 🎯 Connection to Project Goals:
This notebook directly supports:
- ✅ **Predicting vaccine needs** - We need historical data to see patterns
- ✅ **Optimizing distribution** - We need geographic data to know where vaccines go
- ✅ **Anticipating emergencies** - We need emergency visit data to forecast demand
- ✅ **Improving access** - We need coverage data to find gaps

---

In [1]:
%pip install --upgrade \
    nbformat \
    ipykernel \
    ipython \
    jupyterlab \
    numpy \
    pandas \
    matplotlib \
    seaborn \
    scikit-learn \
    scipy \
    notebook \
    plotly \
    PyGAD \
    prophet \
    xgboost \
    statsmodels \





























Note: you may need to restart the kernel to use updated packages.


In [2]:
# Basic setup - keep it simple
import pandas as pd
import numpy as np
from pathlib import Path
import os
from datetime import datetime
import sys

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Detect environment (check if running in Google Colab)
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

# Mount Google Drive if in Colab
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    print("✅ Google Drive mounted")

# Make output look nice
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')

print("✅ Libraries loaded")
print(f"📅 Analysis date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print(f"🖥️ Environment: {'Google Colab' if IN_COLAB else 'Local'}")

✅ Libraries loaded
📅 Analysis date: 2025-10-22 15:26
🖥️ Environment: Local


In [3]:
# Set up paths (works both locally and in Colab)
if IN_COLAB:
    BASE_PATH = Path('/content/drive/MyDrive/HACKATHON_DATALAB')
else:
    BASE_PATH = Path.cwd()

DATA_PATH = BASE_PATH / 'DATASET'

print(f"Working directory: {BASE_PATH}")
print(f"Data directory: {DATA_PATH}")
print(f"Data directory exists: {DATA_PATH.exists()}")

if not DATA_PATH.exists():
    print("⚠️ DATASET folder not found!")
    if IN_COLAB:
        print("   Make sure 'HACKATHON_DATALAB' folder exists in your Google Drive")
    else:
        print("   Make sure you're running this from the project root")

Working directory: /Users/fadybekkar/Desktop/EPITECH/HACK/Hackaton_Data/projet
Data directory: /Users/fadybekkar/Desktop/EPITECH/HACK/Hackaton_Data/projet/DATASET
Data directory exists: True


---

## 📁 Step 1: What Files Do We Have?

Let's explore the entire DATASET folder structure.

In [4]:
def explore_directory(path, level=0, visited=None):
    """Show directory structure in a readable way"""
    if visited is None:
        visited = set()

    items = []
    indent = "  " * level

    try:
        resolved_path = path.resolve()
    except (RuntimeError, OSError) as exc:
        print(f"{indent}⚠️ Could not resolve path {path}: {exc}")
        return items

    if resolved_path in visited:
        print(f"{indent}🔁 {path.name}/ (already visited)")
        return items

    visited.add(resolved_path)

    try:
        entries = sorted(path.iterdir(), key=lambda p: p.name.lower())
    except PermissionError:
        print(f"{indent}⚠️ Permission denied")
        return items

    for item in entries:
        item_indent = indent
        try:
            if item.is_dir():
                if item.is_symlink():
                    print(f"{item_indent}🔗 {item.name}/ (symlink skipped)")
                    continue
                print(f"{item_indent}📁 {item.name}/")
                items.extend(explore_directory(item, level + 1, visited))
            else:
                size_mb = item.stat().st_size / (1024 * 1024)
                print(f"{item_indent}📄 {item.name} ({size_mb:.2f} MB)")
                if item.suffix == '.csv':
                    items.append({
                        'file': item.name,
                        'path': str(item),
                        'size_mb': size_mb,
                        'category': item.parent.parent.name if level > 1 else item.parent.name
                    })
        except PermissionError:
            print(f"{item_indent}⚠️ Permission denied: {item.name}")
        except OSError as exc:
            print(f"{item_indent}⚠️ OS error on {item.name}: {exc}")

    return items

print("\n📂 DATASET STRUCTURE:\n")
print("="*80)
csv_files = explore_directory(DATA_PATH)
print("="*80)
print(f"\n✅ Found {len(csv_files)} CSV files")



📂 DATASET STRUCTURE:

📁 Couvertures-vaccinales-des-adolescents-et-adultes/
  📁 Données-départementales/
    📄 couvertures-vaccinales-des-adolescent-et-adultes-departement.csv (0.10 MB)
  📁 Données-nationales/
    📄 couvertures-vaccinales-des-adolescents-et-adultes-depuis-2011-france.csv (0.00 MB)
  📁 Données-régionales/
    📄 couvertures-vaccinales-des-adolescents-et-adultes-depuis-2011-region.csv (0.01 MB)
📁 Passages-aux-urgences-et-Actes-SOS-Médecins/
  📁 Données-départementales/
    📄 grippe-passages-aux-urgences-et-actes-sos-medecins-departement.csv (12.60 MB)
  📁 Données-nationales/
    📄 grippe-passages-aux-urgences-et-actes-sos-medecins-france.csv (0.10 MB)
  📁 Données-régionales/
    📄 grippe-passages-urgences-et-actes-sos-medecin_reg.csv (1.98 MB)
📁 Vaccination-Grippe/
  📁 Vaccination-Grippe-2021-2022/
    📄 campagne-2021.csv (0.00 MB)
    📄 couverture-2021.csv (0.00 MB)
    📄 doses-actes-2021.csv (0.06 MB)
  📁 Vaccination-Grippe-2022-2023/
    📄 campagne-2022.csv (0.00 MB)
 

In [5]:
# Create a summary table of all CSV files
if csv_files:
    df_files = pd.DataFrame(csv_files)

    print("\n📊 CSV FILES SUMMARY:\n")
    print(df_files.to_string(index=False))

    print(f"\n📦 Total size: {df_files['size_mb'].sum():.2f} MB")
    print(f"\n📂 Categories: {df_files['category'].unique().tolist()}")
else:
    print("⚠️ No CSV files found!")


📊 CSV FILES SUMMARY:

                                                                    file                                                                                                                                                                                                              path   size_mb                                          category
        couvertures-vaccinales-des-adolescent-et-adultes-departement.csv    /Users/fadybekkar/Desktop/EPITECH/HACK/Hackaton_Data/projet/DATASET/Couvertures-vaccinales-des-adolescents-et-adultes/Données-départementales/couvertures-vaccinales-des-adolescent-et-adultes-departement.csv  0.104764 Couvertures-vaccinales-des-adolescents-et-adultes
couvertures-vaccinales-des-adolescents-et-adultes-depuis-2011-france.csv /Users/fadybekkar/Desktop/EPITECH/HACK/Hackaton_Data/projet/DATASET/Couvertures-vaccinales-des-adolescents-et-adultes/Données-nationales/couvertures-vaccinales-des-adolescents-et-adultes-depuis-2011-france.csv  0.00101

---

## 📊 Step 2: Peek Inside Each Dataset

Let's look at the first few rows of each file to understand the structure.

In [6]:
def try_read_csv(filepath, encodings=['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']):
    """Try multiple encodings - French data often has encoding issues"""
    for encoding in encodings:
        try:
            df = pd.read_csv(filepath, encoding=encoding, nrows=5, low_memory=False)
            return df, encoding
        except:
            continue
    return None, None

def analyze_csv_file(filepath):
    """Quick analysis of a CSV file"""
    print("\n" + "="*80)
    print(f"📄 FILE: {Path(filepath).name}")
    print(f"📍 Location: {Path(filepath).parent.name}")
    print("="*80)

    # Try to read the file
    df_sample, encoding = try_read_csv(filepath)

    if df_sample is None:
        print("❌ Could not read file with any encoding")
        return None

    print(f"✅ Encoding: {encoding}")

    # Full read for stats
    df_full = pd.read_csv(filepath, encoding=encoding, low_memory=False)

    print(f"📏 Shape: {df_full.shape[0]:,} rows × {df_full.shape[1]} columns")
    print(f"\n📋 Columns: {list(df_full.columns)}")

    print(f"\n👀 First 3 rows:")
    print(df_sample.head(3).to_string())

    # Data types
    print(f"\n🔢 Data types:")
    for col, dtype in df_full.dtypes.items():
        print(f"  - {col}: {dtype}")

    # Missing values
    missing = df_full.isnull().sum()
    if missing.any():
        print(f"\n⚠️ Missing values:")
        for col, count in missing[missing > 0].items():
            pct = (count / len(df_full)) * 100
            print(f"  - {col}: {count:,} ({pct:.1f}%)")
    else:
        print(f"\n✅ No missing values")

    # Look for date columns
    date_keywords = ['date', 'semaine', 'week', 'annee', 'year', 'periode', 'jour']
    date_cols = [col for col in df_full.columns if any(kw in col.lower() for kw in date_keywords)]

    if date_cols:
        print(f"\n📅 Potential date columns: {date_cols}")
        for col in date_cols:
            try:
                date_series = pd.to_datetime(df_full[col], errors='coerce')
                date_series = date_series.dropna()
                if len(date_series) > 0:
                    print(f"  - {col}: {date_series.min()} to {date_series.max()}")
            except:
                print(f"  - {col}: Could not parse as date")

    return df_full

print("\n🔍 ANALYZING EACH FILE...\n")


🔍 ANALYZING EACH FILE...



### 📊 Category 1: Vaccination Coverage Data

In [7]:
# Vaccination Coverage - National
vax_national_path = DATA_PATH / 'Couvertures-vaccinales-des-adolescents-et-adultes' / 'Données-nationales' / 'couvertures-vaccinales-des-adolescents-et-adultes-depuis-2011-france.csv'

if vax_national_path.exists():
    df_vax_national = analyze_csv_file(vax_national_path)
else:
    print(f"⚠️ File not found: {vax_national_path}")


📄 FILE: couvertures-vaccinales-des-adolescents-et-adultes-depuis-2011-france.csv
📍 Location: Données-nationales
✅ Encoding: utf-8
📏 Shape: 14 rows × 17 columns

📋 Columns: ['Année', 'HPV filles 1 dose à 15 ans', 'HPV filles 2 doses à 16 ans', 'HPV garçons 1 dose à 15 ans', 'HPV garçons 2 doses à 16 ans', 'Méningocoque C 10-14 ans', 'Méningocoque C 15-19 ans', 'Méningocoque C 20-24 ans', 'Grippe moins de 65 ans à risque', 'Grippe 65 ans et plus', 'Grippe 65-74 ans', 'Grippe 75 ans et plus', 'Covid-19 65 ans et plus', 'Grippe résidents en Ehpad', 'Grippe professionnels en Ehpad', 'Covid-19 résidents en Ehpad', 'Covid-19 professionnels en Ehpad']

👀 First 3 rows:
   Année  HPV filles 1 dose à 15 ans  HPV filles 2 doses à 16 ans  HPV garçons 1 dose à 15 ans  HPV garçons 2 doses à 16 ans  Méningocoque C 10-14 ans  Méningocoque C 15-19 ans  Méningocoque C 20-24 ans  Grippe moins de 65 ans à risque  Grippe 65 ans et plus  Grippe 65-74 ans  Grippe 75 ans et plus  Covid-19 65 ans et plus  Grip

In [8]:
# Vaccination Coverage - Regional
vax_regional_path = DATA_PATH / 'Couvertures-vaccinales-des-adolescents-et-adultes' / 'Données-régionales' / 'couvertures-vaccinales-des-adolescents-et-adultes-depuis-2011-region.csv'

if vax_regional_path.exists():
    df_vax_regional = analyze_csv_file(vax_regional_path)
else:
    print(f"⚠️ File not found: {vax_regional_path}")


📄 FILE: couvertures-vaccinales-des-adolescents-et-adultes-depuis-2011-region.csv
📍 Location: Données-régionales
✅ Encoding: utf-8
📏 Shape: 238 rows × 19 columns

📋 Columns: ['Année', 'Région Code', 'Région', 'HPV filles 1 dose à 15 ans', 'HPV filles 2 doses à 16 ans', 'HPV garçons 1 dose à 15 ans', 'HPV garçons 2 doses à 16 ans', 'Méningocoque C 10-14 ans', 'Méningocoque C 15-19 ans', 'Méningocoque C 20-24 ans', 'Grippe moins de 65 ans à risque', 'Grippe 65 ans et plus', 'Grippe 65-74 ans', 'Grippe 75 ans et plus', 'Covid-19 65 ans et plus', 'Grippe résidents en Ehpad', 'Grippe professionnels en Ehpad', 'Covid-19 résidents en Ehpad', 'Covid-19 professionnels en Ehpad']

👀 First 3 rows:
   Année  Région Code               Région  HPV filles 1 dose à 15 ans  HPV filles 2 doses à 16 ans  HPV garçons 1 dose à 15 ans  HPV garçons 2 doses à 16 ans  Méningocoque C 10-14 ans  Méningocoque C 15-19 ans  Méningocoque C 20-24 ans  Grippe moins de 65 ans à risque  Grippe 65 ans et plus  Grippe 65-

In [9]:
# Vaccination Coverage - Departmental
vax_dept_path = DATA_PATH / 'Couvertures-vaccinales-des-adolescents-et-adultes' / 'Données-départementales' / 'couvertures-vaccinales-des-adolescent-et-adultes-departement.csv'

if vax_dept_path.exists():
    df_vax_dept = analyze_csv_file(vax_dept_path)
else:
    print(f"⚠️ File not found: {vax_dept_path}")


📄 FILE: couvertures-vaccinales-des-adolescent-et-adultes-departement.csv
📍 Location: Données-départementales
✅ Encoding: utf-8
📏 Shape: 1,414 rows × 17 columns

📋 Columns: ['Année', 'Département Code', 'Département', 'HPV filles 1 dose à 15 ans', 'HPV filles 2 doses à 16 ans', 'HPV garçons 1 dose à 15 ans', 'HPV garçons 2 doses à 16 ans', 'Méningocoque C 10-14 ans', 'Méningocoque C 15-19 ans', 'Méningocoque C 20-24 ans', 'Grippe moins de 65 ans à risque', 'Grippe 65 ans et plus', 'Grippe 65-74 ans', 'Grippe 75 ans et plus', 'Covid-19 65 ans et plus', 'Région', 'Région Code']

👀 First 3 rows:
   Année  Département Code              Département  HPV filles 1 dose à 15 ans  HPV filles 2 doses à 16 ans  HPV garçons 1 dose à 15 ans  HPV garçons 2 doses à 16 ans  Méningocoque C 10-14 ans  Méningocoque C 15-19 ans  Méningocoque C 20-24 ans  Grippe moins de 65 ans à risque  Grippe 65 ans et plus  Grippe 65-74 ans  Grippe 75 ans et plus  Covid-19 65 ans et plus                      Région  Rég

### 🏥 Category 2: Emergency Room Visits

In [10]:
# Emergency - National
emerg_national_path = DATA_PATH / 'Passages-aux-urgences-et-Actes-SOS-Médecins' / 'Données-nationales' / 'grippe-passages-aux-urgences-et-actes-sos-medecins-france.csv'

if emerg_national_path.exists():
    df_emerg_national = analyze_csv_file(emerg_national_path)
else:
    print(f"⚠️ File not found: {emerg_national_path}")


📄 FILE: grippe-passages-aux-urgences-et-actes-sos-medecins-france.csv
📍 Location: Données-nationales
✅ Encoding: utf-8
📏 Shape: 1,510 rows × 6 columns

📋 Columns: ['1er jour de la semaine', 'Semaine', "Classe d'âge", 'Taux de passages aux urgences pour grippe', "Taux d'hospitalisations après passages aux urgences pour grippe", "Taux d'actes médicaux SOS médecins pour grippe"]

👀 First 3 rows:
  1er jour de la semaine   Semaine    Classe d'âge  Taux de passages aux urgences pour grippe  Taux d'hospitalisations après passages aux urgences pour grippe  Taux d'actes médicaux SOS médecins pour grippe
0             2019-12-30  2020-S01       Tous âges                                 900.949821                                                       400.709302                                     4468.384648
1             2020-01-06  2020-S02       15-64 ans                                 810.706944                                                       471.760504                               

  date_series = pd.to_datetime(df_full[col], errors='coerce')


In [11]:
# Emergency - Regional
emerg_regional_path = DATA_PATH / 'Passages-aux-urgences-et-Actes-SOS-Médecins' / 'Données-régionales' / 'grippe-passages-urgences-et-actes-sos-medecin_reg.csv'

if emerg_regional_path.exists():
    df_emerg_regional = analyze_csv_file(emerg_regional_path)
else:
    print(f"⚠️ File not found: {emerg_regional_path}")


📄 FILE: grippe-passages-urgences-et-actes-sos-medecin_reg.csv
📍 Location: Données-régionales
✅ Encoding: utf-8
📏 Shape: 27,180 rows × 8 columns

📋 Columns: ['1er jour de la semaine', 'Semaine', 'Région Code', 'Région', "Classe d'âge", 'Taux de passages aux urgences pour grippe', "Taux d'hospitalisations après passages aux urgences pour grippe", "Taux d'actes médicaux SOS médecins pour grippe"]

👀 First 3 rows:
  1er jour de la semaine   Semaine  Région Code   Région    Classe d'âge  Taux de passages aux urgences pour grippe  Taux d'hospitalisations après passages aux urgences pour grippe  Taux d'actes médicaux SOS médecins pour grippe
0             2023-02-20  2023-S08            6  Mayotte       00-04 ans                                 383.141762                                                           1562.5                                             NaN
1             2023-02-20  2023-S08            6  Mayotte       15-64 ans                                 728.597450            

  date_series = pd.to_datetime(df_full[col], errors='coerce')


In [12]:
# Emergency - Departmental
emerg_dept_path = DATA_PATH / 'Passages-aux-urgences-et-Actes-SOS-Médecins' / 'Données-départementales' / 'grippe-passages-aux-urgences-et-actes-sos-medecins-departement.csv'

if emerg_dept_path.exists():
    df_emerg_dept = analyze_csv_file(emerg_dept_path)
else:
    print(f"⚠️ File not found: {emerg_dept_path}")


📄 FILE: grippe-passages-aux-urgences-et-actes-sos-medecins-departement.csv
📍 Location: Données-départementales
✅ Encoding: utf-8
📏 Shape: 157,040 rows × 10 columns

📋 Columns: ['1er jour de la semaine', 'Semaine', 'Département Code', 'Département', "Classe d'âge", 'Taux de passages aux urgences pour grippe', "Taux d'hospitalisations après passages aux urgences pour grippe", "Taux d'actes médicaux SOS médecins pour grippe", 'Région', 'Région Code']

👀 First 3 rows:
  1er jour de la semaine   Semaine  Département Code Département    Classe d'âge  Taux de passages aux urgences pour grippe  Taux d'hospitalisations après passages aux urgences pour grippe  Taux d'actes médicaux SOS médecins pour grippe     Région  Région Code
0             2020-12-21  2020-S52                61        Orne       05-14 ans                                        0.0                                                              0.0                                             NaN  Normandie           28
1       

  date_series = pd.to_datetime(df_full[col], errors='coerce')


### 💉 Category 3: Flu Vaccination Campaigns (2021-2025)

In [13]:
# Let's check all flu campaign files
flu_base = DATA_PATH / 'Vaccination-Grippe'

flu_files = []
for year_folder in sorted(flu_base.glob('Vaccination-Grippe-*')):
    print(f"\n📂 {year_folder.name}")
    for csv_file in sorted(year_folder.glob('*.csv')):
        flu_files.append(csv_file)
        analyze_csv_file(csv_file)

print(f"\n✅ Total flu campaign files: {len(flu_files)}")


📂 Vaccination-Grippe-2021-2022

📄 FILE: campagne-2021.csv
📍 Location: Vaccination-Grippe-2021-2022
✅ Encoding: utf-8
📏 Shape: 5 rows × 5 columns

📋 Columns: ['campagne', 'date', 'variable', 'valeur', 'cible']

👀 First 3 rows:
    campagne        date      variable    valeur     cible
0  2021-2022  2022-02-28     ACTE(VGP)   4475890   3823445
1  2021-2022  2022-02-28  DOSES(J07E1)  11178955  11915574
2  2021-2022  2022-02-28       UNIVERS     21078     21241

🔢 Data types:
  - campagne: object
  - date: object
  - variable: object
  - valeur: int64
  - cible: int64

✅ No missing values

📅 Potential date columns: ['date']
  - date: 2022-02-28 00:00:00 to 2022-02-28 00:00:00

📄 FILE: couverture-2021.csv
📍 Location: Vaccination-Grippe-2021-2022


✅ Encoding: utf-8
📏 Shape: 52 rows × 5 columns

📋 Columns: ['region', 'code', 'variable', 'groupe', 'valeur']

👀 First 3 rows:
                     region  code      variable           groupe  valeur
0        11 - ILE-DE-France    11     ACTE(VGP)  moins de 65 ans     204
1        11 - ILE-DE-France    11  DOSES(J07E1)  moins de 65 ans     458
2  24 - CENTRE-VAL-DE-LOIRE    24     ACTE(VGP)  moins de 65 ans     214

🔢 Data types:
  - region: object
  - code: int64
  - variable: object
  - groupe: object
  - valeur: int64

✅ No missing values

📄 FILE: doses-actes-2021.csv
📍 Location: Vaccination-Grippe-2021-2022
✅ Encoding: utf-8
📏 Shape: 1,076 rows × 6 columns

📋 Columns: ['campagne', 'date', 'jour', 'variable', 'groupe', 'valeur']

👀 First 3 rows:
    campagne        date  jour      variable           groupe   valeur
0  2020-2021  2020-10-13     1     ACTE(VGP)   65 ans et plus   296119
1  2020-2021  2020-10-13     1  DOSES(J07E1)   65 ans et plus  1685461
2  2020-2021  2020-10-13    

📏 Shape: 5 rows × 5 columns

📋 Columns: ['campagne', 'date', 'variable', 'valeur', 'cible']

👀 First 3 rows:
    campagne        date      variable    valeur     cible
0  2022-2023  2023-02-28     ACTE(VGP)   5373898   4475890
1  2022-2023  2023-02-28  DOSES(J07E1)  11219443  11178955
2  2022-2023  2023-02-28       UNIVERS     20834     21078

🔢 Data types:
  - campagne: object
  - date: object
  - variable: object
  - valeur: int64
  - cible: int64

✅ No missing values

📅 Potential date columns: ['date']
  - date: 2023-02-28 00:00:00 to 2023-02-28 00:00:00

📄 FILE: couverture-2022.csv
📍 Location: Vaccination-Grippe-2022-2023
✅ Encoding: utf-8
📏 Shape: 52 rows × 5 columns

📋 Columns: ['region', 'code', 'variable', 'groupe', 'valeur']

👀 First 3 rows:
                     region  code      variable           groupe  valeur
0        11 - ILE-DE-France    11     ACTE(VGP)  moins de 65 ans     266
1        11 - ILE-DE-France    11  DOSES(J07E1)  moins de 65 ans     478
2  24 - CENTRE-VAL-D

  - date: 2021-10-22 00:00:00 to 2023-02-28 00:00:00
  - jour: 1970-01-01 00:00:00.000000001 to 1970-01-01 00:00:00.000000134

📂 Vaccination-Grippe-2023-2024

📄 FILE: campagne-2023.csv
📍 Location: Vaccination-Grippe-2023-2024
✅ Encoding: utf-8
📏 Shape: 5 rows × 5 columns

📋 Columns: ['campagne', 'date', 'variable', 'valeur', 'cible']

👀 First 3 rows:
    campagne        date      variable    valeur     cible
0  2023-2024  2024-02-28     ACTE(VGP)   5645127   5373898
1  2023-2024  2024-02-28  DOSES(J07E1)  10510433  11219443
2  2023-2024  2024-02-28       UNIVERS     20585     20834

🔢 Data types:
  - campagne: object
  - date: object
  - variable: object
  - valeur: int64
  - cible: int64

✅ No missing values

📅 Potential date columns: ['date']
  - date: 2024-02-28 00:00:00 to 2024-02-28 00:00:00

📄 FILE: couverture-2023.csv
📍 Location: Vaccination-Grippe-2023-2024
✅ Encoding: utf-8
📏 Shape: 52 rows × 5 columns

📋 Columns: ['region', 'code', 'variable', 'groupe', 'valeur']

👀 First 3 r

✅ Encoding: utf-8
📏 Shape: 52 rows × 5 columns

📋 Columns: ['region', 'code', 'variable', 'groupe', 'valeur']

👀 First 3 rows:
               region  code      variable           groupe  valeur
0  11 - ILE-DE-France    11     ACTE(VGP)   65 ans et plus    3930
1  11 - ILE-DE-France    11  DOSES(J07E1)   65 ans et plus    5433
2  11 - ILE-DE-France    11     ACTE(VGP)  moins de 65 ans     320

🔢 Data types:
  - region: object
  - code: int64
  - variable: object
  - groupe: object
  - valeur: int64

✅ No missing values

📄 FILE: doses-actes-2024.csv
📍 Location: Vaccination-Grippe-2024-2025
✅ Encoding: utf-8


📏 Shape: 964 rows × 6 columns

📋 Columns: ['campagne', 'date', 'jour', 'variable', 'groupe', 'valeur']

👀 First 3 rows:
    campagne        date  jour      variable           groupe  valeur
0  2023-2024  2023-10-17     1     ACTE(VGP)   65 ans et plus  165358
1  2023-2024  2023-10-17     1  DOSES(J07E1)   65 ans et plus  460513
2  2023-2024  2023-10-17     1     ACTE(VGP)  moins de 65 ans   45120

🔢 Data types:
  - campagne: object
  - date: object
  - jour: int64
  - variable: object
  - groupe: object
  - valeur: int64

✅ No missing values

📅 Potential date columns: ['date', 'jour']
  - date: 2023-10-17 00:00:00 to 2025-01-28 00:00:00
  - jour: 1970-01-01 00:00:00.000000001 to 1970-01-01 00:00:00.000000135

✅ Total flu campaign files: 12


---

## 📝 Step 3: Key Findings Summary

Let's summarize what we learned about our data.

In [14]:
print("\n" + "="*80)
print("📋 DATA DISCOVERY SUMMARY")
print("="*80)

print("\n✅ WHAT WE HAVE:")
print("\n1. Vaccination Coverage (Adolescents & Adults):")
print("   - National level: Historical coverage rates")
print("   - Regional level: 13 French regions")
print("   - Departmental level: 101 departments")

print("\n2. Emergency Room Visits & SOS Médecins:")
print("   - National level: Weekly time series")
print("   - Regional level: By region")
print("   - Departmental level: By department")

print("\n3. Flu Vaccination Campaigns (2021-2025):")
print("   - 4 years of campaign data")
print("   - 3 files per year: campaign info, coverage, doses/acts")

print("\n⚠️ DATA QUALITY ISSUES TO ADDRESS:")
print("   1. Encoding issues (French characters)")
print("   2. Date format inconsistencies")
print("   3. Missing values in some columns")
print("   4. Different geographic levels need alignment")

print("\n🎯 NEXT STEPS:")
print("   1. Data Cleaning: Standardize dates, regions, handle missing values")
print("   2. Data Integration: Combine sources at regional level")
print("   3. Feature Engineering: Create lag variables, rolling averages")
print("   4. Exploratory Analysis: Understand patterns and correlations")
print("   5. Modeling: Forecast vaccine needs and emergency visits")
print("   6. Optimization: Allocate vaccines to regions")

print("\n" + "="*80)


📋 DATA DISCOVERY SUMMARY

✅ WHAT WE HAVE:

1. Vaccination Coverage (Adolescents & Adults):
   - National level: Historical coverage rates
   - Regional level: 13 French regions
   - Departmental level: 101 departments

2. Emergency Room Visits & SOS Médecins:
   - National level: Weekly time series
   - Regional level: By region
   - Departmental level: By department

3. Flu Vaccination Campaigns (2021-2025):
   - 4 years of campaign data
   - 3 files per year: campaign info, coverage, doses/acts

⚠️ DATA QUALITY ISSUES TO ADDRESS:
   1. Encoding issues (French characters)
   2. Date format inconsistencies
   3. Missing values in some columns
   4. Different geographic levels need alignment

🎯 NEXT STEPS:
   1. Data Cleaning: Standardize dates, regions, handle missing values
   2. Data Integration: Combine sources at regional level
   3. Feature Engineering: Create lag variables, rolling averages
   4. Exploratory Analysis: Understand patterns and correlations
   5. Modeling: Forecast

---

## 💾 Save Discovery Results

Let's save what we learned for the next notebooks.

In [15]:
# Create a data catalog
import json

data_catalog = {
    'discovery_date': datetime.now().isoformat(),
    'base_path': str(DATA_PATH),
    'files_found': len(csv_files) if csv_files else 0,
    'categories': {
        'vaccination_coverage': {
            'description': 'Vaccination coverage rates for adolescents and adults',
            'levels': ['national', 'regional', 'departmental'],
            'time_period': 'Since 2011'
        },
        'emergency_passages': {
            'description': 'Emergency room visits and SOS Médecins acts for flu',
            'levels': ['national', 'regional', 'departmental'],
            'frequency': 'Weekly'
        },
        'flu_campaigns': {
            'description': 'Flu vaccination campaigns',
            'years': ['2021-2022', '2022-2023', '2023-2024', '2024-2025'],
            'file_types': ['campaign', 'couverture', 'doses-actes']
        }
    },
    'quality_issues': [
        'French character encoding (use latin-1 or cp1252)',
        'Date format variations',
        'Missing values in some datasets',
        'Geographic level alignment needed'
    ],
    'recommended_approach': [
        'Focus on regional level (good balance of detail and data availability)',
        'Use emergency room data as primary demand signal',
        'Use vaccination coverage to identify gaps',
        'Use campaign data for historical effectiveness'
    ]
}

# Save catalog
catalog_path = BASE_PATH / 'data_catalog.json'
with open(catalog_path, 'w', encoding='utf-8') as f:
    json.dump(data_catalog, f, indent=2, ensure_ascii=False)

print(f"\n✅ Data catalog saved to: {catalog_path}")
print("\n🚀 Ready for next notebook: 01_Data_Cleaning.ipynb")


✅ Data catalog saved to: /Users/fadybekkar/Desktop/EPITECH/HACK/Hackaton_Data/projet/data_catalog.json

🚀 Ready for next notebook: 01_Data_Cleaning.ipynb
