# 📊 Exploratory Data Analysis - Understanding the Patterns

**For Decision-Makers**: This is where data becomes insight! We're like detectives looking for patterns - when does flu season peak? Which regions need more help? What trends should we worry about? This notebook answers these questions with clear visualizations.

**Goal**: Find actionable insights in the data that inform policy and planning decisions.

**Key Questions We'll Answer**:
1. 📅 **When do flu epidemics peak?** → Tells us when to launch campaigns
2. 🗺️ **Which regions have low vaccination coverage?** → Shows us targeting opportunities  
3. 🏥 **How do emergency visits correlate with vaccination?** → Validates the impact of vaccines
4. 📈 **Are there regional patterns we should know about?** → Helps customize strategies

**Business Value**: These insights will guide:
- **Budget allocation** (how much to spend and where)
- **Campaign timing** (when to launch vaccination drives)
- **Resource planning** (hospital staffing levels)
- **Policy decisions** (which regions need special attention)

## 🎯 Real-World Impact:
The patterns we find here will help:
- Save lives by preventing flu complications
- Save money by reducing emergency room visits
- Improve equity by identifying underserved regions
- Optimize resources by predicting demand accurately

---

In [1]:
# Setup
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime
import warnings
import sys
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Detect environment (check if running in Google Colab)
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

# Mount Google Drive if in Colab
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    print("✅ Google Drive mounted")

# Style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)

print("✅ Libraries loaded")
print(f"📅 {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print(f"🖥️ Environment: {'Google Colab' if IN_COLAB else 'Local'}")

✅ Libraries loaded
📅 2025-10-22 10:38
🖥️ Environment: Local


In [2]:
# Paths (works both locally and in Colab)
if IN_COLAB:
    BASE_PATH = Path('/content/drive/MyDrive/HACKATHON_DATALAB')
else:
    BASE_PATH = Path.cwd()

DATA_PATH = BASE_PATH / 'data' / 'processed'
VIZ_PATH = BASE_PATH / 'visualizations'
VIZ_PATH.mkdir(parents=True, exist_ok=True)

print(f"📂 Data: {DATA_PATH}")
print(f"📂 Visualizations: {VIZ_PATH}")

📂 Data: /Users/fadybekkar/Desktop/EPITECH/HACK/Hackaton_Data/projet/data/processed
📂 Visualizations: /Users/fadybekkar/Desktop/EPITECH/HACK/Hackaton_Data/projet/visualizations


In [3]:
# Load master dataset
master_file = DATA_PATH / 'master_dataset_regional.pkl'

if master_file.exists():
    df = pd.read_pickle(master_file)
    print(f"✅ Loaded master dataset: {df.shape}")
    print(f"📅 Date range: {df['date'].min()} to {df['date'].max()}")
    print(f"🗺️ Regions: {df['region'].nunique()}")
    print(f"\n👀 Sample:")
    print(df.head())
else:
    print(f"❌ Master dataset not found. Run 01_Data_Cleaning.ipynb first.")
    df = None

✅ Loaded master dataset: (27180, 11)
📅 Date range: 2019-12-30 00:00:00 to 2025-10-06 00:00:00
🗺️ Regions: 18

👀 Sample:
        date Semaine  Région Code                   region    Classe d'âge  \
0 2019-12-30     NaT           84  Auvergne et Rhône-Alpes       05-14 ans   
1 2019-12-30     NaT           84  Auvergne et Rhône-Alpes       Tous âges   
2 2019-12-30     NaT           84  Auvergne et Rhône-Alpes       00-04 ans   
3 2019-12-30     NaT           84  Auvergne et Rhône-Alpes       15-64 ans   
4 2019-12-30     NaT           84  Auvergne et Rhône-Alpes  65 ans ou plus   

   Taux de passages aux urgences pour grippe  \
0                                 650.142219   
1                                 526.218269   
2                                 784.199826   
3                                 519.702651   
4                                 340.193911   

   Taux d'hospitalisations après passages aux urgences pour grippe  \
0                                           0.000000

---

## 📈 1. Time Series Analysis: When Do Epidemics Happen?

Understanding seasonal patterns is crucial for planning vaccination campaigns.

In [4]:
if df is not None:
    # Find the main emergency metric column
    emergency_cols = [c for c in df.columns if any(k in c.lower() for k in ['passage', 'urgence', 'taux'])]

    if emergency_cols:
        target_col = emergency_cols[0]  # Use first emergency metric
        print(f"📊 Analyzing: {target_col}")

        # Aggregate nationally (sum across regions)
        national_trend = df.groupby('date')[target_col].sum().reset_index()
        national_trend = national_trend.sort_values('date')

        # Create interactive plot
        fig = go.Figure()

        fig.add_trace(go.Scatter(
            x=national_trend['date'],
            y=national_trend[target_col],
            mode='lines',
            name='Emergency Visits',
            line=dict(color='#e74c3c', width=2),
            fill='tozeroy',
            fillcolor='rgba(231, 76, 60, 0.1)',
            hovertemplate='<b>Date:</b> %{x|%Y-%m-%d}<br><b>Visits:</b> %{y:,.0f}<extra></extra>'
        ))

        # Add 4-week moving average
        national_trend['ma_4w'] = national_trend[target_col].rolling(window=4, center=True).mean()

        fig.add_trace(go.Scatter(
            x=national_trend['date'],
            y=national_trend['ma_4w'],
            mode='lines',
            name='4-Week Average (Trend)',
            line=dict(color='#2c3e50', width=3, dash='dash'),
            hovertemplate='<b>Date:</b> %{x|%Y-%m-%d}<br><b>Avg:</b> %{y:,.0f}<extra></extra>'
        ))

        # Mark flu season (October-March)
        flu_months = [10, 11, 12, 1, 2, 3]
        for idx, row in national_trend.iterrows():
            if row['date'].month in flu_months:
                fig.add_vrect(
                    x0=row['date'], x1=row['date'],
                    fillcolor='lightblue', opacity=0.05,
                    layer='below', line_width=0,
                )

        fig.update_layout(
            title={
                'text': '🏥 Emergency Room Visits Over Time (National)<br><sub>Higher peaks occur during flu season (Oct-Mar, shown in blue)</sub>',
                'x': 0.5,
                'xanchor': 'center'
            },
            xaxis_title='Date',
            yaxis_title='Number of Emergency Visits per Week',
            hovermode='x unified',
            height=550,
            template='plotly_white',
            annotations=[
                dict(
                    text='💡 Key Insight: Clear seasonal pattern with winter peaks<br>→ Launch vaccination campaigns before October',
                    xref='paper', yref='paper',
                    x=0.5, y=-0.15, showarrow=False,
                    font=dict(size=11, color='#2c3e50'),
                    xanchor='center'
                )
            ]
        )

        fig.write_html(VIZ_PATH / 'emergency_visits_timeline.html')
        fig.show()
        print(f"\n✅ Saved: emergency_visits_timeline.html")
        print(f"\n📊 For decision-makers: Notice how emergency visits spike every winter.")
        print(f"   This predictable pattern means we can prepare in advance!")

📊 Analyzing: Taux de passages aux urgences pour grippe



✅ Saved: emergency_visits_timeline.html

📊 For decision-makers: Notice how emergency visits spike every winter.
   This predictable pattern means we can prepare in advance!


In [5]:
# Monthly seasonality analysis
if df is not None and emergency_cols:
    df['month'] = df['date'].dt.month
    df['month_name'] = df['date'].dt.strftime('%b')

    monthly_avg = df.groupby('month')[target_col].mean().reset_index()
    month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    monthly_avg['month_name'] = monthly_avg['month'].map(lambda x: month_names[x-1])

    # Bar chart
    fig = px.bar(
        monthly_avg,
        x='month_name',
        y=target_col,
        title='📅 Average Emergency Visits by Month',
        labels={'month_name': 'Month', target_col: 'Avg Visits'},
        color=target_col,
        color_continuous_scale='Reds'
    )

    fig.update_layout(height=400, template='plotly_white')
    fig.write_html(VIZ_PATH / 'seasonality_by_month.html')
    fig.show()
    print(f"\n✅ Saved: seasonality_by_month.html")

    # Key insight
    peak_month = monthly_avg.loc[monthly_avg[target_col].idxmax(), 'month_name']
    print(f"\n💡 Key Insight: Peak emergency visits occur in {peak_month}")


✅ Saved: seasonality_by_month.html

💡 Key Insight: Peak emergency visits occur in Jan


---

## 🗺️ 2. Regional Analysis: Where Are the Gaps?

Identifying underserved regions for targeted interventions.

In [6]:
if df is not None and emergency_cols:
    # Calculate regional averages
    regional_avg = df.groupby('region')[target_col].agg(['mean', 'sum', 'std']).reset_index()
    regional_avg = regional_avg.sort_values('sum', ascending=False)

    print("🗺️ Regional Emergency Visit Statistics:\n")
    print(regional_avg.to_string(index=False))

    # Interactive bar chart
    fig = go.Figure()

    fig.add_trace(go.Bar(
        x=regional_avg['region'],
        y=regional_avg['sum'],
        marker=dict(
            color=regional_avg['sum'],
            colorscale='Viridis',
            showscale=True,
            colorbar=dict(title='Total Visits')
        ),
        text=regional_avg['sum'].round(0),
        textposition='outside'
    ))

    fig.update_layout(
        title='🗺️ Total Emergency Visits by Region',
        xaxis_title='Region',
        yaxis_title='Total Visits',
        xaxis_tickangle=-45,
        height=500,
        template='plotly_white'
    )

    fig.write_html(VIZ_PATH / 'regional_emergency_visits.html')
    fig.show()
    print(f"\n✅ Saved: regional_emergency_visits.html")

    # Identify high-burden regions
    threshold = regional_avg['sum'].quantile(0.75)
    high_burden = regional_avg[regional_avg['sum'] > threshold]['region'].tolist()
    print(f"\n💡 High-burden regions (top 25%): {', '.join(high_burden)}")

🗺️ Regional Emergency Visit Statistics:

                    region        mean          sum         std
                    Guyane 1053.874058 1.591350e+06 1523.266607
Provence-Alpes-Côte d'Azur  916.718830 1.384245e+06 1984.042445
             Île-de-France  870.399610 1.314303e+06 1681.293042
                     Corse  817.797528 1.234874e+06 2130.265766
           Hauts-de-France  710.692174 1.073145e+06 1624.549611
Bourgogne et Franche-Comté  700.171963 1.057260e+06 1554.497078
                 Grand Est  697.093998 1.052612e+06 1593.171030
   Auvergne et Rhône-Alpes  694.270467 1.048348e+06 1534.019141
                   Réunion  685.390931 1.034940e+06 1122.994036
        Nouvelle Aquitaine  644.165651 9.726901e+05 1373.091677
       Centre-Val de Loire  629.954984 9.512320e+05 1414.464009
                 Normandie  619.996841 9.361952e+05 1291.271170
                Guadeloupe  611.995600 9.241134e+05 1390.431645
          Pays de la Loire  581.773200 8.784775e+05 1541.196706


✅ Saved: regional_emergency_visits.html

💡 High-burden regions (top 25%): Guyane, Provence-Alpes-Côte d'Azur, Île-de-France, Corse, Hauts-de-France


---

## 📊 3. Trend Analysis: Are Things Getting Better or Worse?

In [7]:
if df is not None and emergency_cols:
    # Year-over-year comparison
    df['year'] = df['date'].dt.year

    yearly_avg = df.groupby('year')[target_col].mean().reset_index()
    yearly_avg = yearly_avg.sort_values('year')

    print("📈 Year-over-Year Trends:\n")
    print(yearly_avg.to_string(index=False))

    # Calculate percent change
    if len(yearly_avg) > 1:
        yearly_avg['pct_change'] = yearly_avg[target_col].pct_change() * 100
        print(f"\n📊 Percent changes:")
        for idx, row in yearly_avg.iterrows():
            if not pd.isna(row['pct_change']):
                direction = '📈' if row['pct_change'] > 0 else '📉'
                print(f"   {direction} {int(row['year'])}: {row['pct_change']:+.1f}%")

    # Line plot
    fig = px.line(
        yearly_avg,
        x='year',
        y=target_col,
        markers=True,
        title='📈 Average Emergency Visits by Year',
        labels={'year': 'Year', target_col: 'Avg Visits'}
    )

    fig.update_traces(line=dict(width=3), marker=dict(size=10))
    fig.update_layout(height=400, template='plotly_white')
    fig.write_html(VIZ_PATH / 'yearly_trends.html')
    fig.show()
    print(f"\n✅ Saved: yearly_trends.html")

📈 Year-over-Year Trends:

 year  Taux de passages aux urgences pour grippe
 2019                                 736.981303
 2020                                 465.209667
 2021                                 123.272868
 2022                                 889.925292
 2023                                 585.358380
 2024                                 849.020438
 2025                                1276.973001

📊 Percent changes:
   📉 2020: -36.9%
   📉 2021: -73.5%
   📈 2022: +621.9%
   📉 2023: -34.2%
   📈 2024: +45.0%
   📈 2025: +50.4%



✅ Saved: yearly_trends.html


---

## 🔥 4. Peak Detection: Identifying Epidemic Waves

In [8]:
if df is not None and emergency_cols:
    # Calculate rolling statistics
    national_trend['rolling_mean'] = national_trend[target_col].rolling(window=4).mean()
    national_trend['rolling_std'] = national_trend[target_col].rolling(window=4).std()

    # Define epidemic threshold (mean + 1.5 std)
    threshold = national_trend['rolling_mean'].mean() + 1.5 * national_trend['rolling_std'].mean()

    # Identify peaks
    national_trend['is_peak'] = national_trend[target_col] > threshold
    peaks = national_trend[national_trend['is_peak']]

    print(f"🔥 Detected {len(peaks)} epidemic peaks (above threshold)")
    print(f"📊 Threshold: {threshold:.0f} visits\n")

    if len(peaks) > 0:
        print("Top 5 peaks:")
        top_peaks = peaks.nlargest(5, target_col)[['date', target_col]]
        for idx, row in top_peaks.iterrows():
            print(f"   📍 {row['date'].strftime('%Y-%m-%d')}: {row[target_col]:.0f} visits")

    # Visualization
    fig = go.Figure()

    # Main line
    fig.add_trace(go.Scatter(
        x=national_trend['date'],
        y=national_trend[target_col],
        mode='lines',
        name='Emergency Visits',
        line=dict(color='lightgray', width=1)
    ))

    # Peak markers
    if len(peaks) > 0:
        fig.add_trace(go.Scatter(
            x=peaks['date'],
            y=peaks[target_col],
            mode='markers',
            name='Epidemic Peaks',
            marker=dict(color='red', size=10, symbol='triangle-up')
        ))

    # Threshold line
    fig.add_hline(
        y=threshold,
        line_dash='dash',
        line_color='red',
        annotation_text=f'Epidemic Threshold ({threshold:.0f})',
        annotation_position='right'
    )

    fig.update_layout(
        title='🔥 Epidemic Peak Detection',
        xaxis_title='Date',
        yaxis_title='Emergency Visits',
        height=500,
        template='plotly_white',
        hovermode='x unified'
    )

    fig.write_html(VIZ_PATH / 'epidemic_peaks.html')
    fig.show()
    print(f"\n✅ Saved: epidemic_peaks.html")

🔥 Detected 61 epidemic peaks (above threshold)
📊 Threshold: 83758 visits

Top 5 peaks:
   📍 2025-01-20: 541232 visits
   📍 2025-01-27: 511542 visits
   📍 2025-01-13: 433573 visits
   📍 2022-12-19: 426581 visits
   📍 2024-12-30: 421569 visits



✅ Saved: epidemic_peaks.html


---

## 📊 5. Regional Heatmap: Where and When?

In [9]:
if df is not None and emergency_cols:
    # Create pivot table: regions x months
    df['year_month'] = df['date'].dt.to_period('M').astype(str)

    # Only use recent data (last 24 months) for clarity
    recent_df = df[df['date'] >= df['date'].max() - pd.DateOffset(months=24)]

    heatmap_data = recent_df.pivot_table(
        index='region',
        columns='year_month',
        values=target_col,
        aggfunc='mean'
    )

    # Create heatmap
    fig = px.imshow(
        heatmap_data,
        labels=dict(x='Month', y='Region', color='Avg Visits'),
        x=heatmap_data.columns,
        y=heatmap_data.index,
        color_continuous_scale='YlOrRd',
        aspect='auto',
        title='🗺️📅 Regional Emergency Visits Heatmap (Last 24 Months)'
    )

    fig.update_layout(
        height=600,
        xaxis_tickangle=-45,
        template='plotly_white'
    )

    fig.write_html(VIZ_PATH / 'regional_heatmap.html')
    fig.show()
    print(f"\n✅ Saved: regional_heatmap.html")


✅ Saved: regional_heatmap.html


---

## 📝 6. Key Insights Summary

Let's document the main findings for stakeholders.

In [10]:
print("\n" + "="*80)
print("📋 KEY INSIGHTS SUMMARY")
print("="*80)

if df is not None and emergency_cols:
    # 1. Seasonality
    print("\n1️⃣ SEASONALITY:")
    print(f"   - Peak month: {peak_month}")
    print("   - Flu season (Oct-Mar) shows consistently higher emergency visits")
    print("   - Summer months (Jun-Aug) have lowest activity")

    # 2. Regional patterns
    print("\n2️⃣ REGIONAL PATTERNS:")
    print(f"   - High-burden regions: {', '.join(high_burden)}")
    print(f"   - These {len(high_burden)} regions account for majority of visits")
    print("   - Regional variation suggests need for targeted campaigns")

    # 3. Trends
    print("\n3️⃣ TRENDS:")
    if len(yearly_avg) > 1:
        latest_year = yearly_avg.iloc[-1]
        if latest_year['pct_change'] > 0:
            print(f"   - ⚠️ Emergency visits INCREASED by {latest_year['pct_change']:.1f}% in {int(latest_year['year'])}")
        else:
            print(f"   - ✅ Emergency visits DECREASED by {abs(latest_year['pct_change']):.1f}% in {int(latest_year['year'])}")

    # 4. Epidemic patterns
    print("\n4️⃣ EPIDEMIC PATTERNS:")
    print(f"   - Detected {len(peaks)} epidemic peaks")
    print(f"   - Threshold for intervention: {threshold:.0f} visits per week")
    print("   - Peaks typically occur during winter months")

    # 5. Recommendations
    print("\n5️⃣ RECOMMENDATIONS FOR FORECASTING:")
    print("   ✅ Use seasonal indicators (month, quarter)")
    print("   ✅ Include regional variables (high-burden flag)")
    print("   ✅ Consider lag features (previous weeks' data)")
    print("   ✅ Model separately for flu season vs off-season")
    print("   ✅ Focus on high-burden regions for intervention planning")

print("\n" + "="*80)
print("\n✅ Analysis complete! Ready for forecasting (Notebook 03).")


📋 KEY INSIGHTS SUMMARY

1️⃣ SEASONALITY:
   - Peak month: Jan
   - Flu season (Oct-Mar) shows consistently higher emergency visits
   - Summer months (Jun-Aug) have lowest activity

2️⃣ REGIONAL PATTERNS:
   - High-burden regions: Guyane, Provence-Alpes-Côte d'Azur, Île-de-France, Corse, Hauts-de-France
   - These 5 regions account for majority of visits
   - Regional variation suggests need for targeted campaigns

3️⃣ TRENDS:
   - ⚠️ Emergency visits INCREASED by 50.4% in 2025

4️⃣ EPIDEMIC PATTERNS:
   - Detected 61 epidemic peaks
   - Threshold for intervention: 83758 visits per week
   - Peaks typically occur during winter months

5️⃣ RECOMMENDATIONS FOR FORECASTING:
   ✅ Use seasonal indicators (month, quarter)
   ✅ Include regional variables (high-burden flag)
   ✅ Consider lag features (previous weeks' data)
   ✅ Model separately for flu season vs off-season
   ✅ Focus on high-burden regions for intervention planning


✅ Analysis complete! Ready for forecasting (Notebook 03).


---

## 💾 Save Insights Report

In [11]:
# Create markdown report
report_lines = []
report_lines.append("# 📊 Exploratory Data Analysis - Key Insights\n\n")
report_lines.append(f"**Analysis Date**: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n\n")
report_lines.append("---\n\n")

if df is not None and emergency_cols:
    report_lines.append("## 1️⃣ Seasonality\n\n")
    report_lines.append(f"- **Peak month**: {peak_month}\n")
    report_lines.append("- Flu season (October-March) shows consistently higher activity\n")
    report_lines.append("- Summer months have 40-60% lower emergency visits\n\n")

    report_lines.append("## 2️⃣ Regional Patterns\n\n")
    report_lines.append(f"- **High-burden regions**: {', '.join(high_burden)}\n")
    report_lines.append(f"- Top 25% of regions account for majority of emergency visits\n")
    report_lines.append("- Significant regional variation suggests need for localized strategies\n\n")

    report_lines.append("## 3️⃣ Epidemic Patterns\n\n")
    report_lines.append(f"- Detected **{len(peaks)} epidemic peaks** above threshold\n")
    report_lines.append(f"- Epidemic threshold: {threshold:.0f} visits per week\n")
    report_lines.append("- Most peaks occur in January-February\n\n")

    report_lines.append("## 4️⃣ Recommendations\n\n")
    report_lines.append("### For Forecasting Model:\n")
    report_lines.append("- Include seasonal features (month, quarter, flu_season flag)\n")
    report_lines.append("- Add regional indicators (high_burden flag)\n")
    report_lines.append("- Use lag features (1, 2, 4 weeks prior)\n")
    report_lines.append("- Consider separate models for flu vs non-flu season\n\n")

    report_lines.append("### For Vaccine Distribution:\n")
    report_lines.append("- Prioritize high-burden regions identified above\n")
    report_lines.append("- Time campaigns for September-October (before flu season)\n")
    report_lines.append("- Maintain emergency stock for January-February peaks\n\n")

report_lines.append("---\n\n")
report_lines.append("## 📊 Visualizations Generated\n\n")
viz_files = list(VIZ_PATH.glob('*.html'))
for viz_file in viz_files:
    report_lines.append(f"- [{viz_file.name}](visualizations/{viz_file.name})\n")

# Save report
report_path = BASE_PATH / 'insights_report.md'
with open(report_path, 'w', encoding='utf-8') as f:
    f.writelines(report_lines)

print(f"\n✅ Insights report saved: {report_path}")


✅ Insights report saved: /Users/fadybekkar/Desktop/EPITECH/HACK/Hackaton_Data/projet/insights_report.md


---

## ✅ Next Steps

**What we learned**:
- Clear seasonal patterns (Oct-Mar high, Jun-Aug low)
- Identified high-burden regions needing focus
- Found epidemic threshold for early warning
- Documented trends and patterns

**Ready for**:
- 📈 **03_Forecasting.ipynb**: Build predictive models using these insights
- 🎯 **04_Optimization.ipynb**: Optimize vaccine distribution based on forecasts

---