# 🧭 Mpox Outbreak Predictor

## 📌 Problem Statement
Mpox (formerly known as Monkeypox) is a viral disease that has shown significant outbreaks in various regions.  
The objective of this project is to develop a machine learning model that can predict and classify Mpox outbreak risks based on available case data, trends, and related features.  

## 🎯 Project Goal
- Perform exploratory data analysis (EDA) to understand patterns and trends in Mpox outbreak data.
- Build a classification model to predict whether a given set of conditions indicates **High Risk** or **Low Risk** of an outbreak.
- Provide insights into the most important features driving outbreak predictions.

## 🗂 Dataset
We will be working with the following datasets (located in `data/raw/`):
- `owid-monkeypox-data.csv`
- `mpox.csv`
- `mpox-daily-confirmed-cases.csv`
- `DATA.csv`

## 🔄 Project Life Cycle
1. **Data Collection** → Gather outbreak datasets from multiple sources.
2. **Data Cleaning** → Handle missing values, duplicates, and formatting issues.
3. **Exploratory Data Analysis (EDA)** → Identify patterns, correlations, and anomalies.
4. **Feature Engineering** → Create new features to improve model performance.
5. **Model Training** → Build classification models using ML algorithms.
6. **Model Evaluation** → Assess performance using accuracy, precision, recall, and F1-score.
7. **Deployment** → Deploy the model via API or Streamlit dashboard.


# 📦 Step 1: Load and Preview Datasets

In [None]:
import pandas as pd
import os

# Path to raw data folder
data_path = "data/raw/"

# Dataset file names
files = [
    "owid-monkeypox-data.csv",
    "mpox.csv",
    "mpox-daily-confirmed-cases.csv",
    "DATA.csv"
]

# Dictionary to store DataFrames
datasets = {}

# Load each dataset
for file in files:
    file_path = os.path.join(data_path, file)
    try:
        df = pd.read_csv(file_path)
        datasets[file] = df
        print(f"✅ {file} loaded successfully! Shape: {df.shape}")
    except Exception as e:
        print(f"❌ Error loading {file}: {e}")

# Preview first 5 rows of each dataset
for name, df in datasets.items():
    print(f"\n📄 Preview of {name}:")
    display(df.head())


✅ owid-monkeypox-data.csv loaded successfully! Shape: (33666, 15)
✅ mpox.csv loaded successfully! Shape: (264, 9)
✅ mpox-daily-confirmed-cases.csv loaded successfully! Shape: (127729, 3)
✅ DATA.csv loaded successfully! Shape: (25000, 11)

📄 Preview of owid-monkeypox-data.csv:


Unnamed: 0,location,iso_code,date,total_cases,total_deaths,new_cases,new_deaths,new_cases_smoothed,new_deaths_smoothed,new_cases_per_million,total_cases_per_million,new_cases_smoothed_per_million,new_deaths_per_million,total_deaths_per_million,new_deaths_smoothed_per_million
0,Africa,OWID_AFR,2022-05-01,27.0,2.0,0.0,0.0,0.29,0.0,0.0,0.019,0.0,0.0,0.0014,0.0
1,Africa,OWID_AFR,2022-05-02,27.0,2.0,0.0,0.0,0.29,0.0,0.0,0.019,0.0,0.0,0.0014,0.0
2,Africa,OWID_AFR,2022-05-03,27.0,2.0,0.0,0.0,0.29,0.0,0.0,0.019,0.0,0.0,0.0014,0.0
3,Africa,OWID_AFR,2022-05-04,27.0,2.0,0.0,0.0,0.29,0.0,0.0,0.019,0.0,0.0,0.0014,0.0
4,Africa,OWID_AFR,2022-05-05,27.0,2.0,0.0,0.0,0.29,0.0,0.0,0.019,0.0,0.0,0.0014,0.0



📄 Preview of mpox.csv:


Unnamed: 0,date,location,iso_code,new_cases,new_deaths,new_cases_per_million,new_deaths_per_million,total_cases,total_deaths
0,2022-05,Central African Republic,CAF,0,0,0.0,0.0,0,0
1,2022-05,Congo,COG,0,0,0.0,0.0,0,0
2,2022-05,Ghana,GHA,0,0,0.0,0.0,0,0
3,2022-05,Democratic Republic of Congo,COD,10,0,0.101,0.0,10,0
4,2022-05,Cameroon,CMR,1,0,0.036,0.0,1,0



📄 Preview of mpox-daily-confirmed-cases.csv:


Unnamed: 0,Entity,Day,Daily cases
0,Africa,2022-05-01,0.29
1,Africa,2022-05-02,0.29
2,Africa,2022-05-03,0.29
3,Africa,2022-05-04,0.29
4,Africa,2022-05-05,0.29



📄 Preview of DATA.csv:


Unnamed: 0,Patient_ID,Systemic Illness,Rectal Pain,Sore Throat,Penile Oedema,Oral Lesions,Solitary Lesion,Swollen Tonsils,HIV Infection,Sexually Transmitted Infection,MonkeyPox
0,P0,,False,True,True,True,False,True,False,False,Negative
1,P1,Fever,True,False,True,True,False,False,True,False,Positive
2,P2,Fever,False,True,True,False,False,False,True,False,Positive
3,P3,,True,False,False,False,True,True,True,False,Positive
4,P4,Swollen Lymph Nodes,True,True,True,False,False,True,True,False,Positive


Data Overview

In [8]:
import pandas as pd
import os

# 📂 Path to raw data folder (relative to project root)
data_path = "data/raw/"

# 📄 Dataset file names
files = [
    "owid-monkeypox-data.csv",
    "mpox.csv",
    "mpox-daily-confirmed-cases.csv",
    "DATA.csv"
]

# 📦 Dictionary to store DataFrames
datasets = {}

# 🔹 Load each dataset
for file in files:
    file_path = os.path.join(data_path, file)
    try:
        df = pd.read_csv(file_path)
        datasets[file] = df
        print(f"✅ {file} loaded successfully! Shape: {df.shape}")
    except FileNotFoundError:
        print(f"❌ Error: File not found -> {file_path}")
    except Exception as e:
        print(f"❌ Error loading {file}: {e}")

# 📊 DATA OVERVIEW
for name, df in datasets.items():
    print(f"\n🔍 Overview of {name}")
    print("-" * 60)
    print(f"Shape: {df.shape}")
    
    print("\n📌 Columns & Data Types:")
    print(df.dtypes)
    
    print("\n🚩 Missing Values per Column:")
    print(df.isnull().sum())
    
    print("\n📈 Summary Statistics:")
    print(df.describe(include='all').T)
    
    print("\n" + "="*60)


✅ owid-monkeypox-data.csv loaded successfully! Shape: (33666, 15)
✅ mpox.csv loaded successfully! Shape: (264, 9)
✅ mpox-daily-confirmed-cases.csv loaded successfully! Shape: (127729, 3)
✅ DATA.csv loaded successfully! Shape: (25000, 11)

🔍 Overview of owid-monkeypox-data.csv
------------------------------------------------------------
Shape: (33666, 15)

📌 Columns & Data Types:
location                            object
iso_code                            object
date                                object
total_cases                        float64
total_deaths                       float64
new_cases                          float64
new_deaths                         float64
new_cases_smoothed                 float64
new_deaths_smoothed                float64
new_cases_per_million              float64
total_cases_per_million            float64
new_cases_smoothed_per_million     float64
new_deaths_per_million             float64
total_deaths_per_million           float64
new_deaths_smoot

# 🧹 DATA CLEANING PIPELINE

### 🔹 Step 2A: Cleaning owid-monkeypox-data.csv

Goals:

- Rename key columns (Entity → country, Date → date)

- Convert date to datetime

- Check for missing values in cases/deaths

- Remove duplicates

- Keep only relevant columns

In [37]:
# Reload dataset
owid_path = "data/raw/owid-monkeypox-data.csv"
owid_df = pd.read_csv(owid_path)

# Show first rows + columns
print("📊 Shape:", owid_df.shape)
print("🔎 Columns:", owid_df.columns.tolist())
display(owid_df.head())


📊 Shape: (33666, 15)
🔎 Columns: ['location', 'iso_code', 'date', 'total_cases', 'total_deaths', 'new_cases', 'new_deaths', 'new_cases_smoothed', 'new_deaths_smoothed', 'new_cases_per_million', 'total_cases_per_million', 'new_cases_smoothed_per_million', 'new_deaths_per_million', 'total_deaths_per_million', 'new_deaths_smoothed_per_million']


Unnamed: 0,location,iso_code,date,total_cases,total_deaths,new_cases,new_deaths,new_cases_smoothed,new_deaths_smoothed,new_cases_per_million,total_cases_per_million,new_cases_smoothed_per_million,new_deaths_per_million,total_deaths_per_million,new_deaths_smoothed_per_million
0,Africa,OWID_AFR,2022-05-01,27.0,2.0,0.0,0.0,0.29,0.0,0.0,0.019,0.0,0.0,0.0014,0.0
1,Africa,OWID_AFR,2022-05-02,27.0,2.0,0.0,0.0,0.29,0.0,0.0,0.019,0.0,0.0,0.0014,0.0
2,Africa,OWID_AFR,2022-05-03,27.0,2.0,0.0,0.0,0.29,0.0,0.0,0.019,0.0,0.0,0.0014,0.0
3,Africa,OWID_AFR,2022-05-04,27.0,2.0,0.0,0.0,0.29,0.0,0.0,0.019,0.0,0.0,0.0014,0.0
4,Africa,OWID_AFR,2022-05-05,27.0,2.0,0.0,0.0,0.29,0.0,0.0,0.019,0.0,0.0,0.0014,0.0


In [40]:
import pandas as pd

# Load dataset
owid_path = "data/raw/owid-monkeypox-data.csv"
owid_df = pd.read_csv(owid_path)

# 🔹 Standardize column names
owid_df.rename(columns={"location": "country"}, inplace=True)

# 🔹 Convert date to datetime
owid_df["date"] = pd.to_datetime(owid_df["date"], errors="coerce")

# 🔹 Sort by country + date
owid_df.sort_values(by=["country", "date"], inplace=True)

# 🔹 Handle missing values with forward fill (per country)
owid_df["total_cases"] = owid_df.groupby("country")["total_cases"].ffill()
owid_df["total_deaths"] = owid_df.groupby("country")["total_deaths"].ffill()

# 🔹 Reset index
owid_df.reset_index(drop=True, inplace=True)

# ✅ Save cleaned dataset
owid_df.to_csv("data/processed/owid_monkeypox_clean.csv", index=False)

print("✅ OWID dataset cleaned & saved successfully!")
print("Shape:", owid_df.shape)
print("Columns:", owid_df.columns.tolist())


✅ OWID dataset cleaned & saved successfully!
Shape: (33666, 15)
Columns: ['country', 'iso_code', 'date', 'total_cases', 'total_deaths', 'new_cases', 'new_deaths', 'new_cases_smoothed', 'new_deaths_smoothed', 'new_cases_per_million', 'total_cases_per_million', 'new_cases_smoothed_per_million', 'new_deaths_per_million', 'total_deaths_per_million', 'new_deaths_smoothed_per_million']



```python
import pandas as pd

# Load dataset
owid_path = "data/raw/owid-monkeypox-data.csv"
owid_df = pd.read_csv(owid_path)

# 🔹 Standardize column names
owid_df.rename(columns={"location": "country"}, inplace=True)

# 🔹 Convert date to datetime
owid_df["date"] = pd.to_datetime(owid_df["date"], errors="coerce")

# 🔹 Sort by country + date
owid_df.sort_values(by=["country", "date"], inplace=True)

# 🔹 Handle missing values with forward fill (per country)
owid_df["total_cases"] = owid_df.groupby("country")["total_cases"].ffill()
owid_df["total_deaths"] = owid_df.groupby("country")["total_deaths"].ffill()

# 🔹 Reset index
owid_df.reset_index(drop=True, inplace=True)

# ✅ Save cleaned dataset
owid_df.to_csv("data/processed/owid_monkeypox_clean.csv", index=False)

print("✅ OWID dataset cleaned & saved successfully!")
print("Shape:", owid_df.shape)
print("Columns:", owid_df.columns.tolist())
```

---

### 📌 Key Findings (Step 1: `owid-monkeypox-data.csv` Cleaning)

1. Original dataset had **33,666 rows and 15 columns**.
2. The country field is named **`location`**, not `country` (fixed by renaming).
3. Missing values in `total_cases` and `total_deaths` were forward-filled per country.
4. Dates are now standardized as `datetime64`.
5. Cleaned dataset saved as **`data/processed/owid_monkeypox_clean.csv`** for future use.

---
 

### Step 2: Cleaning mpox.csv.

In [41]:
import pandas as pd

# Load dataset
mpox_path = "data/raw/mpox.csv"
mpox_df = pd.read_csv(mpox_path)

# 🔎 Inspect structure
print("📊 Shape:", mpox_df.shape)
print("🔎 Columns:", mpox_df.columns.tolist())

# Preview first 5 rows
display(mpox_df.head())


📊 Shape: (264, 9)
🔎 Columns: ['date', 'location', 'iso_code', 'new_cases', 'new_deaths', 'new_cases_per_million', 'new_deaths_per_million', 'total_cases', 'total_deaths']


Unnamed: 0,date,location,iso_code,new_cases,new_deaths,new_cases_per_million,new_deaths_per_million,total_cases,total_deaths
0,2022-05,Central African Republic,CAF,0,0,0.0,0.0,0,0
1,2022-05,Congo,COG,0,0,0.0,0.0,0,0
2,2022-05,Ghana,GHA,0,0,0.0,0.0,0,0
3,2022-05,Democratic Republic of Congo,COD,10,0,0.101,0.0,10,0
4,2022-05,Cameroon,CMR,1,0,0.036,0.0,1,0


In [45]:
# 🔹 Copy dataframe to avoid modifying raw
mpox_clean = mpox_df.copy()

# 🔹 Standardize column names
mpox_clean.columns = [col.strip().lower().replace(" ", "_") for col in mpox_clean.columns]

# 🔹 Convert date column
mpox_clean["date"] = pd.to_datetime(mpox_clean["date"], errors="coerce")

# 🔹 Sort by location + date
mpox_clean.sort_values(by=["location", "date"], inplace=True)

# 🔹 Forward fill cumulative counts (per location)
for col in ["total_cases", "total_deaths"]:
    if col in mpox_clean.columns:
        mpox_clean[col] = mpox_clean.groupby("location")[col].ffill()

# 🔹 Handle missing values in new_cases/new_deaths → fill with 0
for col in ["new_cases", "new_deaths"]:
    if col in mpox_clean.columns:
        mpox_clean[col] = mpox_clean[col].fillna(0)

# 🔹 Reset index
mpox_clean.reset_index(drop=True, inplace=True)

# 🔹 Save cleaned version
os.makedirs("data/processed", exist_ok=True)
mpox_clean.to_csv("data/processed/mpox_clean.csv", index=False)

print("✅ mpox.csv cleaned & saved! Shape:", mpox_clean.shape)
display(mpox_clean.head())


✅ mpox.csv cleaned & saved! Shape: (264, 9)


Unnamed: 0,date,location,iso_code,new_cases,new_deaths,new_cases_per_million,new_deaths_per_million,total_cases,total_deaths
0,2024-11-01,Angola,AGO,2,0,0.052,0.0,2,0
1,2024-12-01,Angola,AGO,2,0,0.052,0.0,4,0
2,2025-01-01,Angola,AGO,0,0,0.0,0.0,4,0
3,2025-02-01,Angola,AGO,3,0,0.078,0.0,7,0
4,2025-03-01,Angola,AGO,1,0,0.026,0.0,8,0



```python
# 🔹 Copy dataframe to avoid modifying raw
mpox_clean = mpox_df.copy()

# 🔹 Standardize column names
mpox_clean.columns = [col.strip().lower().replace(" ", "_") for col in mpox_clean.columns]

# 🔹 Convert date column
mpox_clean["date"] = pd.to_datetime(mpox_clean["date"], errors="coerce")

# 🔹 Sort by location + date
mpox_clean.sort_values(by=["location", "date"], inplace=True)

# 🔹 Forward fill cumulative counts (per location)
for col in ["total_cases", "total_deaths"]:
    if col in mpox_clean.columns:
        mpox_clean[col] = mpox_clean.groupby("location")[col].ffill()

# 🔹 Handle missing values in new_cases/new_deaths → fill with 0
for col in ["new_cases", "new_deaths"]:
    if col in mpox_clean.columns:
        mpox_clean[col] = mpox_clean[col].fillna(0)

# 🔹 Reset index
mpox_clean.reset_index(drop=True, inplace=True)

# 🔹 Save cleaned version
os.makedirs("data/processed", exist_ok=True)
mpox_clean.to_csv("data/processed/mpox_clean.csv", index=False)

print("✅ mpox.csv cleaned & saved! Shape:", mpox_clean.shape)
display(mpox_clean.head())
```

---

✅ **This will give us a consistent dataset ready for analysis.**
 

### 🗝 Key Findings from `mpox.csv` Cleaning

1. Dataset covers **264 rows, 9 columns** (aggregated summary).
2. Dates converted into proper `datetime` format for analysis.
3. Cumulative case & death counts forward-filled by country to ensure continuity.
4. Missing daily new cases/deaths replaced with **0** instead of NaN.
5. Final dataset saved as **`data/processed/mpox_clean.csv`** for downstream tasks.

---
 

### 📌 Step 3: Inspect mpox-daily-confirmed-cases.csv

In [48]:
# =========================
# 📌 Step 3: Inspect mpox-daily-confirmed-cases.csv
# =========================
daily_df = datasets["mpox-daily-confirmed-cases.csv"]

print("📊 Shape:", daily_df.shape)
print("🔎 Columns:", daily_df.columns.tolist())

display(daily_df.head())


📊 Shape: (127729, 3)
🔎 Columns: ['Entity', 'Day', 'Daily cases']


Unnamed: 0,Entity,Day,Daily cases
0,Africa,2022-05-01,0.29
1,Africa,2022-05-02,0.29
2,Africa,2022-05-03,0.29
3,Africa,2022-05-04,0.29
4,Africa,2022-05-05,0.29


In [49]:
# =========================
# 📌 Step 3: Clean mpox-daily-confirmed-cases.csv
# =========================

# Copy dataset
daily_df = datasets["mpox-daily-confirmed-cases.csv"].copy()

# 🔹 Rename columns for consistency
daily_df.rename(columns={
    "Entity": "location",
    "Day": "date",
    "Daily cases": "daily_cases"
}, inplace=True)

# 🔹 Convert date to datetime
daily_df["date"] = pd.to_datetime(daily_df["date"], errors="coerce")

# 🔹 Handle missing values
daily_df["daily_cases"].fillna(0, inplace=True)

# 🔹 Sort by location + date
daily_df.sort_values(by=["location", "date"], inplace=True)

# 🔹 Reset index
daily_df.reset_index(drop=True, inplace=True)

print("✅ mpox-daily-confirmed-cases.csv cleaned successfully!")
print("📊 Shape after cleaning:", daily_df.shape)
display(daily_df.head())


✅ mpox-daily-confirmed-cases.csv cleaned successfully!
📊 Shape after cleaning: (127729, 3)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  daily_df["daily_cases"].fillna(0, inplace=True)


Unnamed: 0,location,date,daily_cases
0,Africa,2022-05-01,0.29
1,Africa,2022-05-02,0.29
2,Africa,2022-05-03,0.29
3,Africa,2022-05-04,0.29
4,Africa,2022-05-05,0.29


---

### 🧹 Cleaning `mpox-daily-confirmed-cases.csv`

```python
# =========================
# 📌 Step 3: Clean mpox-daily-confirmed-cases.csv
# =========================

# Copy dataset
daily_df = datasets["mpox-daily-confirmed-cases.csv"].copy()

# 🔹 Rename columns for consistency
daily_df.rename(columns={
    "Entity": "location",
    "Day": "date",
    "Daily cases": "daily_cases"
}, inplace=True)

# 🔹 Convert date to datetime
daily_df["date"] = pd.to_datetime(daily_df["date"], errors="coerce")

# 🔹 Handle missing values
daily_df["daily_cases"].fillna(0, inplace=True)

# 🔹 Sort by location + date
daily_df.sort_values(by=["location", "date"], inplace=True)

# 🔹 Reset index
daily_df.reset_index(drop=True, inplace=True)

print("✅ mpox-daily-confirmed-cases.csv cleaned successfully!")
print("📊 Shape after cleaning:", daily_df.shape)
display(daily_df.head())
```

---

### ✅ Key Findings (`mpox-daily-confirmed-cases.csv`)

1. Dataset contains **127,729 rows** and **3 columns** (location, date, daily cases).
2. `Entity` was renamed to **location**, `Day` → **date**, `Daily cases` → **daily\_cases** for consistency.
3. Dates are now in proper **datetime format**.
4. Missing values in `daily_cases` were replaced with **0** to maintain continuity.
5. Data is now sorted by **location and date** → ready for analysis & aggregation.

---



### 📌 Step 4: Cleaning DATA.csv

In [50]:
# =========================
# 📌 Step 4: Clean DATA.csv
# =========================

# Copy dataset
data_df = datasets["DATA.csv"].copy()

# 🔹 Quick inspection
print("📊 Shape:", data_df.shape)
print("🔎 Columns:", data_df.columns.tolist())
display(data_df.head())


📊 Shape: (25000, 11)
🔎 Columns: ['Patient_ID', 'Systemic Illness', 'Rectal Pain', 'Sore Throat', 'Penile Oedema', 'Oral Lesions', 'Solitary Lesion', 'Swollen Tonsils', 'HIV Infection', 'Sexually Transmitted Infection', 'MonkeyPox']


Unnamed: 0,Patient_ID,Systemic Illness,Rectal Pain,Sore Throat,Penile Oedema,Oral Lesions,Solitary Lesion,Swollen Tonsils,HIV Infection,Sexually Transmitted Infection,MonkeyPox
0,P0,,False,True,True,True,False,True,False,False,Negative
1,P1,Fever,True,False,True,True,False,False,True,False,Positive
2,P2,Fever,False,True,True,False,False,False,True,False,Positive
3,P3,,True,False,False,False,True,True,True,False,Positive
4,P4,Swollen Lymph Nodes,True,True,True,False,False,True,True,False,Positive
