## Synthetic Cholera and Typhoid dataset and Model for Cholera and Typhoid outbreak

This page has a synthetic(self-created/formulated) dataset to show how we can predict **Cholera** and **Typhoid** outbreaks using envrionmental and community factors.
The data is not real but helps us to demonstrate how an AI model would work if we had the actual specific data from different regions in Ghana.

### Features

**Region** : The region in Ghana where the data is from.

**City** : The area in Ghana where the data is from.

**Year** : The year the data is from.

**Month** : The month of the year.

**Rainfall(mm)**: Average rainfall in millimeters(mm) for that month.

**Temperature(Celsius)**: Average temperature in Celsius.

**Sanitation_Index**: A number between 0 and 100 showing how clean the area is. Lower number mean poor sanitation and vice versa.

**Water_Quality_Index**: A score between 0 and 100 that show how clean the area are.

**Population_Density**: Number of people living per square Kilometer.

**Waste_Management_Core**: A score between 0 and 100 showing how well waste is collected and disposed of.

**Cholera_Cases**: The number of cholera cases occured.

**Typhoid_Cases**: The number of typhoid cases occured.

**Next_Month_Cholera**: The number of predicted cholera cases for the next month.

**Next_Month_Typhoid**: The number of predicted typhoid cases for the next month.


### Model 
Train and test a number of regression models on the above dataset and predict cholera and typhoid outbreak. The best model is choosen to be used by the platform.

## Generating the synthetic dataset

In [2]:
import pandas as pd
import numpy as np
import random
import joblib


In [22]:
# Seed
np.random.seed(42)


n_samples = 2000  

cities = np.random.choice(list(city_region_map.keys()), size=n_samples)
regions = [city_region_map[city] for city in cities]
years = np.random.choice(range(2019, 2025), size=n_samples)
months = np.random.choice(range(1, 13), size=n_samples)

rainfall = np.random.randint(50, 250, size=n_samples)
temperature = np.round(np.random.uniform(26, 33, size=n_samples), 1)
population_density = np.random.randint(100, 7000, size=n_samples)
sanitation = np.round(np.random.uniform(30, 90, size=n_samples), 2)
water_quality_index = np.round(np.random.uniform(30, 90, size=n_samples), 2)
waste_management_score = np.round(np.random.uniform(30, 90, size=n_samples), 2)

df = pd.DataFrame({
    "Region": regions,
    "City": cities,
    "Year": years,
    "Month": months,
    "Rainfall_mm": rainfall,
    "Temperature_celsius": temperature,
    "Sanitation_Index": sanitation,
    "Water_Quality_Index": water_quality_index,
    "Population_Density": population_density,
    "Waste_Management_Score": waste_management_score
})


df["Cholera_Cases"] = (
    (df["Rainfall_mm"] / 12)
    + ((100 - df["Sanitation_Index"]) / 8)
    + (df["Population_Density"] / 900)
    + np.random.normal(0, 0.5, size=df.shape[0])
).round().astype(int)

df["Typhoid_Cases"] = (
    (df["Rainfall_mm"] / 18)
    + ((100 - df["Water_Quality_Index"]) / 9)
    + (df["Population_Density"] / 1100)
    + np.random.normal(0, 0.5, size=df.shape[0])
).round().astype(int)

In [23]:
# Add next-month targets
df = df.sort_values(by=["City", "Year", "Month"])
df["Next_Month_Cholera"] = df.groupby("City")["Cholera_Cases"].shift(-1)
df["Next_Month_Typhoid"] = df.groupby("City")["Typhoid_Cases"].shift(-1)
df = df.dropna(subset=["Next_Month_Cholera", "Next_Month_Typhoid"])



In [20]:
df.head()

Unnamed: 0,Region,City,Year,Month,Rainfall_mm,Temperature_celsius,Sanitation_Index,Water_Quality_Index,Population_Density,Waste_Management_Score,Cholera_Cases,Typhoid_Cases,Next_Month_Cholera,Next_Month_Typhoid
352,Greater Accra,Accra,2019,2,216,29.3,41.87,38.69,5205,35.15,50,39,47.0,25.0
148,Greater Accra,Accra,2019,3,226,29.9,42.25,77.58,3016,52.49,47,25,44.0,35.0
317,Greater Accra,Accra,2019,3,191,28.6,67.82,58.79,6911,86.14,44,35,42.0,28.0
1047,Greater Accra,Accra,2019,3,239,31.8,68.06,74.37,3916,48.02,42,28,51.0,36.0
1325,Greater Accra,Accra,2019,3,243,31.6,46.4,56.94,5095,63.61,51,36,51.0,33.0


In [9]:
df.describe()

Unnamed: 0,Year,Month,Rainfall_mm,Temperature_celsius,Sanitation_Index,Water_Quality_Index,Population_Density,Waste_Management_Score,Cholera_Cases,Typhoid_Cases,Cholera_Outbreak,Typhoid_Outbreak
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,2020.875,6.465,156.9,29.345,58.69625,62.17365,3529.245,58.9319,23.33,17.195,0.715,0.71
std,1.410433,3.491378,57.329866,1.422636,17.51664,18.056698,1870.540543,17.806592,6.095645,4.357974,0.452547,0.454901
min,2019.0,1.0,50.0,27.1,30.66,30.38,177.0,30.3,10.0,6.0,0.0,0.0
25%,2019.0,3.0,107.0,28.0,43.3075,45.8525,2125.5,42.09,18.75,14.0,0.0,0.0
50%,2021.0,7.0,166.0,29.3,58.905,64.87,3403.0,59.265,25.0,17.0,1.0,1.0
75%,2022.0,10.0,204.25,30.5,73.3775,77.87,5131.25,73.0,28.0,20.0,1.0,1.0
max,2023.0,12.0,249.0,32.0,89.81,89.88,6998.0,89.78,35.0,27.0,1.0,1.0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Region                  200 non-null    object 
 1   City                    200 non-null    object 
 2   Year                    200 non-null    int64  
 3   Month                   200 non-null    int64  
 4   Rainfall_mm             200 non-null    int32  
 5   Temperature_celsius     200 non-null    float64
 6   Sanitation_Index        200 non-null    float64
 7   Water_Quality_Index     200 non-null    float64
 8   Population_Density      200 non-null    int32  
 9   Waste_Management_Score  200 non-null    float64
 10  Cholera_Cases           200 non-null    int64  
 11  Typhoid_Cases           200 non-null    int64  
 12  Cholera_Outbreak        200 non-null    int64  
 13  Typhoid_Outbreak        200 non-null    int64  
dtypes: float64(4), int32(2), int64(6), object(

## Data Preprocessing and Model Training

In [39]:
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.pipeline import Pipeline
from sklearn.ensemble import StackingRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [24]:
numeric_features = [
    "Temperature_celsius", "Rainfall_mm", "Population_Density",
    "Water_Quality_Index", "Sanitation_Index", "Waste_Management_Score"
]
categorical_features = ["Region", "City"]

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])


## Model Training

In [25]:
X = df.drop(["Cholera_Cases", "Typhoid_Cases", "Next_Month_Cholera", "Next_Month_Typhoid"], axis=1)

y_multi = df[["Next_Month_Cholera", "Next_Month_Typhoid"]]

X_train, X_test, y_train, y_test = train_test_split(X, y_multi, test_size=0.2, random_state=42)


In [28]:
xgb_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', MultiOutputRegressor(XGBRegressor(random_state=42, n_estimators=300, learning_rate=0.1, max_depth=5)))
])

xgb_pipe.fit(X_train, y_train)
y_pred_xgb = xgb_pipe.predict(X_test)

In [29]:
print("\n===== XGBoost Results =====")
print("Cholera → RMSE:", np.sqrt(mean_squared_error(y_test["Next_Month_Cholera"], y_pred_xgb[:, 0])))
print("Cholera → R²:", r2_score(y_test["Next_Month_Cholera"], y_pred_xgb[:, 0]))
print("Typhoid → RMSE:", np.sqrt(mean_squared_error(y_test["Next_Month_Typhoid"], y_pred_xgb[:, 1])))
print("Typhoid → R²:", r2_score(y_test["Next_Month_Typhoid"], y_pred_xgb[:, 1]))


===== XGBoost Results =====
Cholera → RMSE: 6.10497595293681
Cholera → R²: -0.15127453486020248
Typhoid → RMSE: 4.448413314491459
Typhoid → R²: -0.14170687453519726


In [32]:
rf_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', MultiOutputRegressor(RandomForestRegressor(random_state=42, n_estimators=200)))
])

rf_pipe.fit(X_train, y_train)
y_pred_rf = rf_pipe.predict(X_test)

In [33]:
print("\n===== Random Forest Results =====")
print("Cholera → RMSE:", np.sqrt(mean_squared_error(y_test["Next_Month_Cholera"], y_pred_rf[:, 0])))
print("Cholera → R²:", r2_score(y_test["Next_Month_Cholera"], y_pred_rf[:, 0]))
print("Typhoid → RMSE:", np.sqrt(mean_squared_error(y_test["Next_Month_Typhoid"], y_pred_rf[:, 1])))
print("Typhoid → R²:", r2_score(y_test["Next_Month_Typhoid"], y_pred_rf[:, 1]))


===== Random Forest Results =====
Cholera → RMSE: 5.7385121103472025
Cholera → R²: -0.017207581235789693
Typhoid → RMSE: 4.2701749581173285
Typhoid → R²: -0.052048328906511454


In [34]:
gb_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', MultiOutputRegressor(GradientBoostingRegressor(random_state=42)))
])

gb_pipe.fit(X_train, y_train)
y_pred_gb = gb_pipe.predict(X_test)

In [35]:
print("\n===== Gradient Boosting Results =====")
print("Cholera → RMSE:", np.sqrt(mean_squared_error(y_test["Next_Month_Cholera"], y_pred_gb[:, 0])))
print("Cholera → R²:", r2_score(y_test["Next_Month_Cholera"], y_pred_gb[:, 0]))
print("Typhoid → RMSE:", np.sqrt(mean_squared_error(y_test["Next_Month_Typhoid"], y_pred_gb[:, 1])))
print("Typhoid → R²:", r2_score(y_test["Next_Month_Typhoid"], y_pred_gb[:, 1]))


===== Gradient Boosting Results =====
Cholera → RMSE: 5.742481135163961
Cholera → R²: -0.0186151650334625
Typhoid → RMSE: 4.1951663439734
Typhoid → R²: -0.015413010860267917


In [40]:
stacking_regressor = StackingRegressor(
    estimators=[
        ('xgb', XGBRegressor(random_state=42, n_estimators=300, learning_rate=0.1, max_depth=5)),
        ('rf', RandomForestRegressor(random_state=42, n_estimators=200)),
        ('gb', GradientBoostingRegressor(random_state=42)),
        ('hgb', HistGradientBoostingRegressor(random_state=42))
    ],
    final_estimator=XGBRegressor(random_state=42, learning_rate=0.1, n_estimators=100)
)

blend_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', MultiOutputRegressor(stacking_regressor))
])

blend_pipe.fit(X_train, y_train)
y_pred_blend = blend_pipe.predict(X_test)

In [41]:
print("\n===== Blended Ensemble (Stacking) =====")
print("Cholera → RMSE:", np.sqrt(mean_squared_error(y_test["Next_Month_Cholera"], y_pred_blend[:, 0])))
print("Cholera → R²:", r2_score(y_test["Next_Month_Cholera"], y_pred_blend[:, 0]))
print("Typhoid → RMSE:", np.sqrt(mean_squared_error(y_test["Next_Month_Typhoid"], y_pred_blend[:, 1])))
print("Typhoid → R²:", r2_score(y_test["Next_Month_Typhoid"], y_pred_blend[:, 1]))


===== Blended Ensemble (Stacking) =====
Cholera → RMSE: 5.917636736980268
Cholera → R²: -0.08170188609550522
Typhoid → RMSE: 4.430398956098188
Typhoid → R²: -0.13247865273063986


### Choosing the best model

In [44]:
# Store results from each model
results = pd.DataFrame({
    "Model": ["XGBoost", "Random Forest", "Gradient Boosting", "Blended Ensemble"],
    "Cholera_RMSE": [
        np.sqrt(mean_squared_error(y_test["Next_Month_Cholera"], y_pred_xgb[:, 0])),
        np.sqrt(mean_squared_error(y_test["Next_Month_Cholera"], y_pred_rf[:, 0])),
        np.sqrt(mean_squared_error(y_test["Next_Month_Cholera"], y_pred_gb[:, 0])),
        np.sqrt(mean_squared_error(y_test["Next_Month_Cholera"], y_pred_blend[:, 0]))
    ],
    "Cholera_R2": [
        r2_score(y_test["Next_Month_Cholera"], y_pred_xgb[:, 0]),
        r2_score(y_test["Next_Month_Cholera"], y_pred_rf[:, 0]),
        r2_score(y_test["Next_Month_Cholera"], y_pred_gb[:, 0]),
        r2_score(y_test["Next_Month_Cholera"], y_pred_blend[:, 0])
    ],
    "Typhoid_RMSE": [
        np.sqrt(mean_squared_error(y_test["Next_Month_Typhoid"], y_pred_xgb[:, 1])),
        np.sqrt(mean_squared_error(y_test["Next_Month_Typhoid"], y_pred_rf[:, 1])),
        np.sqrt(mean_squared_error(y_test["Next_Month_Typhoid"], y_pred_gb[:, 1])),
        np.sqrt(mean_squared_error(y_test["Next_Month_Typhoid"], y_pred_blend[:, 1]))
    ],
    "Typhoid_R2": [
        r2_score(y_test["Next_Month_Typhoid"], y_pred_xgb[:, 1]),
        r2_score(y_test["Next_Month_Typhoid"], y_pred_rf[:, 1]),
        r2_score(y_test["Next_Month_Typhoid"], y_pred_gb[:, 1]),
        r2_score(y_test["Next_Month_Typhoid"], y_pred_blend[:, 1])
    ]
})

# Display comparison table
print("===== Model Performance Comparison =====")
print(results)

# Identify the best model based on lowest average RMSE
results["Avg_RMSE"] = (results["Cholera_RMSE"] + results["Typhoid_RMSE"]) / 2
best_model_row = results.loc[results["Avg_RMSE"].idxmin()]

print("\n🏆 Best Overall Model Based on RMSE:")
print(best_model_row[["Model", "Cholera_RMSE", "Typhoid_RMSE", "Avg_RMSE"]])


===== Model Performance Comparison =====
               Model  Cholera_RMSE  Cholera_R2  Typhoid_RMSE  Typhoid_R2
0            XGBoost      6.104976   -0.151275      4.448413   -0.141707
1      Random Forest      5.738512   -0.017208      4.270175   -0.052048
2  Gradient Boosting      5.742481   -0.018615      4.195166   -0.015413
3   Blended Ensemble      5.917637   -0.081702      4.430399   -0.132479

🏆 Best Overall Model Based on RMSE:
Model           Gradient Boosting
Cholera_RMSE             5.742481
Typhoid_RMSE             4.195166
Avg_RMSE                 4.968824
Name: 2, dtype: object


Therefore the choosen model is the Gradient Boosting Regressor model.

### Exporting the model

In [46]:
# pipe = your trained pipeline
joblib.dump(gb_pipe, 'cholera_gb_pipeline.joblib')
print("Model saved successfully!")


Model saved successfully!
