## Synthetic Cholera and Typhoid dataset and Model for Cholera and Typhoid outbreak

This page has a synthetic(self-created/formulated) dataset to show how we can predict **Cholera** and **Typhoid** oytbreaks using envrionmental and community factors.
The data is not real but helps us to demonstrate how an AI model would work if we had the actusl specific data from different regions in Ghana.

### Features

**Region** : The region in Ghana where the data is from.

**City** : The area in Ghana where the data is from.

**Year** : The year the data is from.

**Month** : The month of the year.

**Rainfall(mm)**: Average rainfall in millimeters(mm) for that month.

**Temperature(Celsius)**: Average temperature in Celsius.

**Sanitation_Index**: A number between 0 and 100 showing how clean the area is. Lower number mean poor sanitation and vice versa.

**Water_Quality_Index**: A score between 0 and 100 that show how clean the area are.

**Population_Density**: Number of people living per square Kilometer.

**Waste_Management_Core**: A score between showing how well waste is collected and disposed of.

**Cholera_Cases**: The number of cholera cases occured.

**Typhoid_Cases**: The number of typhoid cases occured.

**Cholera_Outbreak**: 1, if an out break occured, 0 otherwise

**Typhoid_Outbreak**: 1, if an out break occured, 0 otherwise

### Model 
Train and test a model on the above dataset and predict cholera and typhoid outbreak.

In [38]:
import pandas as pd
import numpy as np
import random
import joblib


In [39]:
# Seed
np.random.seed(42)

# City → Region mapping
city_region_map = {
    'Accra': 'Greater Accra',
    'Cape Coast': 'Central',
    'Kumasi': 'Ashanti',
    'Ho': 'Volta',
    'Tamale': 'Northern',
    'Takoradi': 'Western',
    'Wa': 'Upper West',
    'Bolgatanga': 'Upper East'
}

# Generate random cities and map to regions
cities = np.random.choice(list(city_region_map.keys()), size=200)
regions = [city_region_map[city] for city in cities]

# Random years and months
years = np.random.choice(range(2019, 2024), size=200)
months = np.random.choice(range(1, 13), size=200)



In [40]:
# Environmental & community features
rainfall = np.random.randint(50, 250, size=200)
temperature = np.round(np.random.uniform(27, 32, size=200), 1)
population_density = np.random.randint(100, 7000, size=200)
sanitation = np.round(np.random.uniform(0.3, 0.9, size=200) * 100, 2)
water_quality_index = np.round(np.random.uniform(0.3, 0.9, size=200) * 100, 2)
waste_management_score = np.round(np.random.uniform(0.3, 0.9, size=200) * 100, 2)


In [41]:
df = pd.DataFrame({
    "Region": regions,
    "City": cities,
    "Year": years,
    "Month": months,
    "Rainfall_mm": rainfall,
    "Temperature_celsius": temperature,
    "Sanitation_Index": sanitation,
    "Water_Quality_Index": water_quality_index,
    "Population_Density": population_density,
    "Waste_Management_Score": waste_management_score
})

In [42]:
# Cholera cases
cholera_cases = ((df["Rainfall_mm"] / 10) +  ((100 - df["Sanitation_Index"]) / 10) + (df["Population_Density"] / 1000)).round().astype(int)

# Typhoid cases
typhoid_cases = ((df["Rainfall_mm"] / 15) +   ((100 - df["Water_Quality_Index"]) / 10) + (df["Population_Density"] / 1200)).round().astype(int)

df["Cholera_Cases"] = cholera_cases
df["Typhoid_Cases"] = typhoid_cases


In [43]:
# Cholera outbreak if cases >= 20
df["Cholera_Outbreak"] = (cholera_cases >= 20).astype(int)

# Typhoid outbreak if cases >= 15
df["Typhoid_Outbreak"] = (typhoid_cases >= 15).astype(int)


In [44]:
df.head()


Unnamed: 0,Region,City,Year,Month,Rainfall_mm,Temperature_celsius,Sanitation_Index,Water_Quality_Index,Population_Density,Waste_Management_Score,Cholera_Cases,Typhoid_Cases,Cholera_Outbreak,Typhoid_Outbreak
0,Upper West,Wa,2019,8,79,31.3,89.81,31.46,1414,31.95,10,13,0,0
1,Volta,Ho,2022,1,66,31.1,60.13,82.21,4920,85.25,16,10,0,0
2,Northern,Tamale,2023,9,162,32.0,65.72,31.28,6248,67.0,26,23,1,1
3,Upper West,Wa,2022,11,111,32.0,34.02,82.48,2998,77.79,21,12,1,0
4,Ashanti,Kumasi,2023,6,133,29.8,75.0,61.74,177,58.89,16,13,0,0


In [45]:
df.describe()

Unnamed: 0,Year,Month,Rainfall_mm,Temperature_celsius,Sanitation_Index,Water_Quality_Index,Population_Density,Waste_Management_Score,Cholera_Cases,Typhoid_Cases,Cholera_Outbreak,Typhoid_Outbreak
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,2020.875,6.465,156.9,29.345,58.69625,62.17365,3529.245,58.9319,23.33,17.195,0.715,0.71
std,1.410433,3.491378,57.329866,1.422636,17.51664,18.056698,1870.540543,17.806592,6.095645,4.357974,0.452547,0.454901
min,2019.0,1.0,50.0,27.1,30.66,30.38,177.0,30.3,10.0,6.0,0.0,0.0
25%,2019.0,3.0,107.0,28.0,43.3075,45.8525,2125.5,42.09,18.75,14.0,0.0,0.0
50%,2021.0,7.0,166.0,29.3,58.905,64.87,3403.0,59.265,25.0,17.0,1.0,1.0
75%,2022.0,10.0,204.25,30.5,73.3775,77.87,5131.25,73.0,28.0,20.0,1.0,1.0
max,2023.0,12.0,249.0,32.0,89.81,89.88,6998.0,89.78,35.0,27.0,1.0,1.0


In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Region                  200 non-null    object 
 1   City                    200 non-null    object 
 2   Year                    200 non-null    int64  
 3   Month                   200 non-null    int64  
 4   Rainfall_mm             200 non-null    int32  
 5   Temperature_celsius     200 non-null    float64
 6   Sanitation_Index        200 non-null    float64
 7   Water_Quality_Index     200 non-null    float64
 8   Population_Density      200 non-null    int32  
 9   Waste_Management_Score  200 non-null    float64
 10  Cholera_Cases           200 non-null    int64  
 11  Typhoid_Cases           200 non-null    int64  
 12  Cholera_Outbreak        200 non-null    int64  
 13  Typhoid_Outbreak        200 non-null    int64  
dtypes: float64(4), int32(2), int64(6), object(

## Data Preprocessing and Model Training

In [47]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

In [48]:
y = df[["Cholera_Outbreak", "Typhoid_Outbreak"]]
x = df.drop(["Cholera_Outbreak", "Typhoid_Outbreak"], axis=1)

# Encode City
le_city = LabelEncoder()
x['City'] = le_city.fit_transform(x['City'])

# Encode Region
le_region = LabelEncoder()
x['Region'] = le_region.fit_transform(x['Region'])


In [49]:
X_train, X_test, y_train, y_test = train_test_split(x, y,random_state=0)
print(X_train)
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42))])

     Region  City  Year  Month  Rainfall_mm  Temperature_celsius  \
71        1     2  2021     10          194                 27.8   
124       3     6  2022      8          234                 30.7   
184       2     0  2023      1          226                 29.1   
97        6     3  2019      9          154                 29.3   
149       4     1  2022     10          213                 31.8   
..      ...   ...   ...    ...          ...                  ...   
67        2     0  2023     10          185                 30.3   
192       2     0  2021      5          226                 27.9   
117       7     5  2019      2           96                 30.0   
47        5     7  2019     12          125                 27.7   
172       0     4  2023     11          202                 27.3   

     Sanitation_Index  Water_Quality_Index  Population_Density  \
71              75.82                45.94                6096   
124             86.21                88.87         

In [50]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder # Import OneHotEncoder
from sklearn.impute import SimpleImputer # Good practice for missing values
from sklearn.compose import ColumnTransformer # To apply different steps to different columns
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
# Assume your data lists (regions, cities, etc.) are defined above this line

# --- 1. Create the DataFrame (Keep original string values) ---
# df = pd.DataFrame({
#     "Region": regions,
#     "City": cities,
#     "Year": years,
#     "Month": months,
#     "Rainfall(mm)": rainfall,
#     "Temperature(Celsius)": temperature,
#     "Sanitation_Index": sanitation,
#     "Water_Quality_Index": water_quality_index,
#     "Population_Density": population_density,
#     "Waste_Management_Score": waste_management_score,
# })

# --- 2. Separate Features (X) and Target (y) ---
# Use the DataFrame BEFORE manual encoding
y = df[["Cholera_Outbreak", "Typhoid_Outbreak"]]
X = df.drop(["Cholera_Outbreak", "Typhoid_Outbreak"], axis=1)

# --- 3. Identify Column Types ---
# List the names of columns that are numeric vs categorical
numeric_features = ["Year", "Month", "Rainfall_mm", "Temperature_celsius",
                    "Sanitation_Index", "Water_Quality_Index",
                    "Population_Density", "Waste_Management_Score"]
categorical_features = ["Region", "City"]

# --- 4. Create Preprocessing Pipelines for Each Type ---

# Pipeline for numeric features: Impute missing values (median) then scale
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Pipeline for categorical features: Impute missing values (most frequent) then OneHotEncode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)) # handle_unknown is important!
])

# --- 5. Combine Preprocessing Steps with ColumnTransformer ---
# This applies the correct transformer to the correct columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough' # Keep any other columns not listed (if any)
)

# --- 6. Create the Full Pipeline ---
# Chain the preprocessor and the classifier
pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(eval_metric='logloss', random_state=42))
])

# --- 7. Split Data ---
# Split the ORIGINAL X and y
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


In [51]:
# Train
pipe.fit(X_train, y_train)

# Predict
y_pred = pipe.predict(X_test)


In [52]:
# Cholera
y_test_cholera = y_test["Cholera_Outbreak"]
y_pred_cholera = y_pred[:, 0]  

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print("Cholera Accuracy:", accuracy_score(y_test_cholera, y_pred_cholera))
print("Cholera Confusion Matrix:\n", confusion_matrix(y_test_cholera, y_pred_cholera))
print(classification_report(y_test_cholera, y_pred_cholera))

# Typhoid
y_test_typhoid = y_test["Typhoid_Outbreak"]
y_pred_typhoid = y_pred[:, 1]

print("Typhoid Accuracy:", accuracy_score(y_test_typhoid, y_pred_typhoid))
print("Typhoid Confusion Matrix:\n", confusion_matrix(y_test_typhoid, y_pred_typhoid))
print(classification_report(y_test_typhoid, y_pred_typhoid))


Cholera Accuracy: 1.0
Cholera Confusion Matrix:
 [[12  0]
 [ 0 38]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00      1.00      1.00        38

    accuracy                           1.00        50
   macro avg       1.00      1.00      1.00        50
weighted avg       1.00      1.00      1.00        50

Typhoid Accuracy: 1.0
Typhoid Confusion Matrix:
 [[15  0]
 [ 0 35]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       1.00      1.00      1.00        35

    accuracy                           1.00        50
   macro avg       1.00      1.00      1.00        50
weighted avg       1.00      1.00      1.00        50



In [54]:
# pipe = your trained pipeline
joblib.dump(pipe, 'cholera_xgb_pipeline.joblib')
print("Model saved successfully!")


Model saved successfully!
