## **Data Preprocessing**

In [1]:
import pandas as pd
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder

In [2]:
# Load data
df = pd.read_csv('Malaria_Project_Data.csv')

In [3]:
# Confirm that there are no missing values
print("Missing Values:")
df.isnull().sum()

Missing Values:


Country                           0
Latitude                          0
Longitude                         0
Year                              0
Month                             0
Population                        0
Malaria_cases                     0
Malaria_incidence                 0
Malaria_deaths                    0
Vector_species                    0
Insecticide_resistance_level      0
Drug_resistance_reported          0
Intervention_history            829
Outbreak_flag                     0
Avg_temperature                   0
Total_rainfall                    0
Rainfall_1month_lag               0
Humidity                          0
Vegetation_index                  0
Altitude                          0
Poverty_rate                      0
Literacy_rate                     0
Urban_rural                       0
Sanitation_access                 0
Bed_net_coverage                  0
Healthcare_access                 0
dtype: int64

In [4]:
# Check dataset for outliers
col = 'Malaria_incidence'
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = ((df[col] < lower_bound) | (df[col] > upper_bound)).sum()
print(f"Column: {col}, Outliers: {outliers}")

Column: Malaria_incidence, Outliers: 452


In [5]:
# Remove outliers from 'Malaria_incidence'
df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

In [6]:
# Fill missing values in 'Intervention_history' with 'None'
df['Intervention_history'] = df['Intervention_history'].fillna('None')

In [7]:
# Handle skewed distribution
df['Malaria_incidence_log'] = np.log1p(df['Malaria_incidence'])

---

I applied log1p transformation (np.log1p(x)) to reduce skewness because log1p is better since it handles zeros without issues.

---

In [8]:
# Drop one of two similar correlated features (Total Rainfall and Rainfall_1month_lag)
if 'Rainfall_1month_lag' in df.columns:
    df = df.drop(columns=['Rainfall_1month_lag'])

## **Feature Engineering**

In [9]:
df['Decade'] = (df['Year'] // 10) * 10

df['Temp_Humidity_Index'] = df['Avg_temperature'] * df['Humidity']

df['Rainfall_diff'] = df['Total_rainfall'] - df['Total_rainfall'].shift(1)
df['Rainfall_diff'] = df['Rainfall_diff'].fillna(0)  # because first value after shift will be NaN

---

- Year was grouped into **Decade** like 1990, 2000, 2010, etc.

- **Temp_Humidity_Index** was engineered to find the Temperature-Humidity interaction since Mosquito lifecycles depend on both temperature and humidity together.

- **Rainfall_diff** was also engineered to capture sudden changes in rainfall, which could drive mosquito outbreaks.

---

In [10]:
# Create datetime column for sorting and lagging
df['Month'] = df['Month'].astype(int)
df['Date'] = pd.to_datetime(df['Year'].astype(str) + '-' + df['Month'].astype(str) + '-01')
df.sort_values(by=['Country', 'Date'], inplace=True)

In [11]:
# Create lagged features
features_to_lag = ['Total_rainfall', 'Avg_temperature', 'Humidity', 'Temp_Humidity_Index']
lag_months = [1, 2]

for feature in features_to_lag:
    for lag in lag_months:
        df[f'{feature}_lag{lag}'] = df.groupby('Country')[feature].shift(lag)

---

**Mosquito breeding cycles** and **malaria parasite development** take time, so:

- A rainfall spike today would probably cause increased cases 1–2 months later, not immediately.

- Temperature and humidity also influence mosquito lifespan and biting rate with delay.

These influenced the need to create lagged features.

---

In [12]:
# Drop rows with NaNs from lagging
df.dropna(subset=[f'{f}_lag{l}' for f in features_to_lag for l in lag_months], inplace=True)

In [13]:
# One-hot encode Month
month_ohe = pd.get_dummies(df['Month'], prefix='Month')
df = pd.concat([df, month_ohe], axis=1)

---

**Month** column was one-hot encoded because, for the time series forecasting part of this project, seasonality is critical.

Malaria outbreaks are often seasonal (for example, rainy seasons → more mosquitoes → more cases).

Therefore, keeping Month as a simple number (1–12) would imply an order that may mislead some models (In reality, 1 (January) is closer to 12 (December) than 6 (June), but this isn't true linearly).

So, **One-hot encoding Month will let models freely learn seasonal patterns without assuming incorrect numerical distances between months.**

---

In [14]:
# Drop Month, since original column is now encoded
df = df.drop(columns=['Month'])

In [15]:
# Ordinal encoding for Insecticide_resistance_level
ordinal_map = {'Low': 0, 'Moderate': 1, 'High': 2}
df['Insecticide_resistance_level_encoded'] = df['Insecticide_resistance_level'].map(ordinal_map)

# Label encoding for Drug_resistance_reported
df['Drug_resistance_reported_encoded'] = df['Drug_resistance_reported'].map({'No': 0, 'Yes': 1})

# One-hot encoding for Vector_species
df = pd.get_dummies(df, columns=['Vector_species'], prefix='Vector')

# One-hot encoding for Intervention_history
df = pd.get_dummies(df, columns=['Intervention_history'], prefix='Intervene')

In [16]:
# Drop original categorical columns
df = df.drop(columns=['Insecticide_resistance_level', 'Drug_resistance_reported'])

In [17]:
# Reset index
df = df.reset_index(drop=True)

In [18]:
# Quick check
df.head()

Unnamed: 0,Country,Latitude,Longitude,Year,Population,Malaria_cases,Malaria_incidence,Malaria_deaths,Outbreak_flag,Avg_temperature,...,Insecticide_resistance_level_encoded,Drug_resistance_reported_encoded,Vector_An. arabiensis,Vector_An. funestus,Vector_An. gambiae,Intervene_Health education,Intervene_Larval control,Intervene_Net distribution,Intervene_None,Intervene_Spraying
0,Burkina Faso,5.94,40.86,1990,92649622,741601,81.37,19017,1,26.4,...,2,0,True,False,False,False,True,False,False,False
1,Burkina Faso,5.94,40.86,1990,199953233,472963,240.32,6466,0,32.1,...,2,0,False,True,False,True,False,False,False,False
2,Burkina Faso,5.94,40.86,1990,112752093,29872,6.01,1009,0,33.6,...,1,0,False,False,True,False,False,True,False,False
3,Burkina Faso,5.94,40.86,1990,54664948,90316,10.35,3811,0,23.5,...,0,0,True,False,False,True,False,False,False,False
4,Burkina Faso,5.94,40.86,1990,104799026,249847,33.3,7898,0,27.0,...,1,0,True,False,False,False,False,False,True,False


In [19]:
df.to_csv('Preprocessed_Malaria_Data.csv', index=False)