## **Data Preprocessing**

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load data
df = pd.read_csv('Malaria_Project_Data.csv')

In [3]:
# Confirm that there are no missing values
print("Missing Values:")
df.isnull().sum()

Missing Values:


Country                           0
Latitude                          0
Longitude                         0
Year                              0
Month                             0
Population                        0
Malaria_cases                     0
Malaria_incidence                 0
Malaria_deaths                    0
Vector_species                    0
Insecticide_resistance_level      0
Drug_resistance_reported          0
Intervention_history            829
Outbreak_flag                     0
Avg_temperature                   0
Total_rainfall                    0
Rainfall_1month_lag               0
Humidity                          0
Vegetation_index                  0
Altitude                          0
Poverty_rate                      0
Literacy_rate                     0
Urban_rural                       0
Sanitation_access                 0
Bed_net_coverage                  0
Healthcare_access                 0
dtype: int64

In [4]:
# Fill missing values in 'Intervention_history' with 'None'
df['Intervention_history'] = df['Intervention_history'].fillna('None')

In [5]:
# Handle skewed distribution
df['Malaria_incidence_log'] = np.log1p(df['Malaria_incidence'])

---

I applied log1p transformation (np.log1p(x)) to reduce skewness because log1p is better since it handles zeros without issues.

---

In [6]:
# Drop one of two similar correlated features (Total Rainfall and Rainfall_1month_lag)
if 'Rainfall_1month_lag' in df.columns:
    df = df.drop(columns=['Rainfall_1month_lag'])

## **Feature Engineering**

In [7]:
df['Decade'] = (df['Year'] // 10) * 10

df['Temp_Humidity_Index'] = df['Avg_temperature'] * df['Humidity']

df['Rainfall_diff'] = df['Total_rainfall'] - df['Total_rainfall'].shift(1)
df['Rainfall_diff'] = df['Rainfall_diff'].fillna(0)  # because first value after shift will be NaN

---

- Year was grouped into **Decade** like 1990, 2000, 2010, etc.

- **Temp_Humidity_Index** was engineered to find the Temperature-Humidity interaction since Mosquito lifecycles depend on both temperature and humidity together.

- **Rainfall_diff** was also engineered to capture sudden changes in rainfall, which could drive mosquito outbreaks.

---

In [8]:
# One-hot encode Month
month_ohe = pd.get_dummies(df['Month'], prefix='Month')
df = pd.concat([df, month_ohe], axis=1)

---

**Month** column was one-hot encoded because, for the time series forecasting part of this project, seasonality is critical.

Malaria outbreaks are often seasonal (for example, rainy seasons → more mosquitoes → more cases).

Therefore, keeping Month as a simple number (1–12) would imply an order that may mislead some models (In reality, 1 (January) is closer to 12 (December) than 6 (June), but this isn't true linearly).

So, **One-hot encoding Month will let models freely learn seasonal patterns without assuming incorrect numerical distances between months.**

---

In [9]:
# Drop Month, since original column is now encoded
df = df.drop(columns=['Month'])

In [10]:
# Reset index
df = df.reset_index(drop=True)

In [11]:
# Quick check
df.head()

Unnamed: 0,Country,Latitude,Longitude,Year,Population,Malaria_cases,Malaria_incidence,Malaria_deaths,Vector_species,Insecticide_resistance_level,...,Month_3,Month_4,Month_5,Month_6,Month_7,Month_8,Month_9,Month_10,Month_11,Month_12
0,Nigeria,2.47,44.29,1990,143923543,330873,584.61,15019,An. arabiensis,Moderate,...,False,False,False,False,False,False,False,False,False,False
1,Nigeria,2.47,44.29,1990,97549065,245357,37.45,9857,An. gambiae,Moderate,...,False,False,False,False,False,False,False,False,False,False
2,Nigeria,2.47,44.29,1990,194477233,261763,41.45,5472,An. funestus,Moderate,...,True,False,False,False,False,False,False,False,False,False
3,Nigeria,2.47,44.29,1990,79631839,993608,470.89,37775,An. gambiae,Low,...,False,True,False,False,False,False,False,False,False,False
4,Nigeria,2.47,44.29,1990,10125408,45100,21.47,1916,An. arabiensis,High,...,False,False,True,False,False,False,False,False,False,False


In [12]:
df.to_csv('Preprocessed_Malaria_Data.csv', index=False)