# Dengue Forecasting Project – Introduction

This project was developed as a hands-on application of the skills acquired during the IBM Data Science Professional Certificate. To put my knowledge into practice, I chose to work with a real-world dataset sourced from a friend’s undergraduate thesis in Brazil.

The dataset, made publicly available on Harvard’s Dataverse platform, contains time series data related to dengue cases and climate variables, with the objective of predicting future outbreaks. The original academic project was carried out by students from São Paulo, Brazil, and gained media attention for its innovative use of machine learning in public health forecasting.

Data Sources
+ Dataset source: [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NN7EOY)
+ Original project news coverage: [Metrópoles Dengue Algorithm](https://www.metropoles.com/sao-paulo/estudantes-de-sp-criam-algoritmo-capaz-de-prever-casos-de-dengue)

### Objective
The main goal of this project is to clean, explore, and model the data to predict dengue cases using machine learning techniques. The project is structured into the following steps:

1. Data Loading and Cleaning – Apply best practices to handle missing values, inconsistencies, and prepare features.

2. Exploratory Data Analysis (EDA) – Visualize time trends, climate variables, and correlations with dengue outbreaks.

3. Modeling and Forecasting – Train and optimize a Random Forest model to predict future dengue cases.

4. Evaluation – Assess model performance and visualize predictions.

By the end of this notebook, the project aims to demonstrate not only a functional dengue forecasting pipeline but also a clear application of end-to-end data science methodologies

---

# Data Cleaning
Before performing any analysis or modeling, it is crucial to clean and preprocess the data to ensure its quality and usability. The raw dataset presented several issues, such as missing values, invalid entries, and potential data entry errors. The following steps were taken to clean and prepare the data for analysis and forecasting:

### 1. Loading the Data
The dataset was read from a .tab file using `pandas.read_csv()` with `\t` as the separator. Proper error handling was implemented to ensure the process fails gracefully in case of missing or malformed files.

### 2. Initial Exploration
We examined the structure of the dataset using:

* `.info()` to understand column types and null values,

* `.head()` to preview the first rows,

* `.isnull().sum()` to assess missing data.

###  3. Handling 'NULL' Strings
Some numeric columns were incorrectly stored as string values with 'NULL'. These were replaced with proper NaN values using pandas.NA for better handling in subsequent steps.

### 4. Handling Suspicious Zeros
In climate-related columns like temp_media_mensal (average monthly temperature), zero values are unrealistic and likely indicate missing data. These zeros were replaced with NaN.

### 5. Filling Missing Values
For the key climate columns:

* `precipitacao_total_mensal (rainfall)`,

* `temp_media_mensal (temperature)`,

* `vento_vlc_media_mensal (wind speed)`,

we applied linear interpolation, followed by forward fill `(ffill)` and backward fill `(bfill)` to impute missing values.

###  6. Removing Duplicates
Duplicate rows were identified and removed to prevent data leakage or bias in model training

### 7. Creating Date Features
The dt_notificacao column (notification date) was converted to datetime format. Rows with invalid or missing dates were dropped. Several temporal features were then derived, including:

* `year`, `month`, `day`,

* `day_of_week`, `week_of_year`, `day_of_year`.

These features can improve temporal understanding and forecasting.

### 8. Creating Binary Indicator Features (has_ Columns)
We created binary features to flag the presence (1) or absence (0) of key symptoms and environmental measurements, such as:

* `has_precipitacao`, `has_vento`

* `has_febre`, `has_vomito`, `has_nausea`, etc.

These simplify the interpretation of sparse symptom data and help improve model robustness.

### 9. Final Checks
After all transformations:

* `.info()` and `.isnull().sum()` were used to confirm the absence of nulls.

We also calculated the percentage of zero values in critical columns to identify remaining sparsity in the data.

---

# Exploratory Data Analysis (EDA)

Exploratory Data Analysis is a critical step to understand the data's characteristics, identify patterns, and uncover insights that will guide the feature engineering and modeling process. This section provides an overview of the key findings from our analysis of the dengue dataset.

## Dengue Cases Over Time

The time series plot reveals the temporal dynamics of dengue cases. We can observe a clear seasonality, with distinct peaks occurring annually. These peaks suggest a strong correlation with specific periods of the year, likely influenced by climate factors. The data also highlights a significant outbreak in 2024, which surpasses previous years in magnitude.

<div style="text-align: center;">
  <img src="../models/dengue_cases_over_time.png" alt="Dengue Cases Over Time" width="600">
</div>

## Distribution of Dengue Cases by Month

A box plot was used to examine the distribution of cases across different months. This visualization confirms the seasonality observed in the time series plot, showing that cases are predominantly high during the first few months of the year (specifically, months 1 to 5). This period aligns with the Brazilian summer and rainy season, reinforcing the link between climate and dengue incidence. The plot also shows a high number of outliers, representing months with unusually large outbreaks, such as the one in 2024.

<div style="text-align: center;">
  <img src="../models/cases_by_month.png" alt="Distribution of Dengue Cases by Month" width="600">
</div>

## Monthly Dengue Cases per Year

This heatmap provides a more detailed look at the seasonal patterns by showing the number of dengue cases for each month of each year. The color intensity clearly shows that the highest case numbers consistently fall between months 1 and 5. The year 2024 stands out with particularly intense values, especially from February to May, visually representing the scale of the recent outbreak.

<div style="text-align: center;">
  <img src="../models/seasonality_heatmap.png" alt="Monthly Dengue Cases per Year" width="600">
</div>

## Distribution of Symptoms in Dengue Cases

A pie chart was generated to visualize the distribution of reported symptoms. The data indicates that Fever is the most common symptom, present in the majority of cases (56.8%). Nausea and Vomiting follow, with 27.3% and 15.9% of cases respectively. This distribution provides a general understanding of the clinical profile of the reported dengue cases.

<div style="text-align: center;">
  <img src="../models/symptoms_pie_chart.png" alt="Distribution of Symptoms in Dengue Cases" width="600">
</div>

## Correlation between Cases and Climate Variables

A correlation heatmap was created to quantify the relationships between the number of dengue cases and key climate variables. The plot shows a low correlation between `qntd_casos` and the climate variables at the same time step. However, it's worth noting the strong correlations among the climate variables themselves, particularly between `temp_media_mensal` (mean monthly temperature) and `vento_vlc_media_mensal` (mean monthly wind velocity), with a correlation of 0.85. This suggests that these variables might be interdependent.

<div style="text-align: center;">
  <img src="../models/correlation_heatmap.png" alt="Correlation between Cases and Climate Variables" width="600">
</div>

## Correlation Matrix (Relevant Features)

To address the low contemporaneous correlation, we created lagged features for both dengue cases and climate variables. The correlation matrix including these lagged features reveals more interesting relationships. The current number of cases (`qntd_casos`) shows a much stronger correlation with lagged climate variables, such as `temp_media_lag3` and `vento_media_lag3`. This highlights a crucial finding: climate conditions from previous months are more predictive of current dengue cases than the conditions of the current month. The heatmap also shows that the lagged climate features themselves are highly correlated, reinforcing the idea of a delayed impact on the spread of the disease.

<div style="text-align: center;">
  <img src="../models/lag_features_heatmap.png" alt="Correlation matrix (Relevant Features)" width="600">
</div>

### Key EDA Takeaways

The exploratory analysis confirmed the strong seasonal nature of dengue outbreaks in Brazil, with climate variables such as temperature and wind velocity showing a delayed impact on the number of cases. These findings support the decision to include lagged features in our predictive models, as they capture essential temporal dynamics for accurate forecasting.

---

# Machine Learning: Dengue Forecasting

With a clean and well-understood dataset, the next step is to build a predictive model to forecast future dengue cases. This section details the process of training, evaluating, and optimizing a machine learning model for this time-series forecasting task.

### Feature Engineering: Lag Features

Building on the insights from our EDA, which showed a strong correlation between dengue cases and past climate conditions, we engineered **lag features**. These features represent the values of key variables from previous months. Specifically, we created lag features for 3 and 4 months prior (`lag3` and `lag4`) for dengue cases, precipitation, temperature, and wind velocity. This allows the model to capture the delayed effect of these factors on the spread of the disease.

### Baseline Model

To establish a benchmark for performance, we first trained a **Random Forest Regressor** with default parameters. Random Forest is a robust and effective model for this type of problem, as it can handle non-linear relationships and different data types without extensive preprocessing. The model was trained on a time-series split of the data, with 80% used for training and 20% for testing.

The forecast from this baseline model shows that it successfully captures the seasonal pattern and the general trend of dengue cases.

**Model Evaluation:**
- **Mean Squared Error (MSE):** 6.2 million
- **R-squared (R²):** 0.14

<div style="text-align: center;">
  <img src="../models/baseline_dengue_forecast.png" alt="Baseline Dengue Forecast" width="600">
</div>

### Hyperparameter Optimization

While the baseline model performs reasonably well, its predictions can be improved by tuning its hyperparameters. We used **Randomized Search Cross-Validation (`RandomizedSearchCV`)** to efficiently search for the best combination of parameters for the Random Forest model.

A key aspect of this step was using a `TimeSeriesSplit` cross-validation strategy. This ensures that the model is always trained on past data and evaluated on future data, preventing data leakage and providing a more realistic estimate of its performance on new, unseen data. We ran a "complete" search to explore a wide range of parameters.

The best parameters found were:
- `n_estimators`: 500
- `min_samples_split`: 5
- `min_samples_leaf`: 2
- `max_samples`: 0.8
- `max_features`: 'sqrt'
- `max_depth`: None
- `bootstrap`: True

The R² in cross-validation was **0.30**, suggesting a promising potential for the model with these parameters.

### Optimized Model Results

The optimized model shows a noticeable visual improvement in forecasting accuracy. As seen in the plot below, the predictions align more closely with the actual peaks and valleys of the dengue case numbers, especially for the large outbreak in 2024.

**Model Evaluation:**
- **Mean Squared Error (MSE):** ~6.2 million
- **R-squared (R²):** 0.14

While the R² score on the test set is similar to the baseline model (0.14), the visual comparison shows that the optimized model provides a better fit for the high-magnitude events, particularly the peak in 2024. The low R² for both models may indicate that a significant portion of the variance in dengue cases is not captured by the features used, or could suggest a model instability on the specific test set. The superior cross-validation R² of the optimized model also indicates its overall higher quality.

<div style="text-align: center;">
  <img src="../models/optimized_dengue_forecast_complete.png" alt="Optimized Dengue Forecast" width="600">
</div>

### Summary

The modeling process highlighted the importance of time-aware validation and feature engineering in forecasting tasks. While the R² scores remained modest, the use of lag features and hyperparameter optimization led to a more responsive model that better captures dengue seasonality and extreme case surges. This lays a strong foundation for further model experimentation, including alternative regressors and ensemble methods.

---

## Conclusion

This project successfully demonstrates a complete end-to-end data science pipeline for analyzing and forecasting dengue cases. Starting with raw data, we performed a series of crucial steps to transform it into a usable format, uncovering key insights along the way that guided our predictive modeling efforts.

The Exploratory Data Analysis (EDA) confirmed a strong seasonal pattern in dengue outbreaks and revealed a critical finding: a significant delay between climate variables and the incidence of new cases. This insight was the foundation for our feature engineering, where we successfully incorporated lagged features into our models.

Our machine learning efforts, using a Random Forest Regressor, established a baseline for forecasting. While the model's final R-squared score was statistically low (0.14), a visual inspection of the predictions confirms that it successfully captures the overall seasonal trends and the timing of major outbreaks, including the one in 2024. The difference between the cross-validation and test set R-squared scores suggests that there is still significant room for improvement.

## Future Work

For future work, the project can be expanded in several promising directions:
* **Exploring More Advanced Models:** The next logical step is to implement more powerful models like **XGBoost**, which is well-suited for capturing complex interactions and may yield a higher predictive accuracy.
* **Enriching the Dataset:** Incorporating additional features, such as population density, public health intervention data, or other socioeconomic indicators, could provide the model with a more complete picture of the factors that influence dengue spread.

## Final Thoughts

In summary, this project provides a robust foundation for dengue forecasting, delivering valuable insights and a working model that can be refined to better support public health authorities in their efforts to combat the disease.  
With further improvements, this forecasting approach has the potential to become a powerful decision-making tool for anticipating outbreaks and optimizing public health interventions in vulnerable regions.

---

# Acknowledgments and References

This project utilizes a publicly available dataset and draws inspiration from an academic project.

- **Dataset source:** [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NN7EOY)
- **Original project news coverage:** [Metrópoles Dengue Algorithm](https://www.metropoles.com/sao-paulo/estudantes-de-sp-criam-algoritmo-capaz-de-prever-casos-de-dengue)
- **Libraries:** The project was developed using standard Python libraries, including Pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn.