# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 4: Predict Dengue Cases

[![](../images/denguecampaign-wpfb-v2.jpg)](https://www.nea.gov.sg/dengue-zika/dengue)

# <u>Background</u>

Dengue fever is a mosquito-borne viral illness that poses a significant public health threat in Singapore. Dengue fever is endemic in Singapore, with frequent outbreaks causing considerable morbidity and occasionally even mortality. With the incidence of <u>[**dengue increasing globally**](https://www.paho.org/en/news/3-8-2023-dengue-cases-increase-globally-vector-control-community-engagement-key-prevent-spread#:~:text=During%20the%20EPI%2DWIN%20Webinar,million%20infections%20occurring%20each%20year.)</u>, it is crucial to implement effective prevention and control strategies.

In Singapore, the National Environmental Agency (NEA) has initiated several programs to combat the spread of dengue, including the launch of <u>[**Project Wolbachia**](https://www.nea.gov.sg/corporate-functions/resources/research/wolbachia-aedes-mosquito-suppression-strategy)</u>. This innovative project involves the release of male mosquitoes infected with the Wolbachia bacteria, which causes the eggs laid by female mosquitoes to be non-viable, reducing the mosquito population and the transmission of dengue.

Despite these efforts, predicting the number of dengue cases and understanding the trends in dengue outbreaks remain challenges. Weather patterns, public awareness, and the success of Project Wolbachia all play a role in influencing the incidence of dengue.

As part of the <u>[**Vector Biology and Control Division (VBCD)**](https://www.nea.gov.sg/corporate-functions/who-we-are/groups-and-divisions/public-health-groups-and-divisions)</u>, we carry out research and surveillance of mosquitoes and other vectors and risk assessment of present and future vector-borne diseases that pose a public health threat to Singapore. The division continually explores and evaluates new and safe control tools, and develops novel strategies for risk intervention and management tailored to Singapore's unique urban and dense landscape.

# <u>Problem Statement</u>

Dengue fever is a serious public health concern in Singapore, leading to considerable <u>[**morbidity**](https://www.straitstimes.com/singapore/health/singapore-records-19-dengue-deaths-in-2022-nearly-four-times-2021-s-toll#:~:text=SINGAPORE%20%2D%20There%20were%2019%20deaths,people%20died%20of%20the%20disease.)</u> and <u>[**economic burden**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10021432/#:~:text=We%20estimated%20that%20the%20average,21%2C262%20DALYs%20from%202010%E2%80%932020.)</u>. The National Environmental Agency (NEA) is tasked with managing and reducing the impact of dengue in the country. Predicting dengue outbreaks is a complex task that requires taking into account a variety of factors, including environmental, social, and behavioral variables.

Project Wolbachia in Singapore aims to reduce dengue transmission by releasing mosquitoes infected with Wolbachia bacteria. However, the project has faced challenges including ecological concerns, <u>[**skepticism from the community**](https://www.todayonline.com/voices/project-wolbachia-residents-are-killing-helpful-mosquitoes-which-can-be-nuisance)</u>, <u>[**high costs**](https://journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0000024#:~:text=Under%20an%20assumed%20steady%2Dstate,to%202020%20under%2040%25%20intervention)</u>, <u>[**weather disruptions**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9098883/)</u>, and integration with other control measures. Monitoring and evaluating the project's impact are essential for its success and to inform future dengue control strategies.

As part of the VBCD, we seek to develop a predictive model capable of forecasting the number of dengue cases and understanding the trends in dengue outbreaks for a subsequent period. The model would incorporate a wide range of relevant factors that influence the spread of dengue, over a period of 2012 to 2023.

Specifically, the model should include:

1. **Weather Data**: It is well-known that climatic conditions play a pivotal role in affecting mosquito behavior and the spread of dengue fever. Hence, the model will incorporate data on weather variables such as temperature, humidity, and rainfall to assess their potential influence on dengue cases.

2. **Google Search Trends**: A notable indicator of public awareness and concern regarding dengue is the volume of related search queries on Google. Therefore, the model will include Google search trends data on dengue-related terms spanning to reflect the behavioral dynamics that might contribute to the spread of the disease.

3. **Project Wolbachia Data**: Project Wolbachia, initiated to curb the transmission of dengue, involves the release of mosquitoes infected with Wolbachia bacteria, impairing their ability to transmit the virus. The model will take into account data on the timing and locations of Project Wolbachia releases to understand their potential impact on the incidence of dengue cases.

Accurate predictions from the model can have significant implications for public health in Singapore. With reliable forecasts, the VBCD can:

1. **Allocate Resources Efficiently**: By anticipating dengue outbreaks, the VBCD can allocate resources, such as mosquito control teams and public health campaigns, more efficiently to areas at higher risk.

2. **Target Prevention Measures Effectively**: With insights into the factors driving dengue outbreaks, the VBCD can tailor prevention measures, such as fogging, removal of breeding sites, and public education, to specific high-risk areas and times.

3. **Raise Public Awareness**: During periods of high predicted risk, the VBCD can increase public awareness campaigns to inform residents about the importance of mosquito control and dengue prevention measures.

4. **Assess the Impact of Project Wolbachia**: By incorporating data on Project Wolbachia, the model can provide insights into the effectiveness of the program and inform decisions on future releases of Wolbachia-infected mosquitoes.

Ultimately, the development of a robust predictive model for dengue cases in Singapore will contribute to more effective dengue management and prevention strategies, thereby reducing the impact of dengue on public health and the economy.

# <u>Methodology</u>

**1. Data Collection and Integration**
   - **Weather Data:** Collect and integrate weather data such as temperature, humidity, and rainfall, which influence mosquito behavior and dengue transmission.
   - **Dengue Cases Data:** Obtain historical weekly dengue cases data to understand the patterns and trends of dengue outbreaks in Singapore.
   - **Google Search Trends Data:** Acquire Google trends data on dengue-related search terms to gauge public awareness and concern about dengue.
   - **Project Wolbachia Data:** Collect data on the release of Wolbachia-infected mosquitoes, including the timing, location, and scale of each release.

**2. Data Preprocessing and Cleaning**
   - **Data Wrangling:** Clean, transform, and integrate data from various sources into a unified dataset suitable for analysis.
   - **Missing Data and Outliers:** Handle missing values and outliers to ensure data quality and reliability.
   - **Temporal Alignment:** Aggregate and align data on a weekly or monthly basis to match the frequency of dengue cases data.

**3. Exploratory Data Analysis (EDA)**
   - **Descriptive Statistics:** Generate summary statistics and visualizations to understand the distribution, patterns, and relationships among variables.
   - **Trends and Seasonality:** Analyze temporal patterns, including trends and seasonality, in the dengue cases data.
   - **Correlation Analysis:** Identify potential relationships between weather variables, Google search trends, Project Wolbachia data, and dengue cases.

**4. Feature Engineering**
   - **Lag Variables:** Create lag variables for weather data and Google search trends to capture their delayed effects on dengue cases.
   - **Interaction Terms:** Generate interaction terms among weather variables to account for their combined effects on dengue transmission.
   - **Geospatial Features:** Encode the location of dengue clusters and Project Wolbachia releases as numeric features.

**5. Modeling and Prediction**
   - **SARIMAX Model:** Build a SARIMAX model to forecast the number of dengue cases, incorporating external regressors such as weather data and Google search trends.
   - **Hyperparameter Tuning:** Optimize the model's hyperparameters to enhance its predictive performance.

**6. Model Evaluation**
   - **Model Validation:** Split the data into training and validation sets to assess the model's performance on unseen data.
   - **Evaluation Metrics:** Use appropriate metrics, such as MAE, MSE, or RMSE, to evaluate the model's accuracy in predicting dengue cases.

**7. Cost-Benefit Analysis**
   - **Cost Estimation:** Estimate the annual costs of the dengue prevention program, including Project Wolbachia, public awareness campaigns, and other interventions.
   - **Benefit Quantification:** Quantify the benefits of the program in terms of reduced dengue cases, healthcare costs, and productivity losses.
   - **Trade-Off Analysis:** Assess the trade-offs between the costs and benefits of the program to optimize resource allocation and maximize benefits.

**8. Recommendations and Insights**
   - **Policy Recommendations:** Provide recommendations to the NEA for targeted dengue prevention measures based on the model's predictions and cost-benefit analysis.
   - **Model Interpretation:** Offer insights into the factors contributing to dengue outbreaks and the effectiveness of Project Wolbachia.

This methodology outlines a comprehensive approach to predict dengue cases in Singapore, assess the cost-benefit of the dengue prevention program, and provide actionable recommendations to the NEA.

# <u>Objective</u>

1. Collect and preprocess data on weather patterns, dengue clusters, Google search trends for dengue, and Project Wolbachia releases.
2. Conduct exploratory data analysis to understand the relationships between the collected variables and the number of dengue cases.
3. Engineer relevant features that can improve the predictive power of the model.
4. Experiment with different models to forecast the number of dengue cases for the subsequent period, incorporating the identified features and external regressors.
5. Evaluate the performance of the model using the Mean Absolute Percentage Error (MAPE) metric to assess the accuracy of its predictions as a percentage of the actual number of dengue cases.
6. Analyze the model's results and provide insights into the factors contributing to the predicted number of dengue cases and trends.
7. Conduct a cost-benefit analysis of the dengue prevention program, including:
   - Estimating the annual cost of the dengue prevention program, including the costs associated with Project Wolbachia releases, public awareness campaigns, and other prevention measures.
   - Quantifying the benefits of the program in terms of the reduction in the number of dengue cases, the reduction in the economic burden of dengue (e.g. productivity losses), and the improvement in public health.
   - Assessing the trade-offs between the costs and benefits of the program, considering the optimal allocation of resources to maximize the benefits and minimize the costs.
   - Providing recommendations for optimizing the dengue prevention program to achieve maximum benefits at the lowest possible cost.
8. Provide recommendations to the NEA based on the model's predictions and the cost-benefit analysis, including potential modifications to the Project Wolbachia release strategy and targeted dengue prevention measures.

We are using the Mean Absolute Percentage Error (MAPE) metric because it measures the accuracy of the model's predictions as a percentage of the actual number of dengue cases. This makes it easier to interpret and communicate the model's performance to stakeholders. Moreover, MAPE provides a relative error measure, which is useful for comparing the accuracy of different models or evaluating the model on datasets with different scales. It is particularly suitable for this project because it allows for a more intuitive understanding of the model's prediction errors in relation to the actual number of dengue cases, which is crucial for effective resource allocation and dengue prevention efforts.

By aiming to predict the number of dengue cases in the next 16 weeks and using the MAPE metric for model assessment, the NEA, with the support of the Ministry of Health as our secondary stakeholder, will enhance its readiness to address the dengue spread in Singapore. This will not only diminish the public health repercussions of dengue outbreaks but also guide the judicious distribution of resources towards proactive dengue prevention measures. Our project's predictive insights will also empower the Ministry of Health to take preemptive measures, ensuring the health and safety of the public while optimizing the use of resources.

**<u>[Cost Analysis](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10021432/#:~:text=3.2-,Cost%2Deffectiveness%20of%20Wolbachia%20interventions,-Assuming%20steady%20state)</u> of Porject Wolbachia:**

1. **Effectiveness:** Wolbachia interventions were found to be cost-effective at 40% intervention effectiveness or above. The cost averted increases as the assumed intervention effectiveness of Wolbachia increases from 40% to 90%.

2. **Costs Averted Over Time:** At 40% intervention effectiveness, over US\\$329.40M would have been averted from 2010 to 2020. As the assumed intervention effectiveness increases from 40\% to 80\%, the estimated cost averted would also have increased to US\\$658.79M over the same period.

3. <u>[**DALYs Averted:**](https://www.who.int/data/gho/indicator-metadata-registry/imr-details/158#:~:text=Definition%3A-,One%20DALY%20represents%20the%20loss%20of%20the%20equivalent%20of%20one,health%20condition%20in%20a%20population.)</u> The costs per DALY averted decrease over time, indicating that the Wolbachia intervention becomes more cost-effective in terms of health outcomes over the years.

**Approach**:

Assuming steady state costs of **US\\$22.7M** per year, and targetting 80% efficiency, we'll be using the formula ``(1-MAPE)X(Median cost averted) = steady-state cost X 1.8``. To calculate the net amount of cost averted. Median is being used as a measure of central tendency for the cost averted data because it is less sensitive to outliers or extreme values compared to the mean. 

# <u>Part 1</u>

# Scraping Weather Data

We will be scraping historical daily weather data from all weather stations in Singapore

In [18]:
from bs4 import BeautifulSoup
import requests
import os
import pandas as pd
import io
import random
from time import sleep

In [19]:
url = "http://www.weather.gov.sg/climate-historical-daily/"
test = requests.get(url)
BeautifulSoup(test.text)

<!DOCTYPE html>
<!--[if IE 7]>
<html class="ie ie7" lang="en-US">
<![endif]--><!--[if IE 8]>
<html class="ie ie8" lang="en-US">
<![endif]--><!--[if !(IE 7) & !(IE 8)]><!--><html lang="en-US">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="width=device-width" name="viewport"/>
<!--<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache, no-store, must-revalidate" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
-->
<!-- <meta http-equiv="cache-control" content="public, max-age=60, must-revalidate" /> -->
<title>Historical Daily Records | </title>
<link href="http://gmpg.org/xfn/11" rel="profile"/>
<link href="http://www.weather.gov.sg/xmlrpc.php" rel="pingback"/>
<link href="http://www.weather.gov.sg/wp-content/themes/wiptheme/lib/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
<link href="http://www.weath

In [20]:
# Finding the dropdown menu containing station names and codes
dropdown_menu = BeautifulSoup(test.text).find('ul', class_='dropdown-menu long-dropdown')

# Initialize a dictionary to store station names and codes
station_info = {}

# Loop through each <a> tag within the dropdown menu
for link in dropdown_menu.find_all('a'):
    station_name = link.get_text()
    station_code = link['onclick'].split("setYear('")[1].split("')")[0]
    station_info[station_code] = station_name

# Print the extracted station names and codes
for code, name in station_info.items():
    print(f"{code} : {name}")


S104 : Admiralty
S105 : Admiralty West
S109 : Ang Mo Kio
S86 : Boon Lay (East)
S63 : Boon Lay (West)
S120 : Botanic Garden
S55 : Buangkok
S64 : Bukit Panjang
S90 : Bukit Timah
S92 : Buona Vista
S61 : Chai Chee
S24 : Changi
S114 : Choa Chu Kang (Central)
S121 : Choa Chu Kang (South)
S11 : Choa Chu Kang (West)
S50 : Clementi
S118 : Dhoby Ghaut
S107 : East Coast Parkway
S39 : Jurong (East)
S101 : Jurong (North)
S44 : Jurong (West)
S117 : Jurong Island
S33 : Jurong Pier
S31 : Kampong Bahru
S71 : Kent Ridge
S122 : Khatib
S66 : Kranji Reservoir
S112 : Lim Chu Kang
S08 : Lower Peirce Reservoir
S07 : Macritchie Reservoir
S40 : Mandai
S108 : Marina Barrage
S113 : Marine Parade
S111 : Newton
S119 : Nicoll Highway
S116 : Pasir Panjang
S94 : Pasir Ris (Central)
S29 : Pasir Ris (West)
S06 : Paya Lebar
S106 : Pulau Ubin
S81 : Punggol
S77 : Queenstown
S25 : Seletar
S102 : Semakau Island
S80 : Sembawang
S60 : Sentosa Island
S36 : Serangoon
S110 : Serangoon North
S84 : Simei
S79 : Somerset (Road)
S43 :

In [4]:
output_folder = "../data"  # Folder where downloaded files will be saved
combined_csv_filename = "combined_data.csv"  # Combined CSV file name

# Ensure the output folder exists
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

base_url_template = "http://www.weather.gov.sg/files/dailydata/DAILYDATA_{}_{:04d}{:02d}.csv"

# Initialize an empty DataFrame to store the combined data
combined_df = pd.DataFrame()

for station_id, station_name in station_info.items():
    for year in range(2010, 2024):  # Adjust the range as needed
        for month in range(1, 13):
            csv_url = base_url_template.format(station_id, year, month)
            csv_filename = os.path.join(output_folder, f"{station_id}_{year:04d}_{month:02d}.csv")
            
            # Download the CSV file
            csv_response = requests.get(csv_url)
            if csv_response.status_code == 200:
                # with open(csv_filename, "wb") as csv_file:
                #     csv_file.write(csv_response.content)
                #     print(f"Downloaded: {csv_filename}")
                csv_content = io.BytesIO(csv_response.content)
                
                # Read the downloaded CSV file into a DataFrame
                temp_df = pd.read_csv(csv_content, encoding='ISO-8859-1')
                
                # Concatenate the temporary DataFrame with the combined DataFrame
                combined_df = pd.concat([combined_df, temp_df], ignore_index=True)

                print(f"Downloaded and processed: {station_name}_{year:04d}_{month:02d}")
            else:
                print(f"Failed to download: {csv_filename}. Status code:", csv_response.status_code)

            sleep(random.uniform(0.0005, 2))

# Save the combined DataFrame as a single CSV file
combined_csv_filename = os.path.join(output_folder, "combined_data.csv")
combined_df.to_csv(combined_csv_filename, index=False)

print(f"Combined data saved to '{combined_csv_filename}'.")


Downloaded and processed: Admiralty_2010_01
Downloaded and processed: Admiralty_2010_02
Downloaded and processed: Admiralty_2010_03
Downloaded and processed: Admiralty_2010_04
Downloaded and processed: Admiralty_2010_05
Downloaded and processed: Admiralty_2010_06
Downloaded and processed: Admiralty_2010_07
Downloaded and processed: Admiralty_2010_08
Downloaded and processed: Admiralty_2010_09
Downloaded and processed: Admiralty_2010_10
Downloaded and processed: Admiralty_2010_11
Downloaded and processed: Admiralty_2010_12
Downloaded and processed: Admiralty_2011_01
Downloaded and processed: Admiralty_2011_02
Downloaded and processed: Admiralty_2011_03
Downloaded and processed: Admiralty_2011_04
Downloaded and processed: Admiralty_2011_05
Downloaded and processed: Admiralty_2011_06
Downloaded and processed: Admiralty_2011_07
Downloaded and processed: Admiralty_2011_08
Downloaded and processed: Admiralty_2011_09
Downloaded and processed: Admiralty_2011_10
Downloaded and processed: Admira