# Predicting Dengue Fever Cases using weather data via Catboost Algorithm

### Introduction

This project aims to tackle the Dengue Fever Challenge posted on Driven Data.
Five steps are included in this markdown:
1. importing packages and reading data from source
2. conducting basic EDA (Exploration Data Analysis)
3. feature engineering
4. train a model and test result simply using training data
5. train the model again using all training data and do prediction for the 'submit' set

### Step 1: importing packages and reading data from source 

In [None]:
# import packages
import pandas as pd
import numpy as np
import os 
import re
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as MAE
from catboost import CatBoostRegressor

In [None]:
# reading data
path = os.getcwd() + "/Data/"

data = {}
for file in os.listdir(path):
    data[re.sub(".csv", "", file)] = pd.read_csv(path + file)

training_X = data.pop("DengAI_Predicting_Disease_Spread_-_Training_Data_Features")
training_Y = data.pop("DengAI_Predicting_Disease_Spread_-_Training_Data_Labels")
submit_X = data.pop("DengAI_Predicting_Disease_Spread_-_Test_Data_Features")
submit_Y = data.pop("DengAI_Predicting_Disease_Spread_-_Submission_Format")

# joining data for train_test_split in the later stage
training = training_X.merge(training_Y, on=['city', 'year', 'weekofyear'], how='left')
submit = submit_X.merge(submit_Y, on=['city', 'year', 'weekofyear'], how='left')

### Step 2: conduct basic EDA (Exploratory Data Analysis)

In [None]:
# check the shape of training data
training.shape

In [None]:
# check the shape of submission data
submit.shape

In [None]:
# take a look at the training data
training.head()

In [None]:
# take a look at the submission data
submit.head()

In [None]:
# check datatypes of each column
training.dtypes

In [None]:
# check summary of the training data
training.describe(include='all')

In [None]:
# check summary of the submission data
submit.describe(include='all')

In [None]:
# Check if there is any NA in training data
training.isnull().sum()

In [None]:
# Check if there is any NA in submission data
submit.isnull().sum()

In [None]:
# Observe histogram of each numeric feature
plt.figure()
training.hist(figsize=(20,20), layout=(8,3))
plt.tight_layout()
plt.show()

In [None]:
# Explore correlation between variables
plt.figure(figsize=(20,20))
sns.heatmap(training.corr(), xticklabels=training.corr().columns, yticklabels=training.corr().columns, center=0, annot=True)
plt.show()

### Step 3: feature engineering

Based on the exploratory data analysis, it seems that the following steps could/should be done:
1. generating datetime features
2. filling NAs
3. removing highly correlated features
4. one-hot encoding for categorical features

For step 2, since the NA values for each column are weather related data, it makes sense to fillna in groups of city and month.
And as the NA values only accounts for less than 10% of the total data for each column, a naive method, median would be used.

In [None]:
# generating date_time features
training['week_start_date'] = pd.to_datetime(training['week_start_date'], format='%Y-%m-%d')
training['quarter'] = training.week_start_date.dt.quarter
training['month'] = training.week_start_date.dt.month
training['day'] = training.week_start_date.dt.day

submit['week_start_date'] = pd.to_datetime(submit['week_start_date'], format='%Y-%m-%d')
submit['quarter'] = submit.week_start_date.dt.quarter
submit['month'] = submit.week_start_date.dt.month
submit['day'] = submit.week_start_date.dt.day

In [None]:
# fillna via naive method, median, grouped by city and month
training = training.groupby(['city', 'month'], as_index=False).apply(lambda x: x.fillna(x.median())).reset_index(drop=True)
submit = submit.groupby(['city', 'month'], as_index=False).apply(lambda x: x.fillna(x.median())).reset_index(drop=True)

In [None]:
# Remove some highly correlated non-datetime related features (absolute correlation > 0.9)
features_to_remove = ['reanalysis_sat_precip_amt_mm', 'reanalysis_specific_humidity_g_per_kg', 'reanalysis_avg_temp_k', 'reanalysis_tdtr_k']
training.drop(features_to_remove, axis=1, inplace=True)
submit.drop(features_to_remove, axis=1, inplace=True)

In [None]:
# one-hot encoding for categorical data: city
training = pd.concat([pd.get_dummies(training.city).astype(int), training], axis=1)
submit = pd.concat([pd.get_dummies(submit.city).astype(int), submit], axis=1)

### Step 4: train a model and test result simply using training data

This step helps us understand the accuracy of our model before using it to predict the 'submit' dataset

In [None]:
# drop week_start_date before training
training.drop(['week_start_date', 'year'], axis=1, inplace=True)
submit.drop(['week_start_date', 'year'], axis=1, inplace=True)

In [None]:
# train test split for train, test and cv and then drop categorical variable: city
X_train, X_test, y_train, y_test = train_test_split(training.drop(['total_cases'], axis=1), training.total_cases, test_size=0.2, stratify=training.city, random_state=123)
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size=0.25, stratify=X_train.city, random_state=123)

X_train.drop('city', axis=1, inplace=True)
X_cv.drop('city', axis=1, inplace=True)
X_test.drop('city', axis=1, inplace=True)

In [None]:
# train the model
model = CatBoostRegressor(iterations = 4000, learning_rate = 0.3, loss_function='MAE', eval_metric='MAE', use_best_model=True, random_seed=123)
model.fit(X_train, y_train, eval_set=(X_cv, y_cv), verbose=False)

In [None]:
# check result
y_pred = model.predict(X_test)
print('Mean Absolute Error:', MAE(y_test, y_pred))

In [None]:
# observe features importance
pd.DataFrame(data={'features':X_train.columns, 'importance':model.feature_importances_}).plot.bar(x='features', y='importance')
plt.show()

### Step 5: train the model again using all training data and do prediction for the 'submit' set

In [None]:
# train test split for train, and cv
X_train, X_cv, y_train, y_cv = train_test_split(training.drop(['total_cases'], axis=1), training.total_cases, test_size=0.3, stratify=training.city, random_state=123)
X_train.drop('city', axis=1, inplace=True)
X_cv.drop('city', axis=1, inplace=True)

In [None]:
# build model
model = CatBoostRegressor(iterations = 4000, learning_rate = 0.3, loss_function='MAE', eval_metric='MAE', use_best_model=True, random_seed=123)
model.fit(X_train, y_train, eval_set=(X_cv, y_cv), verbose=False)

In [None]:
# do prediction on the 'submit' data and save as csv for final submission
submit_Y.total_cases = model.predict(submit.drop(['total_cases', 'city'], axis=1)).astype(int)
submit_Y.to_csv('submission.csv', index=False)