<a href="https://colab.research.google.com/github/BhekiMabheka/Data_Driven_Competions/blob/master/Predicting_Spread_Disease.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## DengAI: Predicting Disease Spread

**Overview:**

**Dengue fever** is a mosquito-borne disease that occurs in tropical and sub-tropical parts of the world. In mild cases, symptoms are similar to the flu: fever, rash, and muscle and joint pain. In severe cases, dengue fever can cause severe bleeding, low blood pressure, and even death.

Because it is carried by mosquitoes, the transmission dynamics of dengue are <a href="https://ehp.niehs.nih.gov/wp-content/uploads/121/11-12/ehp.1306556.pdf">`related to climate variables`</a> such as temperature and precipitation. Although the relationship to climate is complex, a growing number of scientists argue that climate change is likely to produce distributional shifts that will have significant public health implications worldwide.

In recent years dengue fever has been spreading. Historically, the disease has been most prevalent in Southeast Asia and the Pacific islands. These days many of the nearly half billion cases per year are occurring in Latin America:


In [0]:
## Data Wrangling Packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Machine Learning and Sci-kit learn packages
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

In [0]:
train_feats = pd.read_csv("https://raw.githubusercontent.com/BhekiMabheka/Data/master/Training_Data_Feeats.csv")
print(train_feats.shape)

In [0]:
train_feats.head(2)

In [0]:
tranin_labels = pd.read_csv("https://raw.githubusercontent.com/BhekiMabheka/Data/master/Traning_Labels.csv")
print(tranin_labels.shape)
tranin_labels.head(2)

In [0]:
full = pd.merge(left = train_feats, left_on = ["city", "year", "weekofyear"], right = tranin_labels, right_on = ["city", "year", "weekofyear"])
print(full.shape)
full.head(2)

In [0]:
numeric_feats_df = full.select_dtypes(include = [np.number]) # Select only numeric features
cat_feats_df = full.select_dtypes(include = [np.object])     # Select on categorical features
 
imp = SimpleImputer(missing_values = np.nan, strategy = 'mean') # All the numeric featuters impute them with an average
numeric_feats_df = pd.DataFrame(data = imp.fit_transform(numeric_feats_df), columns = numeric_feats_df.columns.tolist()) # Fit and transform the dataset

In [0]:
full = pd.concat([cat_feats_df, numeric_feats_df], axis = 1)
full.head()

In [0]:
X = full.drop(axis =1, columns = ["city", "week_start_date", "year", "weekofyear", "total_cases"]) # Predictors
y  =  full.total_cases # Target Label

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 2, test_size = 0.2)

In [0]:
# Fitting the gradient boosting regressor model with the training
grd_boost = GradientBoostingRegressor(random_state = 2).fit(X = X_train, y = y_train)

# Train Predictions
train_predictions = grd_boost.predict(X_train)

# Test Predictions
test_predictions = grd_boost.predict(X_test)


In [0]:
# Check performace on the training data
train_mean_absolute_error = mean_absolute_error(y_pred = train_predictions, y_true = y_train)

# Check performace on the testing data
test_mean_absolute_error = mean_absolute_error(y_pred = test_predictions, y_true = y_test)

In [0]:
print("TRAIN MEAN ABSOLUTE ERROR: ", np.round(train_mean_absolute_error, 2))
print("TEST MEAN ABSOLUTE ERROR : ", np.round(test_mean_absolute_error, 2))

The Gradient Boot Regressor model is overfiting, I will try parameter tuning to handle this, if it doesn't work I will to remove the noisy data by removing features with `zero variance` and later on do the feature engineering

### Parameter Tuning