# DengAI: Predicting Disease Spread

**Problem description**

Our goal is to predict the total_cases label for each (city, year, weekofyear) in the test set. There are two cities, San Juan and Iquitos, with test data for each city spanning 5 and 3 years respectively. You will make one submission that contains predictions for both cities. The data for each city have been concatenated along with a city column indicating the source: sj for San Juan and iq for Iquitos. The test set is a pure future hold-out, meaning the test data are sequential and non-overlapping with any of the training data. Throughout, missing values have been filled as NaNs.

**Importing libraries**

We have used the Keras which is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. We have used it for the deep leraning tasks. Further we have used Matplot Library and Seabornlibrary  for plotting the graphs. 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Load the data

The dataset consits following files ::

1.   Training Data Features	
The features for the training dataset.
2.   Training Data Labels	
The number of dengue cases for each row in the training dataset.
3.   Test Data Features	
The features for the testing dataset


In [2]:
dengue_features_train_df = pd.read_csv("dengue_features_train.csv")
dengue_features_test_df = pd.read_csv("dengue_features_test.csv")
dengue_labels_train_df = pd.read_csv("dengue_labels_train.csv")

dengue_features_train_df.head()

FileNotFoundError: ignored

In [None]:
dengue_features_train_df['year'].unique()

**Data Analysis**

In [None]:
dengue_features_train_df.head()

In [None]:
dengue_features_train_df.describe()

Getting the data types for the columns 

In [None]:
dengue_features_train_df.dtypes

**Checking weather the featues has missed values.**


In [None]:
dengue_features_train_df.isnull().sum()

It seems column  `ndvi_ne`  has more missing values .  
**Fill the features by the mean**


In [None]:
dengue_features_train_df["ndvi_ne"] = dengue_features_train_df["ndvi_ne"].fillna(dengue_features_train_df["ndvi_ne"].mean())
dengue_features_train_df["ndvi_nw"] = dengue_features_train_df["ndvi_nw"].fillna(dengue_features_train_df["ndvi_nw"].mean())
dengue_features_train_df["ndvi_se"] = dengue_features_train_df["ndvi_se"].fillna(dengue_features_train_df["ndvi_se"].mean())
dengue_features_train_df["ndvi_sw"] = dengue_features_train_df["ndvi_sw"].fillna(dengue_features_train_df["ndvi_sw"].mean())
dengue_features_train_df["precipitation_amt_mm"] = dengue_features_train_df["precipitation_amt_mm"].fillna(dengue_features_train_df["precipitation_amt_mm"].mean())
dengue_features_train_df["reanalysis_air_temp_k"] = dengue_features_train_df["reanalysis_air_temp_k"].fillna(dengue_features_train_df["reanalysis_air_temp_k"].mean())
dengue_features_train_df["reanalysis_avg_temp_k"] = dengue_features_train_df["reanalysis_avg_temp_k"].fillna(dengue_features_train_df["reanalysis_avg_temp_k"].mean())
dengue_features_train_df["reanalysis_dew_point_temp_k"] = dengue_features_train_df["reanalysis_dew_point_temp_k"].fillna(dengue_features_train_df["reanalysis_dew_point_temp_k"].mean())
dengue_features_train_df["reanalysis_max_air_temp_k"] = dengue_features_train_df["reanalysis_max_air_temp_k"].fillna(dengue_features_train_df["reanalysis_max_air_temp_k"].mean())
dengue_features_train_df["reanalysis_min_air_temp_k"] = dengue_features_train_df["reanalysis_min_air_temp_k"].fillna(dengue_features_train_df["reanalysis_min_air_temp_k"].mean())
dengue_features_train_df["reanalysis_precip_amt_kg_per_m2"] = dengue_features_train_df["reanalysis_precip_amt_kg_per_m2"].fillna(dengue_features_train_df["reanalysis_precip_amt_kg_per_m2"].mean())
dengue_features_train_df["reanalysis_relative_humidity_percent"] = dengue_features_train_df["reanalysis_relative_humidity_percent"].fillna(dengue_features_train_df["reanalysis_relative_humidity_percent"].mean())
dengue_features_train_df["reanalysis_sat_precip_amt_mm"] = dengue_features_train_df["reanalysis_sat_precip_amt_mm"].fillna(dengue_features_train_df["reanalysis_sat_precip_amt_mm"].mean())
dengue_features_train_df["reanalysis_specific_humidity_g_per_kg"] = dengue_features_train_df["reanalysis_specific_humidity_g_per_kg"].fillna(dengue_features_train_df["reanalysis_specific_humidity_g_per_kg"].mean())
dengue_features_train_df["reanalysis_tdtr_k"] = dengue_features_train_df["reanalysis_tdtr_k"].fillna(dengue_features_train_df["reanalysis_tdtr_k"].mean())
dengue_features_train_df["station_avg_temp_c"] = dengue_features_train_df["station_avg_temp_c"].fillna(dengue_features_train_df["station_avg_temp_c"].mean())
dengue_features_train_df["station_diur_temp_rng_c"] = dengue_features_train_df["station_diur_temp_rng_c"].fillna(dengue_features_train_df["station_diur_temp_rng_c"].mean())
dengue_features_train_df["station_max_temp_c"] = dengue_features_train_df["station_max_temp_c"].fillna(dengue_features_train_df["station_max_temp_c"].mean())
dengue_features_train_df["station_min_temp_c"] = dengue_features_train_df["station_min_temp_c"].fillna(dengue_features_train_df["station_min_temp_c"].mean())
dengue_features_train_df["station_precip_mm"] = dengue_features_train_df["station_precip_mm"].fillna(dengue_features_train_df["station_precip_mm"].mean())

**Checking again to verify whetehr there is any null values**

In [None]:
dengue_features_train_df.to_csv("dengue_features_train_clean.csv", index=False)

In [None]:
dengue_features_train_df.isnull().sum()

In [None]:
dengue_features_train_df['total_cases'] = dengue_labels_train_df['total_cases']
dengue_features_train__corr_df=dengue_features_train_df.drop('week_start_date', axis = 1)
dengue_features_train__corr_df=dengue_features_train__corr_df.drop('year', axis = 1)
dengue_features_train__corr_df=dengue_features_train__corr_df.drop('weekofyear', axis = 1)
dengue_features_train__corr_df=dengue_features_train__corr_df.drop('city', axis = 1)
dengue_features_train_corr = dengue_features_train__corr_df.corr()

In [None]:
sns.set() 
plt.figure(figsize=(10, 5))
plt.title('Correlation Plot of all features')
ax = sns.heatmap(dengue_features_train_corr,cmap="BuPu")


**Observation from the dataset**

No variables are exceptionally good at predicting the label (total cases)

The first 4 variables (Normalized Difference Vegetation Index) variables appears to be very weakly correlated with the other variables. They do not appear to be very useful in predicting the labels.

Most of temperature variables in both datasets appear to be strongly correlated with one another.

In [None]:
sns.set(font_scale = 1.0)
(abs(dengue_features_train_corr)
 .total_cases
 .drop('total_cases')
 .sort_values()
 .plot
 .barh())



it appears that certain variables feature prominently on both bar charts, suggesting that they may be commmon drivers of dengue cases. For example, specific humidity (in g/kg), dew point temperature (in K)
, minimum air temperature (in K) and  minimum air temperature (in C) appear to be relatively strongly correlated with the total_cases label.


**Splitting into training, cross-validation and testing dataset**

In [None]:
dengue_features_train__corr_df.head()

In [None]:
X=dengue_features_train_df.iloc[:,4:-1]
y=dengue_features_train_df.iloc[:,-1]

In [None]:
X.head()

In [None]:
y.head()

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn import linear_model as lm
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score 
from datetime import datetime, timedelta

In [None]:
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split( X, y, test_size=0.3, random_state=42)

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
train_X = sc.fit_transform(train_X) 
test_X = sc.fit_transform(test_X) 

In [None]:
reg_l = lm.LinearRegression()
reg_l_fit = reg_l.fit(train_X, train_y)
reg_l_pred=reg_l_fit.predict(test_X)
MAE_l=mean_absolute_error(test_y, reg_l_pred)

In [None]:
print ("MAE :", MAE_l)

In [None]:
ridge001 = lm.Ridge(alpha = 0.01)
ridge001_fit = ridge001.fit(train_X, train_y)
ridge001_pred=ridge001_fit.predict(test_X)
MAE_r001=mean_absolute_error(test_y, ridge001_pred)
print ("MAE :", MAE_r001)

In [None]:
ridge1OO = lm.Ridge(alpha = 100)
ridge100_fit = ridge1OO.fit(train_X, train_y)
ridge100_pred=ridge100_fit.predict(test_X)
MAE_r1OO=mean_absolute_error(test_y, ridge100_pred)
print ("MAE :", MAE_r1OO)

In [None]:
test_X= dengue_features_test_df

In [None]:
test_X.head()

In [None]:
test_X=test_X.fillna(0)

In [None]:
test_X.to_csv("dengue_features_test_clean.csv", index=False)

In [None]:
test_X.shape

In [None]:
test_X.isna().sum()

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
test_X = sc.fit_transform(test_X) 

In [None]:
test_X

In [None]:
predict=ridge001.predict(test_X)
type(predict)
len(predict)
y = np.array(np.round(predict), dtype=int)

In [None]:
dengue_labels_train_df = pd.read_csv("dengue_labels_train.csv")

In [None]:
pred_y = pd.DataFrame(y, columns=["total_cases"])

In [None]:
Submission_Deng_AI = pd.DataFrame()
Submission_Deng_AI["city"] = dengue_features_test_df["city"]
Submission_Deng_AI["year"]=dengue_features_test_df["year"]
Submission_Deng_AI["weekofyear"]=dengue_features_test_df["weekofyear"]
Submission_Deng_AI["total_cases"] =pred_y

In [None]:
sub_df = pd.read_csv("submission_format - submission_format.csv")

In [None]:
y=sub_df['total_cases']

In [None]:
y = np.array(np.round(y), dtype=int)

In [None]:
sub_df['total_cases']=y

In [None]:
sub_df.to_csv("sbclean.csv", index=False)

**Using other Models**

In [None]:
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.initializers import glorot_uniform

In [None]:
def LSTMmodel(input_shape):
  
    input_ = Input(input_shape, dtype='float32')
    # Be careful, the returned output should be a batch of sequences.
    X = LSTM(32, return_sequences=False)(input_)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    # Propagate X trough another LSTM layer with 128-dimensional hidden state
    X = Dense(1)(X)
    model = Model(inputs=input_, outputs=X)
    return model

In [None]:
model = LSTMmodel(X.shape[1])
model.summary()