## Abstract

The goal of this competition is to predict the hourly rain gauge total. The evaluation metrics for this competition is Mean Absolute Error which is written into the file submission.csv that consists of the columns - id and Expected. Methods used in the field of deep learning -  LSTM was implemented on the dataset. LSTM networks are well-suited to classifying, processing and making predictions based on time series data, since there can be lags of unknown duration between important events in a time series.

## Dataset

The columns in the train and test dataset are:

Id:  A unique number for the set of observations over an hour at a gauge.
minutes_past:  For each set of radar observations, the minutes past the top of the hour that the radar observations were carried out.  Radar observations are snapshots at that point in time.
radardist_km:  Distance of gauge from the radar whose observations are being reported.
Ref:  Radar reflectivity in km
Ref_5x5_10th:   10th percentile of reflectivity values in 5x5 neighborhood around the gauge.
Ref_5x5_50th:   50th percentile
Ref_5x5_90th:   90th percentile
RefComposite:  Maximum reflectivity in the vertical column above gauge.  In dBZ.
RefComposite_5x5_10th
RefComposite_5x5_50th
RefComposite_5x5_90th
RhoHV:  Correlation coefficient (unitless)
RhoHV_5x5_10th
RhoHV_5x5_50th
RhoHV_5x5_90th
Zdr:    Differential reflectivity in dB
Zdr_5x5_10th
Zdr_5x5_50th
Zdr_5x5_90th
Kdp:  Specific differential phase (deg/km)
Kdp_5x5_10th
Kdp_5x5_50th
Kdp_5x5_90th
Expected:  Actual gauge observation in mm at the end of the hour.

## EDA

In [17]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory



# Any results you write to the current directory are saved as output.

In [18]:
N_FEATURES = 22

# taken from http://simaaron.github.io/Estimating-rainfall-from-weather-radar-readings-using-recurrent-neural-networks/
THRESHOLD = 73 

Importimg train.csv

In [19]:
train_df = pd.read_csv("train.csv")

In [20]:
# to reduce memory consumption
train_df[train_df.columns[1:]] = train_df[train_df.columns[1:]].astype(np.float32)

Train dataset consists of 13765201 rows and 24 columns.

In [21]:
train_df.shape

(13765201, 24)

Checking for null values if any in the entire dataset

In [36]:
#cross checking for null values
train_df.isnull().any()

Id                       False
minutes_past             False
radardist_km             False
Ref                      False
Ref_5x5_10th             False
Ref_5x5_50th             False
Ref_5x5_90th             False
RefComposite             False
RefComposite_5x5_10th    False
RefComposite_5x5_50th    False
RefComposite_5x5_90th    False
RhoHV                    False
RhoHV_5x5_10th           False
RhoHV_5x5_50th           False
RhoHV_5x5_90th           False
Zdr                      False
Zdr_5x5_10th             False
Zdr_5x5_50th             False
Zdr_5x5_90th             False
Kdp                      False
Kdp_5x5_10th             False
Kdp_5x5_50th             False
Kdp_5x5_90th             False
Expected                 False
dtype: bool

Removing all the entries from the dataset with the Ref column having  null values. The dataset further consists of 9125329 entries with 24 columns.

In [34]:
good_ids = set(train_df.loc[train_df['Ref'].notna(), 'Id'])
train_df = train_df[train_df['Id'].isin(good_ids)]
train_df.shape


(8926102, 24)

Handling null values by replacing null values with 0 in all the columns with null values

In [35]:
train_df.fillna(0.0, inplace=True)
train_df.reset_index(drop=True, inplace=True)
train_df.head()

Unnamed: 0,Id,minutes_past,radardist_km,Ref,Ref_5x5_10th,Ref_5x5_50th,Ref_5x5_90th,RefComposite,RefComposite_5x5_10th,RefComposite_5x5_50th,...,RhoHV_5x5_90th,Zdr,Zdr_5x5_10th,Zdr_5x5_50th,Zdr_5x5_90th,Kdp,Kdp_5x5_10th,Kdp_5x5_50th,Kdp_5x5_90th,Expected
0,2,1.0,2.0,9.0,5.0,7.5,10.5,15.0,10.5,16.5,...,0.998333,0.375,-0.125,0.3125,0.875,1.059998,-1.410004,-0.350006,1.059998,1.016001
1,2,6.0,2.0,26.5,22.5,25.5,31.5,26.5,26.5,28.5,...,1.005,0.0625,-0.1875,0.25,0.6875,0.0,0.0,0.0,1.409988,1.016001
2,2,11.0,2.0,21.5,15.5,20.5,25.0,26.5,23.5,25.0,...,1.001667,0.3125,-0.0625,0.3125,0.625,0.349991,0.0,-0.350006,1.759995,1.016001
3,2,16.0,2.0,18.0,14.0,17.5,21.0,20.5,18.0,20.5,...,1.001667,0.25,0.125,0.375,0.6875,0.349991,-1.059998,0.0,1.059998,1.016001
4,2,21.0,2.0,24.5,16.5,21.0,24.5,24.5,21.0,24.0,...,0.998333,0.25,0.0625,0.1875,0.5625,-0.350006,-1.059998,-0.350006,1.759995,1.016001


Removing outliers from the dataset by taking the value of threshold as 73

In [1]:
train_df = train_df[train_df['Expected'] < THRESHOLD]

NameError: name 'train_df' is not defined

In [26]:
train_df.shape

(8926102, 24)

Grouping the train dataset by Id

In [27]:
train_groups = train_df.groupby("Id")
train_size = len(train_groups)

In [28]:
MAX_SEQ_LEN = train_groups.size().max()
MAX_SEQ_LEN

19

In [29]:
X_train = np.zeros((train_size, MAX_SEQ_LEN, N_FEATURES), dtype=np.float32)
y_train = np.zeros(train_size, dtype=np.float32)

i = 0
for _, group in train_groups:
    X = group.values
    seq_len = X.shape[0]
    X_train[i,:seq_len,:] = X[:,1:23]
    y_train[i] = X[0,23]
    i += 1
    del X
    
del train_groups
X_train.shape, y_train.shape

((714838, 19, 22), (714838,))

preprocessing test datasheet and handling NaN values by replacing it with zero

In [31]:
test_df = pd.read_csv("test.csv")
test_df[test_df.columns[1:]] = test_df[test_df.columns[1:]].astype(np.float32)
test_ids = test_df['Id'].unique()

# Convert all NaNs to zero
test_df = test_df.fillna(0.0)
test_df = test_df.reset_index(drop=True)

preparing X_test values from test dataset

In [None]:
test_groups = test_df.groupby("Id")
test_size = len(test_groups)

X_test = np.zeros((test_size, MAX_SEQ_LEN, N_FEATURES), dtype=np.float32)

i = 0
for _, group in test_groups:
    X = group.values
    seq_len = X.shape[0]
    X_test[i,:seq_len,:] = X[:,1:23]
    i += 1
    del X
    
del test_groups
X_test.shape

# LSTM

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems. This is a behavior required in complex problem domains like machine translation, speech recognition, and more. LSTMs are a complex area of deep learning.

Importing all the required libraries for keras

In [37]:
from keras.layers import (
    Input,
    Dense,
    LSTM,
    AveragePooling1D,
    TimeDistributed,
    Flatten,
    Bidirectional,
    Dropout
)
from keras.models import Model

ModuleNotFoundError: No module named 'keras'

We have implemented two call back functions here early_stopping and reduce_lr. Callbacks are functions that can be applied at certain stages of the training process, such as at the end of each epoch. Specifically, in our solution, we included EarlyStopping(monitor='val_loss', patience=5) to define that we wanted to monitor the test (validation) loss at each epoch and after the test loss has not improved after 5 epochs, training is interrupted. However, since we set patience=5, we won’t get the best model, but the model five epochs after the best model. 

Also using Reduce learning rate when a metric has stopped improving.

Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This callback monitors a quantity and if no improvement is seen for a 'patience' number of epochs, the learning rate is reduced.
1. monitor: quantity to be monitored.
2. factor: factor by which the learning rate will be reduced. new_lr = lr * factor
3. patience: number of epochs with no improvement after which learning rate will be reduced.
4. min_delta: threshold for measuring the new optimum, to only focus on significant changes.

In [None]:
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=5)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, min_delta=0.01)

In [None]:
BATCH_SIZE = 1024 #Training batch size at a time
N_EPOCHS = 4 #total number of training epochs

In [None]:
def get_model_deep(shape=(19, 22)):
    inp = Input(shape)
    x = Dense(16)(inp)
    x = Bidirectional(LSTM(64, return_sequences=True))(x)
    x = TimeDistributed(Dense(64))(x)
    x = Bidirectional(LSTM(128, return_sequences=True))(x)
    x = TimeDistributed(Dense(1))(x)
    x = AveragePooling1D()(x)
    x = Flatten()(x)
    x = Dropout(0.5)(x)
    x = Dense(1)(x)

    model = Model(inp, x)
    return model

configures the model for training with optimizer and loss.
The loss value that will be minimized by the model will then be the sum of all individual losses.
optimizer - adam optimizer a methos for stochastic optimization

In [None]:
model = get_model_deep((19,22))
model.compile(optimizer='adam', loss='mae',)
model.summary()

Fitting the model with batch size of 1024 at a time with 10 epochs.

In [None]:
model.fit(X_train, y_train, 
            batch_size=BATCH_SIZE, epochs=N_EPOCHS, 
            validation_split=0.2, callbacks=[early_stopping, reduce_lr])

Predicting the test results

In [None]:
y_pred = model.predict(X_test, batch_size=BATCH_SIZE)
sample_solution = pd.DataFrame({'Id': test_ids, 'Expected': y_pred.reshape(-1)})
sample_solution.head(5)


Writing the soluion to submission.csv file 

In [None]:
sample_solution.to_csv("submission.csv", index=False)

## Conclusion

## Contributions :

1. Performed data cleaning by checking for null values and replacing them with placeholder values along with EDA.
2. Added necessary comments, required observations after every results and explanations to the code.
3. Changed the values of epochs and patience to check if the model is best trained with minimum loss, tweaked the 


## Citations

1. https://machinelearningmastery.com/how-to-develop-lstm-models-for-multi-step-time-series-forecasting-of-household-power-consumption/
2. https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0
3. https://machinelearningmastery.com/gentle-introduction-long-short-term-memory-networks-experts/
4. https://www.kaggle.com/ilya16/lstm-models
5. http://www.easy-tensorflow.com/tf-tutorials/recurrent-neural-networks/bidirectional-rnn-for-classification

## LICENSE

Copyright 2019, Chaitanya Prasanna Kumar
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.