# Recurrent Neural Network & Classification: 
The objective is to detect the security breach by predicting suspicious access using an RNN model and the provided Logfile data.

Logfile data includes login information like LogID, Timestamp, Method, Path, Status Code, Source, Remote Address, User Agent etc. The last indicator in each row denotes breach(1) and no breach(0) which is the target variable.

The expectation is that you will use the keras package to solve this problem (https://keras.io/).

# 1. Data Processing: 
This data set is a bit messy, so the preprocessing portion is largely a tutorial to make sure students have data ready for keras. 

a) Import the following libraries: 

In [1]:
import sys
import os
import json
import pandas
import numpy
import optparse
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from keras.callbacks import TensorBoard
from keras.models import Sequential, load_model
from keras.layers import Dense, Activation
from keras.layers import LSTM, Dense, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras import backend as k
from collections import OrderedDict

Using TensorFlow backend.


*b*) We will read the code in slightly differently than before: 

In [0]:
dataframe = pandas.read_csv("https://canvas.uchicago.edu/courses/17447/files/1909483/download?verifier=QpuhYSiinG7g0BJhPyc5oP1gwKr0nc0xg8tD6q9Z&wrap=1", engine='python', quotechar='|', header=None)

c) We then need to convert to a `numpy.ndarray` type: 

In [0]:
dataset = dataframe.values

d) Check the shape of the data set - it should be (26773, 2). Spend some time looking at the data. 

e) Store all rows and the 0th index as the feature data: 

In [0]:
X = dataset[:,0]

f) Store all rows and index 1 as the target variable: 

In [0]:
Y = dataset[:,1]

g) In the next step, we will clean up the predictors. This includes removing features that are not valuable, such as timestamp and source. 

In [0]:
for index, item in enumerate(X):
    # Quick hack to space out json elements
    reqJson = json.loads(item, object_pairs_hook=OrderedDict)
    del reqJson['timestamp']
    del reqJson['headers']
    del reqJson['source']
    del reqJson['route']
    del reqJson['responsePayload']
    X[index] = json.dumps(reqJson, separators=(',', ':'))

h) We next will tokenize our data, which just means vectorizing our text. Given the data we will tokenize every character (thus char_level = True)

In [0]:
tokenizer = Tokenizer(filters='\t\n', char_level=True)
tokenizer.fit_on_texts(X)

# we will need this later
num_words = len(tokenizer.word_index) + 1
X = tokenizer.texts_to_sequences(X)

i) Need to pad our data as each observation has a different length

In [0]:
max_log_length = 1024
X_processed = sequence.pad_sequences(X, maxlen=max_log_length)

j) Create your train set to be 75% of the data and your test set to be 25%

In [0]:
X_train, X_test, Y_train, Y_test = train_test_split(X_processed, Y, test_size=.25,random_state=0)

# 2. Model 1 - RNN: 
The first model will be a pretty minimal RNN with only an embedding layer, LSTM layer, and Dense layer. The next model we will add a few more layers.

a) Start by creating an instance of a Sequential model: https://keras.io/getting-started/sequential-model-guide/

In [0]:
k.clear_session()
model = Sequential()

b) From there, add an Embedding layer: https://keras.io/layers/embeddings/

Params:
- input_dim = num_words (the variable we created above)
- output_dim = 32
- input_length = max_log_length (we also created this above) 
- Keep all other variables as the defaults (shown below)

In [0]:
model.add(Embedding(input_dim=num_words, output_dim=32, input_length=max_log_length))

c) Add an LSTM layer https://keras.io/layers/recurrent/#lstm

Params:
- units = 64
- recurrent_dropout=0.5

In [0]:
model.add(LSTM(units=64, recurrent_dropout=.5))

d) Finally, we will add a Dense layer: https://keras.io/layers/core/#dense 

Params:
- units = 1 (this will be our output)
- activation = relu

In [0]:
model.add(Dense(units=1, activation="relu"))

e) Compile model using the .compile() method: https://keras.io/models/model/

Params:
- loss = binary_crossentropy
- optimizer = adam
- metrics = accuracy



In [0]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=['acc'])

Print the model summary

In [16]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1024, 32)          2016      
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 26,913
Trainable params: 26,913
Non-trainable params: 0
_________________________________________________________________


g) Use the `.fit()` method to fit the model on the train data. Use `validation_split=0.25`, `epochs=3` `batch_size=128`.

In [17]:
model_1_fit = model.fit(X_train, Y_train, validation_split=.25, epochs=3, batch_size=128)

Train on 15059 samples, validate on 5020 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


h) Use the `.evaluate()` method to get the loss value & the accuracy value on the test data. Use a batch size of 128 again.

In [18]:
model.evaluate(X_test, Y_test, batch_size=128)



[1.1093023050568658, 0.4720645353603192]

# 3) Model 2 - RNN + Dropout Layers + New Activation Function:

Now we will add a few new layers to our RNN and switch the activation function. You will be creating a new model here, so make sure to call it something different than the model from Part 2.

a) This RNN needs to have the following layers (add in this order):

- Embedding Layer (use same params as before)
- Dropout Layer (https://keras.io/layers/core/#dropout - use a value of 0.5 for now
- LSTM Layer (use same params as before)
- Dropout Layer - use a value of 0.5 
- Dense Layer - (switch to a sigmoid activation function)

In [20]:
k.clear_session()
model2 = Sequential()
model2.add(Embedding(num_words, 32, input_length=max_log_length))
model2.add(LSTM(64, recurrent_dropout=0.5))
model2.add(Dense(1, activation="sigmoid"))

model2.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"])
model2.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1024, 32)          2016      
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 26,913
Trainable params: 26,913
Non-trainable params: 0
_________________________________________________________________


In [21]:
model_2_fit = model2.fit(X_train, Y_train, epochs=3, batch_size=128,validation_split=0.25)

Train on 15059 samples, validate on 5020 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [23]:
model2.evaluate(X_test, Y_test, batch_size=128) 



[0.34538009765435235, 0.8624141021098242]

# 4) Recurrent Neural Net Model 3: Build Your Own
a) RNN Requirements: 
- Use 5 or more layers
- Add a layer that was not utilized in Model 1 or Model 2 (Note: This could be a new Dense layer or an additional LSTM)

In [24]:
k.clear_session()
model3 = Sequential()
model3.add(Embedding(num_words, 32, input_length=max_log_length))
model3.add(Dropout(0.5))
model3.add(LSTM(64,recurrent_dropout=0.5,return_sequences= True))
model3.add(Dropout(0.5))
model3.add(LSTM(64, recurrent_dropout=0.5))
model3.add(Dropout(0.5))
model3.add(Dense(1, activation="sigmoid"))

model3.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=['acc'])

model3.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1024, 32)          2016      
_________________________________________________________________
dropout_1 (Dropout)          (None, 1024, 32)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 1024, 64)          24832     
_________________________________________________________________
dropout_2 (Dropout)          (None, 1024, 64)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dropout_3 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total para

In [25]:
model_3_fit = model3.fit(X_train, Y_train, epochs=3, batch_size=128, validation_split=0.25)

Train on 15059 samples, validate on 5020 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [26]:
model3.evaluate(X_test, Y_test, batch_size=128)



[0.09791403422969684, 0.9763967733544167]

# Conceptual Questions: 

# 5) Explain the difference between the relu activation function and the sigmoid activation function. 

__rectified linear unit  (ReLU)__, defined as: $f(x)=x^{+}=\max(0,x)$, is an activation function with strong biological motivation. 

ReLU's biggest advantage is it's efficiency: since it does not activate all neurons at the same time, it is very fast. This is largely driven by ReLU's behavior to convert inputs to zero if negative, which leads to a sparse matrix. ReLU also has fewer vanishing gradients than sigmoid, but problems can occur which I highlight below. 

Drawbacks: ReLU is the dying ReLU problem. In this state, neurons become 'stuck' and inactive for all inputs (sometimes referred to as 'dying'). If too many neurons become stuck in this dead state, the model can become extremely ineffective. There are methods to mitigating this problem, including __Leaky ReLUs__, which allow a small, positive gradient when the unit is not active, or a smaller learning rate. In addition to dying neurons, the ReLU function is also non-differentiable at zero.

__sigmoid__ defined as: $S(x)={\frac{1}{1+e^{-x}}}={\frac{e^{x}}{e^{x}+1}}$

The sigmoid function is bounded between (0, 1), so it is used for typical binary classification problems. It is also differentiable at any point. The bounded nature of the sigmoid creates a weakness: _vanishing gradients_, where gradients drop close to zero and the model capacity typically falls short of acceptable. For inputs close to 0 or 1, the gradient with respect to those inputs are close to zero. where gradients drop close to zero, and the net does not learn well.


# 6) In regards to question 5, which of these activation functions performed the best (they were used in Model 1 & Model 2) ? Why do you think that is?

The sigmoid function performed much better than ReLU, which may be driven by sigmoid's ability to shift the result based on values of x near the center of the function. 

# 7) Explain how dropout works (you can look at the keras code) for (a) training, and (b) test data sets.

(a) Dropout is a regularization technique used to prevent overfitting. During training, randomly selected neurons are ignored ('dropped out'). This leads to a better generalization error, since the neural network will be less sensitive to specific neuron weights.

(b) Dropout only applies to training. 



# 8) Explain why problems such as this are better modeled with RNNs than CNNs.

This problem involves time-series data. We have inputs from a log file. RNNs are typically used with problems of this nature, given that they have memory that can serve as feedback loops. RNN also can be fed data of different lengths, while CNN require fixed input. In in this case, we have time-series data without consistent lengths.

# 9) Explain what RNN problem is solved using LSTM and briefly describe how.

Recurrent neural networks (RNN) has a problem of long-term dependencies. While RNN was created with the idea of connecting previous information to present tasks, there are situations where more context is needed to make a reasonable prediction of the present tasks. This gap between relevant information and the present task can grow, leading to poor performance by RNN. 

This problem of long-term dependencies is solved with Long Short Term Memory networks (LSTM), a special kind of RNN. LSTMs expand the chain structure found in RNNs to include four additional layers within each chain. These layers are much more complex than a simple RNN, and give LSTM the ability to add or remove information with gates. Gates are comprised of of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer outputs a 0 or 1 (0 disallows; 1 allows).