# Recurrent Neural Network & Classification: 
The objective is to detect the security breach by predicting suspicious access using an RNN model and the provided Logfile data.

Logfile data includes login information like LogID, Timestamp, Method, Path, Status Code, Source, Remote Address, User Agent etc. The last indicator in each row denotes breach(1) and no breach(0) which is the target variable.

The expectation is that you will use the keras package to solve this problem (https://keras.io/).

# 1. Data Processing: 
This data set is a bit messy, so the preprocessing portion is largely a tutorial to make sure students have data ready for keras. 

a) Import the following libraries: 

In [1]:
import sys
import os
import json
import pandas
import numpy
import optparse
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from keras.callbacks import TensorBoard
from keras.models import Sequential, load_model
from keras.layers import Dense, Activation
from keras.layers import LSTM, Dense, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras import backend as k
from collections import OrderedDict

Using TensorFlow backend.


b) We will read the code in slightly differently than before: 

In [2]:
dataframe = pandas.read_csv("dev-access.csv", engine='python', quotechar='|', header=None)

c) We then need to convert to a `numpy.ndarray` type: 

In [3]:
dataset = dataframe.values

d) Check the shape of the data set - it should be (26773, 2). Spend some time looking at the data. 

e) Store all rows and the 0th index as the feature data: 

In [4]:
X = dataset[:,0]

f) Store all rows and index 1 as the target variable: 

In [None]:
Y = dataset[:,1]

g) In the next step, we will clean up the predictors. This includes removing features that are not valuable, such as timestamp and source. 

In [None]:
for index, item in enumerate(X):
    # Quick hack to space out json elements
    reqJson = json.loads(item, object_pairs_hook=OrderedDict)
    del reqJson['timestamp']
    del reqJson['headers']
    del reqJson['source']
    del reqJson['route']
    del reqJson['responsePayload']
    X[index] = json.dumps(reqJson, separators=(',', ':'))

h) We next will tokenize our data, which just means vectorizing our text. Given the data we will tokenize every character (thus char_level = True)

In [None]:
tokenizer = Tokenizer(filters='\t\n', char_level=True)
tokenizer.fit_on_texts(X)

# we will need this later
num_words = len(tokenizer.word_index) + 1
X = tokenizer.texts_to_sequences(X)

i) Need to pad our data as each observation has a different length

In [None]:
max_log_length = 1024
X_processed = sequence.pad_sequences(X, maxlen=max_log_length)

j) Create your train set to be 75% of the data and your test set to be 25%

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X_processed, Y, test_size=.25,random_state=0)

# 2. Model 1 - RNN: 
The first model will be a pretty minimal RNN with only an embedding layer, LSTM layer, and Dense layer. The next model we will add a few more layers.

a) Start by creating an instance of a Sequential model: https://keras.io/getting-started/sequential-model-guide/

In [None]:
k.clear_session()
model1 = Sequential()

b) From there, add an Embedding layer: https://keras.io/layers/embeddings/

Params:
- input_dim = num_words (the variable we created above)
- output_dim = 32
- input_length = max_log_length (we also created this above) 
- Keep all other variables as the defaults (shown below)

In [None]:
model1.add(Embedding(input_dim=num_words, output_dim=32, input_length=max_log_length))

c) Add an LSTM layer https://keras.io/layers/recurrent/#lstm

Params:
- units = 64
- recurrent_dropout=0.5

In [None]:
model1.add(LSTM(units=64, recurrent_dropout=.5))

d) Finally, we will add a Dense layer: https://keras.io/layers/core/#dense 

Params:
- units = 1 (this will be our output)
- activation = relu

In [None]:
model1.add(Dense(units=1, activation="relu"))

e) Compile model using the .compile() method: https://keras.io/models/model/

Params:
- loss = binary_crossentropy
- optimizer = adam
- metrics = accuracy



In [None]:
model1.compile(optimizer="adam", loss="binary_crossentropy", metrics=['acc'])

Print the model summary

In [None]:
model1.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1024, 32)          2016      
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 26,913
Trainable params: 26,913
Non-trainable params: 0
_________________________________________________________________


g) Use the `.fit()` method to fit the model on the train data. Use `validation_split=0.25`, `epochs=3` `batch_size=128`.

In [None]:
model1_fit = model.fit(X_train, Y_train, validation_split=.25, epochs=3, batch_size=128)

Train on 15059 samples, validate on 5020 samples
Epoch 1/3


h) Use the `.evaluate()` method to get the loss value & the accuracy value on the test data. Use a batch size of 128 again.

In [None]:
model1.evaluate(X_test, Y_test, batch_size=128)

# 3) Model 2 - RNN + Dropout Layers + New Activation Function:

Now we will add a few new layers to our RNN and switch the activation function. You will be creating a new model here, so make sure to call it something different than the model from Part 2.

a) This RNN needs to have the following layers (add in this order):

- Embedding Layer (use same params as before)
- Dropout Layer (https://keras.io/layers/core/#dropout - use a value of 0.5 for now
- LSTM Layer (use same params as before)
- Dropout Layer - use a value of 0.5 
- Dense Layer - (switch to a sigmoid activation function)

In [2]:
k.clear_session()
model2 = Sequential()
model2.add(Embedding(num_words, 32, input_length=max_log_length))
model2.add(LSTM(64, recurrent_dropout=0.5))
model2.add(Dense(1, activation="sigmoid"))

model2.compile(optimizer="adam", loss="binary_crossentropy",metrics=["acc"])

model2.summary()

NameError: name 'k' is not defined

In [4]:
model2_fit = model.fit(X_train, Y_train, validation_split=.25, epochs=3, batch_size=128)
model2.evaluate(X_test, Y_test, batch_size=128)

NameError: name 'model' is not defined

# 4) Recurrent Neural Net Model 3: Build Your Own

a) RNN Requirements: 
- Use 5 or more layers
- Add a layer that was not utilized in Model 1 or Model 2 (Note: This could be a new Dense layer or an additional LSTM)

In [None]:
k.clear_session()
model3 = Sequential()
model3.add(Embedding(num_words, 32, input_length=max_log_length))
model3.add(Dropout(0.5))
model3.add(LSTM(64,recurrent_dropout=0.5,return_sequences= True))
model3.add(Dropout(0.5))
model3.add(LSTM(64, recurrent_dropout=0.5))
model3.add(Dropout(0.5))
model3.add(Dense(1, activation="sigmoid"))

model3.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=['acc'])

model3.summary()