# Assignment 7
## Akarsh Sahu
### 11-20-2019

## Part 1: Data Preparation

#### a) Import the following libraries: 

In [1]:
import sys
import os
import json
import pandas
import numpy
import optparse

from keras.callbacks import TensorBoard
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from collections import OrderedDict

Using TensorFlow backend.


#### b) We will read the code in slightly differently than before: 

In [2]:
dataframe = pandas.read_csv("dev-access.csv", engine='python', quotechar='|', header=None)

#### c) We then need to convert to a numpy.ndarray type: 

In [3]:
dataset = dataframe.values

#### d) Check the shape of the data set - it should be (26773, 2). Spend some time looking at the data.

In [4]:
dataset.shape

(26773, 2)

#### e) Store all rows and the 0th index as the feature data: 

In [5]:
X = dataset[:,0]

#### f) Store all rows and index 1 as the target variable: 

In [11]:
Y = dataset[:,1]

#### g) In the next step, we will clean up the predictors. This includes removing features that are not valuable, such as timestamp and source. 

In [12]:
for index, item in enumerate(X):
    # Quick hack to space out json elements
    reqJson = json.loads(item, object_pairs_hook=OrderedDict)
    del reqJson['timestamp']
    del reqJson['headers']
    del reqJson['source']
    del reqJson['route']
    del reqJson['responsePayload']
    X[index] = json.dumps(reqJson, separators=(',', ':'))

#### h) We next will tokenize our data, which just means vectorizing our text. Given the data we will tokenize every character (thus char_level = True)

In [13]:
tokenizer = Tokenizer(filters='\t\n', char_level=True)
tokenizer.fit_on_texts(X)

# we will need this later
num_words = len(tokenizer.word_index)+1
X = tokenizer.texts_to_sequences(X)

#### i) Need to pad our data as each observation has a different length

In [33]:
max_log_length = 1024
X_processed = sequence.pad_sequences(X, maxlen=max_log_length)

In [39]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_processed, Y, test_size=0.25)

## Part 2: Model 1 - RNN: 

#### a-d) The first model will be a pretty minimal RNN with only an embedding layer, simple RNN and Dense layer.

In [40]:
from keras import models
from keras import layers

model_1 = models.Sequential()
model_1.add(Embedding(num_words, 32, input_length = max_log_length))
model_1.add(layers.SimpleRNN(units = 32, activation  = 'relu'))
model_1.add(layers.Dense(1, activation = 'sigmoid'))

#### e) Compile model using the .compile() method:

In [41]:
model_1.compile(optimizer = 'adam',
             loss = 'binary_crossentropy',
             metrics = ['accuracy'])

#### f) Print the model summary

In [42]:
model_1.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 1024, 32)          2016      
_________________________________________________________________
simple_rnn_4 (SimpleRNN)     (None, 32)                2080      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
Total params: 4,129
Trainable params: 4,129
Non-trainable params: 0
_________________________________________________________________


#### g) Use the .fit() method to fit the model on the train data. Use a validation split of 0.25, epochs=3 and batch size = 128.

In [43]:
model_1.fit(X_train, y_train, epochs = 3, batch_size = 128, validation_split = 0.25)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 15059 samples, validate on 5020 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.callbacks.History at 0x2513fc6c0f0>

#### h) Use the .evaluate() method to get the loss value & the accuracy value on the test data. Use a batch size of 128 again.

In [44]:
model_1.evaluate(X_test, y_test, batch_size = 128)



[0.0816379901915477, 0.9859575629234314]

## Part 3: Model 2 - LSTM + Dropout Layers:

#### a) This RNN needs to have the following layers (add in this order):

- Embedding Layer (use same params as before)
- LSTM Layer (units = 64, recurrent_dropout = 0.5)
- Dropout Layer - use a value of 0.5
- Dense Layer - (use same params as before)

In [45]:
model_2 = models.Sequential()
model_2.add(Embedding(num_words, 32, input_length = max_log_length))
model_2.add(layers.LSTM(units = 64, recurrent_dropout=0.5))
model_2.add(layers.Dropout(0.5))
model_2.add(layers.Dense(1, activation = 'sigmoid'))

#### b) Compile model using the .compile() method:

Params:
- loss = binary_crossentropy
- optimizer = adam
- metrics = accuracy

In [46]:
model_2.compile(optimizer = 'adam',
             loss = 'binary_crossentropy',
             metrics = ['accuracy'])

#### c) Print the model summary

In [47]:
model_2.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 1024, 32)          2016      
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 65        
Total params: 26,913
Trainable params: 26,913
Non-trainable params: 0
_________________________________________________________________


#### d) Use the .fit() method to fit the model on the train data. Use a validation split of 0.25, epochs=3 and batch size = 128.

In [48]:
model_2.fit(X_train, y_train, epochs = 3, batch_size = 128, validation_split = 0.25)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 15059 samples, validate on 5020 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.callbacks.History at 0x25144e1b3c8>

#### e) Use the .evaluate() method to get the loss value & the accuracy value on the test data. Use a batch size of 128 again.

In [44]:
model_2.evaluate(X_test, y_test, batch_size = 128)



[0.0816379901915477, 0.9859575629234314]

## Part 4: Recurrent Neural Net Model 3: Build Your Own

#### a) RNN Requirements:
- Use 5 or more layers
- Add a layer that was not utilized in Model 1 or Model 2 (Note: This could be a new Dense layer or an additional LSTM)

In [56]:
model_3 = models.Sequential()
model_3.add(Embedding(num_words, 64, input_length = max_log_length))
model_3.add(layers.LSTM(units = 64, recurrent_dropout=0.5))
model_3.add(layers.Dropout(0.5))
model_3.add(layers.Dense(32, activation = 'relu'))
model_3.add(layers.Dense(1, activation = 'sigmoid'))

#### b) Compiler Requirements:
- Try a new optimizer for the compile step
- Keep accuracy as a metric (feel free to add more metrics if desired)

In [57]:
model_3.compile(optimizer = 'rmsprop',
             loss = 'binary_crossentropy',
             metrics = ['accuracy'])

#### c) Print the model summary

In [58]:
model_3.summary()

Model: "sequential_15"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_14 (Embedding)     (None, 1024, 64)          4032      
_________________________________________________________________
lstm_14 (LSTM)               (None, 64)                33024     
_________________________________________________________________
dropout_6 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 33        
Total params: 39,169
Trainable params: 39,169
Non-trainable params: 0
_________________________________________________________________


#### d) Use the .fit() method to fit the model on the train data. Use a validation split of 0.25, epochs=3 and batch size = 128.

In [59]:
model_3.fit(X_train, y_train, epochs = 3, batch_size = 128, validation_split = 0.25)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 15059 samples, validate on 5020 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.callbacks.History at 0x251045d2f60>

#### e) Use the .evaluate() method to get the loss value & the accuracy value on the test data. Use a batch size of 128 again.

In [60]:
model_3.evaluate(X_test, y_test, batch_size = 128)



[0.09551219332741737, 0.9698237180709839]

## Part 5: Conceptual Questions: 

#### 5) Explain the difference between the relu activation function and the sigmoid activation function.

Sigmoid functions are used in machine learning for the logistic regression and basic neural network implementations and they are the introductory activation units. Sigmoid functions are one of the most widely used activation functions. The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice. But for advanced Neural Network Sigmoid functions are not preferred due to various drawbacks, one of them is “vanishing gradients”. There are ways to work around this problem and sigmoid is still very popular in classification problems.

These day, Relu is the most popular activation function for deep neural networks. Most Deep Learning applications right now make use of Rele instead of Logistic Activation functions for Computer Vision, Speech Recognition and Deep Neural Networks etc. Relu has output 0 if the input is less than 0, and raw output otherwise. That is, if the input is greater than 0, the output is equal to the input. The operation of ReLU is closer to the way our biological neurons work. Relu is non-linear and has two additional major benefits compared to sigmoid: sparsity and a reduced likelihood of vanishing gradient. Also for larger Neural Networks, the speed of building models based off on Relu is very fast opposed to using Sigmoids

#### 6) Describe what one epoch actually is (epoch was a parameter used in the .fit() method).

The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters (1 Epoch = 1 Forward pass + 1 Backward pass for ALL training samples) An epoch is comprised of one or more batches. For example, an epoch that has one batch is called the batch gradient descent learning algorithm. We can think of a for-loop over the number of epochs where each loop proceeds over the training dataset. Within this for-loop is another nested for-loop that iterates over each batch of samples, where one batch has the specified “batch size” number of samples. The number of epochs is traditionally large, often hundreds or thousands, allowing the learning algorithm to run until the error from the model has been sufficiently minimized.

#### 7) Explain how dropout works (you can look at the keras code and/or documentation) for (a) training, and (b) test data sets.

Dropout is one of the most popular regularization techniques for deep neural networks. At every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step. The hyperparameter p is the dropout rate, and it is typically set to 50%. After training, neurons don’t get dropped anymore.

Neurons trained with dropout cannot co-adapt with their neighboring neurons; they have to be as useful as possible on their own. They also cannot rely excessively on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes in the inputs. In the end you get a more robust network that generalizes better.

The power of dropout is that a unique neural network is generated at each training step. Since each neuron can be either present or absent, there is a total of 2N possible networks. This is such a huge number that it is virtually impossible for the same neural network to be sampled twice. Once you have run a 10,000 training steps, you have essentially trained 10,000 different neural networks (each with just one training instance). These neural networks are obviously not independent since they share many of their weights, but they are nevertheless all different. The resulting neural network can be seen as an averaging ensemble of all these smaller neural networks.

#### 8) Explain why problems such as this homework assignment are better modeled with RNNs than CNNs. What type of problem will CNNs outperform RNNs on?

This homework assignment is "identifying breach by predicting suspicious access" using the Logfile, and RNN is designed to work with sequence prediction problems, therefore this assignment is better modeled with RNNs than CNNs.

(RNN also better than CNN where data contains temporal properties, such as a time series and where the data is context sensitive, as in the case of sentence completion.)

They are both different architecture’s of neural nets that perform well on different types of data but some types of data can be processed by either architecture. Examples of this are image classification and text classification, where both systems have been effective. Moreover, some deep learning applications may benefit from the combination of the two architectures.

RNNs are good with series of data (one thing happens after another) and are used a lot in problems that can be framed as “what will happen next given…” while CNNs are especially good at problems like image classification - the general field of computer vision. CNN outperforms RNN on things like medical image analysis, image recognition, face detection and recognition systems, and full motion video analysis. Besides, CNN has been the subject of research and testing for other tasks, and it has been effective in solving traditional Natural Language Processing (NLP) tasks. Specifically, it has achieved very impressive results in semantic parsing, sentence modeling, and search query retrieval. CNNs have been employed in the field of drug discovery. It discovers chemical features, and has been used to predict novel biomolecules for combating disease. Finally CNNs have also been applied to more traditional machine learning problems, such as game playing.

#### 9) Explain what RNN problem is solved using LSTM and briefly describe how.

LSTM is basically considered to avoid the problem of vanishing gradient in RNN (short term memory) - reason why some people say RNN has a bad memory. LSTM has internal mechanisms called gates that can regulate the flow of information. In other words, LSTM networks have some internal contextual state cells that act as long-term or short-term memory cells. These gates can learn which data in a sequence is important to keep or throw away. By doing that, it can pass relevant information down the long chain of sequences to make predictions

The output of the LSTM network is modulated by the state of these cells which is a very important property when we need the prediction of the neural network to depend on the historical context of inputs, rather than only on the very last input. In other words, LSTM networks manage to keep contextual information of inputs by integrating a loop that allows information to flow from one step to the next. (LSTM predictions are always conditioned by the past experience of the network’s inputs). LSTM networks remembers and when to forget, through their forget gate weights.