# Stateful model with Keras - long-term dependencies understanding

## LIbraries import

In [1]:
%matplotlib inline
from __future__ import division, print_function

In [2]:
import numpy as np
from numpy.random import choice
from keras.utils import * 
from keras.layers import TimeDistributed, Activation
from keras.layers import Dropout
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Embedding, LSTM
from keras.layers import BatchNormalization

Using TensorFlow backend.


## Data prepatation

### Loading text

In [28]:
path = get_file('kaspersky.txt', origin="https://psv4.userapi.com/c834700/u7402511/docs/d10/49ffe7ed6ef1/Kaspersky.txt")
text = open(path).read()
print('Corpus lengh: ', len(text))

Downloading data from https://psv4.userapi.com/c834700/u7402511/docs/d10/49ffe7ed6ef1/Kaspersky.txt
Corpus lengh:  45115


In [29]:
print(text[:1000])

Flexible fingerprint for detection of malware
US 8955120 B2
ABSTRACT
System and method for analyzing a target object for similarity to classes of reference objects. A first and a second set of attributes of the target object is identified composed respectively of attributes having values that are common, and variable, among a class of similar objects. A first hash is computed representing the first set of attributes according to a first hashing algorithm that is sensitive to variations in the first set of attributes among the class of similar objects. A second hash representing the second set of attributes is computed according to a second hashing algorithm that is insensitive to variations in the second set of attributes among the class of similar objects. An aggregate representation of the target object that is based on the first hash and the second hash is generated.
DESCRIPTION
CLAIM TO PRIORITY
The Application claims priority to Russian Federation Patent Application No. 2013129552

### Finding unique characters, converting to string, mapping, index converter

In [5]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("Total unique chars: ", vocab_size)
''.join(chars[:])

Total unique chars:  80


"\n '()*,-./0123456789:;ABCDEFGHIJKLMNOPRSTUVWXYZ\\_abcdefghijklmnopqrstuvwxyz«®—“”"

Map from chars to indices and vice versa

In [6]:
char_indices = dict((c,i) for i,c in enumerate(chars))
indices_char = dict((i,c) for i,c in enumerate(chars))
print(char_indices)
print(indices_char)

{'\n': 0, ' ': 1, "'": 2, '(': 3, ')': 4, '*': 5, ',': 6, '-': 7, '.': 8, '/': 9, '0': 10, '1': 11, '2': 12, '3': 13, '4': 14, '5': 15, '6': 16, '7': 17, '8': 18, '9': 19, ':': 20, ';': 21, 'A': 22, 'B': 23, 'C': 24, 'D': 25, 'E': 26, 'F': 27, 'G': 28, 'H': 29, 'I': 30, 'J': 31, 'K': 32, 'L': 33, 'M': 34, 'N': 35, 'O': 36, 'P': 37, 'R': 38, 'S': 39, 'T': 40, 'U': 41, 'V': 42, 'W': 43, 'X': 44, 'Y': 45, 'Z': 46, '\\': 47, '_': 48, 'a': 49, 'b': 50, 'c': 51, 'd': 52, 'e': 53, 'f': 54, 'g': 55, 'h': 56, 'i': 57, 'j': 58, 'k': 59, 'l': 60, 'm': 61, 'n': 62, 'o': 63, 'p': 64, 'q': 65, 'r': 66, 's': 67, 't': 68, 'u': 69, 'v': 70, 'w': 71, 'x': 72, 'y': 73, 'z': 74, '«': 75, '®': 76, '—': 77, '“': 78, '”': 79}
{0: '\n', 1: ' ', 2: "'", 3: '(', 4: ')', 5: '*', 6: ',', 7: '-', 8: '.', 9: '/', 10: '0', 11: '1', 12: '2', 13: '3', 14: '4', 15: '5', 16: '6', 17: '7', 18: '8', 19: '9', 20: ':', 21: ';', 22: 'A', 23: 'B', 24: 'C', 25: 'D', 26: 'E', 27: 'F', 28: 'G', 29: 'H', 30: 'I', 31: 'J', 32: 'K'

idx as character to index converter

In [7]:
idx = [char_indices[c] for c in text]
print(idx[:10])
''.join(indices_char[i] for i in idx[:80])

[27, 60, 53, 72, 57, 50, 60, 53, 1, 54]


'Flexible fingerprint for detection of malware\nUS 8955120 B2\nABSTRACT\nSystem and '

## Preprocess

In [8]:
maxlen = 40 #splitting into the chunks of 40
sentences = []
next_char = []
for i in range(0, len(idx)-maxlen+1):
    sentences.append(idx[i: i+maxlen])
    next_char.append(idx[i+1: i+maxlen+1])
print('Number sentences: ', len(sentences))                         

Number sentences:  45076


### Numpy array of inputs

In [9]:
sentences_2 = np.concatenate([[np.array(o)] for o in sentences[:-2]])
next_char_2 = np.concatenate([[np.array(o)] for o in next_char[:-2]])
print(sentences_2[:1])
print(next_char_2[:1])
sentences_2.shape, next_char_2.shape

[[27 60 53 72 57 50 60 53  1 54 57 62 55 53 66 64 66 57 62 68  1 54 63 66
   1 52 53 68 53 51 68 57 63 62  1 63 54  1 61 49]]
[[60 53 72 57 50 60 53  1 54 57 62 55 53 66 64 66 57 62 68  1 54 63 66  1
  52 53 68 53 51 68 57 63 62  1 63 54  1 61 49 60]]


((45074, 40), (45074, 40))

### Latent vectors

In [10]:
n_fac = 24

## Model creation

In [13]:
model = Sequential([
    Embedding(vocab_size, n_fac, input_length=maxlen), #embeddings already in RNN Keras model
    BatchNormalization(),
    LSTM(512, return_sequences=True), # stateful as additional parameter #return_sequences=True -making RNN
    Dropout(0.2),
    TimeDistributed(Dense(vocab_size)),
    Activation('softmax')
])

In [14]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 40, 24)            1920      
_________________________________________________________________
batch_normalization_2 (Batch (None, 40, 24)            96        
_________________________________________________________________
lstm_4 (LSTM)                (None, 40, 512)           1099776   
_________________________________________________________________
dropout_4 (Dropout)          (None, 40, 512)           0         
_________________________________________________________________
time_distributed_3 (TimeDist (None, 40, 80)            41040     
_________________________________________________________________
activation_3 (Activation)    (None, 40, 80)            0         
Total params: 1,142,832
Trainable params: 1,142,784
Non-trainable params: 48
_________________________________________________________________

In [15]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

## Train and test 

In [16]:
def print_example():
    seed_string='malware detection can based on procedure'
    for i in range(320):
        x=np.array([char_indices[c] for c in seed_string[-40:]])[np.newaxis,:]
        preds = model.predict(x, verbose=0)[0][-1] #model.predict(x)
        preds = preds/np.sum(preds)
        next_char = choice(chars, p=preds)
        seed_string = seed_string + next_char
    print(seed_string)

In [18]:
model.fit(sentences_2, np.expand_dims(next_char_2, -1), batch_size=64, epochs=1)

Epoch 1/1


<keras.callbacks.History at 0x3f8eee4a20>

In [19]:
print_example()

malware detection can based on proceduresh interfaces) in program in ard wilith beed. The extrive, disgling that is considered to be similar to the files from a subset of attributes containiret between fuch, a maticious, with a set of attributes accosing the drofen ndoursss BA. Further portions, RApdirilal filters, a file hash of the file uses comparison mod


In [20]:
model.optimizer.lr=0.001 #reduce learning rate

In [21]:
model.fit(sentences_2, np.expand_dims(next_char_2, -1), batch_size=64, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x3f96e5afd0>

In [22]:
print_example()

malware detection can based on procedured in this manner, “sinfinulation of the file's sections, the size of the selection of attributes (or constituent data for attribute subsets. In one embodiment, the set of attribute identification module is configured to compute a second hash generation module is configured to generate an aggregate representation of the


In [24]:
model.save_weights('C:/Users/Gavrilov/My Projects/RNN_LSTM_1.h5')

In [30]:
print_example()

malware detection can based on procedured in second hash, the set of file strings and their quantity, the filtered set of strings, the second one will becings to the input of attribute hash generation module is further configured to compute a second hash representing the second set of attributes, the hash function generation module 140 receives hashes of all
