# Stateful model with Keras - long-term dependencies understanding

## LIbraries import

In [178]:
%matplotlib inline
from __future__ import division, print_function

In [179]:
import numpy as np
from numpy.random import choice
from keras.utils import * 
from keras.models import Sequential
from keras.layers import TimeDistributed, Activation, Dropout, Dense, Embedding, LSTM, BatchNormalization

## Data prepatation

### Loading text

In [28]:
path = get_file('kaspersky.txt', origin="https://psv4.userapi.com/c834700/u7402511/docs/d10/49ffe7ed6ef1/Kaspersky.txt")
text = open(path).read()
print('Corpus lengh: ', len(text))

Downloading data from https://psv4.userapi.com/c834700/u7402511/docs/d10/49ffe7ed6ef1/Kaspersky.txt
Corpus lengh:  45115


In [33]:
print(text[:1000])

Flexible fingerprint for detection of malware
US 8955120 B2
ABSTRACT
System and method for analyzing a target object for similarity to classes of reference objects. A first and a second set of attributes of the target object is identified composed respectively of attributes having values that are common, and variable, among a class of similar objects. A first hash is computed representing the first set of attributes according to a first hashing algorithm that is sensitive to variations in the first set of attributes among the class of similar objects. A second hash representing the second set of attributes is computed according to a second hashing algorithm that is insensitive to variations in the second set of attributes among the class of similar objects. An aggregate representation of the target object that is based on the first hash and the second hash is generated.
DESCRIPTION
CLAIM TO PRIORITY
The Application claims priority to Russian Federation Patent Application No. 2013129552

### Extracting unique characters, i.e. creating vocabulary size

In [180]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("Total unique chars: ", vocab_size)
print(chars)

Total unique chars:  80
['\n', ' ', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '\\', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '«', '®', '—', '“', '”']


### Creating dictionary of chars and indices, and vice versa

In [181]:
char_indices = dict((c,i) for i,c in enumerate(chars))
indices_char = dict((i,c) for i,c in enumerate(chars))
print(char_indices, "\n", indices_char)

{'\n': 0, ' ': 1, "'": 2, '(': 3, ')': 4, '*': 5, ',': 6, '-': 7, '.': 8, '/': 9, '0': 10, '1': 11, '2': 12, '3': 13, '4': 14, '5': 15, '6': 16, '7': 17, '8': 18, '9': 19, ':': 20, ';': 21, 'A': 22, 'B': 23, 'C': 24, 'D': 25, 'E': 26, 'F': 27, 'G': 28, 'H': 29, 'I': 30, 'J': 31, 'K': 32, 'L': 33, 'M': 34, 'N': 35, 'O': 36, 'P': 37, 'R': 38, 'S': 39, 'T': 40, 'U': 41, 'V': 42, 'W': 43, 'X': 44, 'Y': 45, 'Z': 46, '\\': 47, '_': 48, 'a': 49, 'b': 50, 'c': 51, 'd': 52, 'e': 53, 'f': 54, 'g': 55, 'h': 56, 'i': 57, 'j': 58, 'k': 59, 'l': 60, 'm': 61, 'n': 62, 'o': 63, 'p': 64, 'q': 65, 'r': 66, 's': 67, 't': 68, 'u': 69, 'v': 70, 'w': 71, 'x': 72, 'y': 73, 'z': 74, '«': 75, '®': 76, '—': 77, '“': 78, '”': 79} 
 {0: '\n', 1: ' ', 2: "'", 3: '(', 4: ')', 5: '*', 6: ',', 7: '-', 8: '.', 9: '/', 10: '0', 11: '1', 12: '2', 13: '3', 14: '4', 15: '5', 16: '6', 17: '7', 18: '8', 19: '9', 20: ':', 21: ';', 22: 'A', 23: 'B', 24: 'C', 25: 'D', 26: 'E', 27: 'F', 28: 'G', 29: 'H', 30: 'I', 31: 'J', 32: '

### Vector creation 

#### Every character of a whole text into list of int (indices) according to dictionary indices

In [182]:
idx = [char_indices[c] for c in text]
print(idx[:100])

[27, 60, 53, 72, 57, 50, 60, 53, 1, 54, 57, 62, 55, 53, 66, 64, 66, 57, 62, 68, 1, 54, 63, 66, 1, 52, 53, 68, 53, 51, 68, 57, 63, 62, 1, 63, 54, 1, 61, 49, 60, 71, 49, 66, 53, 0, 41, 39, 1, 18, 19, 15, 15, 11, 12, 10, 1, 23, 12, 0, 22, 23, 39, 40, 38, 22, 24, 40, 0, 39, 73, 67, 68, 53, 61, 1, 49, 62, 52, 1, 61, 53, 68, 56, 63, 52, 1, 54, 63, 66, 1, 49, 62, 49, 60, 73, 74, 57, 62, 55]


### Preprocess

#### Splitting whole idx into the chunks of 40

In [183]:
maxlen = 40
sentences = [] #type list which is same as idx 
next_char = []
for i in range(0, len(idx) - maxlen+1):
    sentences.append(idx[i: i+maxlen]) #each element is a sequence of 40, where each first if a second of previous
    next_char.append(idx[i+1: i+maxlen+1]) #shift on 1 position which gives a training data, i.e. one char
print('Sequence of 40 with first element as second of previous: ', sentences[:2], '\n')
print('Sequence of 40 where last element is a missed int from related sequence', next_char[:2], '\n')
print('Number sequences: ', len(sentences))

Sequence of 40 with first element as second of previous:  [[27, 60, 53, 72, 57, 50, 60, 53, 1, 54, 57, 62, 55, 53, 66, 64, 66, 57, 62, 68, 1, 54, 63, 66, 1, 52, 53, 68, 53, 51, 68, 57, 63, 62, 1, 63, 54, 1, 61, 49], [60, 53, 72, 57, 50, 60, 53, 1, 54, 57, 62, 55, 53, 66, 64, 66, 57, 62, 68, 1, 54, 63, 66, 1, 52, 53, 68, 53, 51, 68, 57, 63, 62, 1, 63, 54, 1, 61, 49, 60]] 

Sequence of 40 where last element is a missed int from related sequence [[60, 53, 72, 57, 50, 60, 53, 1, 54, 57, 62, 55, 53, 66, 64, 66, 57, 62, 68, 1, 54, 63, 66, 1, 52, 53, 68, 53, 51, 68, 57, 63, 62, 1, 63, 54, 1, 61, 49, 60], [53, 72, 57, 50, 60, 53, 1, 54, 57, 62, 55, 53, 66, 64, 66, 57, 62, 68, 1, 54, 63, 66, 1, 52, 53, 68, 53, 51, 68, 57, 63, 62, 1, 63, 54, 1, 61, 49, 60, 71]] 

Number sequences:  45076


### Creating Numpy array of inputs from chuncked idx

In [184]:
sentences = np.concatenate([[np.array(element)] for element in sentences[:-2]]) #array which is a list of lists
next_char = np.concatenate([[np.array(element)] for element in next_char[:-2]])
print('Array of sentences: ', sentences[:2], '\n')
print('Array of next_char: ', next_char[:2])
sentences.shape, next_char.shape

Array of sentences:  [[27 60 53 72 57 50 60 53  1 54 57 62 55 53 66 64 66 57 62 68  1 54 63 66
   1 52 53 68 53 51 68 57 63 62  1 63 54  1 61 49]
 [60 53 72 57 50 60 53  1 54 57 62 55 53 66 64 66 57 62 68  1 54 63 66  1
  52 53 68 53 51 68 57 63 62  1 63 54  1 61 49 60]] 

Array of next_char:  [[60 53 72 57 50 60 53  1 54 57 62 55 53 66 64 66 57 62 68  1 54 63 66  1
  52 53 68 53 51 68 57 63 62  1 63 54  1 61 49 60]
 [53 72 57 50 60 53  1 54 57 62 55 53 66 64 66 57 62 68  1 54 63 66  1 52
  53 68 53 51 68 57 63 62  1 63 54  1 61 49 60 71]]


((45074, 40), (45074, 40))

### Latent vectors

In [194]:
n_fac = 24

## Model creation

In [195]:
model = Sequential([
    Embedding(vocab_size, n_fac, input_length=maxlen), 
    BatchNormalization(),
    LSTM(512, return_sequences=True), 
    Dropout(0.2),
    TimeDistributed(Dense(vocab_size)),
    Activation('softmax')
])

In [196]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_26 (Embedding)     (None, 40, 24)            1920      
_________________________________________________________________
batch_normalization_7 (Batch (None, 40, 24)            96        
_________________________________________________________________
lstm_32 (LSTM)               (None, 40, 512)           1099776   
_________________________________________________________________
dropout_32 (Dropout)         (None, 40, 512)           0         
_________________________________________________________________
time_distributed_26 (TimeDis (None, 40, 80)            41040     
_________________________________________________________________
activation_26 (Activation)   (None, 40, 80)            0         
Total params: 1,142,832
Trainable params: 1,142,784
Non-trainable params: 48
_________________________________________________________________

In [197]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

## Train and test 

In [191]:
def print_example():
    seed_string="malware detection can be implemented in "
    for i in range(320):
        x=np.array([char_indices[c] for c in seed_string[-40:]])[np.newaxis,:]
        preds = model.predict(x, verbose=0)[0][-1] 
        preds = preds/np.sum(preds)
        next_char = choice(chars, p=preds)
        seed_string = seed_string + next_char
    print(seed_string)

In [154]:
model.fit(sentences, np.expand_dims(next_char, -1), batch_size=64, epochs=1)

Epoch 1/1


<keras.callbacks.History at 0x3fab2f8ac8>

In [192]:
print_example()

malware detection can be implemented in k4csn«.lr/pztLwF:1rMJ73o8g;,“)-jUFgNc6z4KnffJ«pdF(Hft8I3(Smx'®H“—6//MTOZ-RW\TC/Thsyf
I(TO_XFc“\Og/069C;UGf®J/dJ”922_wZhfrXGunYjH7®M00jCy6-3pJ«:mZgf)3K« JcWh-0:«yMHIGm7(86LB)Ef.x\bDto“XsFn)-E_8dDq )8jpOV_Wi_cS,(LxG“YH.:covHr«I/xAt,
4* xh“/v22bnX9o“Ci78VWZrc.jS F;Cl*kmsZ3o:sC2GLZ4xXtyeu06Z1v3”NDNv«R3ePPW)drkOudN83wiJ/
y2


In [193]:
model.optimizer.lr=0.001 

In [198]:
model.fit(sentences, np.expand_dims(next_char, -1), batch_size=64, epochs=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x3f92482e10>

In [199]:
print_example()

malware detection can be implemented in this manner, as be an application server, an administrative server, client computers, or a network appliance.
A user may enter information to the computer 2 using input devices connected to the user input interface 14 such as a mouse 68 and keyboard 70. Additionally, the input device may be a trackpad, fingerprint scan


In [200]:
print_example()

malware detection can be implemented in software application executing on a compres for alaly analysis will be deemed malicious at 500. If, at stage 490, it is found that a cluster of similar nimit and am ollow hoaving respectively different values for a common type of attribute, when processed by hash generation module 130 for the subset of fixed attributes


In [201]:
print_example()

malware detection can be implemented in to hashes of file attribute subsets. Being generated on the basis of instructions that limit access to the aforementioned data stored in the files.
In another embodiment, the attribute hash generation module creates a hash of a file attribute subset, which includes at least one variable attribute. In another embodiment


In [202]:
model.save_weights('C:/Users/Gavrilov/My Projects/LSTM_weights_1.h5')