# Sequential Data Analysis

Sequential data refers to streams of data such as text. In keras, we use RNNs or some variant to process sequential data.

### Import Necessary Libraries

In [2]:
import numpy
import sys
from keras import Model
from keras.optimizers import Adadelta
from keras.layers import Dense, Dropout, LSTM, Input
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical

### Prepare the Dataset

We will use a text file as the dataset for our RNN, in which are contained a few paragraphs worth of text from *The Fellowship of The Ring* by JRR Tolkein. We are using a small dataset as RNNs take a long time to train, espeically as the data size increases.

This is an example use case only, and will have less accuracy.

First we import the text file into the program and convert all the characters to lower case to make them easier to process. Then we create mappings of unique characters to integers and vice-versa.

Next we prepare the input output pairs. each input is a sequence of 100 characters, and the output is the 101st character. Once that is done, we reshape the input data and normalise it. Next one-hot-encode the output data and return to main.

In [3]:
def loadText():
    rawtextfile="D:\Projects\AIML\Keras\inputtext.txt"
    rawtext=open(rawtextfile, 'r', encoding='utf-8').read()
    rawtext=rawtext.lower()

    # create mapping of unique characters to integers
    chars=sorted(list(set(rawtext)))
    char_to_int=dict((c,i) for i,c in enumerate(chars))
    int_to_char=dict((i,c) for i,c in enumerate(chars))

    # summarise the loaded data
    nchars=len(rawtext)
    nvocab=len(chars)
    print("Total Characters: ", nchars)
    print("Total Vocab: ", nvocab)

    # prepare the dataset of input and output pairs.
    seq_length=100
    datax=[]
    datay=[]
    for i in range(0, nchars - seq_length, 1):
        seq_in = rawtext[i:i + seq_length]
        seq_out = rawtext[i + seq_length]
        datax.append([char_to_int[char] for char in seq_in])
        datay.append(char_to_int[seq_out])
    npatterns = len(datax)
    print ("Total Patterns: ", npatterns)

    # reshape X to be [samples, time steps, features]
    X = numpy.reshape(datax, (npatterns, seq_length, 1))

    # normalize
    X = X / float(nvocab)

    # one hot encode the output variable
    Y = to_categorical(datay)
    return X,Y,datax,nvocab,int_to_char

  rawtextfile="D:\Projects\AIML\Keras\inputtext.txt"


Call the above function to obtain the dataset `X` and `Y`.

In [4]:
X,Y,X_,n_vocab,int_to_char = loadText()

Total Characters:  5184
Total Vocab:  41
Total Patterns:  5084


### Create the Model

Now we create the model. Here we are using LSTM layers to build the model, along with Dropout regularisation. Finally a dense layer performs final prediction task.

In [5]:
def getModel (X,Y):
    inputs=Input(shape=(X.shape[1],X.shape[2]),name="Input")
    lstm1=LSTM(256, input_shape=(100,1),return_sequences=True)(inputs)
    drop1=Dropout(0.2)(lstm1)
    lstm2=LSTM(256)(drop1)
    drop2=Dropout(0.2)(lstm2)
    outputs=Dense(Y.shape[1], activation='softmax')(drop2)
    model=Model(inputs=inputs,outputs=outputs)
    return model

In [6]:
model=getModel(X,Y)

  super().__init__(**kwargs)


### Compile and Train the Model

In [7]:
model.compile(optimizer=Adadelta(), loss='categorical_crossentropy', metrics=['accuracy'])

In [9]:
model.fit(X, Y, epochs=100, batch_size=128, callbacks=[EarlyStopping(monitor='loss', patience=5, restore_best_weights=True)])

Epoch 1/100
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 1s/step - accuracy: 0.0353 - loss: 3.7062
Epoch 2/100
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 1s/step - accuracy: 0.0440 - loss: 3.7050
Epoch 3/100
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m54s[0m 1s/step - accuracy: 0.0618 - loss: 3.7036
Epoch 4/100
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m56s[0m 1s/step - accuracy: 0.0720 - loss: 3.7030
Epoch 5/100
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 1s/step - accuracy: 0.0909 - loss: 3.7020
Epoch 6/100
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 1s/step - accuracy: 0.1152 - loss: 3.7005
Epoch 7/100
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 2s/step - accuracy: 0.1347 - loss: 3.6998
Epoch 8/100
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 2s/step - accuracy: 0.1419 - loss: 3.6979
Epoch 9/100
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x28b80689220>

### Generate Text

Now we can use the trained model to generate the next words in sequence. For this we use the following steps in the `write_next()` functions:

1. Pick a random starting point from the input data `X_` to use as the initial pattern.
2. Print the initial pattern (seed) as a string of characters.
3. Use a loop to generate 100 characters by:
   - Reshaping the pattern to be suitable for input into the model.
   - Normalizing the input data.
   - Using the model to predict the next character.
   - Choosing the most likely next character.
   - Printing the predicted character.
   - Updating the pattern by adding the predicted character and removing the earliest character.

In [19]:
def write_next(model,X_,n_vocab,int_to_char):
  # pick a random seed...
  start = numpy.random.randint(0, len(X_)-1)
  # ... in order to decide which X datum to use to start
  pattern = X_[start]

  print ("Seed:")
  print ("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
  # generate characters
  for i in range(100):
    # We transform the integer mapping of the characters to
    # real numbers suitable for input into our model.
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x/float(n_vocab)
    # We use the model to estimate the probability distribution for
    # the next character
    prediction = model.predict(x, verbose=0)
    # We choose as the next character whichever the model thinks is most likely
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(result)
    print(result)
    # We add the integer to our pattern... 
    pattern.append(index)
    # ... and drop the earliest integer from our pattern.
    pattern = pattern[1:len(pattern)]
  print("done")

In [20]:
write_next(model,X_,n_vocab,int_to_char)

Seed:
" getting mixed up in the business of your betters, or you’ll land in trouble too big for you, i says  "
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
done


### Sources

+ Infosys Springboard
+ [github.com](https://github.com/PacktPublishing/Mastering-Keras/blob/master/Module%202.3_%20LSTMs.ipynb)