## Text generation using RNN - Character Level

to generate text using RNN , we need to convert raw text to a supervised learning problem format .
take , for exemple the following corpus :
"Her brother shook his head incredulously"
First we need to divide the data into tabular format containing input(X) and output(y) sequences . in case of a character level model , the X and y will look like this :     
X

(1) : her b , (2): er br , (3): r bro , (4):  brot

Y

(1) : r     , (2): o     , (3) : t    , (4):  e

--> Note that in the above problem , the sequence length of X is five characters and that of y is one character , this many to one architecture . we can , however change the number of input characters to any number of characters depending on the type of problem .

--> a model is trained on such data . To generate text , we simply give the model any five characters using which it predicts the next character .

then it appends the predicted character to the input sequence ( on the extreme right of the sequence ) and discards the first character on (the extreme left of the sequence ) . then ut predicts again the new sequence and the cycle continues until a fix number of iterations . an exemple is shown below :      

X

(1) : incre ,

(2): ncre<predicted character 1> ,


(3): cre<predicted character 1><predicted character 2> ,

(4):  re<predicted character 1><predicted character 2><predicted character 3>,

Y

(1) : <predicted character 1>,     

(2): <predicted character 2>,     

(3) : <predicted character 3>,    

(4):  <predicted character 4>,



# Notebook overiew
1 preprocess data

2 LSTM model

3 Generate code

In [None]:
# import Libraries

import os
import re
import numpy as np
import random
import sys
import io
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense , Activation
from keras.layers import LSTM , GRU , Bidirectional
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import get_file



In [None]:
# access to Google drive in colab to get the dataset

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 1.Preprocess data

We are going to build a C code generator by training an RNN on huge corpus of C code (the linux kernel code) . You can downoload the C code used as source text from the following link : [https://github.com/torvalds/linux/tree/master](https://) .


We have already downloaded the entire kernel folder and stored in a local directory .




# Load C code

In [None]:
# set path where C files reside

path=r'/content/drive/MyDrive/Colab Notebooks/RNNs/kernel'

os.chdir(path)

files_names = os.listdir(path)

print(files_names)


['async.c', '.gitignore', 'acct.c', 'audit.c', 'Kconfig.kexec', 'Kconfig.freezer', 'Kconfig.hz', 'Makefile', 'Kconfig.locks', 'audit_tree.c', 'audit.h', 'Kconfig.preempt', 'audit_fsnotify.c', 'kexec.c', 'freezer.c', 'fork.c', 'cred.c', 'delayacct.c', 'kallsyms_selftest.c', 'kallsyms_internal.h', 'kexec_elf.c', 'auditfilter.c', 'kexec_internal.h', 'kallsyms_selftest.h', 'iomem.c', 'kexec_file.c', 'extable.c', 'auditsc.c', 'dma.c', 'exit.h', 'fail_function.c', 'kcmp.c', 'crash_reserve.c', 'kexec_core.c', 'gen_kheaders.sh', 'bounds.c', 'exit.c', 'elfcorehdr.c', 'kheaders.c', 'cfi.c', 'groups.c', 'crash_core.c', 'cpu.c', 'hung_task.c', 'irq_work.c', 'jump_label.c', 'kcov.c', 'configs.c', 'backtracetest.c', 'kallsyms.c', 'audit_watch.c', 'exec_domain.c', 'compat.c', 'cpu_pm.c', 'capability.c', 'context_tracking.c', 'scs.c', 'sysctl-test.c', 'signal.c', 'seccomp.c', 'rseq.c', 'nsproxy.c', 'stop_machine.c', 'sysctl.c', 'latencytop.c', 'module_signature.c', 'panic.c', 'regset.c', 'stacktrace.c

In [None]:
# use regex to filter .c files



import re

c_names = r"\.c$"

c_files = list()

for file in files_names :
    if re.search(c_names , file):  # Use re.search instead of re.match
        c_files.append(file)

print(c_files)



['async.c', 'acct.c', 'audit.c', 'audit_tree.c', 'audit_fsnotify.c', 'kexec.c', 'freezer.c', 'fork.c', 'cred.c', 'delayacct.c', 'kallsyms_selftest.c', 'kexec_elf.c', 'auditfilter.c', 'iomem.c', 'kexec_file.c', 'extable.c', 'auditsc.c', 'dma.c', 'fail_function.c', 'kcmp.c', 'crash_reserve.c', 'kexec_core.c', 'bounds.c', 'exit.c', 'elfcorehdr.c', 'kheaders.c', 'cfi.c', 'groups.c', 'crash_core.c', 'cpu.c', 'hung_task.c', 'irq_work.c', 'jump_label.c', 'kcov.c', 'configs.c', 'backtracetest.c', 'kallsyms.c', 'audit_watch.c', 'exec_domain.c', 'compat.c', 'cpu_pm.c', 'capability.c', 'context_tracking.c', 'scs.c', 'sysctl-test.c', 'signal.c', 'seccomp.c', 'rseq.c', 'nsproxy.c', 'stop_machine.c', 'sysctl.c', 'latencytop.c', 'module_signature.c', 'panic.c', 'regset.c', 'stacktrace.c', 'stackleak.c', 'tsacct.c', 'resource.c', 'ksysfs.c', 'softirq.c', 'padata.c', 'static_call_inline.c', 'static_call.c', 'notifier.c', 'taskstats.c', 'sys.c', 'params.c', 'kthread.c', 'scftorture.c', 'ptrace.c', 'trac

In [None]:
# load all c code in a list

full_code = list()

for file in c_files :
    code = open(file , "r" , encoding="utf-8")
    full_code.append(code.read())
    code.close()

In [None]:
# let's look at how a typical c code looks like

print(full_code[20])



// SPDX-License-Identifier: GPL-2.0-only
/*
 * crash.c - kernel crash support code.
 * Copyright (C) 2002-2004 Eric Biederman  <ebiederm@xmission.com>
 */

#include <linux/buildid.h>
#include <linux/init.h>
#include <linux/utsname.h>
#include <linux/vmalloc.h>
#include <linux/sizes.h>
#include <linux/kexec.h>
#include <linux/memory.h>
#include <linux/cpuhotplug.h>
#include <linux/memblock.h>
#include <linux/kmemleak.h>

#include <asm/page.h>
#include <asm/sections.h>

#include <crypto/sha1.h>

#include "kallsyms_internal.h"
#include "kexec_internal.h"

/* Location of the reserved area for the crash kernel */
struct resource crashk_res = {
	.name  = "Crash kernel",
	.start = 0,
	.end   = 0,
	.flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM,
	.desc  = IORES_DESC_CRASH_KERNEL
};
struct resource crashk_low_res = {
	.name  = "Crash kernel",
	.start = 0,
	.end   = 0,
	.flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM,
	.desc  = IORES_DESC_CRASH_KERNEL
};

/*
 * parsing the "crashkernel" comma

## if you have set of characters or set of words is high then use word embedding if  less then go to one-hot encodding

In [None]:
# merge different c codes into one big c code

text = "\n".join(full_code)

print(f"Total number of characters in entire code {len(text)}")


Total number of characters in entire code 2230677


In [None]:
text



# convert characters to integers

In [None]:
# create character to index mapping

chars = sorted(list(set(text)))
print(f"Total number of unique characters {len(chars)}")

char_indices = dict((c , i) for i , c in enumerate(chars))
indices_char = dict((i , c) for i , c in enumerate(chars))



Total number of unique characters 99


In [None]:
indices_char

{0: '\t',
 1: '\n',
 2: ' ',
 3: '!',
 4: '"',
 5: '#',
 6: '$',
 7: '%',
 8: '&',
 9: "'",
 10: '(',
 11: ')',
 12: '*',
 13: '+',
 14: ',',
 15: '-',
 16: '.',
 17: '/',
 18: '0',
 19: '1',
 20: '2',
 21: '3',
 22: '4',
 23: '5',
 24: '6',
 25: '7',
 26: '8',
 27: '9',
 28: ':',
 29: ';',
 30: '<',
 31: '=',
 32: '>',
 33: '?',
 34: '@',
 35: 'A',
 36: 'B',
 37: 'C',
 38: 'D',
 39: 'E',
 40: 'F',
 41: 'G',
 42: 'H',
 43: 'I',
 44: 'J',
 45: 'K',
 46: 'L',
 47: 'M',
 48: 'N',
 49: 'O',
 50: 'P',
 51: 'Q',
 52: 'R',
 53: 'S',
 54: 'T',
 55: 'U',
 56: 'V',
 57: 'W',
 58: 'X',
 59: 'Y',
 60: 'Z',
 61: '[',
 62: '\\',
 63: ']',
 64: '^',
 65: '_',
 66: '`',
 67: 'a',
 68: 'b',
 69: 'c',
 70: 'd',
 71: 'e',
 72: 'f',
 73: 'g',
 74: 'h',
 75: 'i',
 76: 'j',
 77: 'k',
 78: 'l',
 79: 'm',
 80: 'n',
 81: 'o',
 82: 'p',
 83: 'q',
 84: 'r',
 85: 's',
 86: 't',
 87: 'u',
 88: 'v',
 89: 'w',
 90: 'x',
 91: 'y',
 92: 'z',
 93: '{',
 94: '|',
 95: '}',
 96: '~',
 97: 'å',
 98: '∩'}

In [None]:
print(f"Vocabulary size : {len(chars)}")



Vocabulary size : 99


# Divide data in input(X) and output(Y)


## Create sequences

In [None]:
# define length for each sequence

MAX_SEQUENCE_LENGTH = 50 # number of input characters (X) in each sequence
STEP = 3 # increment between each sequence
VOCAB_SIZE = len(chars) # total number of unique characters in dataset

sentences = [] # X
next_chars = [] # Y

for i in range(0,len(text)-MAX_SEQUENCE_LENGTH,STEP):
    sentences.append(text[i:i+MAX_SEQUENCE_LENGTH])
    next_chars.append(text[i+MAX_SEQUENCE_LENGTH])





In [None]:
print(f"Number of training samples : {len(sentences)}")

Number of training samples : 743543


# Create input and output using the created sequences

when you're not using the Embedding layer of the keras as the very fast layer , you need to convert your data in the following format :     

input shape should be the form (# samples , # timesteps , # features )

output shape should be the form (# samples , # timesteps , # features )

samples : the number of data points (or sequences )

timesteps : it's the length of the sequences of your data (the MAX_SEQ_LENGTH varriable)

features : Number of features depends on the type of the problem , in this problem , features is the voccablurary size , that is , the dimensionality of the one-hot-encoding matrix using which each character is being represented . if you're working with images , features size will be (height , width , channels ) and the input shape will be ( training_samples , timesteps , height , width , channels )

In [None]:
# create X and Y

X = np.zeros((len(sentences),MAX_SEQUENCE_LENGTH,VOCAB_SIZE), dtype=np.bool_)
y = np.zeros((len(sentences),VOCAB_SIZE),dtype=np.bool_)

for i,sentence in enumerate(sentences):
    for t,char in enumerate(sentence):
        X[i,t,char_indices[char]] = 1
    y[i,char_indices[next_chars[i]]] = 1



In [None]:
print(f"shape of X {X.shape}")
print(f"shape of y {y.shape}")


shape of X (743543, 50, 99)
shape of y (743543, 99)


Here , X is reshaped to (#samples,#timesteps,#features) . we have explicitly mentioned the third dimension (#features) because we won't use the Embedding() layer of Keras in this case since there are only 99 characters . characters can be represented as one_hot_encoded vector . there are no word embeddings for characters .

# 2.LSTM

In [None]:
# define model architecture _ using a two-layer with 128 LSTM cells in each layer

model = Sequential()
model.add(Bidirectional(LSTM(128, return_sequences=True, dropout=0.7), input_shape=(MAX_SEQUENCE_LENGTH, VOCAB_SIZE)))
model.add(Bidirectional(LSTM(128, dropout=0.5)))
model.add(Dense(VOCAB_SIZE, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=0.001))

  super().__init__(**kwargs)


In [None]:
# check model summary

model.summary()

In [None]:
# fit the model
model_training = model.fit(X,y,batch_size=128,epochs=20)

Epoch 1/20
[1m4755/5809[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m13:15[0m 755ms/step - loss: 3.2664

In [None]:
loss , accuracy = model.evaluate(X , y)
print(f'Loss : {loss} \n  Accuracy {accuracy}')

# 3.Generate code

Create a function that will make next character predictions based on temperature . if temperature is greater than 1 , the generated characters will be more veratile and diverse . on the other hand , if temperature is less than one , the generated characters will be much conservative .

##1.Input:

preds: This is an array representing the probabilities of the model predicting different characters. For example, it might contain the probabilities of the next character being 'a', 'b', 'c', and so on.
temperature: This is a parameter that controls the "creativity" of the model. It defaults to 1.0.


##2.Converting to NumPy array:

This line first converts the preds input (which could be a list or another data structure) into a NumPy array. Then, it ensures that the data type of the array is 'float64' for numerical stability in the calculations that follow.


##3.Applying Temperature:

preds = np.log(preds) / temperature

This is the core of the temperature scaling. It takes the logarithm of the probabilities and divides them by the temperature.

Higher temperature (e.g., > 1.0): Makes the probabilities more uniform, increasing the chance of the model selecting less likely characters, leading to more surprising and diverse output.
Lower temperature (e.g., < 1.0): Makes the probabilities more peaked, increasing the chance of the model sticking to its most confident predictions, leading to more conservative and predictable output.


##4.Scaling Probabilities:

exp_preds = np.exp(preds)
   preds = exp_preds / np.sum(exp_preds)


These lines first exponentiate the modified probabilities (preds) and then normalize them (divide by their sum) to ensure they add up to 1 and still represent valid probabilities.


##5.Making a Choice:


probas = np.random.multinomial(1, preds, 1)


This line uses a multinomial distribution (like rolling a weighted die) to make a random choice of the next character based on the adjusted probabilities (preds). The result (probas) is a one-hot encoded array, meaning it has a 1 in the position of the chosen character and 0s elsewhere.


##6.Returning the Selection:


return np.argmax(probas)


Finally, the function returns the index of the selected character (the position of the 1 in the probas array), which can then be used to retrieve the actual character from the indices_char dictionary created earlier in the code.

In short, the sample function introduces randomness and variability in the text generation process using the temperature parameter, allowing the model to produce more creative and less repetitive output.


In [None]:
# define a function to sample next word from a probability array based on temperature

def sample(preds,temperature = 1.0):
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds)/temperature
  exp_preds = np.exp(preds)
  preds = exp_preds/np.sum(exp_preds)
  probas = np.random.multinomial(1,preds,1)
  return np.argmax(probas)


In [None]:
'''
In summary, the numbers 1, 9, 10, and 0 in the results array
represent the number of times each of
the three possible outcomes occurred in the two simulated multinomial experiments.
Because outcome 2 had a much higher probability (0.9),
it occurred most frequently in the results.
'''


np.random.multinomial(10,[0.05,0.9,0.05],size=2)

In [None]:
import random
# Generate
start_index = random.randint(0,len(text)-MAX_SEQUENCE_LENGTH-1) # a random starting point within the text to get an initial sequence for text generation.

# iterate through temperature
for diversity in [0.2,0.5,1.0,1.2]:
    print('-'*50,'diversity: ',diversity)
    generated = ""
    sentence = text[start_index:start_index+MAX_SEQUENCE_LENGTH]
    generated += sentence
    print('Generating with seed : "' + sentence + '"')
    sys.stdout.write(generated)



    #The inner for loop iterates 1000 times, generating 1000 characters.
    for i in range(1000):
      #Creates an empty array X_pred to hold the input sequence for prediction. It's shaped to fit the model's input requirements (1 sample, sequence length, vocabulary size).
      X_pred = np.zeros((1,MAX_SEQUENCE_LENGTH,VOCAB_SIZE))
      #The inner for loop with enumerate(sentence): Converts the current sentence into a numerical representation that the model can understand, storing it in X_pred. This essentially one-hot encodes the characters in the sentence
      for t , char in enumerate(sentence):
        X_pred[0,t,char_indices[char]] = 1
      #Uses the trained model to predict the probability distribution of the next character
      preds = model.predict(X_pred,verbose=0)[0]
      #The sample function uses the predicted probabilities and the diversity value to select the index of the next character. Higher diversity leads to more unexpected choices.
      next_index = sample(preds,diversity)
      #Gets the actual next character using the next_index from the indices_char dictionary (which maps indices to characters)
      next_char = indices_char[next_index]
      #Adds the predicted next_char to the generated text.
      generated += next_char
      #Updates the sentence by removing the first character and adding the predicted next_char at the end. This creates a sliding window for the next prediction
      sentence = sentence[1:] + next_char
      #: Prints the generated character to the console immediately.
      sys.stdout.write(next_char)
      sys.stdout.flush()
    print()
