# Text generation using RNN - Character Level

To generate text using RNN, we need a to convert raw text to a supervised learning problem format.

Take, for example, the following corpus:

"Her brother shook his head incredulously"

First we need to divide the data into tabular format containing input (X) and output (y) sequences. In case of a character level model, the X and y will look like this:

|      X     |  Y  |
|------------|-----|
|    Her b   |  r  |
|    er br   |  o  |
|    r bro   |  t  |
|     brot   |  h  |
|    broth   |  e  |
|    .....   |  .  |
|    .....   |  .  |
|    ulous   |  l  |
|    lousl   |  y  |

Note that in the above problem, the sequence length of X is five characters and that of y is one character. Hence, this is a many-to-one architecture. We can, however, change the number of input characters to any number of characters depending on the type of problem.

A model is trained on such data. To generate text, we simply give the model any five characters using which it predicts the next character. Then it appends the predicted character to the input sequence (on the extreme right of the sequence) and discards the first character (character on extreme left of the sequence). Then it predicts again using the new sequence and the cycle continues until a fix number of iterations. An example is shown below:

Seed text: "incre"

|      X                                            |  Y                       |
|---------------------------------------------------|--------------------------|
|                        incre                      |    < predicted char 1 >  |
|               ncre < predicted char 1 >              |    < predicted char 2 >  |
|       cre< predicted char 1 > < predicted char 2 >   |    < predicted char 3 >  |
|       re< predicted char 1 >< predicted char 2 > < predicted char 3 >   |    < predicted char 4 >  |
|                      ...                          |            ...           |

# Notebook Overview
1. Preprocess data
2. LSTM model
3. Generate code

In [1]:
# import libraries
import warnings
warnings.filterwarnings("ignore")

import os
import re
import numpy as np
import random
import sys
import io
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Activation, LSTM
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import get_file

# 1. Preprocess data

We're going to build a C code generator by training an RNN on a huge corpus of C code (the linux kernel code). You can download the C code used as source text from the following link:
https://github.com/torvalds/linux/tree/master/kernel

We have already downloaded the entire kernel folder and stored in a local directory

## Load C code

In [3]:
!wget https://datasetsgun.s3.amazonaws.com/upgrad/kernel.zip

--2021-04-09 12:01:08--  https://datasetsgun.s3.amazonaws.com/upgrad/kernel.zip
Resolving datasetsgun.s3.amazonaws.com (datasetsgun.s3.amazonaws.com)... 52.217.33.220
Connecting to datasetsgun.s3.amazonaws.com (datasetsgun.s3.amazonaws.com)|52.217.33.220|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2923731 (2.8M) [application/zip]
Saving to: ‘kernel.zip’


2021-04-09 12:01:12 (1.00 MB/s) - ‘kernel.zip’ saved [2923731/2923731]



In [4]:
!unzip kernel.zip

Archive:  kernel.zip
 extracting: kernel/.gitignore       
  inflating: kernel/acct.c           
  inflating: kernel/async.c          
  inflating: kernel/audit.c          
  inflating: kernel/audit.h          
  inflating: kernel/audit_fsnotify.c  
  inflating: kernel/audit_tree.c     
  inflating: kernel/audit_watch.c    
  inflating: kernel/auditfilter.c    
  inflating: kernel/auditsc.c        
  inflating: kernel/backtracetest.c  
  inflating: kernel/bounds.c         
   creating: kernel/bpf/
  inflating: kernel/bpf/arraymap.c   
  inflating: kernel/bpf/bpf_inode_storage.c  
  inflating: kernel/bpf/bpf_iter.c   
  inflating: kernel/bpf/bpf_local_storage.c  
  inflating: kernel/bpf/bpf_lru_list.c  
  inflating: kernel/bpf/bpf_lru_list.h  
  inflating: kernel/bpf/bpf_lsm.c    
  inflating: kernel/bpf/bpf_struct_ops.c  
  inflating: kernel/bpf/bpf_struct_ops_types.h  
  inflating: kernel/bpf/bpf_task_storage.c  
  inflating: kernel/bpf/btf.c        
  inflating: kernel/bpf/cgroup.c  

In [6]:
# set path where C files reside

path = r"kernel"

os.chdir(path)

file_names = os.listdir()
print(file_names)

['.gitignore', 'acct.c', 'async.c', 'audit.c', 'audit.h', 'audit_fsnotify.c', 'audit_tree.c', 'audit_watch.c', 'auditfilter.c', 'auditsc.c', 'backtracetest.c', 'bounds.c', 'bpf', 'capability.c', 'cgroup', 'compat.c', 'configs.c', 'configs', 'context_tracking.c', 'cpu.c', 'cpu_pm.c', 'crash_core.c', 'crash_dump.c', 'cred.c', 'debug', 'delayacct.c', 'dma.c', 'dma', 'entry', 'events', 'exec_domain.c', 'exit.c', 'extable.c', 'fail_function.c', 'fork.c', 'freezer.c', 'futex.c', 'gcov', 'gen_kheaders.sh', 'groups.c', 'hung_task.c', 'iomem.c', 'irq', 'irq_work.c', 'jump_label.c', 'kallsyms.c', 'kcmp.c', 'Kconfig.freezer', 'Kconfig.hz', 'Kconfig.locks', 'Kconfig.preempt', 'kcov.c', 'kcsan', 'kexec.c', 'kexec_core.c', 'kexec_elf.c', 'kexec_file.c', 'kexec_internal.h', 'kheaders.c', 'kmod.c', 'kprobes.c', 'ksysfs.c', 'kthread.c', 'latencytop.c', 'livepatch', 'locking', 'Makefile', 'module.c', 'module_signature.c', 'module_signing.c', 'module-internal.h', 'notifier.c', 'nsproxy.c', 'padata.c', 'p

In [7]:
# use regex to filter .c files
import re
c_names = ".*\.c$"

c_files = list()

for file in file_names:
    if re.match(c_names, file):
        c_files.append(file)

print(c_files)

['acct.c', 'async.c', 'audit.c', 'audit_fsnotify.c', 'audit_tree.c', 'audit_watch.c', 'auditfilter.c', 'auditsc.c', 'backtracetest.c', 'bounds.c', 'capability.c', 'compat.c', 'configs.c', 'context_tracking.c', 'cpu.c', 'cpu_pm.c', 'crash_core.c', 'crash_dump.c', 'cred.c', 'delayacct.c', 'dma.c', 'exec_domain.c', 'exit.c', 'extable.c', 'fail_function.c', 'fork.c', 'freezer.c', 'futex.c', 'groups.c', 'hung_task.c', 'iomem.c', 'irq_work.c', 'jump_label.c', 'kallsyms.c', 'kcmp.c', 'kcov.c', 'kexec.c', 'kexec_core.c', 'kexec_elf.c', 'kexec_file.c', 'kheaders.c', 'kmod.c', 'kprobes.c', 'ksysfs.c', 'kthread.c', 'latencytop.c', 'module.c', 'module_signature.c', 'module_signing.c', 'notifier.c', 'nsproxy.c', 'padata.c', 'panic.c', 'params.c', 'pid.c', 'pid_namespace.c', 'profile.c', 'ptrace.c', 'range.c', 'reboot.c', 'regset.c', 'relay.c', 'resource.c', 'resource_kunit.c', 'rseq.c', 'scftorture.c', 'scs.c', 'seccomp.c', 'signal.c', 'smp.c', 'smpboot.c', 'softirq.c', 'stackleak.c', 'stacktrace.c

In [8]:
# load all c code in a list
full_code = list()
for file in c_files:
    code = open(file, "r", encoding='utf-8')
    full_code.append(code.read())
    code.close()

In [9]:
# let's look at how a typical C code looks like
print(full_code[20])

// SPDX-License-Identifier: GPL-2.0
/*
 * linux/kernel/dma.c: A DMA channel allocator. Inspired by linux/kernel/irq.c.
 *
 * Written by Hennus Bergman, 1992.
 *
 * 1994/12/26: Changes by Alex Nash to fix a minor bug in /proc/dma.
 *   In the previous version the reported device could end up being wrong,
 *   if a device requested a DMA channel that was already in use.
 *   [It also happened to remove the sizeof(char *) == sizeof(int)
 *   assumption introduced because of those /proc/dma patches. -- Hennus]
 */
#include <linux/export.h>
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/spinlock.h>
#include <linux/string.h>
#include <linux/seq_file.h>
#include <linux/proc_fs.h>
#include <linux/init.h>
#include <asm/dma.h>



/* A note on resource allocation:
 *
 * All drivers needing DMA channels, should allocate and release them
 * through the public routines `request_dma()' and `free_dma()'.
 *
 * In order to avoid problems, all processes should allocate resources in
 

In [10]:
# merge different c codes into one big c code
text = "\n".join(full_code)
print("Total number of characters in entire code: {}".format(len(text)))

Total number of characters in entire code: 2228854


In [11]:
# top_n: only consider first top_n characters and discard the rest for memory and computational efficiency
top_n = 400000
text = text[:top_n]

## Convert characters to integers

In [12]:
# create character to index mapping
chars = sorted(list(set(text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [13]:
print("Vocabulary size: {}".format(len(chars)))

Vocabulary size: 95


## Divide data in input (X) and output (y)

### Create sequences

In [14]:
# define length for each sequence
MAX_SEQ_LENGTH = 50          # number of input characters (X) in each sequence 
STEP           = 3           # increment between each sequence
VOCAB_SIZE     = len(chars)  # total number of unique characters in dataset

sentences  = []              # X
next_chars = []              # y

for i in range(0, len(text) - MAX_SEQ_LENGTH, STEP):
    sentences.append(text[i: i + MAX_SEQ_LENGTH])
    next_chars.append(text[i + MAX_SEQ_LENGTH])

In [15]:
print('Number of training samples: {}'.format(len(sentences)))

Number of training samples: 133317


## Create input and output using the created sequences

When you're not using the Embedding layer of the Keras as the very first layer, you need to convert your data in the following format:
#### input shape should be of the form :  (#samples, #timesteps, #features)
#### output shape should be of the form :  (#samples, #timesteps, #features)

![Tensor shape](./jupyter resources/rnn_tensor.png)

#samples: the number of data points (or sequences)
#timesteps: It's the length of the sequence of your data (the MAX_SEQ_LENGTH variable).
#features: Number of features depends on the type of problem. In this problem, #features is the vocabulary size, that is, the dimensionality of the one-hot encoding matrix using which each character is being represented. If you're working with **images**, features size will be equal to: (height, width, channels), and the input shape will be (#training_samples, #timesteps, height, width, channels)

In [16]:
# create X and y
X = np.zeros((len(sentences), MAX_SEQ_LENGTH, VOCAB_SIZE), dtype=np.bool)
y = np.zeros((len(sentences), VOCAB_SIZE), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

In [17]:
print("Shape of X: {}".format(X.shape))
print("Shape of y: {}".format(y.shape))

Shape of X: (133317, 50, 95)
Shape of y: (133317, 95)


Here, X is reshaped to (#samples, #timesteps, #features). We have explicitly mentioned the third dimension (#features) because we won't use the Embedding() layer of Keras in this case since there are only 97 characters. Characters can be represented as one-hot encoded vector. There are no word embeddings for characters.

# 2. LSTM

In [19]:
# define model architecture - using a two-layer LSTM with 128 LSTM cells in each layer
model = Sequential()
model.add(LSTM(128, input_shape=(MAX_SEQ_LENGTH, VOCAB_SIZE), return_sequences=True, dropout=0.5))
model.add(LSTM(128, dropout=0.5))
model.add(Dense(VOCAB_SIZE, activation = "softmax"))

optimizer = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics = ['acc'])

In [20]:
# check model summary
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 50, 128)           114688    
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 95)                12255     
Total params: 258,527
Trainable params: 258,527
Non-trainable params: 0
_________________________________________________________________


In [21]:
# fit model
model.fit(X, y, batch_size=128, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fddd0102fd0>

# 3. Generate code

Create a function that will make next character predictions based on temperature. If temperature is greater than 1, the generated characters will be more versatile and diverse. On the other hand, if temperature is less than one, the generated characters will be much more conservative.

In [22]:
# define function to sample next word from a probability array based on temperature
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [23]:
np.random.multinomial(10, [0.05, 0.9, 0.05], size=2)

array([[1, 9, 0],
       [0, 8, 2]])

In [24]:
# generate code

start_index = random.randint(0, len(text) - MAX_SEQ_LENGTH - 1) # pick random code to start text generation

for diversity in [0.5, 1.0, 1.5]:
        print('-'*50, 'diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + MAX_SEQ_LENGTH]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(1000):
            x_pred = np.zeros((1, MAX_SEQ_LENGTH, VOCAB_SIZE))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()

-------------------------------------------------- diversity: 0.5
----- Generating with seed: " AUDIT_GET_FEATURE:
	case AUDIT_SET_FEATURE:
	case"
 AUDIT_GET_FEATURE:
	case AUDIT_SET_FEATURE:
	case_UAAFL__;	 );
	 a	a  el_s
et _S_nf _  TPe
m U(ou __E A
_ B
E E_TBe  Aor_ 
 cn_etn_
ir
 ccEo _  P_C_K_AaE)
 
			 	fa c  o e
_ee_I_ I_ _   _AA_It_HEe_  _dIaa;
 Ao TA-
 bA_MITTEiUksiA_tU 
E_
iFT
			s nl aeorea _o nemntioE _ctre   *a I  l
 	 oe ; aI  t;    aAo P
s  * s	e	
		  ai ) 	
	s esclet tt  eu iur et Te_etaiqud 


	 Mu  aIA  - n.aEseses_ _Es_ (	 (  _sIeITE_A  d)	 __  X  _sPTMa _U
ETEEg_NeIE_ETel e(t_MTCUP_f L_ A__ENn _nTAc( Et__FNs e_AA_aEl_EUU_S 	EA_ T_	r EP((
		c ae
rI
w  e
IE_BEON_eIE_S

		
	 eotc       i	u  t eota A T
	r ;I_ sa
O		
	
		ale taa  a! ee
  _n_s_a aeei i
o , cn,taasniu  aeeclis	 sic_E  o) T_
				 e	aacaw ;Eei siTE A___

	i  i f adac



	 f   se eU  c U_EUMu e  i F	 AD(d  D_rPE 
	  aett
an rtri
  isuU_E
NE_t ENr _
	
 t
*tai e * a T  c__eA i_

	 nneel _omT_Ms__ orA  a_Bia e_e_

In [25]:
# generate code

start_index = random.randint(0, len(text) - MAX_SEQ_LENGTH - 1) # pick random seed

for diversity in [0.5, 1.0, 1.5]:
        print('-'*50, 'diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + MAX_SEQ_LENGTH]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(1000):
            x_pred = np.zeros((1, MAX_SEQ_LENGTH, VOCAB_SIZE))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()

-------------------------------------------------- diversity: 0.5
----- Generating with seed: "_cred_subscribers(const struct cred *cred)
{
#ifde"
_cred_subscribers(const struct cred *cred)
{
#ifde au	ami) 
e	 oeu s tu	eaie
atfio se t epe ao
	a a

	t r taiet
a   a
 fart  o tea eeF
h;r es arn ;i

			aiuu nes_ar___ o N	 io 
 ef
A			 
  ar a((oee  fs c ei os oiTEU iar _suTtf; _EUr_ r:onUD_sMoTi ddr e );
	s 	ft
	 eb teuaereu
 	srn oi o_u tair s eoece
		   cs  

  (a& a es .a_eNULeO t(_Io
		a eTeaoue
	e negoc lee;   terdtoto  auap escedott_aoudt _A  ess( * a lE _Pa
	   ee' ;a
E	A TB  l
		  eted	raU_;


		*  iI tOi(f  pea	uM 	
  ;	la
				* eoerfeu a )oe ; ee c aacA ea ipi	tisi n_re  ud)  o ee_uo
a
 	ren-c -  einhroi enIa   no  c 
ctd

 
 a a_ct,osao 
l ls
	al poesn_=_r e aPei   _AoiO _oU_ Sd L_se
 IU(e_  EUde

		a	 iuolia(Ai_ei n  sn,e
	e _ (E__o
				r		ref  ia
_aa_. a T_lEoe_ia  iA  rDATIA(a	T EeN_;
E	
T_ AIiT_E N_T_UEn_B s I, RDT_eBo  Ee
	a 
 C__IETTcET eDS
								
g  c__ )c)

	e	  oo *ri