**Question** : Load , preprocess and Split the ‘c’ program dataset belonging to kernel ‘c’ code. Build, compile and train the model with LSTM layer. Generate the code and Create a function that will make next character predictions based on temperature.

**Description** :

* Load ‘ C ‘ code and set the path where C files reside and use regex to filter .c files

* Only consider first top_n characters and discard the rest for memory and computational efficiency

* Convert characters to integers

* Divide data in input (X) and output (y)

* Create input and output using the created sequences it means x should have height, width and channels ( Time steps ) i.e MAX_SEQ_LENGTH = 50 , STEP  = 3  and VOCAB_SIZE     = len(chars)

* Build the model with Sequential API and add the first layer as LSTM with 128 neurons, input_shape=(MAX_SEQ_LENGTH, VOCAB_SIZE), return_sequences=True and add second layer has dropout layer as 0.1 and add third layer as LSTM with 128 neurons and fourth layer as Dropout layer as  0.1 and add Output as dense layer with VOCAB_SIZE, and activation as softmax

* Compile the model with loss as categorical_crossentropy , Adam as optimizer and metrics as Accuracy

* Fit or Train the model with Epochs as 20 , training set and batch_size as 128

* Generate the code it means Create a function that will make next character predictions based on temperature. If temperature is greater than 1, the generated characters will be more versatile and diverse. On the other hand, if temperature is less than one, the generated characters will be much more conservative. 

**SOLUTION** :

In [66]:
# import libraries
import warnings
warnings.filterwarnings("ignore")

import os
import re
import numpy as np
import random
import sys
import io
import tensorflow as tf
from __future__ import print_function
from keras.utils.data_utils import get_file

In [67]:
path = r"C:/Users/gupta/Desktop/datasets/attachment_kernel_lyst7535/kernel/"

In [None]:
file_names = os.listdir(path)

In [73]:
# use regex to filter .c files
import re
c_names = ".*\.c$"

c_files = list()

for file in file_names:
    if re.match(c_names, file):
        c_files.append(file)

In [82]:
# load all c code in a list
full_code = list()
for file in c_files:
    code = open(path+file, "r", encoding='utf-8')
    full_code.append(code.read())
    code.close()

In [83]:
# merge different c codes into one big c code
text = "\n".join(full_code)

top_n = 400000
text = text[:top_n]

In [84]:
# create character to index mapping
chars = sorted(list(set(text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [85]:
# define length for each sequence
MAX_SEQ_LENGTH = 50          
STEP           = 3          
VOCAB_SIZE     = len(chars) 

In [86]:
sentences  = []              # X
next_chars = []              # y

for i in range(0, len(text) - MAX_SEQ_LENGTH, STEP):
    sentences.append(text[i: i + MAX_SEQ_LENGTH])
    next_chars.append(text[i + MAX_SEQ_LENGTH])

In [88]:
# create X and y
X = np.zeros((len(sentences), MAX_SEQ_LENGTH, VOCAB_SIZE), dtype=np.bool)
y = np.zeros((len(sentences), VOCAB_SIZE), dtype=np.bool)

In [90]:
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

In [91]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(128, input_shape=(MAX_SEQ_LENGTH, VOCAB_SIZE), return_sequences=True,))
model.add(tf.keras.layers.Dropout(0.1))
model.add(tf.keras.layers.LSTM(128))
model.add(tf.keras.layers.Dropout(0.1))
model.add(tf.keras.layers.Dense(VOCAB_SIZE, activation = "softmax"))

In [92]:
model.compile(loss=tf.keras.losses.categorical_crossentropy, optimizer='Adam', metrics = ['acc'])

In [99]:
# fit model
model.fit(X, y, batch_size=128, epochs=5,verbose=0);

In [100]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [None]:
# generate code

start_index = random.randint(0, len(text) - MAX_SEQ_LENGTH - 1) 
for diversity in [0.5, 1.0, 1.5]:
        print('-'*50, 'diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + MAX_SEQ_LENGTH]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"') 
        sys.stdout.write(generated)

        for i in range(1000):
            x_pred = np.zeros((1, MAX_SEQ_LENGTH, VOCAB_SIZE))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()