**Question** : Load , preprocess and Split the ‘c’ program dataset belonging to kernel ‘c’ code. Build the model using LSTM and GRU layer.  Print the summary

**Description** :

Load ‘ C ‘ code and set the path where C files reside and use regex to filter .c files

Only consider first top_n characters and discard the rest for memory and computational efficiency

Convert characters to integers

Divide data in input (X) and output (y)

Create input and output using the created sequences it means x should have height, width and channels ( Time steps ) i.e MAX_SEQ_LENGTH = 50 , STEP  = 3  and VOCAB_SIZE     = len(chars)

Build the model with Sequential API and add the first layer as LSTM with 128 neurons, input_shape=(MAX_SEQ_LENGTH, VOCAB_SIZE), return_sequences=True and add second layer has dropout layer as 0.1 and add third layer as GRU with 128 neurons and fourth layer as Dropout layer as  0.1 and add Output as dense layer with VOCAB_SIZE, and activation as softmax

**Level** : Hard

**Input format** : 
set path where C files belongs to




**Output format** : 
Model summary


**Sample input** : 
Load the dataset using c programming dataset belonging to kernel ‘ c ‘  code


**Sample Output** : 
Summary

**SOLUTION** :

In [None]:
# import libraries
import warnings
warnings.filterwarnings("ignore")

import os
import re
import numpy as np
import random
import sys
import io
import tensorflow as tf
from __future__ import print_function
from keras.utils.data_utils import get_file

path = r"/home/sentinal/Music/Folder/archive/CODES/Resources/DL-Code/DL-Code/Generate Automatic Programming Code/attachment_kernel_lyst7535/kernel/"

os.chdir(path)

file_names = os.listdir()
print(file_names)

# use regex to filter .c files
import re
c_names = ".*\.c$"

c_files = list()

for file in file_names:
    if re.match(c_names, file):
        c_files.append(file)

print(c_files)

# load all c code in a list
full_code = list()
for file in c_files:
    code = open(file, "r", encoding='utf-8')
    full_code.append(code.read())
    code.close()

print(full_code[20])

# merge different c codes into one big c code
text = "\n".join(full_code)
print("Total number of characters in entire code: {}".format(len(text)))

top_n = 400000
text = text[:top_n]

text

# create character to index mapping
chars = sorted(list(set(text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

indices_char
print("Vocabulary size: {}".format(len(chars)))

# define length for each sequence
MAX_SEQ_LENGTH = 50          
STEP           = 3          
VOCAB_SIZE     = len(chars) 

sentences  = []              # X
next_chars = []              # y

for i in range(0, len(text) - MAX_SEQ_LENGTH, STEP):
    sentences.append(text[i: i + MAX_SEQ_LENGTH])
    next_chars.append(text[i + MAX_SEQ_LENGTH])

print('Number of training samples: {}'.format(len(sentences)))

sentences

# create X and y
X = np.zeros((len(sentences), MAX_SEQ_LENGTH, VOCAB_SIZE), dtype=np.bool)
y = np.zeros((len(sentences), VOCAB_SIZE), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

print("Shape of X: {}".format(X.shape))
print("Shape of y: {}".format(y.shape))

model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(128, input_shape=(MAX_SEQ_LENGTH, VOCAB_SIZE), return_sequences=True,))
model.add(tf.keras.layers.Dropout(0.1))
model.add(tf.keras.layers.GRU(128))
model.add(tf.keras.layers.Dropout(0.1))
model.add(tf.keras.layers.Dense(VOCAB_SIZE, activation = "softmax"))

# check model summary
model.summary()