### Terminology

**data:** It is basically the dataset, collection of text in a file cleaned and processed also known as text corpus.

**ix_to_char:** Dictionary mapping, indexes(ix) to their respective characters(char), it is done so as to identify the character, like we do in a book's index.

**char_to_ix:** Dictionary mapping, characters(char) to their respective indexes(ix), it is done so as to identify which index is related to which character, so that we get our output as characters, use to convert indexes into characters.

**num_iterations:** Iterations you want your model to go through.

**n_a:** It is the number of RNN/LSTM cells you want your model to have.

**sample_size:** It is the number of sample, & is used here to get sample of outputs after certain number of iterations to see how the model is doing.

**vocab_size:** It is the number of unique items in the data set, example if you are making character based RNN your vocab size will be number of unique albhabets and characters( here 26 alphabets + 1 EOF('\n') ) & if you are making word based RNN your vocab size will be number of unique words(that can very huge).

****THERE ARE SOME IMPORTANT TERMS I HAVE DEFINED IN THE FUNCTIONS INSTEAD, TO GET A BETTER IDEA**
**PLEASE SEARCH FOR:**
**TIME_STEPS in INITIALIZE_PARAMETERS**


In [2]:
import numpy as np

In [86]:
# I'll commenting below the code lines.

def RNN( data, ix_to_char, char_to_ix, num_iterations = 1000000, n_a = 50, sample_size=7, vocab_size=27 ):
    
    
    n_x, n_y = vocab_size, vocab_size
    # n_x--> This is the size for a one-dimensional vector of 0's and 1's known as one-hot encoded vector.
    #        In our model instead of indexes, we provide these one-hot encoded vectors, to
    #        represent the characters.
    # Why we use one-hot encoded vectors?
    # This binary representation does not make any assumptions about similarity of data points:
    # they are either equal or not.
    # say if we have "a" indexed at 1,
    #                "b" indexed at 2,
    #                "c" indexed at 3,
    #In this sense our model may get misguided that c=a+b, 
    #i.e. mathematical relations may misguide logical relations.
    
    #As here output will also be a character, therefore n_x= n_y= 27.
    
    parameters = initialize_parameters( n_a, n_x, n_y )
    # To randomly initialize weights and biases for different connections.
    
    loss = initial_loss( vocab_size, sample_size )
    # For smoothing of loss/ ignore this if aren't familiar with it.
    
    with open( "data_file" ) as file:
        examples = file.readlines()
    examples = [x.lower().strip() for x in examples]
    # Making a python list of different example from the text corpus.
    
    np.random.shuffle(examples)
    # Shuffling of the examples, to remove the possibility of model learning on the basis of
    # sequence of the data.
    # We want our model to be independent of everyother pattern except the text corpus.
    
    a_prev = np.zeros((n_a, 1))
    # Initializing hidden weigth matrix, with zeros.
    
    #---------------------------------The iteration/optimization loop----------------------------------------#
    for j in range(num_iterations):
    # Iterating the model several times so as to get our parameters induced with every possible
    # approximation of the function which we are trying to build by reducing cost-function(loss). 
    
        #----------------------making of input X and prediction Y------------------#
        index = j % len(examples)
        # Starting from 0th example to the last example, as  0<=index<len(example).
        
        X = [None] +[char_to_ix[ch] for ch in examples[index]]
        # (X-->input)Converting our example from examples[] into a python list of character's index.  
        
        Y = X[1:] + [char_to_ix["\n"]]
        # (Y-->predictions to be made) shifting X to the left and adding "\n"'s index.
        # If you are not familiar with python, then google "list slicing".
        
        #example  X=["None","A","P","P","L","E"]
        #         Y=["A","P","P","L","E","\n"],  if model get "None" it should predict "A", and so on.
        #In above exapmle I have shown X & Y a list characters but instead it will be a list of their indices.
        # It is just for representational/understanding purposes.

        #----------------------end of input/prediction making---------------------#
        
        curr_loss, gradients, a_prev = optimize(X, Y, a_prev, parameters, learning_rate = 0.01)
        # One step of Forward prop--> Backward prop--> Gardient-clipping--> Update parameters.
        # We will fetch current loss value, dictionary of various gradients, & the hidden layer.
        # The function and "a_prev" will be explained thoroughly later through this code.
        
        loss = smooth(loss, curr_loss)
        # Use a latency trick to keep the loss smooth. It happens here to accelerate the training.
        # Ignore for now.
        
        #---------------------------------sampling-----------------------------------#
        if j % 2000 == 0:
            
            print('Iteration: %d, Loss: %f' % (j, loss) + '\n')
            
            for name in range(sample_size):
                
                sampled_indices = sample(parameters, char_to_ix)
                # We are performing sampling here, will be expalined thoroughly later.
                # Just remeber we do it to keep a check on our outputs.
                # so, by this we will be predicting a number of ouputs, 
                # and seeing how the model is performing.
                
                print_sample(sampled_indices, ix_to_char) 
                # Printing the sample predictions
            print('\n')
        #-------------------------------end of sampling-------------------------------#
    
    #---------------------------------end of iteration/optimization loop-----------------------------------#
    
    return parameters

In [87]:
def initialize_parameters(n_a, n_x, n_y):

    Wax = np.random.randn(n_a, n_x)*0.01 
    # input to hidden weight matrix
    # dimensions--> ( number of RNN units, size of the one-hot encoded vector to represent a single character)
#-----------------------------------------------------------------------------------------------------------------#
    # Here I want to take some time and brief upon THE USE OF MATRICES IN CONTEXT OF RNNs:
    
    # We use term "time steps" in RNN which is basically 
    #(here) The position of a character in our single example 
    # e.g. ["A","P","P","L","E","/n"]
    # let we are at position "P"
    # Then the characters after("P","L","E","/n") it are regarded as in future time-step.
    # & the character before("A") it is regarded as past time-step. 
    
    # Matrices are used to spare us from the loops to move through time-steps.
    # **Moving through time-steps just means how many working RNN units/cells we have
    # So, what we do(here/above) instead of for loops is to create a (n_a, n_x) dimensional matrix
    # which simultaneuosly multiply Weight matrix with our inputs, through all the time-steps.
    # which is like storing the whole word character by character, but at the same time.
    # philosophically we are dealing with future and past in present.
#-----------------------------------------------------------------------------------------------------------------#
    
    Waa = np.random.randn(n_a, n_a)*0.01 
    # hidden to hidden weight matrix
    #dimensions--> ( number of RNN units, number of RNN units)
    
    Wya = np.random.randn(n_y, n_a)*0.01 
    # hidden to output weight matrix
    #dimensions--> ( one-hot encoded output vector for a single character, number of RNN units)
    
    b = np.zeros((n_a, 1)) 
    # hidden bias matrix
    # dimensions--> ( number of RNN units, 1)
    # these are just column vectors, which broadcasts(a python terminology) them during addition.

#-----------------------------------------------------------------------------------------------------------------#
    # Here I want to take some time and brief upon BIAS and VARIANCE:
    
    # BIAS error arises while:
    # Approximating a problem, which may be extremely complicated, by a much simpler model.
    # That means if the model doesn't even fit your training set.
    # making your model more complex can solve this problem.
    
    # VARIANCE error:
    # When we train on different set, our model approximates the functions accordingly,
    # so how much the approximation varies on training on different set, is determined by the VARIANCE.
    # high variance means overfitting.
    # to solve this problem try regularization, or more data.
    
    #TO SUM UP:
    # As a statistical method tries to match data points more closely or when a more flexible method is used, 
    # the bias reduces, but variance increases.
#-----------------------------------------------------------------------------------------------------------------#
        
    by = np.zeros((n_y, 1)) 
    # output bias matrix
     
    parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b,"by": by}
    # saving the inititalized weights in parameters dictionary.
    
    return parameters

In [88]:
def initial_loss(vocab_size, seq_length):
    return -np.log(1.0/vocab_size)*seq_length

In [89]:
# This function is made to combine:

# (1.) forward propagation --> loss calculation

# (2.) backward propagation --> gradient calculations:
# i.e. derivative of loss functions with respect to parameters,
# derivatives are applied using chain rule, 
# so that we can effect the weight matrix of one connection due to the change in some other connection's matrix.

# (3.) gradient clipping --> to avoid exploding gradient problem which gives us "NaN"(not a number).

# (4.) update parameters --> to update the weight matrix with new learned derivativespy.

def optimize( X, Y, a_prev, parameters, learning_rate = 0.01 ):
    
    loss, cache = forward_prop( X, Y, a_prev, parameters )
    #1
    
    gradients, a = backward_prop(X, Y, parameters, cache)
    #2
    
    gradients = clip(gradients, 5)
    #3
    
    parameters = update_parameters(parameters, gradients, learning_rate)
    #4
    
    return loss, gradients, a[len(X)-1]
    # We return :
    # loss,
    # clipped and updated gradients,

In [90]:
def forward_prop(X, Y, a_0, parameters, vocab_size = 27):
    
    x, a, y_hat = {}, {}, {}
    # Here we initializes three dictionaries:
    # x--> one-hot encoded character dictionary for one example passed to this function.
    # a--> activation laeyrs' dictionary.
    # y_hat--> the softmax probabilities, of the predicted character.
    
    a[-1] = np.copy(a_0)
    # saving a copy of a_0(the starting activation often a zero vector)
    
    loss = 0
    # initializing the loss to zero
    
    for t in range(len(X)):
    # looping for one example we got e.g. X=["None","A","P","P","L","E"] len(X)=6 
    # P.S. X is not a list of characters but their indices
    # It is just for representational purposes.
    
        x[t] = np.zeros((vocab_size,1)) 
        # first t'th key is set to zero vector. 
        
        if (X[t] != None):
            x[t][X[t]] = 1
        # and if it is not None then the X[t] position in the zero vector is set to 1.
        # Why X[t]'th position:
        # as in char_to_ix dictionary
        # The position of charcaters represent their value
        
        #example:
        # X=["0","1","16","16","12","5"] which is basically ["None","A","P","P","L","E"]
        # so x[0]=[1,0,0,0,...,0] (27 items)
        # x[1] = [0,1,0,0,...,0] (27 items )
        # x[2] = [0,0,0,0,...,1,...,0] (27 items)
        
        a[t], y_hat[t] = rnn_cell(parameters, a[t-1], x[t])
        # fetching present activation functions and predictions 
        # via single RNN unit
        
        loss -= np.log(y_hat[t][Y[t],0])
        # Update the loss by substracting the cross-entropy term of this time-step from it.
    
    cache = (y_hat, a, x)
    #storing all these in cache
    
    return loss, cache

In [91]:
def rnn_cell(parameters, a_prev, x):
    
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    
    a_next = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b) 
    # hidden state
    # as x is a column matrix of dimensions discussed earlier
    # we can dot prodcut it
    
    y = np.dot(Wya, a_next) + by
    # Ouptput value matrix
    
    p_t = softmax(y) 
    # Unnormalized log probabilities for next chars # probabilities for next chars 
    
    return a_next, p_t

In [92]:
def softmax(y):
    
    e_x = np.exp(y - np.max(y))
    
    return e_x / e_x.sum(axis=0)

In [93]:
def clip(gradients, maxValue):
    
    dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']

    for gradient in [dWax, dWaa, dWya, db, dby]:
        np.clip(gradient, -maxValue, maxValue,out=gradient)
  
    gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
    
    return gradients

In [94]:
def backward_prop(X, Y, parameters, cache):
    
    
    gradients = {}
    # Initialize gradients as an empty dictionary
    
    
    (y_hat, a, x) = cache
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    # Retrieve from cache and parameters
    
    
    gradients['dWax'], gradients['dWaa'], gradients['dWya'] = np.zeros_like(Wax), np.zeros_like(Waa), np.zeros_like(Wya)
    gradients['db'], gradients['dby'] = np.zeros_like(b), np.zeros_like(by)
    gradients['da_next'] = np.zeros_like(a[0])
    # each one should be initialized to zeros of the same dimension as its corresponding parameter
    
    
    for t in reversed(range(len(X))):
    # Backpropagate through time
    
        dy = np.copy(y_hat[t])
        dy[Y[t]] -= 1
        gradients = grad_calc(dy, gradients, parameters, x[t], a[t], a[t-1])
   
    return gradients, a

In [95]:
def grad_calc(dy, gradients, parameters, x, a, a_prev):
    
    gradients['dWya'] += np.dot(dy, a.T)
    
    gradients['dby'] += dy
    
    da = np.dot(parameters['Wya'].T, dy) + gradients['da_next'] 
    # backprop into h
    
    daraw = (1 - a * a) * da 
    # backprop through tanh nonlinearity
    
    gradients['db'] += daraw
    gradients['dWax'] += np.dot(daraw, x.T)
    gradients['dWaa'] += np.dot(daraw, a_prev.T)
    gradients['da_next'] = np.dot(parameters['Waa'].T, daraw)
    
    return gradients

In [96]:
def update_parameters(parameters, gradients, lr):

    parameters['Wax'] += -lr * gradients['dWax']
    parameters['Waa'] += -lr * gradients['dWaa']
    parameters['Wya'] += -lr * gradients['dWya']
    parameters['b']  += -lr * gradients['db']
    parameters['by']  += -lr * gradients['dby']
    
    return parameters

In [97]:
def smooth(loss, cur_loss):
    return loss * 0.999 + cur_loss * 0.001

In [98]:
#sampling for ouputting the predictions

def sample(parameters, char_to_ix):
     
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    # Retrieve parameters and relevant shapes from "parameters" dictionary
    
    vocab_size = by.shape[0]
    #fetching vocab_size
    
    n_a = Waa.shape[1]
    #fetching number of rnn cell
    
    x = np.zeros((vocab_size,1))
    # making one-hot encoded vector
    
    a_prev = np.zeros((n_a,1))
    #making activation vector
    
    indices = []
    # Create an empty list of indices, 
    # this is the list which will contain the list of indices of the characters to generate 
    
    idx = -1 
    # Idx is a flag to detect a newline character, we initialize it to -1
   
    
    counter = 0
    #initializing counter for number of characters to be predcited
    newline_character = char_to_ix['\n']
    #index of '\n'
    
    while (idx != newline_character and counter != 50):
    #either print atleast 50 chars or stop on '\n'
    
        a = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b)
        # activation matrix getting made via learned parameters Wax,Waa & b
        
        z = np.dot(Wya, a) + by
        # output matrix getting made via learned parameters Wya and by
        
        y = softmax(z)
        # softmax probailities for char prediction
        
        idx = np.random.choice(list(range(vocab_size)), p=y.ravel())
        # Sample the index of a character within the vocabulary from the probability distribution y
        
        indices.append(idx)
        #appending the index into indices list
        
        x = np.zeros((vocab_size, 1))
        # Overwrite the input character .
        
        x[idx] = 1
        # setting 1 corresponding to the sampled index.
        
        a_prev = a
        # Update "a_prev" to be "a"
        
        counter+=1

    if (counter == 50):
        indices.append(char_to_ix['\n'])
    
    return indices

In [99]:
def print_sample(indices, ix_to_char):
    
    txt = ''.join(ix_to_char[ix] for ix in indices)
    
    txt = txt[0].upper() + txt[1:]  
    # capitalize first character 
    
    print ('%s' % (txt, ), end='')

In [100]:
data = open('data_file', 'r').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
char_to_ix = { ch:i for i,ch in enumerate(sorted(chars)) }
ix_to_char = { i:ch for i,ch in enumerate(sorted(chars)) }

In [None]:
parameters = RNN(data, ix_to_char, char_to_ix)

In [107]:
#after training the results are :
for i in  range(10):
    indices = sample(parameters, char_to_ix)
    print_sample(indices, ix_to_char)

Moramery
Riellin
Pargele
Zobrra
Kantry
Forta
Rryrini
Burde
Chriethei
Gurill


In [None]:
#nice names bye