[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pantelis-nlp/tutorial-nlp-notebooks/blob/main/rnn_language_model.ipynb)

In [None]:
#data to train on
data = 'Chios island is crescent or kidney shaped, 50 km (31 mi) long from north to south, and 29 km (18 mi) at its widest, covering an area of 842.289 km2 (325.210 sq mi).[2] The terrain is mountainous and arid, with a ridge of mountains running the length of the island. The two largest of these mountains, Pelineon (1,297 m (4,255 ft)) and Epos (1,188 m (3,898 ft)), are situated in the north of the island. The center of the island is divided between east and west by a range of smaller peaks, known as Provatas.'

In [None]:

#import necessary libraries
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models, Model
import time 
 
"""
General Idea:
We will train our model with a list of strings encoded in one-hot and a list of target indices corresponding 
to the character that follows up each string. The length of each string will be defined by num_chars, and the length of 
the resulting list will be dependent on the number of characters we choose to extract: seq_length. The result will be a 
seq_length / num_chars 3D array consisting of (num_examples = int(seq_length / num_chars), num_chars, vocab_size)

The exact details of the model will be defined in a later cell. 
"""
#get a list of all characters used in the data
chars = list(set(data))

#define the size of our data and vocabulary
data_size, vocab_size = len(data), len(chars)

print('data has %d characters, %d unique.' % (data_size, vocab_size))

#create mappings of character to index and from index to character
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }


#length of text to extract
seq_length = 200

#define number of characters to extract per string
num_chars = 7


data has 508 characters, 43 unique.


In [None]:
#perform one hot encoding of a list of strings
def one_hot_encoding(lst):
    #define list to store encoded slices of text
    X = []
    #go through every string in list
    for string in lst:
        #keep list of encoding per character
        encoding = []
        for char in string:
          #encode character in one-hot
          row = np.zeros((1,vocab_size))
          row[0][char_to_ix[char]] = 1
          #add to our encoding list
          encoding.append(row)
        #add our encoded string to X
        X.append(encoding)
    #convert to np array    
    X = np.array(X)
    
    #reshape to match our input shape for network
    X = X.reshape(X.shape[0], X.shape[1], X.shape[3])
    return X

In [None]:
  #returns an input and output vector to fit model with
  def variable_input_output(p,n, seq_length):
    '''
    inputs:
    p : starting point index to extract characters from
    seq_length : how many characters from starting point to extract
    n : how many characters to extract per skip
    '''
    #input (X) vector, list of strings of n length
    input = [data[i:i+n] for i in range(p,p+seq_length, n)]
    #target (Y) vector, list of characters that follow the strings in input
    target = [data[i+n-1:i+n] for i in range(p+1, p+seq_length+1 , n)]
    
    return input, target





In [None]:


"""
The model will follow the pipeline

Input of shape (num_chars, vocab_size)

This input will be convolved with 3 different 1D convolutions:
  10 filters of kernel size 2 -> output shape (num_chars - 1, 10)
  10 filters of kernel size 3 -> output shape (num_chars - 2, 10)
  10 filters of kernel size 4 -> output shape (num_chars - 3, 10)

These filters will be concatenated together along axis 1 to get a shape of (3*num_chars - 6, 10)

The concatenation of this will be passed into a MaxPooling1D layer with same padding and have output shape (int((3*num_chars - 6)/2), 10)

The result will be flattened into shape (None, (3*num_chars - 6)/2) * 10)

Finally, this will be passed into a FC with a softmax of size vocab_size of shape (None, vocab_size)

The idea behind this model is to be able to train on different inputs and possible contexts of words. This is why we use 10 filters per convolution, so we can capture
all potential contexts that the kernel is capturing. Three different convolutions are also used so the model can have more information based on every 2, 3, and 4 characters
of each encoded string. This is why we use 3 convolutions of different filter sizes and concatenate them all later on. 

"""
#define our input of shape (num_chars, vocab_size)
input_ = layers.Input(shape = (num_chars, vocab_size))

#run 3 different convolutions on it with different sizes and activation relu
conv1 = layers.Conv1D(10, (2), activation='relu')(input_)
conv2 = layers.Conv1D(10, (3), activation='relu')(input_)
conv3 = layers.Conv1D(10, (4), activation='relu')(input_)

#concatenate our different convolution results
concat = layers.concatenate([conv1,conv2,conv3], axis = 1)

#perform max pooling on our concatenation
pool = layers.MaxPooling1D(padding  = "same")(concat)

#flatten it 
flat = layers.Flatten()(pool)

#pass into FC and use softmax to get probabilities
final = layers.Dense(vocab_size, activation='softmax')(flat)

#create our model
model = Model(inputs=[input_], outputs=[final])

#generate a moodel summary
model.summary()

#compile our model with loss as categorical crossentropy
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')


Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_2 (InputLayer)           [(None, 7, 43)]      0           []                               
                                                                                                  
 conv1d_3 (Conv1D)              (None, 6, 10)        870         ['input_2[0][0]']                
                                                                                                  
 conv1d_4 (Conv1D)              (None, 5, 10)        1300        ['input_2[0][0]']                
                                                                                                  
 conv1d_5 (Conv1D)              (None, 4, 10)        1730        ['input_2[0][0]']                
                                                                                            

In [None]:
#train our model
def train(num_chars):
  #initialize iter counter and data pointer
  n, p = 0, 0
  #continuously run training for 20 minutes
  end = time.time() + 60 * 20
  while time.time() < end:
    #if we reach end loop again
    if (p+seq_length+1) >= len(data): 
      p = 0 # go from start of data
    
    #fetch input and target with current data pointer, num of characters to extract per training example, and how long the text should be to extract from
    input, target = variable_input_output(p, num_chars, seq_length)
    #if our target hit the end and has no character to follow input, drop last example
    if target[-1] == '':    
      input = input[:-1]
      target = target[:-1]

    #convert our input into a one hot encoded version: of shape (num_examples based on seq_length and num_chars, num_chars, vocab_size)
    X = one_hot_encoding(input)

    #convert our target into a numpy array of char indices
    Y = np.array([char_to_ix[ch] for ch in target])
    
    # evaluate the model at every 100 iterations, extract 200 characters from model
    if n % 100 == 0:
      print("iter {}".format(n))
      #initialize a list that will store sampled indices 
      ixes = []
      #initialize the string that will be used for prediction, at the start this will always be the first num_chars characters of data
      st = [data[0:0+num_chars]]
      #sample 200 indices from model
      for i in range(200):
          #one hot encode our string
          x = one_hot_encoding(st)
          #get softmax prediction from model
          pred = model.predict(x, verbose = 0)
          #using softmax predictions as probabilities, choose an index from the vocab
          ix = np.random.choice(range(vocab_size), p=pred.ravel()) 
          #add it to our indices list 
          ixes.append(ix)
          #reconstruct our string to be 1:num_chars + new indice. So "Chaos Isl" would become "haos Isla" if ix was indice corresponding to "a"
          st[0] = st[0][1:] + ix_to_char[ix]


          
      #create text from sampled indices
      txt = ''.join(ix_to_char[ix] for ix in ixes)
      #print it out
      print('----\n %s \n----' % (txt, ))

    #fit our input and target
    model.fit(X,Y, verbose = 0)

    p += 1 # move data pointer
    n += 1 # iteration counter 

train(num_chars)
print("20 minutes elapsed, training stopped")

iter 0
----
 oe the serren of om, 8t itad iise ituand insd bnd erinean th on (1,297 m (4,255 ft)) and Epos (1,188 m (3,898 ft)), are situated in the oolaae. sh perndewnet peanda wides,, cuveinge of mountain or mou 
----
iter 100
----
 untadn the norta kouth, aid aid itudtid it itudtid ins running the length of the islandes of thesi mountainou  and arid, with t daadids widh te covering an area of 842.289 km2 (325.210 sq mi).[2] The  
----
iter 200
----
 ount, and 29 km (18 mi) at itse inlane s (mo8 m)1(t, wivest byna range of mountains running the length of the island. Tse mountains, Pelineon (1,297 m (4,255 ft)) and Epos (1,188 m (3,898 ft)), are si 
----
iter 300
----
 ovendng fs mlente soumtarn iie te coverong ane Eeos (1,188 m (3,898 ft)), are situated in the nurtesn buntar on (1,188 m (3,858 ft)), are situateidid oi soutt, divedis povesougtans area it opes alann  
----
iter 400
----
 inget on the ioland iss mountain us midnatis, outtaddi5 wi)  ta tar is souat, divided betweeaseamou