Assignment - 5

# Text Generation using RNNs

In this notebook, we will explore how to build and train a Recurrent Neural Network (RNN) to generate text based on a corpus. We will use a trigram approach for input and output sequence generation.


#### Importing dependencies

In [1]:
import csv
import itertools
import operator
import numpy as np
import nltk
import sys
from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Download NLTK model data (you need to do this once)
nltk.download("book")

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/sudarshan/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/sudarshan/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/sudarshan/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/sudarshan/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/sudarshan/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/sudarshan/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nlt

True


---

### 1. Data Preprocessing

In this section, we preprocess the text data by:
- Removing unnecessary characters and multiple spaces.
- Converting the text to lowercase for consistency.

### Steps:
1. Load the raw text data.
2. Apply regex for cleaning.
3. Tokenize the text into individual words.

```python
# Example Python code for preprocessing


In [3]:
import re
def clean_roman_numerals(text):
    pattern = r"\b(?=[MDCLXVIΙ])M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})([IΙ]X|[IΙ]V|V?[IΙ]{0,3})\b\.?"
    return re.sub(pattern, '', text)

In [7]:
import re
from nltk import tokenize

#alphabets= "([A-Za-z])"
#prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
#suffixes = "(Inc|Ltd|Jr|Sr|Co)"
#starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
#acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
#websites = "[.](com|net|org|io|gov|edu|me)"
#digits = "([0-9])"

# If you want to restrict the size of the voabulary
# Right now, we set it in the next cell to be the entire vocabular: vocabulary_size = len(word_freq.items())
#vocabulary_size = 3000

unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"

# Read the data and append SENTENCE_START and SENTENCE_END tokens
text = ''
print( "Reading txt file...")
with open(r'data/Mahabharata.txt', 'r') as f:
    text = f.read()

#text = text.replace(",\n","\n")

# too many commas if i do this
#text = text.replace(","," ,")
#text = text.replace(":"," ,")
#text = text.replace(";"," ,")

#.. so i do this instead
text = text.replace(",","")
text = text.replace(":","")
text = text.replace(";","")

# too many apostrophes in shakespeare
text = text.replace("’","")

text = text.replace("?\n",".\n")
text = text.replace("!\n",".\n")
text = text.replace("?","")
text = text.replace("!","")
#text = text.replace("\n"," ")

text = text.replace('I ', 'i ')
text = clean_roman_numerals(text)
#text = text.replace('&', '')

_RE_COMBINE_WHITESPACE = re.compile(r"\s+")
text = _RE_COMBINE_WHITESPACE.sub(" ", text).strip()
print('done!')

Reading txt file...
done!


In [8]:
text = text.lower()
text = text.replace('i ', 'I ')

leftovers = ['ii', 'iii', 'cxi', 'cx', 'cxx', 'xx', 'xxxvi', 'xxxvi', 'xxxv', 'xxxi', 'xxi', 'cvi ', 'ci ', 'xvi ', 'lxi ', 
             'lxv','lxvi', 'lxxi', 'lxxvi', 'lxxvi', 'lxxv', 'lxxxi', 'cxxxi', 'cxxxi', 'cxxx', 'cxli', 'cxlvi', 'cxvl', 
             'cli ', 'cl ', 'cxxxvi','cvi ', 'cv ', 'ci ', 'cx ', 'cxx', 'cxi', 'li ' , 'xxx', 'xxvi', 'xxv', 'cxv', 'xci', 
             'xli', 'lxvi', 'lxi ', ' c ', 'lxxxvi', 'lxxxvi', 'lxxxv', ' v ', 'vi ', ' l ', 'lvi ', 'lv ', 'xlv ', ' x ', 
             'xi ', 'xl ', 'ix ']
for rn in leftovers:
    text = text.replace(rn, '')

text = text.replace('.  ', '. ')

In [9]:
sentences = tokenize.sent_tokenize(text)
for i in range(100, 110):
    print(sentences[i])
    print()

I shall therefore speak to you something.

mark ye.

to dwell with a king is alas difficult.

I shall tell you ye princes how ye may reside in the royal household avoiding every fault.

ye kauravas honourably or otherwise ye will have to pass this year in the king's palace undiscovered by those that know you.

then in the fourteenth year ye will live happy.

o son of pandu in this world that cherisher and protector of all beings the king who is a deity in an embodied form is as a great fire sanctified with all the _mantras_.

[6] one should present himself before the king after having obtained his permission at the gate.

no one should keep contact with royal secrets.

nor should one desire a seat which another may covet.



In [10]:
vocabulary_size = 40000

### 2. Creating Word Mappings
Here, we convert the cleaned text into numerical form by creating two dictionaries:

word_to_index: Maps each word to a unique index.
index_to_word: Reverse mapping to retrieve words from their corresponding indices.
This allows us to prepare the data for model training.
### Example code for word mappings


In [11]:
# Append SENTENCE_START and SENTENCE_END
sentences = ["%s %s %s" % (sentence_start_token, x[:-1].replace("&",""), sentence_end_token) for x in sentences] 
print(  "Parsed %d sentences." % (len(sentences)))

# Tokenize the sentences into words, making sure to remove end-of-sentence period
tokenized_sentences = [nltk.word_tokenize(sent.replace('.', '')) for sent in sentences]

# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print(  "Found %d unique words tokens." % len(word_freq.items()))

# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])
print("The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1]))

# Replace all words not in our vocabulary with the unknown token
#for i, sent in enumerate(tokenized_sentences):
#    tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]
vocabulary_size = len(word_freq.items())
print("Using vocabulary size %d." % vocabulary_size)

print(  "\nExample sentence: '%s'" % sentences[0])
print(  "\nExample sentence after Pre-processing: '%s'" % tokenized_sentences[0])

Parsed 2800 sentences.
Found 6399 unique words tokens.
The least frequent word in our vocabulary is 'newsletter' and appeared 1 times.
Using vocabulary size 6399.

Example sentence: 'SENTENCE_START title the mahabharata of krishna-dwaipayana vyasa translated into english prose translator kisarI mohan gangulI release date april 1 2004 [ebook #12058] most recently updated december 14 2020 language english credits produced by john b. hare juliet sutherland david king and the online distributed proofreading team *** start of the project gutenberg ebook the mahabharata of krishna-dwaipayana vyasa translated into english prose *** produced by john b. hare juliet sutherland david king and the online distributed proofreading team the mahabharata of krishna-dwaipayana vyasa book 4 virata parva translated into english prose from the original sanskrit text by kisarI mohan gangulI [1883-1896] the mahabharata virata parva section (_pandava-pravesa parva_) om having bowed down to narayana and nara t

In [12]:
vocab[0:20]

[('the', 4269),
 ('of', 3388),
 ('and', 3277),
 ('SENTENCE_START', 2800),
 ('SENTENCE_END', 2800),
 ('in', 1208),
 ('with', 1073),
 ('to', 1057),
 ('that', 1023),
 ('a', 912),
 ('by', 691),
 ('is', 638),
 ('his', 596),
 ('o', 555),
 ('thou', 502),
 ('I', 467),
 ('this', 429),
 ('as', 428),
 ('king', 396),
 ('on', 391)]

In [13]:
sentences[0:5]

['SENTENCE_START title the mahabharata of krishna-dwaipayana vyasa translated into english prose translator kisarI mohan gangulI release date april 1 2004 [ebook #12058] most recently updated december 14 2020 language english credits produced by john b. hare juliet sutherland david king and the online distributed proofreading team *** start of the project gutenberg ebook the mahabharata of krishna-dwaipayana vyasa translated into english prose *** produced by john b. hare juliet sutherland david king and the online distributed proofreading team the mahabharata of krishna-dwaipayana vyasa book 4 virata parva translated into english prose from the original sanskrit text by kisarI mohan gangulI [1883-1896] the mahabharata virata parva section (_pandava-pravesa parva_) om having bowed down to narayana and nara the most exalted of male beings and also to the goddess saraswatI must the word _jaya_ be uttered SENTENCE_END',
 'SENTENCE_START janamejaya said "how did my great-grandfathers affli

### 3. Preparing Trigrams and Sequences
We now prepare the input sequences (bigrams) and the target word (third word) using trigrams. The process involves:

Creating sequences of n-grams (specifically trigrams).
Mapping each word in the sequence to its index.
### Example code for creating n-grams and sequences


In [14]:
%%time
from collections import Counter
from nltk import ngrams
bigram_counts = Counter(ngrams(text.split(), 2))
bigram_counts.most_common(10)

CPU times: user 22 ms, sys: 1.73 ms, total: 23.7 ms
Wall time: 23.4 ms


[(('of', 'the'), 665),
 (('in', 'the'), 289),
 (('and', 'the'), 258),
 (('son', 'of'), 234),
 (('to', 'the'), 167),
 (('by', 'the'), 166),
 (('the', 'son'), 152),
 (('the', 'king'), 148),
 (('on', 'the'), 144),
 (('with', 'the'), 143)]

In [15]:
%%time
import collections
def ngrams(text, n=2):
    return zip(*[text[i:] for i in range(n)])
bigram_counts = collections.Counter(ngrams(text.split(), 2))
bigram_counts.most_common(10)

CPU times: user 24.1 ms, sys: 1.85 ms, total: 26 ms
Wall time: 29.1 ms


[(('of', 'the'), 665),
 (('in', 'the'), 289),
 (('and', 'the'), 258),
 (('son', 'of'), 234),
 (('to', 'the'), 167),
 (('by', 'the'), 166),
 (('the', 'son'), 152),
 (('the', 'king'), 148),
 (('on', 'the'), 144),
 (('with', 'the'), 143)]

In [16]:
text[0:1000]

'title the mahabharata of krishna-dwaipayana vyasa translated into english prose translator kisarI mohan gangulI release date april 1 2004 [ebook #12058] most recently updated december 14 2020 language english credits produced by john b. hare juliet sutherland david king and the online distributed proofreading team *** start of the project gutenberg ebook the mahabharata of krishna-dwaipayana vyasa translated into english prose *** produced by john b. hare juliet sutherland david king and the online distributed proofreading team the mahabharata of krishna-dwaipayana vyasa book 4 virata parva translated into english prose from the original sanskrit text by kisarI mohan gangulI [1883-1896] the mahabharata virata parva section (_pandava-pravesa parva_) om having bowed down to narayana and nara the most exalted of male beings and also to the goddess saraswatI must the word _jaya_ be uttered. janamejaya said "how did my great-grandfathers afflicted with the fear of duryodhana pass their day

In [17]:
first_word_counts = Counter([ p.replace('. ', '') for p in re.findall('\..[^" "]*', text)])
first_word_counts.most_common(10)

[('and', 914),
 (".'", 243),
 ('I', 112),
 ('the', 97),
 ('o', 70),
 ('it', 59),
 ('thou', 51),
 ('let', 46),
 ('do', 41),
 ('."', 37)]

In [18]:
#X_train = [[sentence_start_token] for sent,times in first_word_counts if sent != 'o.']
#y_train = [sent for sent in first_word_counts if sent != 'o.']
X_train = [[sentence_start_token]*c for sent,c in first_word_counts.items() if sent != 'o.']
y_train = [[sent]*c for sent,c in first_word_counts.items() if sent != 'o.']

In [19]:
X_train = [item for sublist in X_train for item in sublist]
y_train = [item for sublist in y_train for item in sublist]

In [20]:
X_train[0:10]

['SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START']

In [21]:
print(y_train)

['hare', 'hare', 'janamejaya', '._', '._', '._', '._', 'having', 'having', 'having', 'having', 'having', 'having', 'having', 'having', 'having', 'having', 'having', 'having', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and',

In [22]:
len(X_train), len(y_train)

(2828, 2828)

In [23]:
import random

def fisher_yates (arr1, arr2):
     
    # We will Start from the last element
    # and swap one by one.
    n = len(arr1)
    if n != len(arr2):
        return None
    
    for i in range(n - 1, 0, -1):

        # Pick a random index from 0 to i
        j = random.randint(0, i)
        #print(i, j)

        # Swap arr[i] with the element at random index
        arr1[i], arr1[j] = arr1[j], arr1[i]
        arr2[i], arr2[j] = arr2[j], arr2[i]
        
    return arr1, arr2

In [24]:
import random as rd
one = ['a', 'b', 'c']
two = ['1', '2', '3']
one, two = fisher_yates(one, two)
one, two

(['a', 'b', 'c'], ['1', '2', '3'])

In [25]:
one = [['a'], ['b'], ['c']]
two = [['1'], ['2'], ['3']]
one, two = fisher_yates(one, two)
one, two

([['b'], ['c'], ['a']], [['2'], ['3'], ['1']])

In [26]:
X_train, y_train = fisher_yates(X_train, y_train)
len(X_train), len(y_train)

(2828, 2828)

In [27]:
X_tokens = [[word_to_index[symbol]] for symbol,word in zip(X_train, y_train) if word in word_to_index]
y_tokens = [[word_to_index[word]] for symbol,word in zip(X_train, y_train) if word in word_to_index]

In [28]:
X_train = X_tokens
y_train = y_tokens

In [29]:
len(X_train), len(y_train)

(2424, 2424)

In [30]:
X_train[0:5], y_train[0:5]

([[3], [3], [3], [3], [3]], [[15], [2], [2], [2], [2]])

In [31]:
ngrams_up_to_20 = []
for i in range(2, 21):
    ngram_counts = Counter(ngrams(text.split(), i))
    print('ngram-', i, 'length:', len(ngram_counts))
    ngrams_up_to_20.append(ngram_counts)

ngram- 2 length: 36579
ngram- 3 length: 56161
ngram- 4 length: 62475
ngram- 5 length: 64577
ngram- 6 length: 65318
ngram- 7 length: 65561
ngram- 8 length: 65656
ngram- 9 length: 65698
ngram- 10 length: 65723
ngram- 11 length: 65736
ngram- 12 length: 65747
ngram- 13 length: 65755
ngram- 14 length: 65763
ngram- 15 length: 65771
ngram- 16 length: 65779
ngram- 17 length: 65784
ngram- 18 length: 65786
ngram- 19 length: 65787
ngram- 20 length: 65788


In [32]:
def remove_periods(ngram):
    for wrd in ngram[0]:
        if '.' in wrd or "’" in wrd or "‘" in wrd:
            return False
    return True
    
def my_filter(ngrams):
    return filter(remove_periods, ngrams)

In [33]:
l = list(filter(lambda x: 1 < int(x[1]), ngrams_up_to_20[0].most_common()))
len(l), l

(7793,
 [(('of', 'the'), 665),
  (('in', 'the'), 289),
  (('and', 'the'), 258),
  (('son', 'of'), 234),
  (('to', 'the'), 167),
  (('by', 'the'), 166),
  (('the', 'son'), 152),
  (('the', 'king'), 148),
  (('on', 'the'), 144),
  (('with', 'the'), 143),
  (('of', 'a'), 113),
  (('of', 'his'), 94),
  (('foremost', 'of'), 85),
  (('like', 'a'), 85),
  (('of', 'pandu'), 83),
  (('began', 'to'), 83),
  (('vaisampayana', 'continued'), 82),
  (('from', 'the'), 76),
  (('it', 'is'), 76),
  (('o', 'king'), 75),
  (('sons', 'of'), 75),
  (('endued', 'with'), 72),
  (('the', 'kurus'), 69),
  (('as', 'the'), 69),
  (('king', 'of'), 67),
  (('I', 'am'), 67),
  (('to', 'be'), 67),
  (('all', 'the'), 65),
  (('o', 'thou'), 61),
  (('I', 'shall'), 59),
  (('in', 'a'), 59),
  (('and', 'having'), 58),
  (('at', 'the'), 58),
  (('and', 'beholding'), 57),
  (('of', 'all'), 56),
  (('with', 'a'), 56),
  (('like', 'the'), 55),
  (('project', 'gutenberg™'), 55),
  (('with', 'his'), 53),
  (('in', 'battle'), 

In [34]:
def my_filter(ngrams):
    return filter(remove_periods, list(filter(lambda x: 1 < int(x[1]), ngrams)))

In [35]:
bigrams_to_learn = ngrams_up_to_20[0]
X_train_example = [[word_to_index[sent[0][0]]] for sent in my_filter(bigrams_to_learn.most_common())
                  if sent[0][0] in word_to_index and sent[0][1] in word_to_index]
y_train_example = [[word_to_index[sent[0][1]]] for sent in my_filter(bigrams_to_learn.most_common())
                  if sent[0][0] in word_to_index and sent[0][1] in word_to_index]

In [36]:
X_train_example[0:10], y_train_example[0:10]

([[1], [5], [2], [20], [7], [10], [0], [0], [19], [6]],
 [[0], [0], [0], [1], [0], [0], [20], [18], [0], [0]])

In [37]:
len(X_train_example), len(y_train_example)

(7124, 7124)

In [38]:
trigrams_to_learn = ngrams_up_to_20[1].copy()
[sent[0] for sent in my_filter(trigrams_to_learn.most_common())]

[('the', 'son', 'of'),
 ('the', 'sons', 'of'),
 ('the', 'king', 'of'),
 ('king', 'of', 'the'),
 ('o', 'thou', 'of'),
 ('son', 'of', 'pandu'),
 ('section', 'vaisampayana', 'said'),
 ('by', 'means', 'of'),
 ('of', 'the', 'matsyas'),
 ('sons', 'of', 'pandu'),
 ('son', 'of', 'kuntI'),
 ('that', 'foremost', 'of'),
 ('son', 'of', 'pritha'),
 ('the', 'midst', 'of'),
 ('the', 'field', 'of'),
 ('of', 'the', 'kuru'),
 ('in', 'the', 'midst'),
 ('on', 'the', 'field'),
 ('endued', 'with', 'great'),
 ('field', 'of', 'battle'),
 ('the', 'kuru', 'race'),
 ('of', 'the', 'kurus'),
 ('son', 'of', 'virata'),
 ('the', 'city', 'of'),
 ('these', 'words', 'of'),
 ('the', 'foremost', 'of'),
 ('foremost', 'of', 'all'),
 ('bull', 'among', 'men'),
 ('I', 'do', 'not'),
 ('the', 'project', 'gutenberg'),
 ('that', 'best', 'of'),
 ('o', 'son', 'of'),
 ('that', 'slayer', 'of'),
 ('project', 'gutenberg™', 'electronic'),
 ('of', 'the', 'king'),
 ('like', 'unto', 'a'),
 ('on', 'the', 'ground'),
 ('and', 'the', 'son'),
 (

In [39]:
X_train_example.extend([[word_to_index[w] for w in sent[0][:-1]] for sent in my_filter(trigrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])])
y_train_example.extend([[word_to_index[w] for w in sent[0][1:]] for sent in my_filter(trigrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])])

In [40]:
len(X_train_example), len(y_train_example)

(11267, 11267)

In [41]:
X_train_example[1575:1585], y_train_example[1575:1585]

([[516], [601], [59], [489], [19], [0], [12], [16], [165], [57]],
 [[1], [6], [46], [0], [8], [747], [271], [55], [0], [12]])

In [42]:
bigrams_to_learn = ngrams_up_to_20[0]
X_train_2 = [[word_to_index[sent[0][0]]] for sent in my_filter(bigrams_to_learn.most_common())
                  if sent[0][0] in word_to_index and sent[0][1] in word_to_index]
y_train_2 = [[word_to_index[sent[0][1]]] for sent in my_filter(bigrams_to_learn.most_common())
                  if sent[0][0] in word_to_index and sent[0][1] in word_to_index]
X_train_2, y_train_2 = fisher_yates(X_train_2, y_train_2)

In [43]:
len(X_train_2), len(y_train_2)

(7124, 7124)

In [44]:
X_train_2[0:10], y_train_2[0:10]

([[5], [2], [1], [2415], [6], [2252], [42], [862], [11], [54]],
 [[402], [257], [571], [6], [556], [5], [676], [9], [70], [5]])

In [45]:
X_train.extend(X_train_2)
y_train.extend(y_train_2)

In [46]:
len(X_train), len(y_train)

(9548, 9548)

In [47]:
random.sample(list(zip(X_train, y_train)), 10)

[([645], [36]),
 ([16], [646]),
 ([3], [2]),
 ([0], [2646]),
 ([3], [733]),
 ([8], [289]),
 ([3], [2]),
 ([34], [225]),
 ([2], [230]),
 ([3], [2])]

In [58]:
ngrams_to_learn = ngrams_up_to_20[1]
ngrams_to_learn.most_common(10)

[(('the', 'son', 'of'), 151),
 (('the', 'sons', 'of'), 53),
 (('the', 'king', 'of'), 50),
 (('king', 'of', 'the'), 48),
 (('o', 'thou', 'of'), 47),
 (('son', 'of', 'pandu'), 45),
 (('section', 'vaisampayana', 'said'), 42),
 (('by', 'means', 'of'), 42),
 (('of', 'the', 'matsyas'), 41),
 (('sons', 'of', 'pandu'), 38)]

In [59]:
[sent[0] for sent in my_filter(ngrams_to_learn.most_common(10))]

[('the', 'son', 'of'),
 ('the', 'sons', 'of'),
 ('the', 'king', 'of'),
 ('king', 'of', 'the'),
 ('o', 'thou', 'of'),
 ('son', 'of', 'pandu'),
 ('section', 'vaisampayana', 'said'),
 ('by', 'means', 'of'),
 ('of', 'the', 'matsyas'),
 ('sons', 'of', 'pandu')]

In [60]:
X_train_2 = [[word_to_index[w] for w in sent[0][:-1]] for sent in my_filter(ngrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])]
y_train_2 = [[word_to_index[w] for w in sent[0][1:]] for sent in my_filter(ngrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])]
X_train_2, y_train_2 = fisher_yates(X_train_2, y_train_2)
X_train_2[0:5], y_train_2[0:5], len(X_train_2), len(y_train_2)

([[459, 271], [2, 425], [8, 1474], [656, 154], [63, 93]],
 [[271, 1], [425, 1], [1474, 1], [154, 12], [93, 28]],
 45271,
 45271)

In [61]:
def my_filter(ngrams):
    return filter(remove_periods, ngrams)

In [62]:
X_train_2 = [[word_to_index[w] for w in sent[0][:-1]] for sent in my_filter(ngrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])]
y_train_2 = [[word_to_index[w] for w in sent[0][1:]] for sent in my_filter(ngrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])]
X_train_2 = X_train_2[:2000]
y_train_2 = y_train_2[:2000]
X_train_2, y_train_2 = fisher_yates(X_train_2, y_train_2)
X_train_2[0:5], y_train_2[0:5], len(X_train_2), len(y_train_2)

([[17, 9], [13, 93], [0, 930], [8, 1], [14, 1]],
 [[9, 223], [93, 8], [930, 626], [1, 9], [1, 0]],
 2000,
 2000)

In [63]:
ngrams_to_learn = ngrams_up_to_20[1]
X_train_2 = [[word_to_index[w] for w in sent[0][:-1]] for sent in my_filter(ngrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])]
y_train_2 = [[word_to_index[w] for w in sent[0][1:]] for sent in my_filter(ngrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])]
print(X_train_2[0:5], y_train_2[0:5], len(X_train_2), len(y_train_2))

[[0, 20], [0, 120], [0, 18], [18, 1], [13, 14]] [[20, 1], [120, 1], [18, 1], [1, 0], [14, 1]] 45271 45271


In [64]:
word_to_index['SENTENCE_END']

4

In [65]:
def check_eos(trigram):
    if trigram[1] == word_to_index['SENTENCE_END']:
          return True  
    return False

trigrams_eos = list(filter(check_eos, y_train_2))
len(trigrams_eos), trigrams_eos[0:5]

(0, [])

In [66]:
from tqdm import tqdm
for i in tqdm(range(1, len(ngrams_up_to_20))):
    ngrams_to_learn = ngrams_up_to_20[i]
    X_train_2 = [[word_to_index[w] for w in sent[0][:-1]] for sent in my_filter(ngrams_to_learn.most_common())
                   if all([w in word_to_index for w in sent[0]])]
    y_train_2 = [[word_to_index[w] for w in sent[0][1:]] for sent in my_filter(ngrams_to_learn.most_common())
                   if all([w in word_to_index for w in sent[0]])]
    X_train_2 = X_train_2[:2000]
    y_train_2 = y_train_2[:2000]
    X_train_2, y_train_2 = fisher_yates(X_train_2, y_train_2)
    X_train.extend(X_train_2)
    y_train.extend(y_train_2)

100%|███████████████████████████████████████████| 18/18 [00:05<00:00,  3.08it/s]


In [67]:
len(X_train), len(y_train)

(81548, 81548)

In [68]:
print(random.sample(list(zip(X_train, y_train)), 10))

[([259, 1, 156, 452, 33, 365, 757, 5, 0, 129, 1, 48, 2], [1, 156, 452, 33, 365, 757, 5, 0, 129, 1, 48, 2, 13]), ([5, 324, 2, 1566, 1, 709, 15], [324, 2, 1566, 1, 709, 15, 86]), ([75, 674, 0, 18, 3533], [674, 0, 18, 3533, 34]), ([1278], [2]), ([0, 18, 15, 86, 306, 8, 15], [18, 15, 86, 306, 8, 15, 1957]), ([0, 722, 844, 41, 49, 672, 0, 260, 1351, 22, 530, 1618, 855, 25, 721, 1333, 1, 0, 18], [722, 844, 41, 49, 672, 0, 260, 1351, 22, 530, 1618, 855, 25, 721, 1333, 1, 0, 18, 3655]), ([452, 33, 365, 757, 5, 0, 129, 1, 48], [33, 365, 757, 5, 0, 129, 1, 48, 2]), ([5, 0, 129, 1, 48], [0, 129, 1, 48, 14]), ([1549, 1550, 3492, 1133, 48, 2454, 1551, 158, 1297, 1552, 36, 0, 1554, 1555], [1550, 3492, 1133, 48, 2454, 1551, 158, 1297, 1552, 36, 0, 1554, 1555, 1134]), ([112, 12, 594, 1391, 92, 1393, 11, 8, 514, 8, 200, 97, 66, 6, 1069], [12, 594, 1391, 92, 1393, 11, 8, 514, 8, 200, 97, 66, 6, 1069, 965])]


In [69]:
len(tokenized_sentences)

2800

In [70]:
tokenized_sentences[100]

['SENTENCE_START',
 'I',
 'shall',
 'therefore',
 'speak',
 'to',
 'you',
 'something',
 'SENTENCE_END']

In [71]:
[[word_to_index[w] for w in sent] for sent in tokenized_sentences if all([w in word_to_index for w in sent])][100]

[3, 15, 86, 130, 721, 7, 95, 1965, 4]

In [72]:
X_train_full_sentences = [[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences
                         if all([w in word_to_index for w in sent])]
y_train_full_sentences = [[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences
                         if all([w in word_to_index for w in sent])]

In [73]:
print(X_train_full_sentences[0:5], y_train_full_sentences[0:5])

[[3, 3480, 0, 1296, 1, 1549, 1550, 1551, 158, 1297, 1552, 3481, 2443, 2444, 2445, 3482, 1903, 3483, 917, 3484, 79, 835, 3485, 3486, 80, 500, 3487, 2446, 3488, 1553, 3489, 3490, 1297, 3491, 1132, 10, 2447, 1904, 2448, 2449, 2450, 2451, 18, 2, 0, 1298, 918, 2452, 2453, 669, 669, 669, 1905, 1, 0, 110, 320, 835, 0, 1296, 1, 1549, 1550, 1551, 158, 1297, 1552, 669, 669, 669, 1132, 10, 2447, 1904, 2448, 2449, 2450, 2451, 18, 2, 0, 1298, 918, 2452, 2453, 0, 1296, 1, 1549, 1550, 3492, 1133, 48, 2454, 1551, 158, 1297, 1552, 36, 0, 1554, 1555, 1134, 10, 2443, 2444, 2445, 79, 3493, 80, 0, 1296, 48, 2454, 117, 52, 3494, 1299, 53, 3495, 49, 1906, 161, 7, 1907, 2, 3496, 0, 500, 1300, 1, 1556, 756, 2, 51, 7, 0, 451, 3497, 292, 0, 708, 1908, 24, 1135], [3, 1557, 28, 37, 282, 521, 31, 3498, 208, 6, 0, 259, 1, 156, 452, 33, 365, 757, 5, 0, 129, 1, 48, 2, 13, 3499, 282, 521, 0, 427, 614, 189, 1136, 6, 670, 453, 7, 63, 547, 2, 349, 3500, 0, 1010, 79, 917, 80, 1011, 63, 365, 1909, 253, 79, 917, 80, 3501, 35

In [74]:
import random
last_n_words = []
for i in range(3, 20):
    tokenized_sentences_400 = random.sample(list(tokenized_sentences), 400)
    for s in tokenized_sentences_400:
        last_n_words.append(s[::-1][:i][::-1])

print(random.sample(last_n_words, 10))

[['SENTENCE_START', 'and', 'for', 'thee', 'all', 'my', 'doors', 'shall', 'be', 'open', 'SENTENCE_END'], [']', 'some', 'texts', 'read', '_diptasya_', 'for', '_diptayam_', 'SENTENCE_END'], ['purpose', 'such', 'as', 'creation', 'of', 'derivative', 'works', 'reports', 'performances', 'and', 'research', 'SENTENCE_END'], ['SENTENCE_START', 'let', 'every', 'preparation', 'therefore', 'for', 'battle', 'be', 'made', 'without', 'delay', 'SENTENCE_END'], ['used', 'in', 'the', 'same', 'sense', 'SENTENCE_END'], ['roar', 'also', 'of', 'many', 'elephants', 'in', 'the', 'midst', 'of', 'ranks', 'arrayed', 'for', 'battled', 'SENTENCE_END'], ['to', 'the', 'king', 'SENTENCE_END'], ['princess', 'kaikeyI', 'looking', 'on', 'then', 'I', 'almost', 'swoon', 'away', 'SENTENCE_END'], ['been', 'slain', 'by', 'the', 'gandharvas', 'SENTENCE_END'], ['hath', 'come', 'for', 'worshipping', 'the', 'illustrious', 'sons', 'of', 'pandu', 'who', 'deserve', 'to', 'be', 'worshipped', 'by', 'us', 'SENTENCE_END']]


In [75]:
len(last_n_words)

6800

In [76]:
X_train_eos = [[word_to_index[w] for w in sent[:-1]] for sent in last_n_words
                         if all([w in word_to_index for w in sent])]
y_train_eos = [[word_to_index[w] for w in sent[1:]] for sent in last_n_words
                         if all([w in word_to_index for w in sent])]

In [77]:
len(X_train_eos), len(y_train_eos)

(6800, 6800)

In [78]:
X_train.extend(X_train_eos)
y_train.extend(y_train_eos)

In [79]:
len(X_train), len(y_train)

(88348, 88348)

In [80]:
import pickle
with open('data/X_train_siddhartha.pkl', 'wb') as file:
    pickle.dump(X_train, file)

In [81]:
with open('data/y_train_siddhartha.pkl', 'wb') as file:
    pickle.dump(y_train, file)

In [82]:
with open('data/tokenized_sentences_siddhartha.pkl', 'wb') as file:
    pickle.dump(tokenized_sentences, file)

In [83]:
with open('data/word_to_index_siddhartha.pkl', 'wb') as file:
    pickle.dump(word_to_index, file)

In [84]:
with open('data/index_to_word_siddhartha.pkl', 'wb') as file:
    pickle.dump(index_to_word, file)

In [85]:
X_train2 = np.asarray(X_train,dtype=object)
y_train2 = np.asarray(y_train,dtype=object)

In [86]:
X_train2.shape, y_train2.shape

((88348,), (88348,))

In [87]:
print(random.sample(list(zip(X_train2, y_train2)), 10))

[([2, 101, 342, 183, 29, 46, 1141, 6], [101, 342, 183, 29, 46, 1141, 6, 168]), ([12, 2640, 1197, 486, 108, 7, 0, 20, 1, 222, 2, 2641, 8, 838, 274, 1, 74, 1, 475], [2640, 1197, 486, 108, 7, 0, 20, 1, 222, 2, 2641, 8, 838, 274, 1, 74, 1, 475, 0]), ([3], [2]), ([0, 18, 2, 3535], [18, 2, 3535, 21]), ([1911, 228, 919, 502, 2, 430, 2462, 22], [228, 919, 502, 2, 430, 2462, 22, 134]), ([22, 8, 11, 0, 185, 945, 22, 1346, 1979], [8, 11, 0, 185, 945, 22, 1346, 1979, 2]), ([7], [1209]), ([112, 68, 92, 3509, 36, 134], [68, 92, 3509, 36, 134, 235]), ([3734, 1, 3735, 29, 78, 2575, 2576, 725, 115, 2, 218, 116, 487, 87, 46, 26, 57], [1, 3735, 29, 78, 2575, 2576, 725, 115, 2, 218, 116, 487, 87, 46, 26, 57, 0]), ([0, 438, 131, 65, 419, 1, 1034, 3640, 7, 50, 116, 1350, 11, 61, 3641, 2], [438, 131, 65, 419, 1, 1034, 3640, 7, 50, 116, 1350, 11, 61, 3641, 2, 116])]


In [88]:
embedding_dim = 100
vocabulary_size, embedding_dim

(6399, 100)

In [89]:
import os
import numpy as np

#glove_dir = 'data/glove'
glove_dir = "data"

embeddings_index = {} #initialize dictionary
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'), encoding='utf8')
try:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
except:
    print(line)
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [90]:
vocabulary_size

6399

In [91]:
embedding_dim = 100

embedding_matrix = np.zeros((vocabulary_size, embedding_dim))
for word, i in vocab:
    embedding_vector = embeddings_index.get(word)
    if i < vocabulary_size:
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

In [92]:
embedding_matrix.shape

(6399, 100)

In [93]:
vocab[200]

('bull', 48)

In [94]:
embedding_matrix[12]

array([ 2.00619996e-01, -8.98370028e-01,  2.54619986e-01, -7.58920014e-01,
        9.65050012e-02, -3.05970013e-01, -7.64230013e-01,  2.65679993e-02,
       -4.19140011e-01, -5.43119982e-02,  3.83730009e-02,  8.51670027e-01,
        5.33879995e-01,  2.73490012e-01, -1.12999998e-01,  1.20609999e-02,
        9.45160016e-02,  2.46339999e-02,  4.66340005e-01, -7.58130014e-01,
       -2.07120001e-01, -1.10250004e-01, -1.20290004e-01, -5.13180017e-01,
       -9.92090032e-02, -4.33939993e-01, -3.66420001e-01,  6.38860017e-02,
        5.58220029e-01, -1.75260007e-01,  2.27789998e-01,  1.84090003e-01,
       -5.54630011e-02, -6.50359988e-01, -1.28410006e+00,  3.77029985e-01,
       -5.21790028e-01, -1.59470007e-01, -5.88270009e-01, -4.60640013e-01,
       -1.70790002e-01, -3.26260000e-01,  1.11259997e+00, -2.29320005e-01,
       -5.94309986e-01,  2.97919989e-01,  8.04319978e-03,  1.69469997e-01,
        1.71079993e-01, -6.59990013e-02, -6.96300030e-01, -3.22090000e-01,
       -4.24439996e-01,  

In [95]:
from scipy import spatial

def find_closest_embeddings(embedding):
    return sorted(embeddings_index.keys(), key=lambda word: spatial.distance.euclidean(embeddings_index[word], embedding))

In [96]:
find_closest_embeddings(embeddings_index["king"])[1:6]

['prince', 'queen', 'monarch', 'brother', 'uncle']

In [97]:
print(find_closest_embeddings(
    embeddings_index["twig"] - embeddings_index["branch"] + embeddings_index["hand"]
)[:10])

['flashlight', 'twig', 'clipboard', 'shove', 'hand', 'fingers', 'clutching', 'clutched', 'tossing', 'stroking']


In [98]:
# from sklearn.manifold import TSNE
# tsne = TSNE(n_components=2, random_state=0)

In [99]:
# words =  list(embeddings_index.keys())[:500]
# vectors = [embeddings_index[word] for word in words]

In [100]:
# Y = tsne.fit_transform(vectors)

In [101]:
vocabulary_size, embedding_dim

(6399, 100)

### 4. Model Architecture
We define the architecture of our RNN model:

Embedding Layer: Maps input indices to dense vectors of fixed size.
SimpleRNN Layer: A simple recurrent layer that learns dependencies from the sequences.
Dense Layer: Outputs the predicted word by applying a softmax over the vocabulary.

In [102]:
class RNN:    
    def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):
        # Assign instance variables
        self.word_dim = word_dim
        self.hidden_dim = hidden_dim
        self.bptt_truncate = bptt_truncate
        
        # Randomly initialize the network parameters
        #self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))
        #self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
        self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, embedding_dim))
        self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))
        self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
        
        # Set GLOVE embeddings matrix
        self.G = embedding_matrix

In [103]:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    # sometimes, may want to do this first:
    #x = np.vectorize(round)(x)
    
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

In [104]:
def forward_propagation(self, x):
    # The total number of time steps
    T = len(x)
    
    # During forward propagation we save all hidden states in s because need them later.
    # We add one additional element for the initial hidden, which we set to 0
    s = np.zeros((T + 1, self.hidden_dim))
    s[-1] = np.zeros(self.hidden_dim)
    
    # The outputs at each time step. Again, we save them for later.
    o = np.zeros((T, self.word_dim))
    
    # For each time step...
    for t in np.arange(T):
        # embedding of x[t]:
        e_t = self.G[x[t]]
                             
        # Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.
        #s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))
        s[t] = np.tanh(self.U.dot(e_t) + self.W.dot(s[t-1]))
        o[t] = softmax(self.V.dot(s[t]))
        
    return [o, s]

RNN.forward_propagation = forward_propagation

In [105]:
word_dim = vocabulary_size
hidden_dim = 100
embedding_dim = 100
U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, embedding_dim))
W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))
V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
x = np.random.randint(0, high=3000, size=word_dim)
T = len(x)
s = np.zeros((T + 1, hidden_dim))
s_m1 = np.zeros(hidden_dim)
o = np.zeros((T, word_dim))
e_0 = embedding_matrix[x[0]]
s_0 = np.tanh(U.dot(e_0) + W.dot(s_m1))
print(s_0.shape, V.shape)
o_0 = softmax(V.dot(s_0))
o_0.shape, o_0

(100,) (6399, 100)


((6399,),
 array([0.00015627, 0.00015627, 0.00015627, ..., 0.00015627, 0.00015627,
        0.00015627]))

In [106]:
def predict(self, x):
    # Perform forward propagation and return index of the highest score
    o, s = self.forward_propagation(x)
    return np.argmax(o[-1], axis=1)

RNN.predict = predict

In [107]:
def predict(self, x):
    # Perform forward propagation and return index of the highest score
    o, s = self.forward_propagation(x)
    return np.argmax(o, axis=1)

RNN.predict = predict

In [108]:
print ("x:\n%s\n%s" % (" ".join([index_to_word[x] for x in X_train2[1000]]), X_train2[1000]))

x:
SENTENCE_START
[3]


In [109]:
print ("x:\n%s\n%s" % (" ".join([index_to_word[x] for x in X_train2[20000]]), X_train2[20000]))

x:
them yudhishthira restored to that regenerate brahmana
[68, 74, 2456, 7, 8, 1303, 578]


In [110]:
print ("x:\n%s\n%s" % (" ".join([index_to_word[x] for x in X_train2[30000]]), X_train2[30000]))

x:
brilliant rings on my ears and conch-bangles on my wrists and causing
[1936, 2487, 19, 31, 931, 2, 3562, 19, 31, 1937, 2, 2488]


In [111]:
vocabulary_size, X_train2[10000]

(6399, [7, 0])

In [112]:
np.random.seed(17)
model = RNN(vocabulary_size)
o, s = model.forward_propagation(X_train2[10000])
print (o.shape, o)

(2, 6399) [[0.00015685 0.00015394 0.0001592  ... 0.00015907 0.00016397 0.00015099]
 [0.00015478 0.00016109 0.00015791 ... 0.00015768 0.00015554 0.00015456]]


In [113]:
np.argmax(o[-1], axis=0)

2139

In [114]:
predictions = model.predict(X_train2[40000])
print(predictions.shape, predictions)

(17,) [ 500 4074 1288  282 2685 4301 4536 5921 6094 2489  629  930 4536  484
 6388 5126  484]


In [115]:
print ("x:\n%s" % (" ".join([index_to_word[x] for x in predictions])))

x:
most fanned saradwat how conch-bracelets speaketh grappled runnest lotus-leaves appear texts neuter grappled force forty 'indra force


In [116]:
def calculate_total_loss(self, x, y):
    L = 0
    # For each sentence...
    for i in np.arange(len(y)):
        o, s = self.forward_propagation(x[i])
        # We only care about our prediction of the "correct" words
        correct_word_predictions = o[np.arange(len(y[i])), y[i]]
        # Add to the loss based on how off we were
        L += -1 * np.sum(np.log(correct_word_predictions))
    return L

def calculate_loss(self, x, y):
    # Divide the total loss by the number of training examples
    N = np.sum((len(y_i) for y_i in y))
    return self.calculate_total_loss(x,y)/N

RNN.calculate_total_loss = calculate_total_loss
RNN.calculate_loss = calculate_loss

In [117]:
# Limit to 1000 examples to save time
print ("Expected Loss for random predictions: %f" % np.log(vocabulary_size))
print ("Actual loss: %f" % model.calculate_loss(X_train[:1000], y_train[:1000]))

Expected Loss for random predictions: 8.763897


  N = np.sum((len(y_i) for y_i in y))


Actual loss: 8.759229


In [118]:
def bptt(self, x, y):
    T = len(y)
    
    # Perform forward propagation
    o, s = self.forward_propagation(x)
    
    # We accumulate the gradients in these variables
    dLdU = np.zeros(self.U.shape)
    dLdV = np.zeros(self.V.shape)
    dLdW = np.zeros(self.W.shape)
    delta_o = o
    delta_o[np.arange(len(y)), y] -= 1.
    
    # For each output backwards...
    for t in np.arange(T)[::-1]:
        dLdV += np.outer(delta_o[t], s[t].T)
        
        # Initial delta calculation
        delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
        
        # Backpropagation through time (for at most self.bptt_truncate steps)
        for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
            
            # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
            dLdW += np.outer(delta_t, s[bptt_step-1])              
            #dLdU[:,x[bptt_step]] += delta_t
            dLdU += np.outer(delta_t, self.G[x[bptt_step]]) 
            
            # Update delta for next step
            delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
            
    return [dLdU, dLdV, dLdW]

RNN.bptt = bptt

In [119]:
def gradient_check(self, x, y, h=0.001, error_threshold=0.01):
    
    # Calculate the gradients using backpropagation. We want to checker if these are correct.
    bptt_gradients = model.bptt(x, y)
    
    # List of all parameters we want to check.
    model_parameters = ['U', 'V', 'W']
    
    # Gradient check for each parameter
    for pidx, pname in enumerate(model_parameters):
        
        # Get the actual parameter value from the mode, e.g. model.W
        parameter = operator.attrgetter(pname)(self)
        print("Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape)))
               
        # Iterate over each element of the parameter matrix, e.g. (0,0), (0,1), ...
        it = np.nditer(parameter, flags=['multi_index'], op_flags=['readwrite'])
        while not it.finished:
            ix = it.multi_index
               
            # Save the original value so we can reset it later
            original_value = parameter[ix]
               
            # Estimate the gradient using (f(x+h) - f(x-h))/(2*h)
            parameter[ix] = original_value + h
            gradplus = model.calculate_total_loss([x],[y])
            parameter[ix] = original_value - h
            gradminus = model.calculate_total_loss([x],[y])
            estimated_gradient = (gradplus - gradminus)/(2*h)
               
            # Reset parameter to original value
            parameter[ix] = original_value
               
            # The gradient for this parameter calculated using backpropagation
            backprop_gradient = bptt_gradients[pidx][ix]
               
            # calculate The relative error: (|x - y|/(|x| + |y|))
            relative_error = np.abs(backprop_gradient - estimated_gradient) / (
                                np.abs(backprop_gradient) + np.abs(estimated_gradient))
            
               # If the error is to large fail the gradient check
            if relative_error > error_threshold:
                print( "Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix))
                print( "+h Loss: %f" % gradplus)
                print( "-h Loss: %f" % gradminus)
                print( "Estimated_gradient: %f" % estimated_gradient)
                print( "Backpropagation gradient: %f" % backprop_gradient)
                print( "Relative Error: %f" % relative_error)
                return 
            it.iternext()
               
        print( "Gradient check for parameter %s passed." % (pname))

RNN.gradient_check = gradient_check

In [120]:
grad_check_vocab_size = 100
np.random.seed(10)
model = RNN(grad_check_vocab_size, 10, bptt_truncate=1000)
model.gradient_check([0,1,2,3], [1,2,3,4])

Performing gradient check for parameter U with size 1000.
Gradient check for parameter U passed.
Performing gradient check for parameter V with size 1000.
Gradient check for parameter V passed.
Performing gradient check for parameter W with size 100.
Gradient check for parameter W passed.


In [121]:
# Performs one step of SGD.
def numpy_sdg_step(self, x, y, learning_rate):
    # Calculate the gradients
    dLdU, dLdV, dLdW = self.bptt(x, y)
    
    # Change parameters according to gradients and learning rate
    self.U -= learning_rate * dLdU
    self.V -= learning_rate * dLdV
    self.W -= learning_rate * dLdW

RNN.sgd_step = numpy_sdg_step

In [127]:
# Outer SGD Loop
# - model: The RNN model instance
# - X_train: The training data set
# - y_train: The training data labels
# - learning_rate: Initial learning rate for SGD
# - nepoch: Number of times to iterate through the complete dataset
# - evaluate_loss_after: Evaluate the loss after this many epochs

def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=100, evaluate_loss_after=5):
    # We keep track of the losses so we can plot them later
    losses = []
    num_examples_seen = 0
    
    for epoch in range(nepoch):
        
        # Optionally evaluate the loss
        if (epoch % evaluate_loss_after == 0):
            loss = model.calculate_loss(X_train, y_train)
            losses.append((num_examples_seen, loss))
            time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            print ("%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss))
            
            # Adjust the learning rate if loss increases
            if (len(losses) > 1 and losses[-1][1] > losses[-2][1]):
                learning_rate = learning_rate * 0.5  
                print ("Setting learning rate to %f" % learning_rate)
            sys.stdout.flush()
            
        # For each training example...
        for i in range(len(y_train)):
            
            # One SGD step
            model.sgd_step(X_train[i], y_train[i], learning_rate)
            num_examples_seen += 1

In [128]:
vocabulary_size

6399

In [129]:
np.random.seed(17)
model = RNN(vocabulary_size)
%timeit model.sgd_step(X_train2[10000], y_train2[10000], 0.005)

261 ms ± 38.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [130]:
np.random.seed(17)

# Train on a small subset of the data to see what happens
model = RNN(vocabulary_size)
losses = train_with_sgd(model, X_train2[10000:10100], y_train2[10000:10100], nepoch=10, evaluate_loss_after=1)

  N = np.sum((len(y_i) for y_i in y))


KeyboardInterrupt: 

In [131]:
len(index_to_word)

6400

In [132]:
def generate_sentence(model, senten_max_length):
    # We start the sentence with the start token
    new_sentence = [word_to_index[sentence_start_token]]
    
    # Repeat until we get an end token and keep our sentences to less than senten_max_length words for now
    while (not new_sentence[-1] == word_to_index[sentence_end_token]) and len(new_sentence) < senten_max_length:
        next_word_probs = model.forward_propagation(new_sentence)
        sampled_word = word_to_index[unknown_token]
        
        # We don't want to sample unknown words
        while sampled_word == word_to_index[unknown_token]:
            
            # correcting for abnormalities
            #abs_v = [-i if i <0 else i for i in next_word_probs[-1][0]] 
            #nrm_v = [i/sum(abs_v) for i in abs_v] 
            #abs_v = [0 if i <0 else i for i in next_word_probs[-1][0]] 
            #abs_v = [0 if i <0 else i for i in next_word_probs[0][-1]] 
            #nrm_v = [i/sum(abs_v) for i in abs_v] 
            #samples = np.random.multinomial(1, nrm_v)
            #sampled_word = np.argmax(samples)
            
            # the secret sauce of creativity
            samples = np.random.multinomial(1, next_word_probs[0][-1])
            
            sampled_word = np.argmax(samples)
            
        new_sentence.append(sampled_word)

    print(new_sentence)
    sentence_str = [index_to_word[x] for x in new_sentence[1:-1]]
    #print(sentence_str)
    return sentence_str

In [133]:
senten_max_length = 20
generate_sentence(model, senten_max_length)

[3, 4941, 3294, 3351, 1818, 199, 3030, 5255, 5876, 404, 3871, 127, 5880, 4060, 2493, 2101, 1889, 4103, 3970, 5128]


['eclipse',
 'uncle',
 'maghavat',
 'hoisted',
 'pierced',
 'adhering',
 '_meghapushpa_',
 'raining',
 'night',
 'swan',
 'out',
 'repelling',
 'rati',
 'enquire',
 'influence',
 'support',
 'fornication',
 'belt']

In [134]:
num_sentences = 10
senten_min_length = 7
senten_max_length = 20

for i in range(num_sentences):
    sent = generate_sentence(model, senten_max_length)
    print (" ".join(sent))

[3, 6250, 382, 2254, 5450, 2696, 3814, 2245, 6304, 6028, 5687, 4670, 3166, 4315, 5205, 517, 5970, 3450, 2083, 1891]
corrupt kind concerning drag horse-lore arranged sending alteration praising redolent induce jumped she-elephants 'tell old enhances plain vritra
[3, 2019, 1432, 3398, 2805, 429, 2920, 3907, 499, 4844, 5103, 3271, 4967, 1164, 1198, 2514, 4940, 2316, 2584, 1445]
hours adopt juices lamentations soul favourites common agreement gashed 'you extol truthfulness break likest 6 obstruct _agni_ observed
[3, 855, 2475, 2918, 1795, 3437, 6358, 438, 2853, 4852, 1842, 3776, 5719, 2885, 1638, 1701, 5905, 3572, 247, 1312]
does vasukI fetch upraised intellectual regulating wise perils deprive chitrasena _vijaya_ highly-trained mother-in-law durga fragrance gushed horses_ covered
[3, 2991, 2475, 1054, 1750, 1564, 2407, 3183, 4794, 1888, 4488, 4863, 2174, 2762, 2097, 242, 2903, 6051, 3155, 994]
comes vasukI attired whither promise venerable 41 marshalling collection kaliya nakha-naki desti

### 5. Model Training
In this section, we train the model using categorical cross-entropy loss and the Adam optimizer. The goal is to minimize the loss over multiple epochs and improve the accuracy.

Training Details:
Loss function: Categorical Cross-Entropy
Optimizer: Adam
Batch Size: 32


In [None]:
np.random.seed(17)

# Train on a small subset of the data to see what happens
model = RNN(vocabulary_size)
losses = train_with_sgd(model, X_train2, y_train2, nepoch=5, evaluate_loss_after=1)

In [135]:
import time
for i in range(5):
    time.sleep(15*60)
    losses = train_with_sgd(model, X_train2, y_train2, nepoch=20, evaluate_loss_after=1)

  N = np.sum((len(y_i) for y_i in y))


2024-10-19 00:23:57: Loss after num_examples_seen=0 epoch=0: 8.763614
2024-10-19 03:42:55: Loss after num_examples_seen=88348 epoch=1: 5.576958
2024-10-19 05:48:59: Loss after num_examples_seen=176696 epoch=2: 5.447429
2024-10-19 09:08:09: Loss after num_examples_seen=265044 epoch=3: 5.543287
Setting learning rate to 0.002500
2024-10-19 10:04:14: Loss after num_examples_seen=353392 epoch=4: 5.296645
2024-10-19 11:26:37: Loss after num_examples_seen=441740 epoch=5: 5.293215
2024-10-19 12:21:12: Loss after num_examples_seen=530088 epoch=6: 5.382714
Setting learning rate to 0.001250
2024-10-19 13:49:38: Loss after num_examples_seen=618436 epoch=7: 5.237830
2024-10-19 14:32:15: Loss after num_examples_seen=706784 epoch=8: 5.191588
2024-10-19 15:17:30: Loss after num_examples_seen=795132 epoch=9: 5.239074
Setting learning rate to 0.000625
2024-10-19 16:03:59: Loss after num_examples_seen=883480 epoch=10: 5.151594
2024-10-19 17:47:33: Loss after num_examples_seen=971828 epoch=11: 5.139855
20

KeyboardInterrupt: 

In [231]:
losses = train_with_sgd(model, X_train2, y_train2, nepoch=5, evaluate_loss_after=1)

  N = np.sum((len(y_i) for y_i in y))


2024-10-16 09:55:45: Loss after num_examples_seen=0 epoch=0: 5.152209
2024-10-16 10:12:25: Loss after num_examples_seen=48950 epoch=1: 5.154802
Setting learning rate to 0.002500
2024-10-16 10:30:41: Loss after num_examples_seen=97900 epoch=2: 4.991009
2024-10-16 10:48:20: Loss after num_examples_seen=146850 epoch=3: 4.963550
2024-10-16 11:08:07: Loss after num_examples_seen=195800 epoch=4: 4.919193


In [232]:
losses = train_with_sgd(model, X_train2, y_train2, nepoch=5, evaluate_loss_after=1)

  N = np.sum((len(y_i) for y_i in y))


2024-10-16 11:24:53: Loss after num_examples_seen=0 epoch=0: 4.936287
2024-10-16 11:39:36: Loss after num_examples_seen=48950 epoch=1: 5.093147
Setting learning rate to 0.002500
2024-10-16 12:35:15: Loss after num_examples_seen=97900 epoch=2: 4.892396
2024-10-16 12:49:06: Loss after num_examples_seen=146850 epoch=3: 4.876551
2024-10-16 13:05:37: Loss after num_examples_seen=195800 epoch=4: 4.888263
Setting learning rate to 0.001250


In [None]:
losses = train_with_sgd(model, X_train2, y_train2, nepoch=5, evaluate_loss_after=1)

  N = np.sum((len(y_i) for y_i in y))


2024-10-16 13:30:10: Loss after num_examples_seen=0 epoch=0: 4.777440
2024-10-16 13:43:46: Loss after num_examples_seen=48950 epoch=1: 4.988670
Setting learning rate to 0.002500
2024-10-16 14:00:59: Loss after num_examples_seen=97900 epoch=2: 4.861099
2024-10-16 14:33:38: Loss after num_examples_seen=146850 epoch=3: 4.863277
Setting learning rate to 0.001250


In [None]:
losses = train_with_sgd(model, X_train2, y_train2, nepoch=5, evaluate_loss_after=1)

In [None]:
losses = train_with_sgd(model, X_train2, y_train2, nepoch=5, evaluate_loss_after=1)

### 6. Generating Text
Once the model is trained, we can use it to generate text. By providing a seed (e.g., a phrase), the model predicts the next word iteratively to form a sequence.

Generation Process:
Provide a seed text.
Generate the next word using the model’s prediction.
Append the new word to the seed and continue.


In [156]:
def generate_sentence(model, senten_max_length):
    prompt='''Of the god of called together all his younger brothers and said it in thus great'''
    # We start the sentence with the start token
    new_sentence = [word_to_index[word] for word in prompt if word in word_to_index]
#     new_sentence = [word_to_index[sentence_start_token]]
    
    # Repeat until we get an end token and keep our sentences to less than senten_max_length words for now
    while (not new_sentence[-1] == word_to_index[sentence_end_token]) and len(new_sentence) < senten_max_length:
        next_words_probs = model.forward_propagation(new_sentence)
        sampled_word = word_to_index[unknown_token]
        
        # We don't want to sample unknown words
        while sampled_word == word_to_index[unknown_token]:
            
            #print(next_word_probs[0][-1])
            samples = np.random.multinomial(1, next_words_probs[0][-1])
            sampled_word = np.argmax(samples)
            
        new_sentence.append(sampled_word)

    #print(new_sentence)
    sentence_str = [index_to_word[x] for x in new_sentence[1:-1]]
    #print(sentence_str)
    return sentence_str

In [172]:
def generate_sentence(model, senten_max_length):
    prompt = '''Of the god of called together all his younger brothers and said it in thus great'''
    
    # Convert the prompt words to their corresponding indices if they exist in word_to_index
    new_sentence = [word_to_index[word] for word in prompt.split() if word in word_to_index]

    # Ensure the sentence has a valid start by adding the start token if necessary
    if not new_sentence:
        new_sentence = [word_to_index[sentence_start_token]]

    # Repeat until we reach the end token or exceed the maximum sentence length
    while (new_sentence[-1] != word_to_index[sentence_end_token]) and len(new_sentence) < senten_max_length:
        # Get the probabilities for the next word
        next_word_probs = model.forward_propagation(new_sentence)
        sampled_word = word_to_index[unknown_token]
        
        # Avoid sampling unknown words
        while sampled_word == word_to_index[unknown_token]:
            # Sample a word based on the probabilities
            samples = np.random.multinomial(1, next_word_probs[0][-1])
            sampled_word = np.argmax(samples)
        
        # Append the sampled word to the sentence
        new_sentence.append(sampled_word)

    # Convert the generated sentence indices back to words, excluding start and end tokens
    sentence_str = [index_to_word[x] for x in new_sentence[1:-1]]
    
    return ' '.join(sentence_str)


In [174]:
num_sentences = 20
senten_min_length = 7
senten_max_length = 40

for i in range(num_sentences):
    sent = []
    # We want long sentences, not sentences with one or two words
    while len(sent) < senten_min_length:
        sent = generate_sentence(model, senten_max_length)
    print ("".join(sent))

god of called together all his younger brothers and said it in thus great are and I shall not in the hours and looking faces their long of what beautiful ye and the thou shinest ever in the
god of called together all his younger brothers and said it in thus great o offices and is coming foremost of drupada and informed vaisyas taking and _rakshasas_ charioteers saying 'who of the and resembling thou rescuest remain
god of called together all his younger brothers and said it in thus great I shall thou them king and o child her the their time and what o child up to the city is a younger of
god of called together all his younger brothers and said it in thus great and whom _gandiva_ for viands entertain who thou the thou o shall perform speech just your branches and long as the with their auspicious
god of called together all his younger brothers and said it in thus great be will to live by by means of the and red dice most tree advice symmetrical and challenged whose rain on a single car


In [162]:
def generate_sentence_with_prompt(model, prompt, senten_max_length):
    # Start the sentence with the provided prompt
    new_sentence = [word_to_index[word] for word in prompt if word in word_to_index]

    # If no valid words in the prompt, start with the sentence start token
    if not new_sentence:
        new_sentence = [word_to_index[sentence_start_token]]

    # Repeat until we get an end token and limit the sentence length
    while (not new_sentence[-1] == word_to_index[sentence_end_token]) and len(new_sentence) < senten_max_length:
        next_word_probs = model.forward_propagation(new_sentence)
        sampled_word = word_to_index[unknown_token]

        # We don't want to sample unknown words
        while sampled_word == word_to_index[unknown_token]:
            abs_v = [0 if i < 0 else i for i in next_word_probs[-1][0]]
            nrm_v = [i / sum(abs_v) for i in abs_v]

            samples = np.random.multinomial(1, nrm_v)
            sampled_word = np.argmax(samples)

        new_sentence.append(sampled_word)

    # Convert the word indices back to words
    sentence_str = [index_to_word[x] for x in new_sentence[1:-1]]

    return sentence_str

In [163]:
num_sentences = 20
senten_min_length = 7
senten_max_length = 40

prompt= '''The son of the god of Justice, called together all his younger brothers and said'''

for i in range(num_sentences):
    sent = []
    # We want long sentences, not sentences with one or two words
    while len(sent) < senten_min_length:
        sent = generate_sentence_with_prompt(model,prompt, senten_max_length)
    print (" ".join(sent))

  nrm_v = [i / sum(abs_v) for i in abs_v]


ValueError: pvals < 0, pvals > 1 or pvals contains NaNs

### 7. Conclusion
We successfully built and trained an RNN model for text generation using trigrams. The model generates text based on the patterns it learned from the input corpus. Further improvements can be made by experimenting with different architectures and hyperparameters.

