Assignment - 5

# Text Generation using RNNs

In this notebook, we will explore how to build and train a Recurrent Neural Network (RNN) to generate text based on a corpus. We will use a trigram approach for input and output sequence generation.


#### Importing dependencies

In [1]:
import csv
import itertools
import operator
import numpy as np
import nltk
import sys
from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Download NLTK model data (you need to do this once)
nltk.download("book")

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/sudarshan/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/sudarshan/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/sudarshan/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/sudarshan/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/sudarshan/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/sudarshan/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nlt

True


---

### 1. Data Preprocessing

In this section, we preprocess the text data by:
- Removing unnecessary characters and multiple spaces.
- Converting the text to lowercase for consistency.

### Steps:
1. Load the raw text data.
2. Apply regex for cleaning.
3. Tokenize the text into individual words.

```python
# Example Python code for preprocessing


In [3]:
import re
def clean_roman_numerals(text):
    pattern = r"\b(?=[MDCLXVIΙ])M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})([IΙ]X|[IΙ]V|V?[IΙ]{0,3})\b\.?"
    return re.sub(pattern, '', text)

In [4]:
import re
from nltk import tokenize

#alphabets= "([A-Za-z])"
#prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
#suffixes = "(Inc|Ltd|Jr|Sr|Co)"
#starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
#acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
#websites = "[.](com|net|org|io|gov|edu|me)"
#digits = "([0-9])"

# If you want to restrict the size of the voabulary
# Right now, we set it in the next cell to be the entire vocabular: vocabulary_size = len(word_freq.items())
#vocabulary_size = 3000

unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"

# Read the data and append SENTENCE_START and SENTENCE_END tokens
text = ''
print( "Reading txt file...")
with open(r'data/siddhartha.txt', 'r') as f:
    text = f.read()

#text = text.replace(",\n","\n")

# too many commas if i do this
#text = text.replace(","," ,")
#text = text.replace(":"," ,")
#text = text.replace(";"," ,")

#.. so i do this instead
text = text.replace(",","")
text = text.replace(":","")
text = text.replace(";","")

# too many apostrophes in shakespeare
text = text.replace("’","")

text = text.replace("?\n",".\n")
text = text.replace("!\n",".\n")
text = text.replace("?","")
text = text.replace("!","")
#text = text.replace("\n"," ")

text = text.replace('I ', 'i ')
text = clean_roman_numerals(text)
#text = text.replace('&', '')

_RE_COMBINE_WHITESPACE = re.compile(r"\s+")
text = _RE_COMBINE_WHITESPACE.sub(" ", text).strip()
print('done!')

Reading txt file...
done!


In [5]:
text = text.lower()
text = text.replace('i ', 'I ')

leftovers = ['ii', 'iii', 'cxi', 'cx', 'cxx', 'xx', 'xxxvi', 'xxxvi', 'xxxv', 'xxxi', 'xxi', 'cvi ', 'ci ', 'xvi ', 'lxi ', 
             'lxv','lxvi', 'lxxi', 'lxxvi', 'lxxvi', 'lxxv', 'lxxxi', 'cxxxi', 'cxxxi', 'cxxx', 'cxli', 'cxlvi', 'cxvl', 
             'cli ', 'cl ', 'cxxxvi','cvi ', 'cv ', 'ci ', 'cx ', 'cxx', 'cxi', 'li ' , 'xxx', 'xxvi', 'xxv', 'cxv', 'xci', 
             'xli', 'lxvi', 'lxi ', ' c ', 'lxxxvi', 'lxxxvi', 'lxxxv', ' v ', 'vi ', ' l ', 'lvi ', 'lv ', 'xlv ', ' x ', 
             'xi ', 'xl ', 'ix ']
for rn in leftovers:
    text = text.replace(rn, '')

text = text.replace('.  ', '. ')

In [6]:
sentences = tokenize.sent_tokenize(text)
for i in range(100, 110):
    print(sentences[i])
    print()

he saw merchants trading princes hunting mourners wailing for their dead whores offering themselves physicians trying to help the sick priests determining the most suitable day for seeding lovers loving mothers nursing their children—and all of this was not worthy of one look from his eye it all lied it all stank it all stank of lies it all pretended to be meaningful and joyful and beautiful and it all was just concealed putrefaction.

the world tasted bitter.

life was torture.

a goal stood before siddhartha a single goal to become empty empty of thirst empty of wishing empty of dreams empty of joy and sorrow.

dead to himself not to be a self any more to find tranquility with an emptied heart to be open to miracles in unselfish thoughts that was his goal.

once all of my self was overcome and had died once every desire and every urge was silent in the heart then the ultimate part of me had to awake the innermost of my being which is no longer my self the great secret.

silently sidd

In [117]:
vocabulary_size = 40000

### 2. Creating Word Mappings
Here, we convert the cleaned text into numerical form by creating two dictionaries:

word_to_index: Maps each word to a unique index.
index_to_word: Reverse mapping to retrieve words from their corresponding indices.
This allows us to prepare the data for model training.
### Example code for word mappings


In [118]:
# Append SENTENCE_START and SENTENCE_END
sentences = ["%s %s %s" % (sentence_start_token, x[:-1].replace("&",""), sentence_end_token) for x in sentences] 
print(  "Parsed %d sentences." % (len(sentences)))

# Tokenize the sentences into words, making sure to remove end-of-sentence period
tokenized_sentences = [nltk.word_tokenize(sent.replace('.', '')) for sent in sentences]

# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print(  "Found %d unique words tokens." % len(word_freq.items()))

# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])
print("The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1]))

# Replace all words not in our vocabulary with the unknown token
#for i, sent in enumerate(tokenized_sentences):
#    tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]
vocabulary_size = len(word_freq.items())
print("Using vocabulary size %d." % vocabulary_size)

print(  "\nExample sentence: '%s'" % sentences[0])
print(  "\nExample sentence after Pre-processing: '%s'" % tokenized_sentences[0])

Parsed 1476 sentences.
Found 4089 unique words tokens.
The least frequent word in our vocabulary is 'newsletter' and appeared 1 times.
Using vocabulary size 4089.

Example sentence: 'SENTENCE_START SENTENCE_START the project gutenberg ebook of siddhartha this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoever SENTENCE_EN SENTENCE_END'

Example sentence after Pre-processing: '['SENTENCE_START', 'SENTENCE_START', 'the', 'project', 'gutenberg', 'ebook', 'of', 'siddhartha', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'united', 'states', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'SENTENCE_EN', 'SENTENCE_END']'


In [119]:
vocab[0:20]

[('SENTENCE_START', 2952),
 ('the', 2239),
 ('SENTENCE_EN', 1476),
 ('SENTENCE_END', 1476),
 ('and', 1423),
 ('to', 1226),
 ('of', 1112),
 ('a', 967),
 ('he', 935),
 ('his', 708),
 ('in', 689),
 ('had', 527),
 ('was', 512),
 ('this', 491),
 ('it', 484),
 ('you', 462),
 ('him', 458),
 ('with', 411),
 ('I', 395),
 ('“', 383)]

In [120]:
sentences[0:5]

['SENTENCE_START SENTENCE_START the project gutenberg ebook of siddhartha this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoever SENTENCE_EN SENTENCE_END',
 'SENTENCE_START SENTENCE_START you may copy it give it away or re-use it under the terms of the project gutenberg license included with this ebook or online at www.gutenberg.org SENTENCE_EN SENTENCE_END',
 'SENTENCE_START SENTENCE_START if you are not located in the united states you will have to check the laws of the country where you are located before using this ebook SENTENCE_EN SENTENCE_END',
 'SENTENCE_START SENTENCE_START title siddhartha author hermann hesse release date february 1 2001 [ebook #2500] most recently updated december 22 2021 language english credits michael pullen chandra yenco and isaac jones *** start of the project gutenberg ebook siddhartha *** siddhartha an indian tale by herman hesse * * * contents first pa

### 3. Preparing Trigrams and Sequences
We now prepare the input sequences (bigrams) and the target word (third word) using trigrams. The process involves:

Creating sequences of n-grams (specifically trigrams).
Mapping each word in the sequence to its index.
### Example code for creating n-grams and sequences


In [121]:
%%time
from collections import Counter
from nltk import ngrams
bigram_counts = Counter(ngrams(text.split(), 2))
bigram_counts.most_common(10)

CPU times: user 11.1 ms, sys: 4.77 ms, total: 15.9 ms
Wall time: 37.2 ms


[(('of', 'the'), 269),
 (('in', 'the'), 215),
 (('he', 'had'), 187),
 (('to', 'the'), 111),
 (('of', 'his'), 87),
 (('to', 'be'), 85),
 (('in', 'his'), 77),
 (('and', 'the'), 77),
 (('I', 'have'), 73),
 (('for', 'a'), 72)]

In [122]:
%%time
import collections
def ngrams(text, n=2):
    return zip(*[text[i:] for i in range(n)])
bigram_counts = collections.Counter(ngrams(text.split(), 2))
bigram_counts.most_common(10)

CPU times: user 19.4 ms, sys: 2.22 ms, total: 21.6 ms
Wall time: 41.9 ms


[(('of', 'the'), 269),
 (('in', 'the'), 215),
 (('he', 'had'), 187),
 (('to', 'the'), 111),
 (('of', 'his'), 87),
 (('to', 'be'), 85),
 (('in', 'his'), 77),
 (('and', 'the'), 77),
 (('I', 'have'), 73),
 (('for', 'a'), 72)]

In [123]:
text[0:1000]

'the project gutenberg ebook of siddhartha this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoever. you may copy it give it away or re-use it under the terms of the project gutenberg license included with this ebook or online at www.gutenberg.org. if you are not located in the united states you will have to check the laws of the country where you are located before using this ebook. title siddhartha author hermann hesse release date february 1 2001 [ebook #2500] most recently updated december 22 2021 language english credits michael pullen chandra yenco and isaac jones *** start of the project gutenberg ebook siddhartha *** siddhartha an indian tale by herman hesse * * * contents first part the son of the brahman with the samanas gotama awakening second part kamala with the childlike people sansara by the river the ferryman the son om govinda first part to romain rolland my dear friend the

In [124]:
first_word_counts = Counter([ p.replace('. ', '') for p in re.findall('\..[^" "]*', text)])
first_word_counts.most_common(10)

[('.”', 159),
 ('but', 117),
 ('he', 104),
 ('and', 68),
 ('the', 63),
 ('I', 55),
 ('siddhartha', 45),
 ('it', 40),
 ('when', 31),
 ('for', 30)]

In [125]:
#X_train = [[sentence_start_token] for sent,times in first_word_counts if sent != 'o.']
#y_train = [sent for sent in first_word_counts if sent != 'o.']
X_train = [[sentence_start_token]*c for sent,c in first_word_counts.items() if sent != 'o.']
y_train = [[sent]*c for sent,c in first_word_counts.items() if sent != 'o.']

In [126]:
X_train = [item for sublist in X_train for item in sublist]
y_train = [item for sublist in y_train for item in sublist]

In [127]:
X_train[0:10]

['SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START',
 'SENTENCE_START']

In [128]:
print(y_train)

['you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', 'you', '.gutenberg.org.', '.gutenberg.org.', '.gutenberg.org.', '.gutenberg.org.', 'title', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'for', 'for', 'for', 'for', 'for', 'for', 'for', 'for', 'for', 'for', 'for', 'for', 'for', 'for', 'for', 'for', 'for', 'for', 'for

In [129]:
len(X_train), len(y_train)

(1655, 1655)

In [130]:
import random

def fisher_yates (arr1, arr2):
     
    # We will Start from the last element
    # and swap one by one.
    n = len(arr1)
    if n != len(arr2):
        return None
    
    for i in range(n - 1, 0, -1):

        # Pick a random index from 0 to i
        j = random.randint(0, i)
        #print(i, j)

        # Swap arr[i] with the element at random index
        arr1[i], arr1[j] = arr1[j], arr1[i]
        arr2[i], arr2[j] = arr2[j], arr2[i]
        
    return arr1, arr2

In [131]:
import random as rd
one = ['a', 'b', 'c']
two = ['1', '2', '3']
one, two = fisher_yates(one, two)
one, two

(['b', 'c', 'a'], ['2', '3', '1'])

In [132]:
one = [['a'], ['b'], ['c']]
two = [['1'], ['2'], ['3']]
one, two = fisher_yates(one, two)
one, two

([['c'], ['b'], ['a']], [['3'], ['2'], ['1']])

In [133]:
X_train, y_train = fisher_yates(X_train, y_train)
len(X_train), len(y_train)

(1655, 1655)

In [134]:
X_tokens = [[word_to_index[symbol]] for symbol,word in zip(X_train, y_train) if word in word_to_index]
y_tokens = [[word_to_index[word]] for symbol,word in zip(X_train, y_train) if word in word_to_index]

In [135]:
X_train = X_tokens
y_train = y_tokens

In [136]:
len(X_train), len(y_train)

(1314, 1314)

In [137]:
X_train[0:5], y_train[0:5]

([[0], [0], [0], [0], [0]], [[27], [18], [668], [27], [41]])

In [138]:
ngrams_up_to_20 = []
for i in range(2, 21):
    ngram_counts = Counter(ngrams(text.split(), i))
    print('ngram-', i, 'length:', len(ngram_counts))
    ngrams_up_to_20.append(ngram_counts)

ngram- 2 length: 24854
ngram- 3 length: 37872
ngram- 4 length: 41022
ngram- 5 length: 41764
ngram- 6 length: 41974
ngram- 7 length: 42048
ngram- 8 length: 42077
ngram- 9 length: 42090
ngram- 10 length: 42095
ngram- 11 length: 42098
ngram- 12 length: 42101
ngram- 13 length: 42104
ngram- 14 length: 42106
ngram- 15 length: 42107
ngram- 16 length: 42107
ngram- 17 length: 42107
ngram- 18 length: 42107
ngram- 19 length: 42107
ngram- 20 length: 42107


In [139]:
def remove_periods(ngram):
    for wrd in ngram[0]:
        if '.' in wrd or "’" in wrd or "‘" in wrd:
            return False
    return True
    
def my_filter(ngrams):
    return filter(remove_periods, ngrams)

In [140]:
l = list(filter(lambda x: 1 < int(x[1]), ngrams_up_to_20[0].most_common()))
len(l), l

(5130,
 [(('of', 'the'), 269),
  (('in', 'the'), 215),
  (('he', 'had'), 187),
  (('to', 'the'), 111),
  (('of', 'his'), 87),
  (('to', 'be'), 85),
  (('in', 'his'), 77),
  (('and', 'the'), 77),
  (('I', 'have'), 73),
  (('for', 'a'), 72),
  (('from', 'the'), 71),
  (('with', 'the'), 70),
  (('a', 'long'), 68),
  (('he', 'was'), 68),
  (('the', 'river'), 67),
  (('had', 'been'), 62),
  (('by', 'the'), 61),
  (('it', 'is'), 61),
  (('of', 'a'), 60),
  (('to', 'him'), 60),
  (('long', 'time'), 57),
  (('project', 'gutenberg™'), 55),
  (('with', 'a'), 53),
  (('of', 'this'), 51),
  (('when', 'he'), 50),
  (('did', 'not'), 48),
  (('at', 'the'), 48),
  (('for', 'the'), 46),
  (('the', 'world'), 46),
  (('into', 'the'), 46),
  (('that', 'he'), 46),
  (('the', 'same'), 44),
  (('had', 'to'), 42),
  (('it', 'was'), 41),
  (('able', 'to'), 41),
  (('in', 'a'), 40),
  (('as', 'a'), 39),
  (('him', 'and'), 39),
  (('on', 'the'), 38),
  (('all', 'of'), 35),
  (('his', 'heart'), 35),
  (('full', '

In [141]:
def my_filter(ngrams):
    return filter(remove_periods, list(filter(lambda x: 1 < int(x[1]), ngrams)))

In [142]:
bigrams_to_learn = ngrams_up_to_20[0]
X_train_example = [[word_to_index[sent[0][0]]] for sent in my_filter(bigrams_to_learn.most_common())
                  if sent[0][0] in word_to_index and sent[0][1] in word_to_index]
y_train_example = [[word_to_index[sent[0][1]]] for sent in my_filter(bigrams_to_learn.most_common())
                  if sent[0][0] in word_to_index and sent[0][1] in word_to_index]

In [143]:
X_train_example[0:10], y_train_example[0:10]

([[6], [10], [8], [5], [6], [5], [10], [4], [18], [22]],
 [[1], [1], [11], [1], [9], [31], [9], [1], [29], [7]])

In [144]:
len(X_train_example), len(y_train_example)

(4836, 4836)

In [145]:
trigrams_to_learn = ngrams_up_to_20[1].copy()
[sent[0] for sent in my_filter(trigrams_to_learn.most_common())]

[('a', 'long', 'time'),
 ('for', 'a', 'long'),
 ('the', 'project', 'gutenberg'),
 ('he', 'had', 'been'),
 ('the', 'exalted', 'one'),
 ('project', 'gutenberg™', 'electronic'),
 ('in', 'order', 'to'),
 ('which', 'he', 'had'),
 ('in', 'his', 'heart'),
 ('that', 'he', 'had'),
 ('in', 'the', 'forest'),
 ('he', 'did', 'not'),
 ('he', 'had', 'not'),
 ('of', 'the', 'world'),
 ('the', 'terms', 'of'),
 ('with', 'a', 'smile'),
 ('seemed', 'to', 'him'),
 ('long', 'time', 'he'),
 ('be', 'able', 'to'),
 ('the', 'project', 'gutenberg™'),
 ('project', 'gutenberg', 'literary'),
 ('gutenberg', 'literary', 'archive'),
 ('all', 'of', 'this'),
 ('he', 'had', 'to'),
 ('when', 'he', 'had'),
 ('literary', 'archive', 'foundation'),
 ('in', 'the', 'united'),
 ('the', 'united', 'states'),
 ('of', 'the', 'project'),
 ('by', 'the', 'river'),
 ('to', 'him', 'and'),
 ('as', 'if', 'he'),
 ('to', 'become', 'a'),
 ('in', 'front', 'of'),
 ('this', 'is', 'what'),
 ('he', 'had', 'learned'),
 ('gutenberg™', 'electronic', '

In [146]:
X_train_example.extend([[word_to_index[w] for w in sent[0][:-1]] for sent in my_filter(trigrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])])
y_train_example.extend([[word_to_index[w] for w in sent[0][1:]] for sent in my_filter(trigrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])])

In [147]:
len(X_train_example), len(y_train_example)

(7213, 7213)

In [148]:
X_train_example[1575:1585], y_train_example[1575:1585]

([[178], [142], [51], [149], [11], [577], [23], [156], [680], [12]],
 [[142], [10], [149], [116], [433], [6], [679], [680], [163], [681]])

In [149]:
bigrams_to_learn = ngrams_up_to_20[0]
X_train_2 = [[word_to_index[sent[0][0]]] for sent in my_filter(bigrams_to_learn.most_common())
                  if sent[0][0] in word_to_index and sent[0][1] in word_to_index]
y_train_2 = [[word_to_index[sent[0][1]]] for sent in my_filter(bigrams_to_learn.most_common())
                  if sent[0][0] in word_to_index and sent[0][1] in word_to_index]
X_train_2, y_train_2 = fisher_yates(X_train_2, y_train_2)

In [150]:
len(X_train_2), len(y_train_2)

(4836, 4836)

In [151]:
X_train_2[0:10], y_train_2[0:10]

([[35], [15], [122], [35], [47], [1129], [130], [22], [54], [7]],
 [[973], [110], [9], [92], [12], [1130], [232], [48], [13], [225]])

In [152]:
X_train.extend(X_train_2)
y_train.extend(y_train_2)

In [153]:
len(X_train), len(y_train)

(6150, 6150)

In [154]:
random.sample(list(zip(X_train, y_train)), 10)

[([633], [36]),
 ([630], [9]),
 ([27], [316]),
 ([0], [117]),
 ([254], [904]),
 ([0], [637]),
 ([1033], [6]),
 ([1510], [33]),
 ([0], [88]),
 ([145], [26])]

In [155]:
ngrams_to_learn = ngrams_up_to_20[1]
ngrams_to_learn.most_common(10)

[(('a', 'long', 'time'), 56),
 (('for', 'a', 'long'), 47),
 (('the', 'project', 'gutenberg'), 20),
 (('he', 'had', 'been'), 19),
 (('the', 'exalted', 'one'), 18),
 (('project', 'gutenberg™', 'electronic'), 18),
 (('in', 'order', 'to'), 16),
 (('which', 'he', 'had'), 15),
 (('in', 'his', 'heart'), 14),
 (('that', 'he', 'had'), 14)]

In [156]:
[sent[0] for sent in my_filter(ngrams_to_learn.most_common(10))]

[('a', 'long', 'time'),
 ('for', 'a', 'long'),
 ('the', 'project', 'gutenberg'),
 ('he', 'had', 'been'),
 ('the', 'exalted', 'one'),
 ('project', 'gutenberg™', 'electronic'),
 ('in', 'order', 'to'),
 ('which', 'he', 'had'),
 ('in', 'his', 'heart'),
 ('that', 'he', 'had')]

In [157]:
X_train_2 = [[word_to_index[w] for w in sent[0][:-1]] for sent in my_filter(ngrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])]
y_train_2 = [[word_to_index[w] for w in sent[0][1:]] for sent in my_filter(ngrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])]
X_train_2, y_train_2 = fisher_yates(X_train_2, y_train_2)
X_train_2[0:5], y_train_2[0:5], len(X_train_2), len(y_train_2)

([[82, 13], [23, 1], [150, 37], [398, 10], [145, 26]],
 [[13, 161], [1, 81], [37, 1], [10, 7], [26, 16]],
 2377,
 2377)

In [158]:
def my_filter(ngrams):
    return filter(remove_periods, ngrams)

In [159]:
X_train_2 = [[word_to_index[w] for w in sent[0][:-1]] for sent in my_filter(ngrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])]
y_train_2 = [[word_to_index[w] for w in sent[0][1:]] for sent in my_filter(ngrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])]
X_train_2 = X_train_2[:2000]
y_train_2 = y_train_2[:2000]
X_train_2, y_train_2 = fisher_yates(X_train_2, y_train_2)
X_train_2[0:5], y_train_2[0:5], len(X_train_2), len(y_train_2)

([[66, 8], [436, 6], [1, 1911], [1, 1632], [42, 120]],
 [[8, 11], [6, 1], [1911, 6], [1632, 429], [120, 37]],
 2000,
 2000)

In [160]:
ngrams_to_learn = ngrams_up_to_20[1]
X_train_2 = [[word_to_index[w] for w in sent[0][:-1]] for sent in my_filter(ngrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])]
y_train_2 = [[word_to_index[w] for w in sent[0][1:]] for sent in my_filter(ngrams_to_learn.most_common())
               if all([w in word_to_index for w in sent[0]])]
print(X_train_2[0:5], y_train_2[0:5], len(X_train_2), len(y_train_2))

[[7, 64], [22, 7], [1, 73], [8, 11], [1, 173]] [[64, 43], [7, 64], [73, 196], [11, 45], [173, 30]] 31793 31793


In [161]:
word_to_index['SENTENCE_END']

3

In [162]:
def check_eos(trigram):
    if trigram[1] == word_to_index['SENTENCE_END']:
          return True  
    return False

trigrams_eos = list(filter(check_eos, y_train_2))
len(trigrams_eos), trigrams_eos[0:5]

(0, [])

In [163]:
from tqdm import tqdm
for i in tqdm(range(1, len(ngrams_up_to_20))):
    ngrams_to_learn = ngrams_up_to_20[i]
    X_train_2 = [[word_to_index[w] for w in sent[0][:-1]] for sent in my_filter(ngrams_to_learn.most_common())
                   if all([w in word_to_index for w in sent[0]])]
    y_train_2 = [[word_to_index[w] for w in sent[0][1:]] for sent in my_filter(ngrams_to_learn.most_common())
                   if all([w in word_to_index for w in sent[0]])]
    X_train_2 = X_train_2[:2000]
    y_train_2 = y_train_2[:2000]
    X_train_2, y_train_2 = fisher_yates(X_train_2, y_train_2)
    X_train.extend(X_train_2)
    y_train.extend(y_train_2)

100%|███████████████████████████████████████████| 18/18 [00:03<00:00,  5.17it/s]


In [164]:
len(X_train), len(y_train)

(42150, 42150)

In [165]:
print(random.sample(list(zip(X_train, y_train)), 10))

[([7, 73, 125], [73, 125, 238]), ([910], [4]), ([1300, 1688, 12, 2415, 34, 2416, 12, 2417, 34, 2418], [1688, 12, 2415, 34, 2416, 12, 2417, 34, 2418, 214]), ([427, 24, 22, 1], [24, 22, 1, 325]), ([31, 150, 1, 1629, 431, 10, 239, 178, 142, 14, 11, 5, 31], [150, 1, 1629, 431, 10, 239, 178, 142, 14, 11, 5, 31, 897]), ([11], [23]), ([1248], [4]), ([374, 298, 15], [298, 15, 72]), ([6, 1, 95, 1, 1621, 6, 893, 6, 306, 6, 1585, 6, 1586, 1, 2300, 6, 1], [1, 95, 1, 1621, 6, 893, 6, 306, 6, 1585, 6, 1586, 1, 2300, 6, 1, 381]), ([25, 30, 4, 88], [30, 4, 88, 163])]


In [166]:
len(tokenized_sentences)

1476

In [167]:
tokenized_sentences[100]

['SENTENCE_START',
 'SENTENCE_START',
 'he',
 'saw',
 'merchants',
 'trading',
 'princes',
 'hunting',
 'mourners',
 'wailing',
 'for',
 'their',
 'dead',
 'whores',
 'offering',
 'themselves',
 'physicians',
 'trying',
 'to',
 'help',
 'the',
 'sick',
 'priests',
 'determining',
 'the',
 'most',
 'suitable',
 'day',
 'for',
 'seeding',
 'lovers',
 'loving',
 'mothers',
 'nursing',
 'their',
 'children—and',
 'all',
 'of',
 'this',
 'was',
 'not',
 'worthy',
 'of',
 'one',
 'look',
 'from',
 'his',
 'eye',
 'it',
 'all',
 'lied',
 'it',
 'all',
 'stank',
 'it',
 'all',
 'stank',
 'of',
 'lies',
 'it',
 'all',
 'pretended',
 'to',
 'be',
 'meaningful',
 'and',
 'joyful',
 'and',
 'beautiful',
 'and',
 'it',
 'all',
 'was',
 'just',
 'concealed',
 'putrefaction',
 'SENTENCE_EN',
 'SENTENCE_END']

In [168]:
[[word_to_index[w] for w in sent] for sent in tokenized_sentences if all([w in word_to_index for w in sent])][100]

[0,
 0,
 8,
 60,
 1297,
 1676,
 1677,
 1298,
 2379,
 1678,
 22,
 92,
 417,
 2380,
 1679,
 441,
 2381,
 1299,
 5,
 545,
 1,
 1044,
 1625,
 2382,
 1,
 156,
 1680,
 146,
 22,
 2383,
 1681,
 1045,
 673,
 2384,
 92,
 2385,
 33,
 6,
 13,
 12,
 23,
 911,
 6,
 30,
 312,
 26,
 9,
 1247,
 14,
 33,
 2386,
 14,
 33,
 1300,
 14,
 33,
 1300,
 6,
 2387,
 14,
 33,
 2388,
 5,
 31,
 1616,
 4,
 789,
 4,
 164,
 4,
 14,
 33,
 12,
 89,
 1046,
 2389,
 2,
 3]

In [169]:
X_train_full_sentences = [[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences
                         if all([w in word_to_index for w in sent])]
y_train_full_sentences = [[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences
                         if all([w in word_to_index for w in sent])]

In [170]:
print(X_train_full_sentences[0:5], y_train_full_sentences[0:5])

[[0, 0, 1, 73, 196, 427, 6, 21, 13, 427, 24, 22, 1, 325, 6, 878, 1570, 10, 1, 374, 298, 4, 156, 104, 1233, 6, 1, 95, 37, 40, 1234, 4, 17, 569, 40, 1571, 1572, 2], [0, 0, 15, 215, 458, 14, 197, 14, 145, 39, 1573, 14, 216, 1, 272, 6, 1, 73, 196, 343, 1235, 17, 13, 427, 39, 1003, 37, 879, 2], [0, 0, 66, 15, 46, 23, 667, 10, 1, 374, 298, 15, 72, 29, 5, 1004, 1, 523, 6, 1, 764, 103, 15, 46, 667, 108, 765, 13, 427, 2], [0, 0, 1574, 21, 2229, 2230, 1575, 2231, 1236, 2232, 766, 1576, 2233, 427, 2234, 2235, 2236, 156, 2237, 1577, 2238, 2239, 2240, 1005, 2241, 2242, 1578, 2243, 2244, 2245, 4, 2246, 2247, 375, 375, 375, 486, 6, 1, 73, 196, 427, 21, 375, 375, 375, 21, 76, 2248, 1237, 34, 2249, 1575, 375, 375, 375, 1006, 246, 186, 1, 136, 6, 1, 137, 17, 1, 147, 177, 668, 669, 186, 91, 17, 1, 286, 86, 428, 34, 1, 54, 1, 187, 1, 136, 217, 42, 246, 186, 5, 2250, 2251, 36, 198, 102, 1, 136, 6, 1, 137, 10, 1, 524, 6, 1, 287, 10, 1, 1579, 6, 1, 2252, 459, 1, 1580, 10, 1, 524, 6, 1, 2253, 166, 10, 1, 524,

In [171]:
import random
last_n_words = []
for i in range(3, 20):
    tokenized_sentences_400 = random.sample(list(tokenized_sentences), 400)
    for s in tokenized_sentences_400:
        last_n_words.append(s[::-1][:i][::-1])

print(random.sample(last_n_words, 10))

[['his', 'deep', 'sleep', 'would', 'meet', 'with', 'his', 'innermost', 'part', 'and', 'would', 'reside', 'in', 'the', 'atman', 'SENTENCE_EN', 'SENTENCE_END'], ['how', 'youre', 'able', 'to', 'write', '”', 'the', 'merchant', 'praised', 'him', 'SENTENCE_EN', 'SENTENCE_END'], ['with', 'this', 'trip', 'SENTENCE_EN', 'SENTENCE_END'], ['gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'wwwgutenbergorg', 'SENTENCE_EN', 'SENTENCE_END'], ['proper', 'it', 'is', 'for', 'a', 'brahman', 'to', 'speak', 'harsh', 'and', 'angry', 'words', 'SENTENCE_EN', 'SENTENCE_END'], ['loudly', 'and', 'used', 'crude', 'swearwords', 'SENTENCE_EN', 'SENTENCE_END'], ['one', 'in', 'the', 'grove', 'jetavana', '”', '“', 'youre', 'siddhartha', '”', 'govinda', 'exclaimed', 'loudly', 'SENTENCE_EN', 'SENTENCE_END'], ['the', 'beginning', 'and', 'as', 'a', 'child', 'again', 'he', 'had', 'to', 'smile', 'SENTENCE_EN', 'SENTENCE_END'], ['he', 'used', 'to', 'in', 'the', 'spring', 'of', 'his', 'years'

In [172]:
len(last_n_words)

6800

In [173]:
X_train_eos = [[word_to_index[w] for w in sent[:-1]] for sent in last_n_words
                         if all([w in word_to_index for w in sent])]
y_train_eos = [[word_to_index[w] for w in sent[1:]] for sent in last_n_words
                         if all([w in word_to_index for w in sent])]

In [174]:
len(X_train_eos), len(y_train_eos)

(6800, 6800)

In [175]:
X_train.extend(X_train_eos)
y_train.extend(y_train_eos)

In [176]:
len(X_train), len(y_train)

(48950, 48950)

In [177]:
import pickle
with open('data/X_train_siddhartha.pkl', 'wb') as file:
    pickle.dump(X_train, file)

In [178]:
with open('data/y_train_siddhartha.pkl', 'wb') as file:
    pickle.dump(y_train, file)

In [179]:
with open('data/tokenized_sentences_siddhartha.pkl', 'wb') as file:
    pickle.dump(tokenized_sentences, file)

In [180]:
with open('data/word_to_index_siddhartha.pkl', 'wb') as file:
    pickle.dump(word_to_index, file)

In [181]:
with open('data/index_to_word_siddhartha.pkl', 'wb') as file:
    pickle.dump(index_to_word, file)

In [182]:
X_train2 = np.asarray(X_train,dtype=object)
y_train2 = np.asarray(y_train,dtype=object)

In [183]:
X_train2.shape, y_train2.shape

((48950,), (48950,))

In [184]:
print(random.sample(list(zip(X_train2, y_train2)), 10))

[([427, 39, 1003], [39, 1003, 37]), ([6, 13, 493, 4], [13, 493, 4, 775]), ([1], [387]), ([0, 0, 19, 855, 17, 15, 208, 1037, 20, 67, 21, 2], [0, 19, 855, 17, 15, 208, 1037, 20, 67, 21, 2, 3]), ([186, 13, 775, 186, 14, 12, 23, 1618, 4, 2298, 14, 12, 275], [13, 775, 186, 14, 12, 23, 1618, 4, 2298, 14, 12, 275, 79]), ([45, 120], [120, 307]), ([1, 88, 30, 1, 1615, 30, 80, 1], [88, 30, 1, 1615, 30, 80, 1, 360]), ([122, 1, 222, 902, 6, 1, 688], [1, 222, 902, 6, 1, 688, 8]), ([20, 19, 1, 154, 207, 165, 5, 586, 91, 20, 2896, 21, 2], [19, 1, 154, 207, 165, 5, 586, 91, 20, 2896, 21, 2, 3]), ([0], [105])]


In [185]:
embedding_dim = 100
vocabulary_size, embedding_dim

(4089, 100)

In [186]:
import os
import numpy as np

#glove_dir = 'data/glove'
glove_dir = "data"

embeddings_index = {} #initialize dictionary
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'), encoding='utf8')
try:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
except:
    print(line)
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [187]:
vocabulary_size

4089

In [188]:
embedding_dim = 100

embedding_matrix = np.zeros((vocabulary_size, embedding_dim))
for word, i in vocab:
    embedding_vector = embeddings_index.get(word)
    if i < vocabulary_size:
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

In [189]:
embedding_matrix.shape

(4089, 100)

In [190]:
vocab[200]

('joy', 31)

In [191]:
embedding_matrix[12]

array([-0.39159   ,  0.22118001,  0.81884003, -0.48398   , -0.57314003,
        0.083019  , -0.20906   , -0.074538  ,  0.049359  , -0.55949998,
       -0.32308999,  0.57011998, -0.21456   , -0.41084999,  0.29183   ,
        0.17476   , -0.96956998,  0.048109  ,  0.47062999,  0.74265999,
        0.74690002,  1.02139997, -0.13095   , -0.67132002,  0.37097999,
        0.43346   , -0.079043  , -0.53241998,  0.16960999,  0.28220001,
       -0.40671   ,  0.40191999, -0.23286   , -0.44812   ,  0.16073   ,
        0.266     , -0.57449001,  0.17587   ,  0.60320997, -0.29776999,
        0.17654   , -0.76122999,  0.10279   , -0.47314   , -0.76828998,
       -0.29628   ,  0.51100999,  0.59928   ,  0.64578998, -1.18060005,
        0.084544  , -0.59182   ,  0.1964    ,  0.88892001, -0.34691   ,
       -2.38919997, -0.12136   , -0.17922001,  0.87950999, -0.08393   ,
        0.21187   ,  1.3937    , -1.33019996,  0.54578   ,  0.18774   ,
       -0.27192   ,  0.50072998, -0.10156   ,  0.20821001,  0.21

In [192]:
from scipy import spatial

def find_closest_embeddings(embedding):
    return sorted(embeddings_index.keys(), key=lambda word: spatial.distance.euclidean(embeddings_index[word], embedding))

In [193]:
find_closest_embeddings(embeddings_index["king"])[1:6]

['prince', 'queen', 'monarch', 'brother', 'uncle']

In [194]:
print(find_closest_embeddings(
    embeddings_index["twig"] - embeddings_index["branch"] + embeddings_index["hand"]
)[:10])

['flashlight', 'twig', 'clipboard', 'shove', 'hand', 'fingers', 'clutching', 'clutched', 'tossing', 'stroking']


In [195]:
# from sklearn.manifold import TSNE
# tsne = TSNE(n_components=2, random_state=0)

In [196]:
# words =  list(embeddings_index.keys())[:500]
# vectors = [embeddings_index[word] for word in words]

In [197]:
# Y = tsne.fit_transform(vectors)

In [198]:
vocabulary_size, embedding_dim

(4089, 100)

### 4. Model Architecture
We define the architecture of our RNN model:

Embedding Layer: Maps input indices to dense vectors of fixed size.
SimpleRNN Layer: A simple recurrent layer that learns dependencies from the sequences.
Dense Layer: Outputs the predicted word by applying a softmax over the vocabulary.

In [199]:
class RNN:    
    def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):
        # Assign instance variables
        self.word_dim = word_dim
        self.hidden_dim = hidden_dim
        self.bptt_truncate = bptt_truncate
        
        # Randomly initialize the network parameters
        #self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))
        #self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
        self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, embedding_dim))
        self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))
        self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
        
        # Set GLOVE embeddings matrix
        self.G = embedding_matrix

In [200]:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    # sometimes, may want to do this first:
    #x = np.vectorize(round)(x)
    
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

In [201]:
def forward_propagation(self, x):
    # The total number of time steps
    T = len(x)
    
    # During forward propagation we save all hidden states in s because need them later.
    # We add one additional element for the initial hidden, which we set to 0
    s = np.zeros((T + 1, self.hidden_dim))
    s[-1] = np.zeros(self.hidden_dim)
    
    # The outputs at each time step. Again, we save them for later.
    o = np.zeros((T, self.word_dim))
    
    # For each time step...
    for t in np.arange(T):
        # embedding of x[t]:
        e_t = self.G[x[t]]
                             
        # Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.
        #s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))
        s[t] = np.tanh(self.U.dot(e_t) + self.W.dot(s[t-1]))
        o[t] = softmax(self.V.dot(s[t]))
        
    return [o, s]

RNN.forward_propagation = forward_propagation

In [202]:
word_dim = vocabulary_size
hidden_dim = 100
embedding_dim = 100
U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, embedding_dim))
W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))
V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
x = np.random.randint(0, high=3000, size=word_dim)
T = len(x)
s = np.zeros((T + 1, hidden_dim))
s_m1 = np.zeros(hidden_dim)
o = np.zeros((T, word_dim))
e_0 = embedding_matrix[x[0]]
s_0 = np.tanh(U.dot(e_0) + W.dot(s_m1))
print(s_0.shape, V.shape)
o_0 = softmax(V.dot(s_0))
o_0.shape, o_0

(100,) (4089, 100)


((4089,),
 array([0.00024456, 0.00024456, 0.00024456, ..., 0.00024456, 0.00024456,
        0.00024456]))

In [203]:
def predict(self, x):
    # Perform forward propagation and return index of the highest score
    o, s = self.forward_propagation(x)
    return np.argmax(o[-1], axis=1)

RNN.predict = predict

In [204]:
def predict(self, x):
    # Perform forward propagation and return index of the highest score
    o, s = self.forward_propagation(x)
    return np.argmax(o, axis=1)

RNN.predict = predict

In [205]:
print ("x:\n%s\n%s" % (" ".join([index_to_word[x] for x in X_train2[1000]]), X_train2[1000]))

x:
SENTENCE_START
[0]


In [206]:
print ("x:\n%s\n%s" % (" ".join([index_to_word[x] for x in X_train2[20000]]), X_train2[20000]))

x:
melting from the beams of the sun dreams
[2285, 26, 1, 2286, 6, 1, 487, 773]


In [207]:
print ("x:\n%s\n%s" % (" ".join([index_to_word[x] for x in X_train2[30000]]), X_train2[30000]))

x:
it silently out of himself while exhaling with all the concentration of his
[14, 326, 111, 6, 59, 209, 1586, 17, 33, 1, 2259, 6, 9]


In [208]:
vocabulary_size, X_train2[10000]

(4089, [9, 570, 4])

In [209]:
np.random.seed(17)
model = RNN(vocabulary_size)
o, s = model.forward_propagation(X_train2[10000])
print (o.shape, o)

(3, 4089) [[0.00024803 0.00024104 0.00023871 ... 0.00024342 0.00022586 0.0002446 ]
 [0.00024418 0.00025014 0.000246   ... 0.00023526 0.00024535 0.00024223]
 [0.00024463 0.00023684 0.00023286 ... 0.00024458 0.00024566 0.0002521 ]]


In [210]:
np.argmax(o[-1], axis=0)

970

In [211]:
predictions = model.predict(X_train2[40000])
print(predictions.shape, predictions)

(18,) [3342 3114 1447 3096 3114 2041  359  431 1447 3096 3114 1727 3096 2567
  143  431 1447 3096]


In [212]:
print ("x:\n%s" % (" ".join([index_to_word[x] for x in predictions])))

x:
delude company bottom rot company manifestations high source bottom rot company murky rot represented same source bottom rot


In [213]:
def calculate_total_loss(self, x, y):
    L = 0
    # For each sentence...
    for i in np.arange(len(y)):
        o, s = self.forward_propagation(x[i])
        # We only care about our prediction of the "correct" words
        correct_word_predictions = o[np.arange(len(y[i])), y[i]]
        # Add to the loss based on how off we were
        L += -1 * np.sum(np.log(correct_word_predictions))
    return L

def calculate_loss(self, x, y):
    # Divide the total loss by the number of training examples
    N = np.sum((len(y_i) for y_i in y))
    return self.calculate_total_loss(x,y)/N

RNN.calculate_total_loss = calculate_total_loss
RNN.calculate_loss = calculate_loss

In [214]:
# Limit to 1000 examples to save time
print ("Expected Loss for random predictions: %f" % np.log(vocabulary_size))
print ("Actual loss: %f" % model.calculate_loss(X_train[:1000], y_train[:1000]))

Expected Loss for random predictions: 8.316056


  N = np.sum((len(y_i) for y_i in y))


Actual loss: 8.316056


In [215]:
def bptt(self, x, y):
    T = len(y)
    
    # Perform forward propagation
    o, s = self.forward_propagation(x)
    
    # We accumulate the gradients in these variables
    dLdU = np.zeros(self.U.shape)
    dLdV = np.zeros(self.V.shape)
    dLdW = np.zeros(self.W.shape)
    delta_o = o
    delta_o[np.arange(len(y)), y] -= 1.
    
    # For each output backwards...
    for t in np.arange(T)[::-1]:
        dLdV += np.outer(delta_o[t], s[t].T)
        
        # Initial delta calculation
        delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
        
        # Backpropagation through time (for at most self.bptt_truncate steps)
        for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
            
            # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
            dLdW += np.outer(delta_t, s[bptt_step-1])              
            #dLdU[:,x[bptt_step]] += delta_t
            dLdU += np.outer(delta_t, self.G[x[bptt_step]]) 
            
            # Update delta for next step
            delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
            
    return [dLdU, dLdV, dLdW]

RNN.bptt = bptt

In [216]:
def gradient_check(self, x, y, h=0.001, error_threshold=0.01):
    
    # Calculate the gradients using backpropagation. We want to checker if these are correct.
    bptt_gradients = model.bptt(x, y)
    
    # List of all parameters we want to check.
    model_parameters = ['U', 'V', 'W']
    
    # Gradient check for each parameter
    for pidx, pname in enumerate(model_parameters):
        
        # Get the actual parameter value from the mode, e.g. model.W
        parameter = operator.attrgetter(pname)(self)
        print("Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape)))
               
        # Iterate over each element of the parameter matrix, e.g. (0,0), (0,1), ...
        it = np.nditer(parameter, flags=['multi_index'], op_flags=['readwrite'])
        while not it.finished:
            ix = it.multi_index
               
            # Save the original value so we can reset it later
            original_value = parameter[ix]
               
            # Estimate the gradient using (f(x+h) - f(x-h))/(2*h)
            parameter[ix] = original_value + h
            gradplus = model.calculate_total_loss([x],[y])
            parameter[ix] = original_value - h
            gradminus = model.calculate_total_loss([x],[y])
            estimated_gradient = (gradplus - gradminus)/(2*h)
               
            # Reset parameter to original value
            parameter[ix] = original_value
               
            # The gradient for this parameter calculated using backpropagation
            backprop_gradient = bptt_gradients[pidx][ix]
               
            # calculate The relative error: (|x - y|/(|x| + |y|))
            relative_error = np.abs(backprop_gradient - estimated_gradient) / (
                                np.abs(backprop_gradient) + np.abs(estimated_gradient))
            
               # If the error is to large fail the gradient check
            if relative_error > error_threshold:
                print( "Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix))
                print( "+h Loss: %f" % gradplus)
                print( "-h Loss: %f" % gradminus)
                print( "Estimated_gradient: %f" % estimated_gradient)
                print( "Backpropagation gradient: %f" % backprop_gradient)
                print( "Relative Error: %f" % relative_error)
                return 
            it.iternext()
               
        print( "Gradient check for parameter %s passed." % (pname))

RNN.gradient_check = gradient_check

In [217]:
grad_check_vocab_size = 100
np.random.seed(10)
model = RNN(grad_check_vocab_size, 10, bptt_truncate=1000)
model.gradient_check([0,1,2,3], [1,2,3,4])

Performing gradient check for parameter U with size 1000.
Gradient check for parameter U passed.
Performing gradient check for parameter V with size 1000.
Gradient check for parameter V passed.
Performing gradient check for parameter W with size 100.
Gradient check for parameter W passed.


In [218]:
# Performs one step of SGD.
def numpy_sdg_step(self, x, y, learning_rate):
    # Calculate the gradients
    dLdU, dLdV, dLdW = self.bptt(x, y)
    
    # Change parameters according to gradients and learning rate
    self.U -= learning_rate * dLdU
    self.V -= learning_rate * dLdV
    self.W -= learning_rate * dLdW

RNN.sgd_step = numpy_sdg_step

In [219]:
# Outer SGD Loop
# - model: The RNN model instance
# - X_train: The training data set
# - y_train: The training data labels
# - learning_rate: Initial learning rate for SGD
# - nepoch: Number of times to iterate through the complete dataset
# - evaluate_loss_after: Evaluate the loss after this many epochs

def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=100, evaluate_loss_after=5):
    # We keep track of the losses so we can plot them later
    losses = []
    num_examples_seen = 0
    
    for epoch in range(nepoch):
        
        # Optionally evaluate the loss
        if (epoch % evaluate_loss_after == 0):
            loss = model.calculate_loss(X_train, y_train)
            losses.append((num_examples_seen, loss))
            time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            print ("%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss))
            
            # Adjust the learning rate if loss increases
            if (len(losses) > 1 and losses[-1][1] > losses[-2][1]):
                learning_rate = learning_rate * 0.5  
                print ("Setting learning rate to %f" % learning_rate)
            sys.stdout.flush()
            
        # For each training example...
        for i in range(len(y_train)):
            
            # One SGD step
            model.sgd_step(X_train[i], y_train[i], learning_rate)
            num_examples_seen += 1

In [220]:
vocabulary_size

4089

In [221]:
np.random.seed(17)
model = RNN(vocabulary_size)
%timeit model.sgd_step(X_train2[10000], y_train2[10000], 0.005)

6.84 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [222]:
np.random.seed(17)

# Train on a small subset of the data to see what happens
model = RNN(vocabulary_size)
losses = train_with_sgd(model, X_train2[10000:10100], y_train2[10000:10100], nepoch=10, evaluate_loss_after=1)

2024-10-16 06:58:02: Loss after num_examples_seen=0 epoch=0: 8.317129


  N = np.sum((len(y_i) for y_i in y))


2024-10-16 06:58:03: Loss after num_examples_seen=100 epoch=1: 8.227105
2024-10-16 06:58:04: Loss after num_examples_seen=200 epoch=2: 8.092797
2024-10-16 06:58:05: Loss after num_examples_seen=300 epoch=3: 7.845445
2024-10-16 06:58:06: Loss after num_examples_seen=400 epoch=4: 7.435504
2024-10-16 06:58:06: Loss after num_examples_seen=500 epoch=5: 6.996474
2024-10-16 06:58:07: Loss after num_examples_seen=600 epoch=6: 6.589343
2024-10-16 06:58:08: Loss after num_examples_seen=700 epoch=7: 6.282422
2024-10-16 06:58:09: Loss after num_examples_seen=800 epoch=8: 6.043619
2024-10-16 06:58:10: Loss after num_examples_seen=900 epoch=9: 5.848634


In [223]:
len(index_to_word)

4090

In [224]:
def generate_sentence(model, senten_max_length):
    # We start the sentence with the start token
    new_sentence = [word_to_index[sentence_start_token]]
    
    # Repeat until we get an end token and keep our sentences to less than senten_max_length words for now
    while (not new_sentence[-1] == word_to_index[sentence_end_token]) and len(new_sentence) < senten_max_length:
        next_word_probs = model.forward_propagation(new_sentence)
        sampled_word = word_to_index[unknown_token]
        
        # We don't want to sample unknown words
        while sampled_word == word_to_index[unknown_token]:
            
            # correcting for abnormalities
            #abs_v = [-i if i <0 else i for i in next_word_probs[-1][0]] 
            #nrm_v = [i/sum(abs_v) for i in abs_v] 
            #abs_v = [0 if i <0 else i for i in next_word_probs[-1][0]] 
            #abs_v = [0 if i <0 else i for i in next_word_probs[0][-1]] 
            #nrm_v = [i/sum(abs_v) for i in abs_v] 
            #samples = np.random.multinomial(1, nrm_v)
            #sampled_word = np.argmax(samples)
            
            # the secret sauce of creativity
            samples = np.random.multinomial(1, next_word_probs[0][-1])
            
            sampled_word = np.argmax(samples)
            
        new_sentence.append(sampled_word)

    print(new_sentence)
    sentence_str = [index_to_word[x] for x in new_sentence[1:-1]]
    #print(sentence_str)
    return sentence_str

In [225]:
senten_max_length = 20
generate_sentence(model, senten_max_length)

[0, 2112, 4036, 3822, 2563, 3310, 2228, 470, 2603, 1158, 2260, 423, 2105, 513, 957, 2765, 4041, 273, 3329, 3473]


['displayed',
 'widespread',
 'vision',
 'untouchable',
 'non-eternal',
 'wwwgutenbergorg/donate',
 'wont',
 'afar',
 'instead',
 'glow',
 'head',
 'healing',
 'divine',
 'thin',
 'drank',
 'outdated',
 'night',
 'disappointments']

In [226]:
num_sentences = 10
senten_min_length = 7
senten_max_length = 20

for i in range(num_sentences):
    sent = generate_sentence(model, senten_max_length)
    print (" ".join(sent))

[0, 1799, 2368, 1543, 527, 9, 3976, 23, 4, 3118, 4, 289, 10, 2847, 968, 7, 1105, 6, 9, 2790]
enticed thighs general magic his remaining not and disease and walk in approached writing a pondered of his
[0, 3311, 910, 442, 3629, 3635, 4062, 611, 2672, 3916, 1543, 808, 2511, 958, 2397, 3642, 2686, 1908, 1463, 3056]
enchantment beard nirvana immensely loyalty swamp including seeks e-mail general behold eagerness owned dripping unrelenting anew thinks seeking
[0, 3010, 530, 3792, 3300, 924, 3298, 2803, 2308, 4026, 98, 1, 351, 3528, 3539, 449, 12, 14, 1725, 6]
attempts verses blow sakyamunI mocking apparently shuddering wove 1500 very the along spare heed past was it ears
[0, 2615, 2047, 207, 1330, 2119, 2993, 3043, 754, 1247, 560, 3657, 817, 3543, 2778, 2585, 3270, 2481, 1367, 1726]
demonstrated wonderfully youre spreading tenderness profit complain suffered eye thank odds mute accusation guests patiently prayers rebirths politely
[0, 647, 2970, 1917, 2926, 1051, 2714, 1673, 3171, 802, 2834

### 5. Model Training
In this section, we train the model using categorical cross-entropy loss and the Adam optimizer. The goal is to minimize the loss over multiple epochs and improve the accuracy.

Training Details:
Loss function: Categorical Cross-Entropy
Optimizer: Adam
Batch Size: 32


In [227]:
np.random.seed(17)

# Train on a small subset of the data to see what happens
model = RNN(vocabulary_size)
losses = train_with_sgd(model, X_train2, y_train2, nepoch=5, evaluate_loss_after=1)

  N = np.sum((len(y_i) for y_i in y))


2024-10-16 06:59:49: Loss after num_examples_seen=0 epoch=0: 8.317905
2024-10-16 07:17:05: Loss after num_examples_seen=48950 epoch=1: 5.360205
2024-10-16 07:35:34: Loss after num_examples_seen=97900 epoch=2: 5.340448
2024-10-16 07:52:52: Loss after num_examples_seen=146850 epoch=3: 5.297493
2024-10-16 08:10:56: Loss after num_examples_seen=195800 epoch=4: 5.294335


In [230]:
losses = train_with_sgd(model, X_train2, y_train2, nepoch=5, evaluate_loss_after=1)

  N = np.sum((len(y_i) for y_i in y))


2024-10-16 08:31:39: Loss after num_examples_seen=0 epoch=0: 8.317905
2024-10-16 08:48:06: Loss after num_examples_seen=48950 epoch=1: 5.360205
2024-10-16 09:04:23: Loss after num_examples_seen=97900 epoch=2: 5.340448
2024-10-16 09:21:47: Loss after num_examples_seen=146850 epoch=3: 5.297493
2024-10-16 09:38:32: Loss after num_examples_seen=195800 epoch=4: 5.294335


In [231]:
losses = train_with_sgd(model, X_train2, y_train2, nepoch=5, evaluate_loss_after=1)

  N = np.sum((len(y_i) for y_i in y))


2024-10-16 09:55:45: Loss after num_examples_seen=0 epoch=0: 5.152209
2024-10-16 10:12:25: Loss after num_examples_seen=48950 epoch=1: 5.154802
Setting learning rate to 0.002500
2024-10-16 10:30:41: Loss after num_examples_seen=97900 epoch=2: 4.991009
2024-10-16 10:48:20: Loss after num_examples_seen=146850 epoch=3: 4.963550
2024-10-16 11:08:07: Loss after num_examples_seen=195800 epoch=4: 4.919193


In [232]:
losses = train_with_sgd(model, X_train2, y_train2, nepoch=5, evaluate_loss_after=1)

  N = np.sum((len(y_i) for y_i in y))


2024-10-16 11:24:53: Loss after num_examples_seen=0 epoch=0: 4.936287
2024-10-16 11:39:36: Loss after num_examples_seen=48950 epoch=1: 5.093147
Setting learning rate to 0.002500
2024-10-16 12:35:15: Loss after num_examples_seen=97900 epoch=2: 4.892396
2024-10-16 12:49:06: Loss after num_examples_seen=146850 epoch=3: 4.876551
2024-10-16 13:05:37: Loss after num_examples_seen=195800 epoch=4: 4.888263
Setting learning rate to 0.001250


In [233]:
losses = train_with_sgd(model, X_train2, y_train2, nepoch=5, evaluate_loss_after=1)

  N = np.sum((len(y_i) for y_i in y))


2024-10-16 13:30:10: Loss after num_examples_seen=0 epoch=0: 4.777440
2024-10-16 13:43:46: Loss after num_examples_seen=48950 epoch=1: 4.988670
Setting learning rate to 0.002500
2024-10-16 14:00:59: Loss after num_examples_seen=97900 epoch=2: 4.861099
2024-10-16 14:33:38: Loss after num_examples_seen=146850 epoch=3: 4.863277
Setting learning rate to 0.001250
2024-10-16 15:35:27: Loss after num_examples_seen=195800 epoch=4: 4.771211


We ran the below cell but by mistake it was converted to mardown and we lost the running tail.

In [None]:
losses = train_with_sgd(model, X_train2, y_train2, nepoch=40, evaluate_loss_after=1)

In [None]:
losses = train_with_sgd(model, X_train2, y_train2, nepoch=5, evaluate_loss_after=1)

In [None]:
import time
for i in range(10):
    losses = train_with_sgd(model, X_train2, y_train2, nepoch=5, evaluate_loss_after=1)
    time.sleep(15 * 60)

  N = np.sum((len(y_i) for y_i in y))


In [None]:
losses = train_with_sgd(model, X_train2, y_train2, nepoch=5, evaluate_loss_after=1)

### 6. Generating Text
Once the model is trained, we can use it to generate text. By providing a seed (e.g., a phrase), the model predicts the next word iteratively to form a sequence.

Generation Process:
Provide a seed text.
Generate the next word using the model’s prediction.
Append the new word to the seed and continue.


In [228]:
def generate_sentence(model, senten_max_length):
    # We start the sentence with the start token
    new_sentence = [word_to_index[sentence_start_token]]
    
    # Repeat until we get an end token and keep our sentences to less than senten_max_length words for now
    while (not new_sentence[-1] == word_to_index[sentence_end_token]) and len(new_sentence) < senten_max_length:
        next_words_probs = model.forward_propagation(new_sentence)
        sampled_word = word_to_index[unknown_token]
        
        # We don't want to sample unknown words
        while sampled_word == word_to_index[unknown_token]:
            
            #print(next_word_probs[0][-1])
            samples = np.random.multinomial(1, next_words_probs[0][-1])
            sampled_word = np.argmax(samples)
            
        new_sentence.append(sampled_word)

    #print(new_sentence)
    sentence_str = [index_to_word[x] for x in new_sentence[1:-1]]
    #print(sentence_str)
    return sentence_str

In [229]:
num_sentences = 20
senten_min_length = 7
senten_max_length = 20

for i in range(num_sentences):
    sent = []
    # We want long sentences, not sentences with one or two words
    while len(sent) < senten_min_length:
        sent = generate_sentence(model, senten_max_length)
    print (" ".join(sent))

displayed widespread vision untouchable non-eternal wwwgutenbergorg/donate wont afar instead glow head healing divine thin drank outdated night disappointments
enticed thighs general magic ferried worries letter once his fate and like become being the woman back SENTENCE_EN
simpler branches sink visible cheek enchantment beard nirvana immensely loyalty swamp including seeks e-mail general behold eagerness owned
unrelenting anew thinks seeking balls attempts verses blow sakyamunI mocking apparently shuddering wove 1500 very all about or
spare fragrant isolated collect discard wonderfully youre spreading tenderness profit complain suffered eye thank odds mute accusation guests
prayers rebirths politely believers bid pit allow opening slipped clouds shaggy thoughtful fast stairs hurriedly praised formats pure
halted willingly teeth while such frightened distorted harmless self-castigation soil lured solicit encompassed ” “ little do heard
ablutions spit book soul dreamt the community thir

### 7. Conclusion
We successfully built and trained an RNN model for text generation using trigrams. The model generates text based on the patterns it learned from the input corpus. Further improvements can be made by experimenting with different architectures and hyperparameters.

