# Name of the Student : Aagam Manish Shah

# USC ID Number : 8791018480

# 7.1 Generative Models for Text

## (a) In this problem, we are trying to build a generative model to mimic the writing style of prominent British Mathematician, Philosopher, prolific writer, and political activist, Bertrand Russell.

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import tensorflow as tf
from tensorflow import keras
import copy
import string

from sklearn.preprocessing import MinMaxScaler
import operator

## (b) Download the following books from Project Gutenberg http://www.gutenberg.org/ebooks/author/355 in text format:
## i. The Problems of Philosophy
## ii. The Analysis of Mind
## iii. Mysticism and Logic and Other Essays
## iv. Our Knowledge of the External World as a Field for Scientific Method in Philosophy
## Project Gutenberg adds a standard header and footer to each book and this is not part of the original text. Open the file in a text editor and delete the header and footer.
## The header is obvious and ends with the text:
## *** START OF THIS PROJECT GUTENBERG EBOOK AN INQUIRY INTO MEANING AND TRUTH ***
## The footer is all of the text after the line of text that says:THE END
## To have a better model, it is strongly recommended that you download the fol- lowing books from The Library of Congress https://archive.org and convert them to text files:
## i. The History of Western Philosophy
## https://archive.org/details/westernphilosophy4
## ii. The Analysis of Matter
## https://archive.org/details/in.ernet.dli.2015.221533
## iii. An Inquiry into Meaning and Truth
## https://archive.org/details/BertrandRussell-AnInquaryIntoMeaningAndTruth
## Try to only use the text of the books and throw away unwanted text before and after the text, although in a large corpus, these are considered as noise and should not make big problems.

In [2]:
for module in keras, sklearn, tf:
  print("{} version is {}".format(module.__name__, module.__version__))

tensorflow.keras version is 2.3.0-tf
sklearn version is 0.22.2.post1
tensorflow version is 2.2.0


## (c) LSTM: Train an LSTM to mimic Russell's style and thoughts:

### i. Concatenate your text files to create a corpus of Russell's writings.

In [3]:
def generate_corpus(file_path, output_file_name = 'Corpus.txt'):
  datafiles = os.listdir(file_path)
  outputfile = open(output_file_name, 'w')
  for fname in datafiles:
        with open(fname,encoding="ascii", errors='ignore') as infile:
            for line in infile:
                outputfile.write(line)
  outputfile.close()
  return outputfile.name

In [4]:
import os
os.chdir('/Data/Book')

In [5]:
outputfile = generate_corpus('/Data/Book')
contents = open(outputfile, 'r').read()
print("The length of corpus of Russell's writing is: ", len(contents))

The length of corpus of Russell's writing is:  5095252


### ii. Use a character-level representation for this model by using extended ASCII that has N = 256 characters. Each character will be encoded into a an integer using its ASCII code. Rescale the integers to the range [0, 1], because LSTM uses a sigmoid activation function. LSTM will receive the rescaled integers as its input.

In [6]:
def generate_char_set(orginal_text, ignore_case=False, remove_punctuation=False):
    clipped_text = copy.copy(orginal_text)
    char_set = set(orginal_text)
    if ignore_case:
        clipped_text = orginal_text.lower()
        char_set = set(orginal_text.lower())
    if remove_punctuation:
        char_set = char_set.difference(set(string.punctuation))
        clipped_text = clipped_text.translate(str.maketrans('', '', string.punctuation))
    return clipped_text, char_set

In [7]:
clipped_data, Character_set = generate_char_set(contents, ignore_case=True, remove_punctuation=True)
char2ascii_Dict = dict()
for index, char in enumerate(sorted(Character_set)):
    char2ascii_Dict[char] = index
print("""--------------The original dictionary of distinct characters with its ASCII values: --------------\n""", sorted(char2ascii_Dict.items(), key=operator.itemgetter(1)))

rescaled_char_Dict = dict()
rescaled_values = MinMaxScaler().fit_transform(np.array(list(char2ascii_Dict.values())).reshape(-1, 1))
for index in range(len(rescaled_values)):
    rescaled_char_Dict[list(char2ascii_Dict.keys())[index]] = rescaled_values[index][0]
print("""--------------The rescaled dictionary of distinct characters with its ASCII values: --------------\n""", sorted(rescaled_char_Dict.items(), key=operator.itemgetter(1)))

ascii2char_Dict = {v:k for k, v in char2ascii_Dict.items()}
print("""--------------The reversed dictionary of distinct characters with its ASCII values: --------------\n""", sorted(ascii2char_Dict.items(), key=operator.itemgetter(1)))

--------------The original dictionary of distinct characters with its ASCII values: --------------
 [('\n', 0), (' ', 1), ('0', 2), ('1', 3), ('2', 4), ('3', 5), ('4', 6), ('5', 7), ('6', 8), ('7', 9), ('8', 10), ('9', 11), ('a', 12), ('b', 13), ('c', 14), ('d', 15), ('e', 16), ('f', 17), ('g', 18), ('h', 19), ('i', 20), ('j', 21), ('k', 22), ('l', 23), ('m', 24), ('n', 25), ('o', 26), ('p', 27), ('q', 28), ('r', 29), ('s', 30), ('t', 31), ('u', 32), ('v', 33), ('w', 34), ('x', 35), ('y', 36), ('z', 37)]
--------------The rescaled dictionary of distinct characters with its ASCII values: --------------
 [('\n', 0.0), (' ', 0.02702702702702703), ('0', 0.05405405405405406), ('1', 0.08108108108108109), ('2', 0.10810810810810811), ('3', 0.13513513513513514), ('4', 0.16216216216216217), ('5', 0.1891891891891892), ('6', 0.21621621621621623), ('7', 0.24324324324324326), ('8', 0.2702702702702703), ('9', 0.2972972972972973), ('a', 0.32432432432432434), ('b', 0.35135135135135137), ('c', 0.3783783

### iii. Choose a window size, e.g., W = 100.

### iv. Inputs to the network will be the first W-1 = 99 characters of each sequence, and the output of the network will be the Wth character of the sequence. Basically, we are training the network to predict each character using the 99 characters that precede it. Slide the window in strides of S = 1 on the text. For example, if W = 5 and S = 1 and we want to train the network with the sequence ABRACADABRA, The first input to the network will be ABRA and the corresponding output will be C. The second input will be BRAC and the second output will be A, etc.

In [8]:
window_size = 99
X_data = []
y_data = []
for i in range(0,len(clipped_data)-window_size,1):
  temp = clipped_data[i:i + window_size]
  end_char = clipped_data[i + window_size]
  X_data.append([rescaled_char_Dict[char] for char in temp])
  y_data.append(char2ascii_Dict[end_char])

In [9]:
number_of_blocks = len(X_data)
X_data = np.reshape(X_data, (number_of_blocks, window_size, 1))
Y_data = keras.utils.to_categorical(y_data)
print(X_data.shape)
print(Y_data.shape)

(4942256, 99, 1)
(4942256, 38)


### v. Note that the output has to be encoded using a one-hot encoding scheme with N = 256 (or less) elements. This means that the network reads integers, but outputs a vector of N = 256 (or less) elements.

### vi. Use a single hidden layer for the LSTM with N = 256 (or less) memory units.

### vii. Use a Softmax output layer to yield a probability prediction for each of the characters between 0 and 1. This is actually a character classification problem with N classes. Choose log loss (cross entropy) as the objective function for the network (research what it means).

In [10]:
memory_units = 256
model = keras.models.Sequential([keras.layers.LSTM(units=memory_units,input_shape=(X_data.shape[1], X_data.shape[2])),
                                 keras.layers.Dense(Y_data.shape[1], activation="softmax"),])
model.compile(loss='categorical_crossentropy', optimizer=tf.keras.optimizers.Adam(1e-4))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 256)               264192    
_________________________________________________________________
dense (Dense)                (None, 38)                9766      
Total params: 273,958
Trainable params: 273,958
Non-trainable params: 0
_________________________________________________________________


### viii. We do not use a test dataset. We are using the whole training dataset to learn the probability of each character in a sequence. We are not seeking for a very accurate model. Instead we are interested in a generalization of the dataset that can mimic the gist of the text.

### ix. Choose a reasonable number of epochs for training, considering your computational power (e.g., 30, although the network will need more epochs to yield a better model).

In [11]:
number_of_epochs = 30
output_directory = "./text_generation_checkpoint"
if not os.path.exists(output_directory):
    os.mkdir(output_directory)
    
checkpoint_prefix = os.path.join(output_directory, 'ck_{epoch:02d}.hdf5')
checkpoint_callback = keras.callbacks.ModelCheckpoint(filepath = checkpoint_prefix,monitor='loss',save_weights_only= True,mode='min')
early_stopping = keras.callbacks.EarlyStopping(patience = 5, min_delta = 1e-4)
history = model.fit(X_data, Y_data, epochs=number_of_epochs, batch_size=128, callbacks=[checkpoint_callback, early_stopping])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### x. Use model checkpointing to keep the network weights to determine each time an improvement in loss is observed at the end of the epoch. Find the best set of weights in terms of loss.

In [14]:
minimum_loss_checkpoint = output_directory +'/ck_30.hdf5'
model.load_weights(minimum_loss_checkpoint)
model.compile(loss='categorical_crossentropy', optimizer=tf.keras.optimizers.Adam(1e-4))

### xi. Use the network with the best weights to generate 1000 characters, using the following text as initialization of the network:
### There are those who take mental phenomena naively, just as they would physical phenomena. This school of psychologists tends not to emphasize the object.

In [15]:
initial_Text = 'There are those who take mental phenomena naively, just as they would physical phenomena. This school of psychologists tends not to emphasize the object.'
initial_Text = initial_Text.translate(str.maketrans('', '', string.punctuation))

In [18]:
generate_Text = copy.copy(initial_Text.lower())
encoded_List = [rescaled_char_Dict[char] for char in generate_Text][-99:]
for _ in range(1000):
    text = np.reshape(encoded_List, (1, len(encoded_List), 1))
    pred = model.predict(text)
    char_Index = np.argmax(pred)
    char = ascii2char_Dict[char_Index]
    generate_Text += char
    encoded_List.append(rescaled_char_Dict[char])
    encoded_List = encoded_List[1:len(encoded_List)]   
print(generate_Text)

there are those who take mental phenomena naively just as they would physical phenomena this school of psychologists tends not to emphasize the object and the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of the semse of 