
# Text Generation with Neural Networks

## Functions for Processing Text

### Reading in files as a string text

In [1]:
import tensorflow as tf

# To verify that Tensorflow use the GPU instead the CPU to obtain more speed during the computation time fof the training.
print(tf.config.list_physical_devices('GPU'))

#physical_devices = tf.config.experimental.list_physical_devices('GPU')
#tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)

def read_file(filepath):
    
    with open(filepath) as f:
        str_text = f.read()
    
    return str_text

2023-06-07 01:29:05.518989: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


2023-06-07 01:29:07.775523: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-07 01:29:07.807854: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-07 01:29:07.808036: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

In [2]:
sentence = read_file('moby_dick_four_chapters.txt')
read_file('moby_dick_four_chapters.txt')

'Call me Ishmael.  Some years ago--never mind how long\nprecisely--having little or no money in my purse, and nothing\nparticular to interest me on shore, I thought I would sail about a\nlittle and see the watery part of the world.  It is a way I have of\ndriving off the spleen and regulating the circulation.  Whenever I\nfind myself growing grim about the mouth; whenever it is a damp,\ndrizzly November in my soul; whenever I find myself involuntarily\npausing before coffin warehouses, and bringing up the rear of every\nfuneral I meet; and especially whenever my hypos get such an upper\nhand of me, that it requires a strong moral principle to prevent me\nfrom deliberately stepping into the street, and methodically knocking\npeople\'s hats off--then, I account it high time to get to sea as soon\nas I can.  This is my substitute for pistol and ball.  With a\nphilosophical flourish Cato throws himself upon his sword; I quietly\ntake to the ship.  There is nothing surprising in this.  If t

### Tokenize and Clean Text

In the NLP language model, sequential input data is required and the input word/token must be digital.

In [3]:
import spacy, string

# Clean text from ponctuation
def clean_text(txt): 
    txt = "".join(t for t in txt if t not in string.punctuation).lower() 
    txt = txt.encode("utf8").decode("ascii",'ignore') 
    return txt

corpus = clean_text(sentence)  
#print(corpus)
print(f'Length of text: {len(corpus)} characters')

vocab = sorted(set(corpus))
print(f'{len(vocab)} unique characters')


Length of text: 59414 characters
30 unique characters


## Create Sequences of Tokens

In [4]:
# Convert text in spacy format with token
nlp = spacy.load("en_core_web_sm")
doc = nlp(corpus)
#[token.text for token in doc]

#for token in doc:
#    print(token, token.idx)

stopwords = nlp.Defaults.stop_words
lst=[]
for token in corpus.split():
    if token.lower() not in stopwords:    
        lst.append(token)                  
        
#Join items in the list
#print("Original text  : ",corpus)
print("Text after removing stopwords  :   ",' '.join(lst))

import tensorflow as tf
chars = tf.strings.unicode_split(lst, input_encoding='UTF-8')
print(chars)

ids_from_chars = tf.keras.layers.StringLookup(vocabulary=list(vocab), mask_token=None)
ids = ids_from_chars(chars)
print(ids)

yo = tf.strings.reduce_join(chars, axis=-1).numpy()
print(yo)

chars = sorted(list(set(lst)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
n_chars = len(lst)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
 seq_in = lst[i:i + seq_length]
 seq_out = lst[i + seq_length]
 dataX.append([char_to_int[char] for char in seq_in])
 dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Text after removing stopwords  :    ishmael years agonever mind long preciselyhaving little money purse particular interest shore thought sail little watery world way driving spleen regulating circulation find growing grim mouth damp drizzly november soul find involuntarily pausing coffin warehouses bringing rear funeral meet especially hypos upper hand requires strong moral principle prevent deliberately stepping street methodically knocking peoples hats offthen account high time sea soon substitute pistol ball philosophical flourish cato throws sword quietly ship surprising knew men degree time cherish nearly feelings ocean insular city manhattoes belted round wharves indian isles coral reefscommerce surrounds surf right left streets waterward extreme downtown battery noble mole washed waves cooled breezes hours previous sight land look crowds watergazers circumambulate city dreamy sabbath afternoon corlears hook coenties slip whitehall northward seeposted like silent sentinels town 

2023-06-07 01:29:10.847026: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-07 01:29:10.847213: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-07 01:29:10.847334: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

# Keras

### Keras Tokenization

In [5]:
from keras.preprocessing.text import Tokenizer
# create the tokenizer
t = Tokenizer()


In [6]:
# fit the tokenizer on the documents
t.fit_on_texts(lst)



In [7]:
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

OrderedDict([('ishmael', 6), ('years', 6), ('agonever', 1), ('mind', 6), ('long', 14), ('preciselyhaving', 1), ('little', 30), ('money', 5), ('purse', 3), ('particular', 6), ('interest', 1), ('shore', 1), ('thought', 26), ('sail', 4), ('watery', 1), ('world', 7), ('way', 12), ('driving', 1), ('spleen', 1), ('regulating', 1), ('circulation', 1), ('find', 3), ('growing', 1), ('grim', 1), ('mouth', 4), ('damp', 3), ('drizzly', 1), ('november', 1), ('soul', 3), ('involuntarily', 2), ('pausing', 2), ('coffin', 4), ('warehouses', 2), ('bringing', 1), ('rear', 1), ('funeral', 1), ('meet', 1), ('especially', 4), ('hypos', 1), ('upper', 1), ('hand', 12), ('requires', 2), ('strong', 3), ('moral', 1), ('principle', 1), ('prevent', 1), ('deliberately', 1), ('stepping', 1), ('street', 4), ('methodically', 1), ('knocking', 1), ('peoples', 1), ('hats', 1), ('offthen', 1), ('account', 3), ('high', 5), ('time', 20), ('sea', 17), ('soon', 9), ('substitute', 1), ('pistol', 1), ('ball', 1), ('philosophica

### Convert to Numpy Matrix

In [8]:
encoded_docs = t.texts_to_matrix(lst, mode='count')
print(encoded_docs)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 0. 1.]]


In [9]:
print(encoded_docs.shape)

(4601, 2565)


In [10]:
import numpy as np
from keras.utils import to_categorical

# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)

X = np.asarray(X)
y = np.asarray(y)

# Creating an LSTM based model

In [16]:
import keras
from keras.models import Sequential
from keras.layers import Dense,LSTM,Dropout

In [17]:
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_1 (LSTM)               (None, 256)               264192    
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense_1 (Dense)             (None, 2564)              658948    
                                                                 
Total params: 923,140
Trainable params: 923,140
Non-trainable params: 0
_________________________________________________________________
None


2023-06-07 01:29:38.086744: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-06-07 01:29:38.088251: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-06-07 01:29:38.089208: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

### Train / Test Split

In [18]:
from keras.utils import to_categorical


In [19]:
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])


### Training the Model

In [24]:
# fit the model
history = model.fit(X, y, epochs=10, batch_size=32)
print(history.history['loss']) 
print(history.history['accuracy'])

from matplotlib import pyplot 
pyplot.plot(history.history['loss']) 
pyplot.plot(history.history['accuracy']) 
pyplot.title('model loss vs precision') 
pyplot.xlabel('epoch') 
pyplot .legend(['loss', 'accuracy'], loc='upper right') 
pyplot.show()

2023-06-07 01:32:58.201273: E tensorflow/compiler/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:306] gpu_async_0 cuMemAllocAsync failed to allocate 46162256 bytes: CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY)
 Reported by CUDA: Free memory/Total memory: 19398656/4225695744
2023-06-07 01:32:58.201315: E tensorflow/compiler/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:311] Stats: Limit:                       185729024
InUse:                        64810128
MaxInUse:                     75290128
NumAllocs:                         202
MaxAllocSize:                 46162256
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2023-06-07 01:32:58.201331: E tensorflow/compiler/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:63] Histogram of current allocation: (allocation_size_in_bytes, nb_allocation_of_that_sizes), ...;
2023-06-07 01:32:58.201339: E tensorflow/compiler/xla/stream_executor

InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

# Generating New Text

In [25]:
from random import randint
from pickle import load
from keras.models import load_model
from keras.utils import pad_sequences

In [26]:
import sys

int_to_char = dict((i, c) for i, c in enumerate(chars))
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]

# generate characters
for i in range(1000):
 x = np.reshape(pattern, (1, len(pattern), 1))
 x = x / float(n_vocab)
 prediction = model.predict(x, verbose=0)
 index = np.argmax(prediction)
 result = int_to_char[index]
 seq_in = [int_to_char[value] for value in pattern]
 sys.stdout.write(result)
 pattern.append(index)
 pattern = pattern[1:len(pattern)]
print("\nDone.")

2023-06-07 01:33:00.562498: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:429] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2023-06-07 01:33:00.562553: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at cudnn_rnn_ops.cc:1554 : UNKNOWN: Fail to find the dnn implementation.
2023-06-07 01:33:00.562592: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): UNKNOWN: Fail to find the dnn implementation.
	 [[{{node CudnnRNN}}]]
2023-06-07 01:33:00.562651: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): UNKNOWN: {{function_node __inference_gpu_lstm_with_fallback_4239_specialized_for_sequential_1_lstm_1_PartitionedCall_at___inference_predict_function_442

UnknownError: Graph execution error:

Fail to find the dnn implementation.
	 [[{{node CudnnRNN}}]]
	 [[sequential_1/lstm_1/PartitionedCall]] [Op:__inference_predict_function_4428]

# Great Job!