# LSTM for Trump text generation

![alt text](https://miro.medium.com/v2/resize:fit:984/1*Mb_L_slY9rjMr8-IADHvwg.png)

LSTM stands for Long Short-Term Memory, and it's a type of RNN that is known to be effective for text generation, which is our primary goal.

Advantages:

- Capable of capturing complex sequential dependencies.
- Can retain context and remember information from the entire sequence, making it ideal for contextual situations, such as discussing Trump.

Limits:

- LSTM models require a large amount of data; it's a complex model that demands extensive training to handle context effectively.
- LSTMs are computationally intensive, and tuning hyperparameters has been a real challenge during this project.

Tensorflow will be used for the integration of the LSTM model.

## Dataset:

In [1]:
import os

path = 'Trump Rally Speeches/'
files = os.listdir(path)
files = [path + file for file in files]
 
dates = []
locations = []
years = []
days = []
months = []
speeches_text = []
 
month_ab = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep','Oct', 'Nov', 'Dec']

for file in files:
    for month in month_ab:
        if month in file:
            locations.append(file[file.find('/')+1:file.find(month)])
            break
    for i, mont in enumerate(month_ab):
        if month in file:
            date = file[file.find(month):file.find('.txt')]
            dates.append(date)
            months.append(date[:3])
            days.append(str(date[3]))
            years.append(date[-4:])
            break   
        
for file in files:
    with open(file, 'r') as f:
        speeches_text.append(f.read())     
        
import pandas as pd
 
df = pd.DataFrame({'Speech':files, 'Date':dates, 'Location':locations, 'Year':years, 'Month':months, 'Day':days, 'Speech_Text':speeches_text})

We will use the two preprocessing function defined in preprocessing_pipline.
One only remove the punctuation and do lower case, the other one also remove the stop and rare words.

In [3]:
from preprocessing import preprocessing_pipline

preprocessing = preprocessing_pipline(df['Speech_Text'])
df['Speech_Text_prepro'] = preprocessing.preprocess_light()
df['Speech_Text_prepro2'] = preprocessing.preprocess()


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
thank thank thank vice president pence hes good guy weve done great job together merry christmas mic
Thank you. Thank you. Thank you to Vice President Pence. He's a good guy. We've done a great job tog
thank thank thank vice president pence hes good guy weve done great job together merry christmas mic


### Data 
1. Without preprocessing -> terrible results... not intresting
2. Light preprocessing 
3. Heavy preprocessing (stop words and rare words)

In [4]:
from sklearn.model_selection import train_test_split

text_corpus = [word for speech in df['Speech_Text'].str.split() for word in speech]

# Preprocess text 
text_corpus_prepro = [word for speech in df['Speech_Text_prepro'].str.split() for word in speech]

text_corpus_prepro2 = [word for speech in df['Speech_Text_prepro2'].str.split() for word in speech]


## Preprocess before using LSTM

Input is the sequences and output the next word to guess.  
-We tokenize them, and we applied a padding to both: make them the same lenght
-Convert to numpy array  
-Split into train and test

In [5]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import numpy as np

def preprocess_for_LSTM(text_corpus, lenght_of_sequences=10, test_size=0.2):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(text_corpus)

    input = []
    output = []


    for i in range(lenght_of_sequences, len(text_corpus)):
        input.append(text_corpus[i - lenght_of_sequences:i])
        output.append(text_corpus[i])

    input = tokenizer.texts_to_sequences(input)
    output = tokenizer.texts_to_sequences(output)

    # padding
    input = pad_sequences(input, maxlen=lenght_of_sequences, padding='pre')
    output = pad_sequences(output, maxlen=1, padding='pre')
    
    # into numpy arrays
    input = np.array(input)
    output = np.array(output)

    # Split your data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(input, output, test_size=test_size, random_state=42)
    
    return X_train, X_test, y_train, y_test, tokenizer

 

2023-11-05 22:22:39.147370: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-05 22:22:39.182658: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-05 22:22:39.182696: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-05 22:22:39.182736: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-05 22:22:39.190621: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-05 22:22:39.191137: I tensorflow/core/platform/cpu_feature_guard.cc:182] This Tens

In [6]:
X_train, X_test, y_train, y_test, tokenizer = preprocess_for_LSTM(text_corpus_prepro, lenght_of_sequences=10, test_size=0.2)
X_train2, X_test2, y_train2, y_test2, tokenizer2 = preprocess_for_LSTM(text_corpus_prepro2, lenght_of_sequences=10, test_size=0.2)


## The model

In [7]:
import tensorflow as tf
from keras.layers import Embedding, LSTM, Dense

class my_model_LSTM:
    def __init__(self, n, num_unique_words, max_sequence_length):
        """
        Initialize the LSTM model
        
        Args:
            n (int): Number of LSTM units.
            num_unique_words (int): Number of unique words 
            max_sequence_length (int): Maximum sequence length.
        """
        self.n = n
        self.num_unique_words = num_unique_words
        self.max_sequence_length = max_sequence_length
        self.model = self.build_model()
        
    def build_model(self):
        """
        Build and compile the LSTM model
        """
        model = tf.keras.Sequential()
        model.add(Embedding(self.num_unique_words, self.n))
        model.add(LSTM(units=self.n))
        model.add(Dense(units=self.num_unique_words, activation='softmax'))
        model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
        return model
     
    def train(self, X_train, y_train, epochs=10):
        self.model.fit(X_train, y_train, epochs=epochs)
         
    def predict(self, test_corpus):
        return self.model.predict(test_corpus)
     
    def evaluate(self, X_test, y_test):
        return self.model.evaluate(X_test, y_test)
     
    def save(self, path):
        self.model.save(path)
         
    def load(self, path):
        self.model = tf.keras.models.load_model(path)
         
    def generate_text(self, seed_text, max_length=100):
        """
        Generate text based on a seed text

        Args:
            seed_text (str): Initial text seed, we will generate what's next
            max_length (int): Size of the text to generate

        Returns:
            str: Generated text
        """
        output_text = seed_text.split()
        prefix = output_text[-(self.n - 1):]

        for _ in range(max_length):
            input = [tokenizer.word_index[word] for word in prefix]
            input = tf.expand_dims(input, 0)

            predictions = self.model(input)
            predictions = tf.squeeze(predictions, 0)

            if len(predictions.shape) == 1:
                predictions = tf.expand_dims(predictions, 0)

            predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()
            next_word = tokenizer.index_word[predicted_id]

            output_text.append(next_word)
            prefix = prefix[1:] + [next_word]

        return ' '.join(output_text)
    



### Light preprocessing 

In [43]:
num_unique_words = len(tokenizer.word_index) + 1
max_sequence_length = 40
model_LSTM = my_model_LSTM(50, num_unique_words, max_sequence_length)
history = model_LSTM.train(X_train, y_train, epochs=40)
model_LSTM.save('model_LSTM.h5')

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


In [44]:
predictions = model_LSTM.evaluate(X_test, y_test)
print("LSTM Model Accuracy:", predictions[1])
print("LSTM Model Loss:", predictions[0])

LSTM Model Accuracy: 0.12921422719955444
LSTM Model Loss: 7.493661403656006


In [23]:
model_LSTM = my_model_LSTM(256, num_unique_words, max_sequence_length)
model_LSTM.load('model_LSTM.h5')
seed_text = 'france forcing'
generated_text = model_LSTM.generate_text(seed_text, max_length=10)
print(generated_text)

france forcing mesa lift cleanest exonerated mine ab comeys leaking rising rancher


In [24]:
print("LSTM Model Perplexity:", np.exp(model_LSTM.evaluate(X_test, y_test)[0]))

LSTM Model Perplexity: 1796.618208383362


We face some overfiting... The perplexity is still way better than the baseline one: 1796 compare to 3146

### Light preprocessing 

In [8]:
num_unique_words2 = len(tokenizer2.word_index) + 1
max_sequence_length = 40
model_LSTM2 = my_model_LSTM(100, num_unique_words2, max_sequence_length)
model_LSTM2.train(X_train2, y_train2, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [9]:
model_LSTM2.save('model_LSTM2.h5')

  saving_api.save_model(


In [10]:
predictions = model_LSTM2.evaluate(X_test, y_test)
print("LSTM Model Accuracy:", predictions[1])
print("LSTM Model Loss:", predictions[0])

LSTM Model Accuracy: 0.15773722529411316
LSTM Model Loss: 6.406148433685303


In [17]:
model_LSTM2 = my_model_LSTM(256, num_unique_words2, max_sequence_length)
model_LSTM2.load('model_LSTM.h5')
seed_text = 'thank'
generated_text = model_LSTM2.generate_text(seed_text, max_length=30)
print(generated_text)

thank legacy doesnt son helicopter loophole decide many front politicians will… twin lady mckinley rapids conducted shattering pause sharp suppliers obviously stolen rigid 94 weissman soon add surged erdogan enough theft


In [21]:
print("LSTM Model Perplexity:", np.exp(model_LSTM2.evaluate(X_test, y_test)[0]))

LSTM Model Perplexity: 1796.618208383362


With complete preprocessing, the metrics are better, but the generation is less comprehensive and realistic (no stopwords, rare words). It's a trade-off that can be interesting but not ideal for a text generator.

ncreasing the number of LSTM units takes more time to train but allows for better results.