# CST-435 Recurrent Neural Network Project

## Team: 
Gabriel Aracena, Aaron Galicia, Joshua Canode

## Project Description

This assignment accomplishes two goals. It demonstrates how neural networks can be used in forecasting and how they can be used in practical applications involving text (e.g., completing a search request on Google).

Using a large set of texts for training, build an RNN that suggests the next word in a sentence (sequential learning). Consider the entire sentence when completing the sentence instead of words by themselves.

## Problem Statement

The Goal of this project is to create a Recurrent Neural Network (RNN) model that can predict the next word in a given sentence, thereby enabling the completion of sentences in a coherent and contextually relevant manner. The primary goal is to demonstrate the capabilities of neural networks in the field of text prediction, particularly for applications like search engine queries, chatbots, or predictive text input systems.
    
## Data

We chose to use a synthetic dataset generated by ChatGPT. The data contains a variety of short sentences. To introduce greater variety and enhance the diversity of the dataset, these sentences are categorized into questions, first-person perspectives, third-person perspectives, and different types of statements.


## Algorithm of the solution 

In the development of our RNN model, we followed a structured approach encompassing several pivotal phases:

1. **Data Collection and Preprocessing**: This involved gathering and preparing the dataset, including punctuation removal, tokenization, and integer encoding of words.

2. **Many-to-One Sequence Mapping**: We formulated the problem as a many-to-one sequence mapping task, focusing on predicting the next word in a given sequence.

3. **Model Construction with Keras**: The RNN model was built using Keras, featuring multiple layers:
    - An embedding layer for word representations
    - A masking layer to handle words without embeddings
    - An LSTM layer to capture sequential dependencies
    - Dense layers for feature enhancement
    - An output layer for predicting the next word
    - Integration of pretrained GloVe word embeddings.

4. **Embedding Quality Assessment**: We evaluated the quality of these embeddings by analyzing cosine similarities between word vectors.

5. **Training with Model Checkpoint and Early Stopping**: To optimize training, we implemented techniques such as Model Checkpoint and Early Stopping.

6. **Text Generation Function**: A function for generating text predictions was developed, allowing the model to complete sentences.




## Analysis of the findings  

From manual evaluation, it is claer to see that the results are fairly good. The grammar and the sentance structure of the senctances are quite good.

with this input text, "*it's not just about winning; it's about the*" we got sentance completions such as:
- *But remember, it's not just about winning; it's about the new challenges as opportunities to grow and learn.*
- *But remember, it's not just about winning; it's about the friends we make along the way.*
And as expected with a higher temperature the reusulting sentances appear to be more interesting and creative, but make a little less sense. 
- *But remember, it's not just about winning; it's about the force her a jedi that how be that legend of the silence comes questions?*


The text generation results at various temperature settings demonstrate the impact of temperature on the diversity and coherence of generated text. Lower temperatures produce highly deterministic text closely tied to the input, while higher temperatures lead to more creative but potentially less coherent outputs. The choice of temperature depends on the desired balance between maintaining context and introducing novelty, making it a critical parameter for controlling the behavior of text generation models.



### Train Model ###

In [2]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Embedding, GRU, Dense
from keras.callbacks import EarlyStopping

# Read the source text from a file
file_path = 'source_text.txt'
with open(file_path, 'r', encoding='utf-8') as file:
    source_text = file.read()

# Custom tokenizer that does not filter out punctuation (except quotes and double quotes)
tokenizer = Tokenizer(filters='"#$%&()*+-/:;<=>@[\\]^_{|}~\t\n')
tokenizer.fit_on_texts([source_text])
sequence = tokenizer.texts_to_sequences([source_text])[0]
total_words = len(tokenizer.word_index) + 1

# Generate input sequences for training
input_sequences = [sequence[:i] for i in range(1, len(sequence))]

# Calculate average sequence length before padding and adjust it if needed
average_sequence_len = np.mean([len(seq) for seq in input_sequences])
max_sequence_len = int(average_sequence_len * 1.5)
print(f"Average sequence length: {average_sequence_len}, Max sequence length after padding: {max_sequence_len}")

# Pad sequences to the same length
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
target_word = to_categorical(sequence[1:], num_classes=total_words)

# Load GloVe embeddings
embedding_index = {}
glove_path = 'GloVe840B/glove.840B.300d.txt'
with open(glove_path, 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        try:
            # Ensure that conversion to float is possible
            coefs = np.asarray(values[1:], dtype='float32')
            word = values[0]
            embedding_index[word] = coefs
        except ValueError:
            # Skip the problematic line
            continue

# Create embedding matrix
embedding_dim = 300  # GloVe vector sizen
embedding_matrix = np.zeros((total_words, embedding_dim))
for word, i in tokenizer.word_index.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# Model definition
model = Sequential()
model.add(Embedding(total_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_len, trainable=False))
model.add(GRU(units=100, return_sequences=False))
model.add(Dense(total_words, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Define the batch size
batch_size = 128  # Adjust as needed

# Define Early Stopping callback
early_stopping = EarlyStopping(monitor='loss', patience=5)

# Train the model
model.fit(input_sequences, target_word, epochs=100, verbose=1, callbacks=[early_stopping], batch_size=batch_size)


Average sequence length: 1071.5, Max sequence length after padding: 1607
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100


<keras.src.callbacks.History at 0x13ae28565d0>

### Save Model ###
This should be ran every time a model is trained and is tested

In [3]:
from datetime import datetime
import os

base_dir = 'models'
os.makedirs(base_dir, exist_ok=True)
# After training, create a timestamp or a unique identifier for the model
model_id = datetime.now().strftime("%Y%m%d-%H%M%S")
model_name = f"model_{model_id}.h5"
model_path = os.path.join(base_dir, model_name)

# Save the model to the specified directory
model.save(model_path)
print(f"Model saved to {model_path}")

# Later, to load the model, you can use:
# model = load_model(model_path)

  saving_api.save_model(


Model saved to models\model_20231105-195652.h5


## Generate Next Word ###
This just generates the next immediate word. Number of words generated can be adjusted

In [5]:
import numpy as np

def sample(preds, temperature=1.0):
    # Convert predictions to probabilities
    preds = np.asarray(preds).astype('float64')
    
    # Apply temperature scaling
    preds = np.log(preds + 1e-7) / temperature
    exp_preds = np.exp(preds)
    
    # Normalize predictions
    preds = exp_preds / np.sum(exp_preds)
    
    # Sample a single prediction with the probabilities to return a likely next word index
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def generate_text_seq(model, tokenizer, text, num_words, max_sequence_len, temperature=1.0):
    input_text = text
    for _ in range(num_words):
        # Convert the input text to a sequence of word indexes
        input_seq = tokenizer.texts_to_sequences([input_text])[0]

        # Pad the sequence to the required length
        input_seq = pad_sequences([input_seq], maxlen=max_sequence_len, padding='pre')

        # Predict the next word index
        predictions = model.predict(input_seq, verbose=0)[0]
        predicted_word_index = sample(predictions, temperature)

        # Convert the index to a word
        predicted_word = tokenizer.index_word.get(predicted_word_index, '')

        # Append the predicted word to the input text
        input_text += ' ' + predicted_word

    return input_text.strip()

# Test the model on a new input sequence with temperature
test_text = "Hello, how are"
num_words = 1
temperature = 0.75  # Adjust the temperature as needed to vary randomness
generated_text = generate_text_seq(model, tokenizer, test_text, num_words, max_sequence_len, temperature)
print(generated_text)


Hello, how are the


### Generate Rest of Sentence ###
Generates words until a punctuation is found. There is a max word limit to prevent a feedback loop. 

If words a getting in a feedback loop or you are not happy with results, try adjusting the temperature before training again. Higher values gives more randomness while lower values has a higher likely hood to have a feedback loop

In [6]:
import numpy as np
from keras.preprocessing.sequence import pad_sequences

def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds + 1e-7) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def generate_text_seq(model, tokenizer, text, max_sequence_len, temperature=1.0, punctuations=".!?"):
    input_text = text
    word_count = 0
    max_words = 50
    print(input_text, end=' ')
    while True:
        word_count += 1
        if word_count > max_words:
            print("\nError: Max Words Reached Before Punctuation")
            break
        # Convert the input text to a sequence of word indexes
        input_seq = tokenizer.texts_to_sequences([input_text])[0]

        # Pad the sequence to the required length
        input_seq = pad_sequences([input_seq], maxlen=max_sequence_len, padding='pre')

        # Predict the next word index
        predictions = model.predict(input_seq, verbose=0)[0]
        predicted_word_index = sample(predictions, temperature)

        # Convert the index to a word
        predicted_word = tokenizer.index_word.get(predicted_word_index, '')

        # Append the predicted word to the input text and print it
        input_text += ' ' + predicted_word
        print(predicted_word, end=' ', flush=True)

        # Break if the predicted word ends with a punctuation mark or is empty
        if any(predicted_word.endswith(punct) for punct in punctuations) or predicted_word == '':
            break

    print()  # To ensure we move to a new line after the sentence ends
    return input_text.strip()

# Define a list of temperatures. For example, low=0.2, medium=0.7, high=1.2
temperatures = [0.1 , 0.2, 0.5 , 0.7, 0.8 , 0.85 , 0.9, 1.2, 2]

# Test the model on a new input sequence with different temperatures
test_text = "But remember, it's not just about winning; it's about the"
print("The input text:", test_text)
for temp in temperatures:
    print(f"\nGenerating with temperature {temp}:")
    generated_text = generate_text_seq(model, tokenizer, test_text, max_sequence_len, temp)


The input text: But remember, it's not just about winning; it's about the

Generating with temperature 0.1:
But remember, it's not just about winning; it's about the new challenges as opportunities to grow and learn. 

Generating with temperature 0.2:
But remember, it's not just about winning; it's about the new student at hogwarts high? 

Generating with temperature 0.5:
But remember, it's not just about winning; it's about the new challenges as opportunities to grow and learn. 

Generating with temperature 0.7:
But remember, it's not just about winning; it's about the friends we make along the way. 

Generating with temperature 0.8:
But remember, it's not just about winning; it's about the friends we make along the way. 

Generating with temperature 0.85:
But remember, it's not just about winning; it's about the friends we make along the way. 

Generating with temperature 0.9:
But remember, it's not just about winning; it's about the new challenges as opportunities to each your way. 

### Notes ###
From Aaron, Not Chat GPT Generated Lol:

I believe the training takes the entire training set as one big context instead of training on each sentence as a single context. This is probably why it generates long senteces that don't make much sense. We can change how the training is tokenized to fix this but I don't think its a big deal for the scope of this project. We can just add this thought to the conclusion. The project says to use the embedddings from the Glove algorithm, this is not implemented yet nor do I know if Artsi actually wants us to do that.