### CS521 Final Project - Next Words Prediction in Joke 
### Sharmisha Parvathaneni 

##### This notebook details the development and validation of a Long Short-Term Memory (LSTM) model aimed at predicting next words in Joke. It includes data preprocessing, model training, evaluation, and final predictions.


#####  - Dataset URL: [Short Jokes Dataset](https://www.kaggle.com/datasets/abhinavmoudgil95/short-jokes)
#####  - Number of Records: 100000
##### - Number of Fields: 2

In [1]:
#import required libraries
!pip install tensorflow
import numpy as np
import pandas as pd
import re
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense



2024-04-14 03:14:50.758905: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-14 03:14:50.763746: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-14 03:14:50.828517: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
#Load the dataset
jokes_df = pd.read_csv('shortjokes.csv')
jokes_df.head()

Unnamed: 0,ID,Joke
0,1,"[me narrating a documentary about narrators] ""..."
1,2,Telling my daughter garlic is good for you. Go...
2,3,I've been going through a really rough period ...
3,4,"If I could have dinner with anyone, dead or al..."
4,5,Two guys walk into a bar. The third guy ducks.


In [3]:
print("Number of records: ", jokes_df.shape[0])
print("Number of fields: ", jokes_df.shape[1])

Number of records:  100000
Number of fields:  2


In [4]:
jokes_df['Joke']

0        [me narrating a documentary about narrators] "...
1        Telling my daughter garlic is good for you. Go...
2        I've been going through a really rough period ...
3        If I could have dinner with anyone, dead or al...
4           Two guys walk into a bar. The third guy ducks.
                               ...                        
99995    Every time I walk into a singles bar I can hea...
99996    how wide is the universe? how long is a piece ...
99997    A man goes to a halloween party wearing nothin...
99998                           I don't Bolivia Peru-v it.
99999    What's the world's longest Ted Talk? How I Met...
Name: Joke, Length: 100000, dtype: object

In [5]:
# Function to clean text
def clean_text(text):

# Convert to lowercase
    text = text.lower()
# Removing non-words
    text = re.sub(r'\W', ' ', text)
#Removing Single Characters
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text)
#Removing Extra Spaces
    text = re.sub(r'\s+', ' ', text, flags=re.I)
    return text

# Applying the cleaning function to the Joke column
jokes_df['Cleaned_Joke'] = jokes_df['Joke'].apply(clean_text)

jokes_df[['Joke', 'Cleaned_Joke']]


Unnamed: 0,Joke,Cleaned_Joke
0,"[me narrating a documentary about narrators] ""...",me narrating documentary about narrators can ...
1,Telling my daughter garlic is good for you. Go...,telling my daughter garlic is good for you goo...
2,I've been going through a really rough period ...,i ve been going through really rough period at...
3,"If I could have dinner with anyone, dead or al...",if could have dinner with anyone dead or alive...
4,Two guys walk into a bar. The third guy ducks.,two guys walk into bar the third guy ducks
...,...,...
99995,Every time I walk into a singles bar I can hea...,every time walk into singles bar can hear mom ...
99996,how wide is the universe? how long is a piece ...,how wide is the universe how long is piece of ...
99997,A man goes to a halloween party wearing nothin...,a man goes to halloween party wearing nothing ...
99998,I don't Bolivia Peru-v it.,i don bolivia peru it


### Tokenization

![Alt Text](../images/Tokenization.png)


In [6]:
#Tokenizer process
tokenizer = Tokenizer()  # Creates an instance of the Tokenizer class from the Keras preprocessing library.
#fit
tokenizer.fit_on_texts(jokes_df['Joke']) # Processes the text data to extract unique words and assign an index to each word.

total_words = len(tokenizer.word_index) + 1
total_words

46923

##### By creating a vocabulary index (word_index), the tokenizer provides a consistent way to convert back and forth between words and their corresponding numeric indices. This consistency is important for both training the model and using it to generate new text.

#####  In the context of predicting the next word in a joke, the model needs to understand the sequence of words that come before to make a prediction about what comes next. Each word in a sequence is input as a numeric index, and the model learns to predict the next index (word) in the sequence. During inference, these predictions are converted back to words using the tokenizer's vocabulary. With tokenization, models can effectively capture the relationships between words in sequences, learning patterns like syntax, semantics, and common phrases.

### Creating n-grams from Tokenized Text Data

In [7]:
#declaring ngrams
input_sequences = []
for line in jokes_df['Joke']:
    token_list = tokenizer.texts_to_sequences([line])[0]
    # Generating n-grams from Token Sequences
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

##### By training on n-gram sequences, LSTMs learn the context gradually. For instance, in a training set consisting of the sentence "The quick brown fox", the LSTM would see sequences like ["The", "quick"], ["The", "quick", "brown"], and use each to predict the next word. This step-by-step buildup of sequences helps the model understand how the presence of certain words affects the likelihood of following words, a concept crucial for generating coherent text.
##### Generating multiple subsequences from each joke maximizes the utilization of the data, creating numerous training examples from a single line of text. This approach improves the model's exposure to different patterns and contexts within the data, enhancing its ability to generalize and predict accurately in varied situations.

In [8]:
input_sequences[:5]

[[14, 8701],
 [14, 8701, 1],
 [14, 8701, 1, 3039],
 [14, 8701, 1, 3039, 43],
 [14, 8701, 1, 3039, 43, 25110]]

In [9]:
# Calculates the maximum length of sequences within the list input_sequences
max_len = max([len(x) for x in input_sequences])

### Pad Sequences

In [10]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# This function transforms a list (input_sequences) of numeric sequences (lists of integers where each integer represents a word token) into a 2D numpy array of shape.
padded_input_sequences = pad_sequences(input_sequences, maxlen = max_len, padding='pre')   

In [11]:
padded_input_sequences

array([[   0,    0,    0, ...,    0,   14, 8701],
       [   0,    0,    0, ...,   14, 8701,    1],
       [   0,    0,    0, ..., 8701,    1, 3039],
       ...,
       [   0,    0,    0, ...,   27,    3,  620],
       [   0,    0,    0, ...,    3,  620,   23],
       [   0,    0,    0, ...,  620,   23,  379]], dtype=int32)

##### Neural networks require inputs to be of consistent size and shape. The pad_sequences function standardizes the length of all input sequences, which is essential for training these models efficiently.

##### By padding sequences, particularly with pre-padding, the model focuses more on the most recent inputs (the actual words) when making predictions. This is especially important in tasks like predicting the next word where the context closer to the word being predicted is usually more relevant than the earlier context.

##### Many types of neural networks, such as LSTMs (Long Short-Term Memory networks) or GRUs (Gated Recurrent Units), are designed to work with sequence data and often expect all inputs to be of uniform length. Padding ensures compatibility with these architectural requirements.

In [12]:
# Defines the input features for a model by selecting all rows of padded_input_sequences and all columns 
X = padded_input_sequences[:,:-1]

In [13]:
# Extracts the target variable
y = padded_input_sequences[:,-1]

###  Processing batches of data, specifically for converting sequences of integers (representing words) into one-hot encoded formats.

In [14]:
total_rows = 100000
total_words = 46923
batch_size = 1000 

In [15]:
#Generator Function (batch_generator)
def batch_generator(y, batch_size, total_words):
    num_batches = int(np.ceil(len(y) / batch_size))
    for i in range(num_batches):
        start = i * batch_size
        end = min(start + batch_size, len(y))  
        y_batch = y[start:end]
        y_batch_encoded = tf.keras.utils.to_categorical(y_batch, num_classes=total_words) # Converts the numeric labels in y_batch into a one-hot encoded format
        yield y_batch_encoded


In [16]:
y = padded_input_sequences[:,-1]
# Using the generator to process data
for encoded_batch in batch_generator(y, batch_size, total_words):
    pass

##### Using a generator is memory-efficient, especially important when dealing with large datasets (100,000 rows). It prevents loading all data into memory at once, which could lead to high memory consumption.

### Building a Sequential model 

In [17]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

total_words = 46923
  
max_sequence_len = 10  

model = Sequential()

#Adding Layers to the Model
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(LSTM(150))
model.add(Dense(total_words, activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.build(input_shape=(None, max_sequence_len-1))

print(model.summary())




None


##### By training this model on sequences derived from jokes, it learns the patterns in how words are sequenced to form jokes, enabling it to generate or continue jokes by predicting subsequent words.


##### Model Summary
##### Layer Information:
##### Embedding Layer: Converts integer-encoded words to dense vectors of a fixed size (100 dimensions). It shows that the model has 46,923 possible words (vocab size) and outputs a sequence length of 9 (input length) for each batch.
##### Output Shape: (None, 9, 100) indicates the batch size (None is any batch size), sequence length (9), and embedding dimension (100).
##### Parameters: 4,692,300 parameters, calculated as the product of the number of words in the vocabulary and the embedding dimensions (total_words * embedding_dimension).
##### LSTM Layer: A type of recurrent neural network that can learn order dependence in sequence prediction problems.
##### Output Shape: (None, 150) shows that for each sequence in the batch, the LSTM outputs a single vector of 150 features that captures the information from the sequence.
##### Parameters: 150,600 parameters, which are derived from the internal calculations of the LSTM (involving several gates and states).
##### Dense Layer: This is the output layer of the model, using a softmax activation function. It maps the output of the LSTM into a probability distribution over the 46,923 possible next words.
##### Output Shape: (None, 46923) indicates that for each input sequence, the model predicts a probability distribution across all 46,923 words.
##### Parameters: 7,085,373, which comes from multiplying the output size of the LSTM by the number of words in the vocabulary, plus bias terms for each word.

##### Total Parameters:
##### 11,928,273 Total Parameters (45.50 MB): This is the sum of parameters from all layers. The memory size (45.50 MB) indicates the space required to store these parameters in memory. All parameters are trainable, meaning they are updated during the training process.


##### The large number of parameters (almost 12 million) signifies a highly flexible model that can learn complex patterns in data. The extensive embedding and output layers allow the model to handle a large vocabulary with nuanced understanding. The LSTM's ability to remember and utilize context effectively makes it ideal for tasks like text generation where the meaning depends heavily on preceding words. This is critical in humor, where context can drastically alter the meaning.

##### By using a softmax activation in the output layer, the model is trained to output a probability distribution over all possible words. The word with the highest probability is selected as the predicted next word.


In [None]:
from tensorflow.keras.utils import to_categorical
import numpy as np

# Convert labels to a sparse matrix format
y_sparse = np.array(y, dtype='int32')  

# sparse categorical crossentropy loss to handle sparse labels
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Fit
model.fit(X, y_sparse, epochs=50,  batch_size= 512)

Epoch 1/50
[1m3243/3243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1411s[0m 435ms/step - accuracy: 0.0721 - loss: 6.9527
Epoch 2/50
[1m3243/3243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1478s[0m 440ms/step - accuracy: 0.1607 - loss: 5.7324
Epoch 3/50
[1m3243/3243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1387s[0m 428ms/step - accuracy: 0.2303 - loss: 4.6811
Epoch 7/50
[1m3243/3243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1392s[0m 425ms/step - accuracy: 0.2417 - loss: 4.5312
Epoch 8/50
[1m3243/3243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1395s[0m 423ms/step - accuracy: 0.2528 - loss: 4.4042
Epoch 9/50
[1m3243/3243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1410s[0m 435ms/step - accuracy: 0.2635 - loss: 4.2874
Epoch 10/50
[1m3243/3243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1438s[0m 427ms/step - accuracy: 0.2742 - loss: 4.1835
Epoch 11/50
[1m3243/3243[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1380s[0m 426ms/step - accuracy: 0.28

<keras.src.callbacks.history.History at 0x7f24827471c0>

##### From above accuracy and loss, we can understand that increasing epochs to 100 or 150, adjusting the learning rate, experimenting with different batch size and choosing a diverse dataset could have increased the accuracy. 

##### The LSTM model was able to predict the next words with a moderate degree of accuracy, as evidenced by the final accuracy level of approximately 42%. This suggests that while the LSTM can capture some patterns and structures inherent to the dataset, it struggles to produce coherent and contextually appropriate continuations that form a complete, well-structured joke. 

##### The LSTM's performance suggests that while it has learned some patterns in the data, humor generation requires more sophisticated understanding and modeling of language nuances, cultural references, and comedic timing, which are areas where GPTs excel.



### To improve upon the LSTM model's capabilities, one might consider 
##### Implementing and experimenting with attention mechanisms within LSTM models could potentially improve performance by enabling the model to focus on critical parts of the input sequence.
##### Experiment with more sophisticated pre-training strategies that specifically focus on humor. This could involve curating a large dataset with varied comedic styles and using it to pre-train an LSTM before fine-tuning on joke-specific data.
##### Apply meta-learning techniques so that the model can quickly adapt to new styles of humor or content with minimal additional data, enabling faster personalization and responsiveness to trends.
##### Hybrid models or a switch to transformer-based architectures like GPT, which have been shown to perform better in text generation tasks due to their attention mechanisms and pre-training on vast datasets.


