## LSTM 

#####  This notebook explores a dataset of short jokes sourced from Kaggle. We aim to preprocess this data for use in training a LSTM model to predict the next word in a joke.
####  - **Dataset URL**: [Short Jokes Dataset](https://www.kaggle.com/datasets/abhinavmoudgil95/short-jokes)
####  - **Number of Records**: 231657
#### - **Number of Fields**: 2

In [1]:
%pip install tensorflow

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Importing Libraries

import pandas as pd
import os
import numpy as np

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

2024-04-12 17:17:14.168272: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
# Load the dataset
jokes_df = pd.read_csv('jokes.csv')
jokes_df.head()

Unnamed: 0,ID,Joke
0,1,"[me narrating a documentary about narrators] ""..."
1,2,Telling my daughter garlic is good for you. Go...
2,3,I've been going through a really rough period ...
3,4,"If I could have dinner with anyone, dead or al..."
4,5,Two guys walk into a bar. The third guy ducks.


In [4]:
print("Number of records: ", jokes_df.shape[0])
print("Number of fields: ", jokes_df.shape[1])

Number of records:  231657
Number of fields:  2


In [5]:
jokes_df['Joke']

0         [me narrating a documentary about narrators] "...
1         Telling my daughter garlic is good for you. Go...
2         I've been going through a really rough period ...
3         If I could have dinner with anyone, dead or al...
4            Two guys walk into a bar. The third guy ducks.
                                ...                        
231652                  The Spicy Sausage by Delia Katessen
231653    TIL That I Shouldn't have gone to law school, ...
231654    What did the RAM stick say to the politician? ...
231655    what do you call a play about victorian era me...
231656    Calculus should be taught in every high school...
Name: Joke, Length: 231657, dtype: object

##### By replacing these special spacing characters with regular spaces, the text becomes more uniform. This uniformity is crucial for natural language processing tasks, as it ensures that the algorithm treats all spaces equally, preventing any potential misinterpretation or errors during tokenization, feature extraction, or modeling.

In [6]:
# Cleaning Text Data
jokes_df['Joke'] = jokes_df['Joke'].apply(lambda x: x.replace(u'\xa0',u' '))
jokes_df['Joke'] = jokes_df['Joke'].apply(lambda x: x.replace('\u200a',' '))

### Initializing and Fitting the Tokenizer


In [7]:
# Tokenizer Initialization with an OOV Token
tokenizer = Tokenizer(oov_token='<oov>') 
# Fitting the Tokenizer on Texts
tokenizer.fit_on_texts(jokes_df['Joke'])
total_words = len(tokenizer.word_index) + 1

print("Total number of words: ", total_words)
print("Word: ID")
print("------------")
print("<oov>: ", tokenizer.word_index['<oov>'])
print("Strong: ", tokenizer.word_index['strong'])
print("And: ", tokenizer.word_index['and'])
print("Consumption: ", tokenizer.word_index['consumption'])

Total number of words:  70650
Word: ID
------------
<oov>:  1
Strong:  1479
And:  7
Consumption:  9242


##### By converting text into a sequence of integers, it allows neural networks to process and learn from textual data. The inclusion of the OOV token helps in making the model robust to new words it encounters post-training, improving its ability to handle real-world data where not every word can be anticipated during training.

##### After fitting the tokenizer, tokenizer.word_index gives us a dictionary where keys are the words and values are the corresponding unique indices. The length of this dictionary tells us how many unique words are in the vocabulary. Adding 1 to this count includes the OOV token in the total count. Knowing the total number of words (i.e., the size of the vocabulary) is important because it defines the dimensionality for certain layers in the neural network model, like the embedding layer, which requires the size of the vocabulary to correctly encode the words into vectors.

##### Total number of words:  70650 - This is the total count of unique words that the tokenizer has recognized in the dataset, including the out-of-vocabulary (OOV) token. This large number indicates that your dataset is quite diverse in terms of vocabulary.

### Initialization of input_sequences

In [8]:
input_sequences = []
for line in jokes_df['Joke']:
    token_list = tokenizer.texts_to_sequences([line])[0]
    #print(token_list)
    
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# print(input_sequences)
print("Total input sequences: ", len(input_sequences))

Total input sequences:  3850485


##### Here each joke (line) is converted into a sequence of integers (token_list) using the previously trained tokenizer. These integers represent words as per the vocabulary established by the tokenizer.
##### By training on n-gram sequences, LSTMs learn the context gradually. For instance, in a training set consisting of the sentence "The quick brown fox", the LSTM would see sequences like ["The", "quick"], ["The", "quick", "brown"], and use each to predict the next word. This step-by-step buildup of sequences helps the model understand how the presence of certain words affects the likelihood of following words, a concept crucial for generating coherent text.
##### Generating multiple subsequences from each joke maximizes the utilization of the data, creating numerous training examples from a single line of text. This approach improves the model's exposure to different patterns and contexts within the data, enhancing its ability to generalize and predict accurately in varied situations.

### Pad Sequences 


In [9]:
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
input_sequences[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,   15, 9606,    2], dtype=int32)

##### To process multiple sequences simultaneously, they must all have the same length. Padding ensures that sequences within a batch have consistent dimensions, enabling efficient parallel processing.

In [10]:
# create features and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]

### Trainining LSTM model 

In [None]:
model = Sequential()
model.add(Embedding(total_words, 100)) 
model.add(Bidirectional(LSTM(64)))
model.add(Dense(total_words, activation='softmax'))
adam = Adam(learning_rate=0.01) 
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(xs, labels, epochs=3, batch_size=128)  

Epoch 1/3


2024-04-12 17:18:05.455255: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1416978480 exceeds 10% of free system memory.


[1m30082/30082[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12126s[0m 403ms/step - accuracy: 0.1281 - loss: 6.2899
Epoch 2/3
[1m30082/30082[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11904s[0m 396ms/step - accuracy: 0.2034 - loss: 5.3186
Epoch 3/3
[1m30082/30082[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11956s[0m 397ms/step - accuracy: 0.2245 - loss: 5.0806


<keras.src.callbacks.history.History at 0x7fd8c0cfcb50>

##### Here Sequential(): Initializes a sequential model, a linear stack of layers.
##### Embedding: Adds an embedding layer to the model. This layer is responsible for converting integer indices into dense vectors of fixed size. Here, it is used to transform the input sequences into dense vectors of dimensionality 100.
##### Bidirectional(LSTM(64)): Adds a bidirectional LSTM layer with 64 units. Bidirectional LSTMs process input sequences in both forward and backward directions, capturing dependencies in both directions. This enhances the model's ability to understand context.
##### Dense: Adds a fully connected (dense) layer with total_words units and softmax activation. This layer produces a probability distribution over the vocabulary, predicting the likelihood of each word in the vocabulary being the next word in the sequence.
##### The overall purpose of this is to construct, compile, and train a LSTM model for the task of next-word prediction using LSTM and bidirectional LSTM layers. The model is trained to predict the next word in a sequence of words, given the preceding words. The training process aims to optimize the model's parameters (weights and biases) to minimize the loss function (sparse categorical crossentropy) and improve its accuracy in making predictions.

##### The model's accuracy increased from 12.81% in the first epoch to 22.45% in the third epoch, showing a gradual improvement in accuracy and loss. However, the performance gains from epoch to epoch are relatively modest, so there is so much to improve. Possible enhancements include increasing model complexity, ensuring a diverse training dataset, applying regularization techniques, adjusting the learning rate, increasing epochs, tuning hyperparameters, and evaluating the model's performance on a separate validation set. By iteratively refining the model architecture, preprocessing steps, and training process, it's possible to achieve better accuracy and loss values and enhance the model's overall performance.

### Predicting Next Words in Joke

In [27]:
from keras.preprocessing.sequence import pad_sequences
import numpy as np

# Starting with two words
seed_text = "Donald trump"

for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len - 1, padding='pre')
    predicted_probs = model.predict(token_list, verbose=0)
    predicted_index = np.argmax(predicted_probs, axis=-1)[0]
    output_word = ""
    for word, index in tokenizer.word_index.items():
        if index == predicted_index:
            output_word = word
            break
    seed_text += " " + output_word
    if output_word in ['.', '!', '?']:
        break

print(seed_text)


Donald trump is the best way to get a joke about the new barbie doll on the market


##### From above generated sentence, we can understand that the factors like the number of training epochs, learning rate can affect the performance. Underfitting or overfitting could also be issues.

##### Overall, we still need to address the issues to predict next words accurately to form a Joke.


##### To improve a model's ability to predict humor-like words or generate jokes, one could consider:

##### 1.Using more advanced models such as Transformer-based architectures, which have shown better performance in understanding context and generating text.
##### 2.Incorporating larger and more diverse datasets that include a variety of humorous texts.
##### 3.Applying techniques like fine-tuning on specific humor-focused datasets after pre-training on general text to capture the nuances of humorous language.


##### Jokes frequently use ambiguities and creative twists in language, which can be tricky for models like LSTM networks to handle. These models often focus on the literal meaning of words and may miss the multiple interpretations and wordplay that make jokes funny. Essentially, jokes require understanding beyond just the surface level of the text, something that LSTM models might not be well-equipped to do.

