## Table of content

- Recurrent Neural Network 
- Long Short-Term Memory 
- Neural Language Model (Case Study 05)
- Conclusion
- References

# Recurrent Neural Network

A recurrent neural network (RNN) is a special type of artificial neural network adapted to work for time series data or data that involves sequences. Ordinary feedforward neural networks are only meant for data points that are independent of each other. However, if we have data in a sequence such that one data point depends upon the previous data point, we need to modify the neural network to incorporate the dependencies between these data points. RNNs have the concept of “memory” that helps them store the states or information of previous inputs to generate the next output of the sequence.

![RNN Architecture](RNN2.webp)


**Advantages and Shortcomings of RNNs**

RNNs have various advantages, such as:

- Ability to handle sequence data
- Ability to handle inputs of varying lengths
- Ability to store or “memorize” historical information

The disadvantages are:

- The computation can be very slow.
- The network does not take into account future inputs to make decisions.
- Vanishing gradient problem, where the gradients used to compute the weight update may get very close to zero, preventing the network from learning new weights. The deeper the network, the more pronounced this problem is.


Well Explained in: https://www.youtube.com/watch?v=AsNTP8Kwu80

# Long short - Term Memory Network

LSTM stands for Long Short-Term Memory, and it is a type of recurrent neural network (RNN) architecture that is designed to overcome the limitations of traditional RNNs in capturing long-term dependencies in sequential data.

LSTMs address the issues of RNN by introducing a memory cell that can store information over long periods of time. The memory cell is equipped with gating mechanisms that control the flow of information, allowing the LSTM to selectively retain or forget information based on the current input and the learned patterns.

![LSTM Architecture](lstm.png)

LSTMs have been widely used in various natural language processing tasks, including language modeling, machine translation, sentiment analysis, and speech recognition, due to their ability to model and understand sequential data with long-term dependencies.

Well Explained in: https://www.youtube.com/watch?v=YCzL96nL7j0

# Language Models

Language models are AI models designed to understand and generate human language. They are trained on large amounts of text data and learn the statistical patterns and relationships between words, phrases, and sentences. Language models can be used for a variety of natural language processing tasks, such as text generation, machine translation, sentiment analysis, and question answering.

The primary goal of a language model is to predict the next word or sequence of words in a given context. It learns the probability distribution over a vocabulary of words based on the context provided. This is done by utilizing recurrent neural networks (RNNs) or variants like long short-term memory (LSTM) or transformers.

## Case Study 05

In this Case study we are going to implement a Text Generation with LSTM.

Text generation is a natural language processing task that involves generating new text based on a given prompt or context. The goal is to generate coherent and contextually relevant text that resembles human-written language.

Text generation can be approached in different ways, depending on the specific requirements and techniques used. One common approach is to use language models, such as recurrent neural networks (RNNs), long short-term memory (LSTM) networks, or transformer models.

In this lab we will implement the RNN and LSTM approach, the dataset used in this study is https://www.kaggle.com/aashita/nyt-comments .

In [1]:
import os
import re
import string
import pandas as pd
from nltk.corpus import stopwords
from keras.utils import pad_sequences
from tensorflow.keras.utils import to_categorical
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
import keras.utils as ku 
import numpy as np 

In [80]:
def load_headlines(directory):
    """
    Load headlines from CSV files in the specified directory.

    Args:
        directory (str): Directory path where the headline files are located.

    Returns:
        list: List of loaded headlines.

    """
    all_headlines = []
    
    # Iterate over files in the directory
    for filename in os.listdir(directory):
        if 'Articles' in filename:
            # Read the CSV file into a DataFrame
            article_df = pd.read_csv(os.path.join(directory, filename))
            # Extract the headline values and append to the list
            all_headlines.extend(list(article_df.headline.values))
            # Break after the first file with 'Articles' in the name
            break
        
    # Filter out headlines with the value "Unknown"   
    all_headlines = [line for line in all_headlines if line != "Unknown"]
    return all_headlines


directory_path = 'NYC/'
headlines = load_headlines(directory_path)
print(headlines[:10])

['Finding an Expansive View  of a Forgotten People in Niger', 'And Now,  the Dreaded Trump Curse', 'Venezuela’s Descent Into Dictatorship', 'Stain Permeates Basketball Blue Blood', 'Taking Things for Granted', 'The Caged Beast Awakens', 'An Ever-Unfolding Story', 'O’Reilly Thrives as Settlements Add Up', 'Mouse Infestation', 'Divide in G.O.P. Now Threatens Trump Tax Plan']


In [65]:
tokenizer = Tokenizer()

In [83]:
def clean_data(data):
    """
    Clean the input data by removing punctuation, converting to lowercase,
    and removing non-ASCII characters.

    Args:
        data (str): Input data to be cleaned.

    Returns:
        str: Cleaned data.

 
    """
    # Remove punctuation characters and convert the word to lower case
    data = "".join(word for word in data if word not in string.punctuation).lower()
    # Encode with UTF-8 and decode with ASCII, ignoring non-ASCII characters
    data = data.encode("utf8").decode("ascii",'ignore')
    return data 

In [84]:
corpus = [clean_data(x) for x in all_headlines]
print(corpus[:5])

['finding an expansive view  of a forgotten people in niger', 'and now  the dreaded trump curse', 'venezuelas descent into dictatorship', 'stain permeates basketball blue blood', 'taking things for granted']


In [68]:
def get_seq_tokens(corpus):
    """
    Convert the corpus into input sequences and calculate the total number of words.

    Args:
        corpus (list): List of strings representing the corpus.

    Returns:
        input_sequences (list): List of input sequences.
        total_words (int): Total number of words in the vocabulary.

    """   
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1

    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words


In [69]:
input_sequences, total_words = get_seq_tokens(corpus)
print(input_sequences[:9])

[[169, 17], [169, 17, 665], [169, 17, 665, 367], [169, 17, 665, 367, 4], [169, 17, 665, 367, 4, 2], [169, 17, 665, 367, 4, 2, 666], [169, 17, 665, 367, 4, 2, 666, 170], [169, 17, 665, 367, 4, 2, 666, 170, 5], [169, 17, 665, 367, 4, 2, 666, 170, 5, 667]]


In [70]:
def sequence_padding(input_sequences):
    """
    Pad the input sequences with zeros and create predictors and labels.

    Args:
        input_sequences (list): List of input sequences.

    Returns:
        predictors (numpy.ndarray): Array of input predictors.
        label (numpy.ndarray): Array of labels.
        max_sequence_len (int): Maximum sequence length.

    """    
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

    predictors, label = input_sequences[:, :-1], input_sequences[:, -1]
    label = to_categorical(label, num_classes=total_words)

    return predictors, label, max_sequence_len

In [71]:
predictors, label, max_sequence_len = sequence_padding(input_sequences)

In [72]:
def create_model(max_sequence_len, total_words):
    # Calculate input length for the Embedding layer
    input_len = max_sequence_len - 1
    
    # Initialize the model
    model = Sequential()
    
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))
    
    # Compile the model
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

# Create the model and print the model summary
model = create_model(max_sequence_len, total_words)
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 18, 10)            24220     
                                                                 
 lstm_3 (LSTM)               (None, 100)               44400     
                                                                 
 dropout_3 (Dropout)         (None, 100)               0         
                                                                 
 dense_3 (Dense)             (None, 2422)              244622    
                                                                 
Total params: 313,242
Trainable params: 313,242
Non-trainable params: 0
_________________________________________________________________


In [14]:
model.fit(predictors, label, epochs=100, verbose=5)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x1d1f6928eb0>

In [74]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        # Convert the seed text to token list
        token_list = tokenizer.texts_to_sequences([clean_data(seed_text)])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        
        # Predict the next word index
        predicted = np.argmax(model.predict(token_list),axis=-1)

        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
                
        # Append the predicted word to the seed text       
        seed_text += " " + output_word
    return seed_text.title()

In [78]:
print (generate_text("finding an expansive", 3, model, max_sequence_len))
print (generate_text("science and technology", 5, model, max_sequence_len))
print (generate_text("Donald trump", 2, model, max_sequence_len))
print (generate_text("New york", 4, model, max_sequence_len))

Finding An Expansive Zombies Unleashed India
Science And Technology Am Became Am Am Learn
Donald Trump Unleashed Civics
New York Coal Mayor Coming Dreaded


# Conclusion

there are several ways to further improve the Language model. Here are some suggestions:

- **Adding more data:** Increasing the size of your training dataset can often lead to better performance. If possible, consider acquiring more data or exploring additional sources to expand the training data.

- **Fine-tuning the model architecture:** Experiment with different model architectures, such as increasing or decreasing the number of layers, adjusting the number of units in each layer, or trying different types of layers (e.g., LSTM, GRU, or Transformer). You can also consider incorporating techniques like attention mechanisms or residual connections to improve the model's ability to capture long-range dependencies.

- **Fine-tuning hyperparameters:** Hyperparameters like the number of epochs, learning rate, batch size, and activation functions can significantly impact model performance. Try different combinations of these hyperparameters to find the optimal configuration. You can also utilize techniques like grid search or random search to systematically explore the hyperparameter space.

- **Regularization techniques:** Regularization techniques like dropout and L2 regularization can help prevent overfitting and improve the generalization of the model. Experiment with different dropout rates or regularization strengths to strike a balance between model complexity and generalization.

- **Transfer learning and pretraining:** If you have access to a pre-trained language model, such as GPT or BERT, you can leverage transfer learning by fine-tuning the pre-trained model on your specific task or domain. This can help improve the performance of your model, especially if you have limited training data.

# References

- https://machinelearningmastery.com/an-introduction-to-recurrent-neural-networks-and-the-math-that-powers-them/
- https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks
- https://towardsdatascience.com/learn-how-recurrent-neural-networks-work-84e975feaaf7
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- https://towardsdatascience.com/language-modeling-with-lstms-in-pytorch-381a26badcbf
- https://medium.com/@shivambansal36/language-modelling-text-generation-using-lstms-deep-learning-for-nlp-ed36b224b275
- https://www.analyticsvidhya.com/blog/2022/02/explaining-text-generation-with-lstm/
- https://www.kaggle.com/code/shivamb/beginners-guide-to-text-generation-using-lstms/notebook