# Text Processing with NLTK: Lemmatization and Simple RNN Model Training

In [1]:
import re
import nltk
from nltk.corpus import stopwords

nltk.download('punkt')  # Download tokenizer if not already installed
nltk.download('stopwords')  # Download stopwords if not already installed

[nltk_data] Downloading package punkt to C:\Users\Karim
[nltk_data]     Nasr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Karim
[nltk_data]     Nasr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Text Cleaning and Lemmatization

This function `clean_text_Lemmatization` is designed to clean text data by performing several steps:

1. **Lowercase Conversion:** The function converts all text to lowercase to ensure uniformity.
2. **URL Removal:** It removes any URLs present in the text using a regular expression pattern.
3. **Special Character Removal:** All special characters except hyphens and apostrophes are removed to avoid noise in the text.
4. **Punctuation Removal:** Punctuation marks are removed from the text.
5. **Stopword Removal (Optional):** If enabled, common stopwords in English are removed from the text.
6. **Tokenization:** The text is tokenized into individual words.
7. **Lemmatization:** Using WordNet lemmatizer from the NLTK library, words are lemmatized to their base form.

**Arguments:**
- `text (str)`: The text data to be cleaned.

**Returns:**
- `list`: A list of cleaned tokens ready for further processing.

In [2]:
def clean_text_Lemmatization(text):
    """
    Cleans text by performing the following steps:

    1. Lowercase conversion
    2. URL removal
    3. Special character removal (excluding hyphen and apostrophe)
    4. Punctuation removal
    5. Stop word removal (customizable option)
    6. Tokenization
    7. Lemmatization

    Args:
        text (str): The text to be cleaned.

    Returns:
        list: A list of cleaned tokens.
    """

    # 1. Lowercase conversion
    text = text.lower()

    # 2. URL removal
    url_pattern = r"http[s]?://\S+\b|\bwww\.\S+\b"  # Improved URL pattern
    text = re.sub(url_pattern, "", text)

    # 3. Special character removal (excluding hyphen and apostrophe)
    special_char_pattern = r"[^a-zA-Z0-9\-\']"
    text = re.sub(special_char_pattern, " ", text)

    # 4. Punctuation removal
    text = re.sub(r'[^\w\s-]', '', text)

    # 5. Stop word removal (optional)
    remove_stopwords = True  # Flag to enable/disable stop word removal
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        text = ' '.join([word for word in text.split() if word not in stop_words])

    # 6. Tokenization
    tokens = nltk.word_tokenize(text)

    # 7. Stemming (using PorterStemmer for clarity)
    lemmatizer = nltk.WordNetLemmatizer()
    cleaned_tokens = [lemmatizer.lemmatize(token) for token in tokens]

    return cleaned_tokens

In [3]:
# Define the text
text = """
"The wind whispered through the tall grass, carrying with it the scent of rain and distant adventures. Sarah stood at the edge of the cliff, her hair billowing around her like a cloak as she gazed out at the endless expanse of the ocean. The waves crashed against the rocks below, their rhythmic symphony a soothing balm to her troubled mind.

She had always been drawn to the sea, its vastness mirroring the depths of her own soul. There was a sense of freedom here, a liberation from the constraints of everyday life. With each crashing wave, she felt her worries wash away, replaced by a sense of peace and clarity.

But today, there was something different in the air, something electric and charged. As she watched, a storm began to brew on the horizon, dark clouds swirling ominously overhead. Yet, instead of fear, Sarah felt a surge of excitement coursing through her veins.

For as long as she could remember, she had yearned for adventure, for a life less ordinary. And now, it seemed, fate was finally calling her name. With a fierce determination burning in her heart, she took a step forward, towards the unknown.

The journey ahead would be fraught with peril and uncertainty, but Sarah welcomed it with open arms. For she knew that only by venturing into the unknown could she truly discover who she was meant to be.

And so, with the wind at her back and the ocean at her feet, Sarah set sail into the storm, ready to embrace whatever lay ahead. For in that moment, she knew that her destiny awaited her on the other side of the horizon."
"""

cleaned_tokens_lemm = clean_text_Lemmatization(text)
print(cleaned_tokens_lemm)

['wind', 'whispered', 'tall', 'grass', 'carrying', 'scent', 'rain', 'distant', 'adventure', 'sarah', 'stood', 'edge', 'cliff', 'hair', 'billowing', 'around', 'like', 'cloak', 'gazed', 'endless', 'expanse', 'ocean', 'wave', 'crashed', 'rock', 'rhythmic', 'symphony', 'soothing', 'balm', 'troubled', 'mind', 'always', 'drawn', 'sea', 'vastness', 'mirroring', 'depth', 'soul', 'sense', 'freedom', 'liberation', 'constraint', 'everyday', 'life', 'crashing', 'wave', 'felt', 'worry', 'wash', 'away', 'replaced', 'sense', 'peace', 'clarity', 'today', 'something', 'different', 'air', 'something', 'electric', 'charged', 'watched', 'storm', 'began', 'brew', 'horizon', 'dark', 'cloud', 'swirling', 'ominously', 'overhead', 'yet', 'instead', 'fear', 'sarah', 'felt', 'surge', 'excitement', 'coursing', 'vein', 'long', 'could', 'remember', 'yearned', 'adventure', 'life', 'le', 'ordinary', 'seemed', 'fate', 'finally', 'calling', 'name', 'fierce', 'determination', 'burning', 'heart', 'took', 'step', 'forward

### Tokenizing Text and Creating Training Data

This code cell utilizes the `clean_text_Lemmatization` function to clean the text and then tokenize it. Additionally, it creates training data for a word prediction model.

- **Tokenization:** The cleaned text is tokenized to form individual words.
- **Vocabulary Generation:** A set of unique words is created from the tokenized text.
- **Mapping Words to Indices:** Each word in the vocabulary is mapped to a unique index for numerical representation.
- **Creating Training Samples:** The text is then divided into context-target pairs for training the word prediction model.

**Libraries Used:**
- `torch`: PyTorch library for building neural networks.
- `numpy`: NumPy library for numerical operations.

**Returns:**
- `data`: A list of tuples containing context-target pairs for training the model.

In [4]:
%%time
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Tokenize the text
words = clean_text_Lemmatization(text)
vocab = set(words)
word_to_ix = {word: i for i, word in enumerate(vocab)}
ix_to_word = {i: word for i, word in enumerate(vocab)}
data = []
for i in range(0, len(words) - 2):
    context = [word_to_ix[words[i]], word_to_ix[words[i + 1]]]
    target = word_to_ix[words[i + 2]]
    data.append((context, target))

CPU times: total: 844 ms
Wall time: 1.48 s


### Recurrent Neural Network (RNN) Model Definition

This code cell defines an RNN model for word prediction. The model architecture consists of the following components:

- **Embedding Layer:** An embedding layer that maps each word index to a dense vector representation.
- **RNN Layer:** An RNN layer that takes the embedded inputs and produces hidden states.
- **Linear Layer:** A linear layer that predicts the next word in the sequence based on the final hidden state.

**Arguments:**
- `vocab_size`: The size of the vocabulary.
- `embedding_dim`: The dimensionality of the word embeddings.
- `hidden_dim`: The dimensionality of the hidden state of the RNN.

**Methods:**
- `forward`: Defines the forward pass of the model, taking inputs and returning the predicted output.

In [5]:
# Define RNN model
class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(RNN, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        self.linear = nn.Linear(hidden_dim, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        rnn_out, _ = self.rnn(embeds.view(len(inputs), 1, -1))
        output = self.linear(rnn_out.view(len(inputs), -1))
        return output[-1]

In [6]:
len(vocab)

121

### Model Training Setup

This code cell sets up the model training process by defining hyperparameters, initializing the model, loss function, and optimizer, and splitting the data into training and validation sets.

**Hyperparameters:**
- `EMBEDDING_DIM`: Dimensionality of the word embeddings.
- `HIDDEN_DIM`: Dimensionality of the hidden state of the RNN.
- `LEARNING_RATE`: Learning rate for the optimizer.
- `EPOCHS`: Number of training epochs.

**Model Initialization:**
- An instance of the `RNN` class is initialized with the specified hyperparameters.

**Loss Function:**
- Cross-Entropy Loss is used as the loss function for training.

**Optimizer:**
- Stochastic Gradient Descent (SGD) optimizer is employed for parameter optimization.

**Data Splitting:**
- The data is split into training and validation sets using an 80-20 split ratio.



In [7]:
%%time
# Hyperparameters
EMBEDDING_DIM = 10
HIDDEN_DIM = 10
LEARNING_RATE = 0.1
EPOCHS = 100

# Initialize model, loss function, and optimizer
model = RNN(len(vocab), EMBEDDING_DIM, HIDDEN_DIM)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE)

from sklearn.model_selection import train_test_split

# Split the data into training and validation sets (80% - 20%)
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

CPU times: total: 78.1 ms
Wall time: 901 ms


### Training Loop

This code cell contains the training loop where the model is trained over multiple epochs using the training data. After each epoch, the training loss is computed and printed. Additionally, the validation loss is calculated to monitor the model's performance on unseen data.

**Training Process:**
- For each epoch, the total loss is initialized.
- For each training sample in the training data:
  - The model gradients are zeroed.
  - The context and target indices are converted to PyTorch tensors.
  - The model predicts the log probabilities of the next word.
  - The loss is calculated using the predicted probabilities and the actual target.
  - Gradients are backpropagated through the network.
  - The optimizer updates the model parameters.
- After each epoch, the training loss is computed and printed.

**Validation Process:**
- The model is switched to evaluation mode.
- The total validation loss is initialized.
- For each validation sample in the validation data:
  - The context and target indices are converted to PyTorch tensors.
  - The model predicts the log probabilities of the next word.
  - The validation loss is calculated using the predicted probabilities and the actual target.
- After each epoch, the validation loss is computed and printed.

**Note:** The model is switched back to training mode after validation.

In [8]:
# Training loop
for epoch in range(EPOCHS):
    total_loss = 0
    for context, target in train_data:
        model.zero_grad()

        context_idxs = torch.tensor(context, dtype=torch.long)
        target = torch.tensor([target], dtype=torch.long)

        log_probs = model(context_idxs)
        loss = loss_function(log_probs.unsqueeze(0), target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    if (epoch+1) % 10 == 0:
        print(f'Epoch {epoch+1}/{EPOCHS}, Training Loss: {total_loss/len(train_data):.4f}', end=' ')

    # Compute validation loss
    model.eval()
    total_val_loss = 0
    with torch.no_grad():
        for context, target in val_data:
            context_idxs = torch.tensor(context, dtype=torch.long)
            target = torch.tensor([target], dtype=torch.long)

            log_probs = model(context_idxs)
            val_loss = loss_function(log_probs.unsqueeze(0), target)
            total_val_loss += val_loss.item()
    if (epoch+1) % 10 == 0:
        print(f'Validation Loss: {total_val_loss/len(val_data):.4f}')
    model.train()

Epoch 10/100, Training Loss: 1.6664 Validation Loss: 6.2019
Epoch 20/100, Training Loss: 0.5274 Validation Loss: 7.5834
Epoch 30/100, Training Loss: 0.2545 Validation Loss: 8.3788
Epoch 40/100, Training Loss: 0.1577 Validation Loss: 8.9227
Epoch 50/100, Training Loss: 0.1139 Validation Loss: 9.3189
Epoch 60/100, Training Loss: 0.0891 Validation Loss: 9.6255
Epoch 70/100, Training Loss: 0.0732 Validation Loss: 9.8758
Epoch 80/100, Training Loss: 0.0622 Validation Loss: 10.0883
Epoch 90/100, Training Loss: 0.0540 Validation Loss: 10.2732
Epoch 100/100, Training Loss: 0.0478 Validation Loss: 10.4370


### Prediction Function

This code cell defines a function `predict_next_words` to generate predictions for the next words given a starting text sequence. The function takes the trained model, starting text, word-to-index and index-to-word mappings, and the number of words to predict.

**Functionality:**
- The starting text is cleaned and lemmatized to obtain the last two words as context.
- Using the model, the next words are predicted iteratively based on the context.
- The predicted words are appended to a list and returned.

**Arguments:**
- `model`: Trained RNN model for word prediction.
- `text`: Starting text sequence.
- `word_to_ix`: Dictionary mapping words to indices.
- `ix_to_word`: Dictionary mapping indices to words.
- `num_words`: Number of words to predict.

**Returns:**
- `predicted_words`: List of predicted words.

In [11]:
# Prediction
def predict_next_words(model, text, word_to_ix, ix_to_word, num_words=5):
    words = clean_text_Lemmatization(text)
    context = [word_to_ix[words[-2]], word_to_ix[words[-1]]]
    predicted_words = []
    print(words, context)
    for _ in range(num_words):
        context_tensor = torch.tensor(context, dtype=torch.long)
        with torch.no_grad():
            output = model(context_tensor)
        predicted_ix = torch.argmax(output).item()
        predicted_word = ix_to_word[predicted_ix]
        predicted_words.append(predicted_word)
        context = [context[-1], predicted_ix]  # Update context for next prediction
    return predicted_words

text = 'sarah remember'
predicted_words = predict_next_words(model, text, word_to_ix, ix_to_word, num_words=10)
print(f'The predicted next five words are: {predicted_words}')
print('\n' + text + ' ' + ' '.join(predicted_words))

['sarah', 'remember'] [77, 82]
The predicted next five words are: ['ahead', 'moment', 'knew', 'venturing', 'side', 'rock', 'rhythmic', 'symphony', 'soothing', 'welcomed']

sarah remember ahead moment knew venturing side rock rhythmic symphony soothing welcomed


### Prediction Function with Temperature Sampling

This code cell defines a modified prediction function `predict_next_word_temperature` to generate predictions for the next words using temperature sampling. Temperature sampling is a technique used to control the randomness of predictions generated by the model.

**Functionality:**
- The starting text is cleaned and lemmatized to obtain the last two words as context.
- Using the model, the next words are predicted iteratively based on the context and temperature sampling.
- The predicted words are appended to a list and returned.

**Arguments:**
- `model`: Trained RNN model for word prediction.
- `text`: Starting text sequence.
- `word_to_ix`: Dictionary mapping words to indices.
- `ix_to_word`: Dictionary mapping indices to words.
- `num_words`: Number of words to predict.
- `temperature`: Parameter controlling the randomness of predictions. Higher values result in more randomness.

**Returns:**
- `predicted_words`: List of predicted words.

In [12]:
def predict_next_word_temperature(model, text, word_to_ix, ix_to_word, num_words=5, temperature=1.0):
    words = clean_text_Lemmatization(text)
    context = [word_to_ix.get(words[-2], 0), word_to_ix.get(words[-1], 0)]
    predicted_words = []
    for _ in range(num_words):
        context_tensor = torch.tensor(context, dtype=torch.long)
        with torch.no_grad():
            output = model(context_tensor)
        output = output / temperature
        probabilities = torch.nn.functional.softmax(output, dim=0)
        predicted_ix = torch.multinomial(probabilities, 1).item()
        predicted_word = ix_to_word[predicted_ix]
        predicted_words.append(predicted_word)
        context = [context[-1], predicted_ix]  # Update context for next prediction
    return predicted_words

text = 'sarah remember'
predicted_words = predict_next_word_temperature(model, text, word_to_ix, ix_to_word, num_words=10, temperature=1)
print(f'The predicted next five words are: {predicted_words}')
print('\n' + text + ' ' + ' '.join(predicted_words))

The predicted next five words are: ['ahead', 'mirroring', 'knew', 'charged', 'vastness', 'air', 'something', 'electric', 'charged', 'watched']

sarah remember ahead mirroring knew charged vastness air something electric charged watched


## Project Completed by:
[Karim Nasr](https://www.linkedin.com/in/karim-nasr-abu-al-fath/)