# Assigment 2 - Task 1

##### Name: Jaimis Arvindbhai Miyani
    
##### Student ID: 400551743
    
##### MacID: miyanij@mcmaster.ca
    
##### Subject: SEP 775 - Introduction to Computational Natural Language Processing

# RNN-Based Text Generation
### Objectives:
Implement a simple RNN for text generation to deepen your understanding of how re- current neural networks can be used to model sequences and generate text based on learned patterns.

### 1. RNN Model Implementation
#### Implement a basic Recurrent Neural Network model from scratch using PyTorch or TensorFlow. Your model should include an embedding layer, at least one RNN layer, and a fully connected layer for output. Refer to the "Recurrent Neural Networks (RNN)"section of the lectures for guidance on the architecture.
#### Use the"Long Short-Term Memory RNNs(LSTMs)"section as a reference to enhance your model with LSTM cells to improve its ability to capture long-term dependencies in text.


In [3]:
import torch
import torch.nn as nn

class RNNModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, num_layers=1, bidirectional=False):
        super(RNNModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=bidirectional)
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        embedded = self.embedding(x)
        output, (hidden, cell) = self.lstm(embedded)
        # If bidirectional, concatenate the final forward and backward hidden states
        if self.lstm.bidirectional:
            hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        else:
            hidden = hidden[-1,:,:]
        output = self.fc(hidden)
        return output


### 2. Dataset Preparation
#### Select a small text dataset for training your model. This could be a collection of poems, song lyrics, or any text of your choice. Preprocess the data by tokenizing the text into sequences and converting them into numerical format suitable for training your RNN.

In [4]:
import re
from collections import Counter

def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Tokenize text by splitting on whitespace and punctuation
    tokens = re.findall(r'\b\w+\b', text)
    return tokens

def create_vocab(tokenized_text):
    # Create vocabulary (map tokens to indices)
    vocab = {}
    for token in tokenized_text:
        if token not in vocab:
            vocab[token] = len(vocab)
    return vocab

def numerify_text(tokenized_text, vocab):
    # Convert tokens to numerical indices using vocabulary
    numerical_text = [vocab[token] for token in tokenized_text]
    return numerical_text

# Your choice of song lyrics
song_lyrics = """
Uh huh, uh huh (yeah, Rihanna)
Uh huh, uh huh (Good Girl Gone Bad)
Uh huh, uh huh (take three, action)
Uh huh, uh huh (Hov)

No clouds in my stones
Let it rain, I hydroplane in the bank
Comin' down like the Dow Jones
When the clouds come, we gone
We Roc-A-Fella
We fly higher than weather
In G5's or better
You know me (you know me)
In anticipation for precipitation stack chips for the rainy day
Jay, Rain Man is back
With Little Ms. Sunshine, Rihanna, where you at?

You have my heart
And we'll never be worlds apart
Maybe in magazines
But you'll still be my star
Baby, 'cause in the dark
You can't see shiny cars
And that's when you need me there
With you I'll always share
Because

When the sun shines, we'll shine together
Told you I'll be here forever
Said I'll always be your friend
Took an oath, I'ma stick it out to the end
Now that it's raining more than ever
Know that we'll still have each other
You can stand under my umbrella
You can stand under my umbrella, ella, ella, eh, eh, eh
Under my umbrella, ella, ella, eh, eh, eh
Under my umbrella, ella, ella, eh, eh, eh
Under my umbrella, ella, ella, eh, eh, eh, eh, eh-eh

These fancy things will never come in between
You're part of my entity, here for infinity
When the war has took its part
When the world has dealt its cards
If the hand is hard
Together we'll mend your heart
Because

When the sun shines, we shine together
Told you I'll be here forever
Said I'll always be your friend
Took an oath, I'ma stick it out to the end
Now that it's raining more than ever
Know that we'll still have each other
You can stand under my umbrella
You can stand under my umbrella, ella, ella, eh, eh, eh
Under my umbrella, ella, ella, eh, eh, eh
Under my umbrella, ella, ella, eh, eh, eh
Under my umbrella, ella, ella, eh, eh, eh, eh, eh-eh

You can run into my arms
It's okay, don't be alarmed
Come into me (there's no distance in between our love)
So gon' and let the rain pour
I'll be all you need and more
Because

When the sun shines, we shine together
Told you I'll be here forever
Said I'll always be your friend
Took an oath, I'ma stick it out to the end
Now that it's raining more than ever
Know that we'll still have each other
You can stand under my umbrella
You can stand under my umbrella, ella, ella, eh, eh, eh
Under my umbrella, ella, ella, eh, eh, eh
Under my umbrella, ella, ella, eh, eh, eh
Under my umbrella, ella, ella, eh, eh, eh, eh, eh-eh

It's raining, raining
Ooh, baby, it's raining, raining
Baby, come into me
Come into me
It's raining, raining
Ooh, baby, it's raining, raining
You can always come into me
Come into me
It's pouring rain
It's pouring rain
Come into me
Come into me
It's pouring rain
It's pouring rain, come into me
"""


# Preprocess the text
tokens = preprocess_text(song_lyrics)

# Create vocabulary
vocab = create_vocab(tokens)

# Numerify the text
numerical_text = numerify_text(tokens, vocab)

# Print tokens, vocabulary, and numerical text
print(f"--------------- Tokens ---------------\n\n{tokens}")
print(f"\n\n------------- Vocabulary -------------\n\n{vocab}")
print(f"\n\n----------- Numerical text -----------\n\n{numerical_text}")


--------------- Tokens ---------------

['uh', 'huh', 'uh', 'huh', 'yeah', 'rihanna', 'uh', 'huh', 'uh', 'huh', 'good', 'girl', 'gone', 'bad', 'uh', 'huh', 'uh', 'huh', 'take', 'three', 'action', 'uh', 'huh', 'uh', 'huh', 'hov', 'no', 'clouds', 'in', 'my', 'stones', 'let', 'it', 'rain', 'i', 'hydroplane', 'in', 'the', 'bank', 'comin', 'down', 'like', 'the', 'dow', 'jones', 'when', 'the', 'clouds', 'come', 'we', 'gone', 'we', 'roc', 'a', 'fella', 'we', 'fly', 'higher', 'than', 'weather', 'in', 'g5', 's', 'or', 'better', 'you', 'know', 'me', 'you', 'know', 'me', 'in', 'anticipation', 'for', 'precipitation', 'stack', 'chips', 'for', 'the', 'rainy', 'day', 'jay', 'rain', 'man', 'is', 'back', 'with', 'little', 'ms', 'sunshine', 'rihanna', 'where', 'you', 'at', 'you', 'have', 'my', 'heart', 'and', 'we', 'll', 'never', 'be', 'worlds', 'apart', 'maybe', 'in', 'magazines', 'but', 'you', 'll', 'still', 'be', 'my', 'star', 'baby', 'cause', 'in', 'the', 'dark', 'you', 'can', 't', 'see', 'shiny', '

### 3. Training (10 Points)
#### Train your RNN model on the prepared dataset. Aim to optimize the model to predict the next word in a sequence based on the given context. Adjust hyper- parameters such as learning rate, number of epochs, and hidden layer dimensions to improve performance.

In [5]:
# Hyperparameters
vocab_size = len(vocab)
embedding_dim = 100
hidden_dim = 256
output_dim = vocab_size
learning_rate = 0.001
num_epochs = 100
seq_length = 10  # Sequence length for training

# Convert numerical text to PyTorch tensor
numerical_text_tensor = torch.tensor(numerical_text)

# Define model, loss function, and optimizer
model = RNNModel(vocab_size, embedding_dim, hidden_dim, output_dim)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    for i in range(0, len(numerical_text_tensor) - seq_length, seq_length):
        # Get input and target sequences
        inputs = numerical_text_tensor[i:i+seq_length].unsqueeze(0)
        targets = numerical_text_tensor[i+1:i+1+seq_length].view(-1)

        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, targets)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if i % 100 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i}/{len(numerical_text_tensor)-seq_length}], Loss: {loss.item():.4f}')

# Save the trained model
torch.save(model.state_dict(), 'rnn_model.pth')

Epoch [1/100], Step [0/552], Loss: 5.0324
Epoch [1/100], Step [100/552], Loss: 5.0420
Epoch [1/100], Step [200/552], Loss: 5.0411
Epoch [1/100], Step [300/552], Loss: 4.9097
Epoch [1/100], Step [400/552], Loss: 4.8458
Epoch [1/100], Step [500/552], Loss: 4.8033
Epoch [2/100], Step [0/552], Loss: 4.2868
Epoch [2/100], Step [100/552], Loss: 4.8396
Epoch [2/100], Step [200/552], Loss: 2.6307
Epoch [2/100], Step [300/552], Loss: 4.2102
Epoch [2/100], Step [400/552], Loss: 4.0916
Epoch [2/100], Step [500/552], Loss: 3.7073
Epoch [3/100], Step [0/552], Loss: 3.3117
Epoch [3/100], Step [100/552], Loss: 4.5739
Epoch [3/100], Step [200/552], Loss: 0.5607
Epoch [3/100], Step [300/552], Loss: 3.1494
Epoch [3/100], Step [400/552], Loss: 3.1376
Epoch [3/100], Step [500/552], Loss: 2.6763
Epoch [4/100], Step [0/552], Loss: 2.1112
Epoch [4/100], Step [100/552], Loss: 4.1520
Epoch [4/100], Step [200/552], Loss: 0.5199
Epoch [4/100], Step [300/552], Loss: 1.7719
Epoch [4/100], Step [400/552], Loss: 2.3

Epoch [31/100], Step [500/552], Loss: 1.1645
Epoch [32/100], Step [0/552], Loss: 0.7074
Epoch [32/100], Step [100/552], Loss: 1.2429
Epoch [32/100], Step [200/552], Loss: 0.4828
Epoch [32/100], Step [300/552], Loss: 0.5416
Epoch [32/100], Step [400/552], Loss: 1.1745
Epoch [32/100], Step [500/552], Loss: 1.1588
Epoch [33/100], Step [0/552], Loss: 0.7121
Epoch [33/100], Step [100/552], Loss: 1.2418
Epoch [33/100], Step [200/552], Loss: 0.4978
Epoch [33/100], Step [300/552], Loss: 0.5408
Epoch [33/100], Step [400/552], Loss: 1.1727
Epoch [33/100], Step [500/552], Loss: 1.1627
Epoch [34/100], Step [0/552], Loss: 0.7049
Epoch [34/100], Step [100/552], Loss: 1.2393
Epoch [34/100], Step [200/552], Loss: 0.4818
Epoch [34/100], Step [300/552], Loss: 0.5400
Epoch [34/100], Step [400/552], Loss: 1.1714
Epoch [34/100], Step [500/552], Loss: 1.1567
Epoch [35/100], Step [0/552], Loss: 0.7095
Epoch [35/100], Step [100/552], Loss: 1.2384
Epoch [35/100], Step [200/552], Loss: 0.4991
Epoch [35/100], St

Epoch [62/100], Step [500/552], Loss: 1.1408
Epoch [63/100], Step [0/552], Loss: 0.6883
Epoch [63/100], Step [100/552], Loss: 1.2121
Epoch [63/100], Step [200/552], Loss: 0.4916
Epoch [63/100], Step [300/552], Loss: 0.5274
Epoch [63/100], Step [400/552], Loss: 1.1433
Epoch [63/100], Step [500/552], Loss: 1.1454
Epoch [64/100], Step [0/552], Loss: 0.6807
Epoch [64/100], Step [100/552], Loss: 1.2112
Epoch [64/100], Step [200/552], Loss: 0.4771
Epoch [64/100], Step [300/552], Loss: 0.5272
Epoch [64/100], Step [400/552], Loss: 1.1428
Epoch [64/100], Step [500/552], Loss: 1.1401
Epoch [65/100], Step [0/552], Loss: 0.6874
Epoch [65/100], Step [100/552], Loss: 1.2110
Epoch [65/100], Step [200/552], Loss: 0.4907
Epoch [65/100], Step [300/552], Loss: 0.5269
Epoch [65/100], Step [400/552], Loss: 1.1421
Epoch [65/100], Step [500/552], Loss: 1.1445
Epoch [66/100], Step [0/552], Loss: 0.6798
Epoch [66/100], Step [100/552], Loss: 1.2101
Epoch [66/100], Step [200/552], Loss: 0.4771
Epoch [66/100], St

Epoch [93/100], Step [400/552], Loss: 1.1285
Epoch [93/100], Step [500/552], Loss: 1.1357
Epoch [94/100], Step [0/552], Loss: 0.6708
Epoch [94/100], Step [100/552], Loss: 1.1984
Epoch [94/100], Step [200/552], Loss: 0.4752
Epoch [94/100], Step [300/552], Loss: 0.5210
Epoch [94/100], Step [400/552], Loss: 1.1278
Epoch [94/100], Step [500/552], Loss: 1.1308
Epoch [95/100], Step [0/552], Loss: 0.6786
Epoch [95/100], Step [100/552], Loss: 1.1983
Epoch [95/100], Step [200/552], Loss: 0.4869
Epoch [95/100], Step [300/552], Loss: 0.5208
Epoch [95/100], Step [400/552], Loss: 1.1277
Epoch [95/100], Step [500/552], Loss: 1.1352
Epoch [96/100], Step [0/552], Loss: 0.6703
Epoch [96/100], Step [100/552], Loss: 1.1977
Epoch [96/100], Step [200/552], Loss: 0.4750
Epoch [96/100], Step [300/552], Loss: 0.5207
Epoch [96/100], Step [400/552], Loss: 1.1271
Epoch [96/100], Step [500/552], Loss: 1.1303
Epoch [97/100], Step [0/552], Loss: 0.6781
Epoch [97/100], Step [100/552], Loss: 1.1976
Epoch [97/100], St

### 4. Text Generation
#### Once trained, use your model to generate text. Start with a seed sentence or word, then predict the next word using your model. Append the predicted word to your text and use the updated sequence as the new input to generate the next word. Repeat this process to generate a text of at least 100 words.


In [6]:
import torch
import torch.nn.functional as F
import numpy as np

# Load the trained model
model = RNNModel(vocab_size, embedding_dim, hidden_dim, output_dim)
model.load_state_dict(torch.load('rnn_model.pth'))
model.eval()

# Function to generate text
def generate_text(model, start_text, length=100):
    model.eval()
    current_text = start_text
    with torch.no_grad():
        for _ in range(length):
            # Tokenize the current text
            tokens = preprocess_text(current_text)
            numerical_tokens = numerify_text(tokens, vocab)

            # Convert to PyTorch tensor and add batch dimension
            input_tensor = torch.tensor(numerical_tokens).unsqueeze(0)

            # Forward pass to predict the next word
            output = model(input_tensor)

            # Reshape output to (batch_size, sequence_length, vocab_size)
            output = output.view(1, -1, output_dim)

            # Apply softmax along the vocab dimension
            probabilities = F.softmax(output, dim=2).squeeze().detach().numpy()

            # Sample the next word based on the predicted probabilities
            predicted_index = np.random.choice(len(probabilities[-1]), p=probabilities[-1])
            predicted_word = list(vocab.keys())[list(vocab.values()).index(predicted_index)]

            # Append the predicted word to the current text
            current_text += ' ' + predicted_word

    return current_text

# Seed sentence to start text generation
seed_sentence = "You can stand under my umbrella"


# Generate text
generated_text = generate_text(model, seed_sentence, length=100)

# Print the generated text
print(f"------------ Generated Text ------------\n\n{generated_text}")


------------ Generated Text ------------

You can stand under my umbrella you need and we ll mend your friend took an oath i ma stick it s raining you can stand under my arms it s no clouds come into me you need and more because when the world has dealt its cards if the world has dealt its cards if the rain come in the take three action uh huh uh huh uh huh war has dealt its cards if the bank comin down like the rain come into me there s no distance in between you can stand under my umbrella you need and more because when the end


### 5. Analysis
#### Analyze the generated text. Discuss how well your model captures the style and coherence of the chosen dataset. Reflect on the performance of the basic RNN model versus the LSTM-enhanced version. Consider the effects of different hyper- parameters on the quality of the generated text.

##### Style and Coherence Analysis:
The generated text continues to include phrases and words from the original song "Umbrella" by Rihanna, such as "under my umbrella," "ella eh eh," and "stand under my umbrella." This indicates that the model has retained the stylistic elements of the song.

However, similar to the previous generated text, there are instances where the coherence is compromised. For example, phrases like "have each other you can stand under my umbrella" and "under my entity here for the dow jones" do not make sense in the context of the song lyrics. This suggests that while the model has learned some patterns from the training data, it may struggle with maintaining coherence and understanding context over longer sequences.

##### Comparison with Original Song:
The generated text still includes elements from the original song, such as the repeated phrase "under my umbrella ella eh eh" and the mention of standing under an umbrella. However, it also introduces new phrases and words that are not present in the original song lyrics.

##### Reflecting on Model Performance:
Basic RNN vs. LSTM-Enhanced Model: It's likely that the LSTM-enhanced model performed better in capturing the long-term dependencies and stylistic nuances of the song lyrics compared to the basic RNN. However, even with LSTM, the generated text still exhibits some inconsistencies and lacks full coherence.
Effects of Hyperparameters:
Adjusting hyperparameters such as the learning rate, number of epochs, and hidden layer dimensions may have an impact on the quality of the generated text. Fine-tuning these hyperparameters could potentially lead to better results.

##### Conclusion:
While the newly generated text continues to exhibit some resemblance to the style of Rihanna's songs, there are still areas for improvement in terms of coherence and accuracy. Experimenting with different hyperparameters and possibly using more advanced techniques could help enhance the model's performance and produce more coherent and stylistically accurate generated text. Additionally, providing a larger and more diverse training dataset may also improve the model's ability to capture the intricacies of song lyrics.