# Eminem Song Lyrics Generation using GPT-2

This Jupyter notebook details the process of training a GPT-2 model to generate song lyrics in the style of Eminem, one of the most iconic rappers of all time. Our goal is to capture the essence of Eminem's lyrical style using the power of the GPT-2 transformer model, known for its effectiveness in natural language understanding and generation tasks.

## Project Overview

- **Objective**: To train a GPT-2 model capable of generating song lyrics that mimic Eminem's style.
- **Data Source**: The dataset consists of a collection of Eminem's song lyrics, curated to provide a diverse representation of his work.
- **Approach**: We utilize a pre-trained GPT-2 model and fine-tune it on the Eminem lyrics dataset, employing specific preprocessing steps to optimize the model's performance for this task.
- **Expected Outcome**: A model that generates new lyrics reflecting Eminem's thematic elements, rhyme schemes, and lyrical complexity.

By the end of this notebook, we will have a trained model ready to produce Eminem-style song lyrics, along with an evaluation of its performance and insights into the training process.


In [5]:
######imports

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup
import numpy as np
import pandas as pd
import os
import re
from utilities import import_data_from_location, handle_special_characters
import utilities as util

# Initialize the tokenizer and model from the pre-trained 'gpt2' model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Check for GPU availability and set the device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

print(f"Setup completed. Using device: {device}")

special_tokens = {'pad_token': '<PAD>'}
special_tokens_dict = {'additional_special_tokens': ['<startsong>', '<endsong>']}

tokenizer.add_special_tokens(special_tokens)
tokenizer.add_special_tokens(special_tokens_dict)

model.resize_token_embeddings(len(tokenizer))





You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 50260. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


Setup completed. Using device: cpu


Embedding(50260, 768)

## Data Preprocessing

In this section, we'll prepare the Eminem song lyrics dataset for training the GPT-2 model. This involves loading the dataset, cleaning the text, tokenizing the lyrics, and organizing the data into a suitable format for the model.



In [9]:

file_loc = './Eminem_Lyrics.csv'
songs = util.import_data_from_location(file_loc)

def remove_sections(text):
    # Define a regular expression pattern to match the song sections and their variations
    pattern = r"\[(Verse|Intro|Chorus|Interlude|Outro).*?\]"
    
    # Replace the matched patterns with an empty string
    cleaned_text = re.sub(pattern, "", text)
    
    return cleaned_text.strip()


def PreProcess(df,col):
    df[col] = df[col].apply(remove_sections)
    print(f"Lyrics after remove_sections: {df[col][0]}")
    df[col] = df[col].apply(util.handle_special_characters)
    print(f"Lyrics after handle_special_characters: {df[col][0]}")
    df[col] = df[col].apply(util.remove_non_ascii_characters)
    print(f"Lyrics after remove_non_ascii_characters: {df[col][0]}")
    df[col] = df[col].apply(util.expand_contractions, args=(util.contractions_dict,))
    print(f"Lyrics after expanding contractions: {df[col][0]}")
    return df

preprocessed_songs = PreProcess(songs, 'Lyrics')


Error with encoding utf-8: 'utf-8' codec can't decode byte 0x92 in position 6: invalid start byte
Success with encoding: latin1
Lyrics after remove_sections: Thus far, this album has provided musical accompaniment to make your passing pleasant
Our next number is designed to drown out the sound of shovels
Music to be buried by
Lyrics after handle_special_characters: Thus far, this album has provided musical accompaniment to make your passing pleasant
Our next number is designed to drown out the sound of shovels
Music to be buried by
Removed non-ASCII characters: 
Removed non-ASCII characters: á
Removed non-ASCII characters: 
Removed non-ASCII characters: 
Removed non-ASCII characters: ó
Removed non-ASCII characters: 
Removed non-ASCII characters: 
Removed non-ASCII characters: é
Removed non-ASCII characters: 
Removed non-ASCII characters: 
Removed non-ASCII characters: 
Removed non-ASCII characters: ö
Removed non-ASCII characters: 
Removed non-ASCII characters: 
Removed no

In [10]:
Define the custom dataset class for song lyrics
class SongDataset(Dataset):
    """A custom Dataset class for song lyrics."""
    def __init__(self, txt_list, tokenizer, max_length):
        """Initializes the dataset with tokenized and encoded song lyrics."""
        self.input_ids = []
        self.attn_masks = []

        # Tokenize and encode each song in the dataset
        for txt in txt_list:
            encodings = tokenizer('<startsong> ' + txt + ' <endsong>', truncation=True, max_length=max_length, padding="max_length")
            self.input_ids.append(torch.tensor(encodings['input_ids']))
            self.attn_masks.append(torch.tensor(encodings['attention_mask']))

    def __len__(self):
        """Returns the number of songs in the dataset."""
        return len(self.input_ids)

    def __getitem__(self, idx):
        """Returns the tokenized and encoded data of the song at the specified index."""
        return self.input_ids[idx], self.attn_masks[idx]



# Assuming `preprocessed_songs` is a list of preprocessed song lyrics
dataset = SongDataset(preprocessed_songs, tokenizer, max_length=512)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

## Training Loop

In [13]:
epochs = 10
optimizer = AdamW(model.parameters(), lr=1e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=len(dataloader) * epochs)

epoch_losses = []
epoch_perplexities = []
# Training loop with gradient accumulation
gradient_accumulation_steps = 4  # Adjust as needed for memory management
model.train()

for epoch in range(epochs):
    print(f"Epoch {epoch+1}/{epochs}")
    total_loss = 0
    optimizer.zero_grad()  # Initialize gradients to zero at the start of each epoch

    for batch_idx, (input_ids, masks) in enumerate(dataloader):
        input_ids, masks = input_ids.to(device), masks.to(device)
        outputs = model(input_ids, labels=input_ids, attention_mask=masks)
        loss = outputs.loss / gradient_accumulation_steps  # Adjust loss for gradient accumulation
        loss.backward()  # Accumulate gradients
        total_loss += loss.item()

        # Step the optimizer and scheduler every `gradient_accumulation_steps`
        if (batch_idx + 1) % gradient_accumulation_steps == 0:
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()  # Clear gradients after updating weights

        # Periodically print the loss, perplexity, and a sample prediction
        if batch_idx % 10 == 0:
            adjusted_loss = loss.item() * gradient_accumulation_steps  # Adjust the loss back for reporting
            perplexity = np.exp(adjusted_loss)
            print(f"Batch {batch_idx}/{len(dataloader)} - Loss: {adjusted_loss:.4f} - Perplexity: {perplexity:.2f}")

            # Decode and display a sample input, target, and prediction for qualitative evaluation
            input_sequence = tokenizer.decode(input_ids[0], skip_special_tokens=True)
            prediction_ids = torch.argmax(outputs.logits, dim=-1)[0]
            prediction_sequence = tokenizer.decode(prediction_ids, skip_special_tokens=True)

            print(f"  Input Sequence: {input_sequence}")
            print(f"  Target Sequence: {input_sequence}")  # Target is the same as input in language modeling
            print(f"  Prediction: {prediction_sequence}\n")

    # Compute and report the average loss and perplexity for the epoch
    avg_loss = total_loss / len(dataloader)
    avg_perplexity = np.exp(avg_loss)
    epoch_losses.append(avg_loss)
    epoch_perplexities.append(avg_perplexity)
    print(f"End of Epoch {epoch+1} - Average Loss: {avg_loss:.4f} - Average Perplexity: {avg_perplexity:.2f}\n")


Epoch 1/10


KeyboardInterrupt: 

## Training loss and perplexity

In [None]:
import matplotlib.pyplot as plt

def plot_metrics(epoch_losses, epoch_perplexities):
    epochs_range = range(1, len(epoch_losses) + 1)
    plt.figure(figsize=(12, 6))

    plt.subplot(1, 2, 1)
    plt.plot(epochs_range, epoch_losses, label='Training Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.title('Training Loss Over Epochs')
    plt.legend()

    plt.subplot(1, 2, 2)
    plt.plot(epochs_range, epoch_perplexities, label='Training Perplexity')
    plt.xlabel('Epochs')
    plt.ylabel('Perplexity')
    plt.title('Training Perplexity Over Epochs')
    plt.legend()

    plt.tight_layout()
    plt.show()

plot_metrics(epoch_losses, epoch_perplexities)

## Generation of new songs

In [None]:
model.eval()
prompt = "<startsong>"

generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
generated = generated.to(device)

def generate_text_samples(model, tokenizer, prompt, device, num_samples=3):
       """
    Generates text samples with different configurations using the specified model and tokenizer.

    Parameters:
    - model: The trained model used for text generation.
    - tokenizer: The tokenizer for encoding and decoding the text.
    - prompt (str): The initial text to start the generation.
    - device: The device (CPU or GPU) on which the model is loaded.
    - num_samples (int, optional): The number of samples to generate for each configuration. Default is 3.
    """
    generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
    generated = generated.to(device)

    input_ids = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0).to(device)
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=device)
    # Define different generation configurations
    generation_configs = [
        {"config": {"temperature": 0.8, "top_k": 50, "top_p": 0.95}, "description": "Temperature"},
        {"config": {"temperature": 1.0, "top_k": 30, "top_p": 0.95}, "description": "Top-K"},
        {"config": {"temperature": 1.0, "top_k": 0, "top_p": 0.85}, "description": "Top-P"},
        {"config": {"temperature": 1.0, "top_k": 50, "top_p": 0.95, "num_beams": 5, "early_stopping": True}, "description": "Beam Search"},
        {"config": {"temperature": 1.0, "top_k": 50, "top_p": 0.95, "repetition_penalty": 2.0}, "description": "No Repetition Penalty"},

        {"config": {"temperature": 0.7, "top_k": 50, "top_p": 0.95, "repetition_penalty": 2.5}, "description": "Temperature with High Repetition Penalty"},
        {"config": {"temperature": 0.9, "top_k": 20, "top_p": 0.85, "repetition_penalty": 2.0, "no_repeat_ngram_size": 2}, "description": "Top-K with N-Gram Repetition Prevention"},
        {"config": {"temperature": 1.0, "top_k": 0, "top_p": 0.8, "repetition_penalty": 2.0, "no_repeat_ngram_size": 3}, "description": "Top-P with N-Gram Repetition Prevention"},
        {"config": {"num_beams": 5, "early_stopping": True, "no_repeat_ngram_size": 3}, "description": "Beam Search with N-Gram Repetition Prevention"},
        {"config": {"temperature": 1.0, "top_k": 50, "top_p": 0.95, "repetition_penalty": 3.0, "no_repeat_ngram_size": 4}, "description": "No Repetition Penalty with Strong N-Gram Prevention"}
    ]
    for config in generation_configs:
        print(f"\nGenerating with {config['description']} configuration:")
        sample_outputs = model.generate(
            input_ids,
            do_sample=True,
            max_length=300,
            num_return_sequences=num_samples,
            **{k: v for k, v in config.items() if k != "description"}
        )

        for i, sample_output in enumerate(sample_outputs):
            print(f"{config['description']} Sample {i+1}: {tokenizer.decode(sample_output, skip_special_tokens=True)}\n")


# Example usage
print("Generating text samples with different configurations:")
print("-----------------------------------------------------")
print("Prompt: '<startsong>'")
generate_text_samples(model, tokenizer, prompt="<startsong>", device=device, num_samples=3)
print("Prompt: 'I'm feeling like a rap god'")
generate_text_samples(model, tokenizer, prompt="I'm feeling like a rap god", device=device, num_samples=2)
print("Prompt: 'my name is slim shady'")
generate_text_samples(model, tokenizer, prompt="my name is slim shady", device=device, num_samples=2)