## 1. Initialization and Imports

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import load_dataset

This section of code imports the necessary libraries and modules. The ‘transformers’ module provides models and utilities for working with models like GPT-2. 


## 2. Model and Tokenizer Setup

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

This section of code initializes the tokenizer and model with the pre-trained GPT-2 parameters. Additionally, a padding token is set to indicate the end of a text sequence. 


## 3. Dataset Preperation

In [2]:
dataset = load_dataset("amishshah/song_lyrics")
dataset = dataset["train"].shuffle(seed=42)
subset_size = 10000
dataset = dataset.select(range(subset_size))
train_test_dataset = dataset.train_test_split(test_size=0.1)
train_dataset = train_test_dataset["train"]
val_dataset = train_test_dataset["test"]

In this section of code the dataset is loaded. A subset of the data (10,000 songs)  is chosen and shuffled. This reduces training time and ensures that all genres of music are used in training. Then the dataset is split into training and validation sets. 

## 4. Tokenization

In [None]:
def tokenize_function(examples):
    return tokenizer(examples['lyrics'], truncation=True, padding=True, max_length=512)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)

This section of code takes the training and validation datasets and converts them into a format the model can process. 

## 5. Training Setup

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=4,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=1000,
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

This section of code sets the training arguments and initializes a Trainer object with the model, training arguments, data collator, and datasets. For the training arguments, the results of the training are put into the output directory and the trainer does four passes over the dataset. The weight_decay parameter adds a regularization term to prevent the model from fitting the training data too closely. Finally, the progress will be tracked every one thousand steps.

## 6. Model Training 

In [None]:
trainer.train()

Starts the training process using the previously setup parameters.

## 7. Model Usage

In [27]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, pipeline
checkpoint_path = './results'
model = GPT2LMHeadModel.from_pretrained(checkpoint_path)
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint_path)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    
text_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
results = text_generator('Write a song about overcoming a major challenge. ', max_length=600)
print(results[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Write a song about overcoming a major challenge. 

You know some people can’t stand the fact that somebody is losing
And I think I’m in a class, I have my reasons why
Is a stranger coming for you or is he? 

That all goes without saying
I can say this many times
Don’t want to tell you many truths
You don’t wanna say anything
It seems so simple and basic
But these days I’m too accustomed to be listening to
That everything else has gone unsaid

Then I’m ready to say the
Lord's Prayer
And let us pray for Him with all our heart
In His name
He will do me good.
I’ll help you with your other problems
He will make me feel my way to heaven
Come on around and save your man
Let them put down your sins
So forgive your enemies, be kind to yourself.
Let us all pray for our fellow man.
I’ll help them with their other problems
Come on around and save your man
Let us all pray for His name and forgive me with all of our heart

I hear a call that you’re standing right outside
I see, I hear the voice of y

After training the model, it can be used to generate lyrics! The fine-tuned model is loaded and the user can enter a prompt with a specified character limit. 

## 8. Compare to Base GPT Model

In [28]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, pipeline
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
gpt_text_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
prompt = "Write a song about overcoming a major challenge. "
results = gpt_text_generator(prompt, max_length=600)
print(results[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Write a song about overcoming a major challenge.  For example, let's call a song to be played on Friday nights. If The Beatles have played The Beatles, so do let that song have its same lead-up chorus.  Or say, Let's take The Beatles perform in the same place as The Doors performed in '67.  Then the songs would be called Let Them Dance and It would be a big song, except let's call The Doors in that place.  Let's assume that the lead-up chorus is as loud as The Doors used to be.  It's an extra six feet.  The lead-up song might have five songs.  These five would then each go through six separate pieces.  But the lead-up song, even if two songs are included, would still end in the same song.  The Beatles could have performed at 7:00PM or 11:00PM, at or close to midnight.  Let's put that number to a scale like 4-10. (The Beatles had the top four best-known songs of all time in 1967.)
That's where you leave your "The Beatles: Part 1 Songs Song" to the Beatles: Part 2 (to be completed.)
And 

## 9. Evaluate using ChatGPT 3.5

In [4]:
from openai import OpenAI

client = OpenAI(
    api_key="OPENAI_API_KEY",
)

def compare_lyrics(lyrics1, lyrics2):
    prompt_text = f"Here are two sets of song lyrics:\n\nLyrics A:\n{lyrics1}\n\nLyrics B:\n{lyrics2}\n\nWhich set of lyrics do you think is better?"
    
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt_text,
            }
        ],
        model="gpt-3.5-turbo",
    )

#     print(response.choices[0].text.strip())
    print(chat_completion.choices[0].message)
    
    
prompt = "Complete this lyric about love and loss:"
# Load models and tokenizer
model_base = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

checkpoint_path = './results'
model_fine_tuned = GPT2LMHeadModel.from_pretrained(checkpoint_path)
tokenizer_fine_tuned = GPT2Tokenizer.from_pretrained(checkpoint_path)

text_generator_finetuned = pipeline('text-generation', model=model_fine_tuned, tokenizer=tokenizer_fine_tuned)
generated_lyrics_finetuned = text_generator_finetuned(prompt, max_length=500, truncation=True)[0]['generated_text']

text_generator_base = pipeline('text-generation', model=model_base, tokenizer=tokenizer)
generated_lyrics_base = text_generator_base(prompt, max_length=500, truncation=True)[0]['generated_text']

# Call the function to compare the lyrics
compare_lyrics(generated_lyrics_base, generated_lyrics_finetuned)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


ChatCompletionMessage(content='It is subjective and depends on personal preference, but based on the provided excerpts, Lyrics B seem to be more emotionally powerful and cohesive.', role='assistant', function_call=None, tool_calls=None)


## Try Using A Loop To Compare and Find Percentage of the Time That The Fine Tuned Model is prefered

In [12]:
song_prompts = [
    "Write a song about finding love in unexpected places.",
    "Write a song about the feeling of losing a close friend.",
    "Write a song about memories of summer evenings.",
    "Write a song about the journey of self-discovery.",
    "Write a song about the first snowfall of the year.",
    "Write a song about overcoming a major challenge.",
    "Write a song about dreams of a better future.",
    "Write a song about moments of peaceful solitude.",
    "Write a song about the thrill of a new adventure.",
    "Write a song about reflections on past mistakes.",
    "Write a song about celebrating a major achievement.",
    "Write a song about the magic of childhood imaginations.",
    "Write a song about long drives on scenic roads.",
    "Write a song about the pain of unrequited love.",
    "Write a song about the warmth of coming home.",
    "Write a song about nights spent under the stars.",
    "Write a song about the intensity of a storm.",
    "Write a song about growing old with someone.",
    "Write a song about the colors of autumn.",
    "Write a song about the loss of innocence.",
    "Write a song about finding strength within.",
    "Write a song about the power of forgiveness.",
    "Write a song about escaping from reality.",
    "Write a song about a moment of unexpected kindness.",
    "Write a song about the mysteries of the ocean.",
    "Write a song about a reunion after many years.",
    "Write a song about the heartache of saying goodbye.",
    "Write a song about a childhood memory.",
    "Write a song about the adrenaline of competition.",
    "Write a song about a betrayal by someone trusted.",
    "Write a song about the joy of a newborn's first laugh.",
    "Write a song about the struggle for justice.",
    "Write a song about a night out with friends.",
    "Write a song about dealing with inner demons.",
    "Write a song about a journey across the world.",
    "Write a song about the calm before a storm.",
    "Write a song about the exhilaration of a first kiss.",
    "Write a song about the sorrow of a grave mistake.",
    "Write a song about the beauty of a sunrise.",
    "Write a song about the challenges of parenthood.",
    "Write a song about finding a letter from the past.",
    "Write a song about the conflict between heart and mind.",
    "Write a song about the comfort of old friendships.",
    "Write a song about the sting of harsh truths.",
    "Write a song about a day spent in nature.",
    "Write a song about a vow of eternal loyalty.",
    "Write a song about the chaos of a city life.",
    "Write a song about longing for distant places.",
    "Write a song about the peace of a snowy day.",
    "Write a song about breaking free from constraints.",
    "Write a song about the thrill of a chase.",
    "Write a song about the warmth of a fireside gathering.",
    "Write a song about the pain of parting ways.",
    "Write a song about a secret kept for years.",
    "Write a song about the joy of a festival.",
    "Write a song about a love that could have been.",
    "Write a song about an unexpected encounter.",
    "Write a song about a promise made long ago.",
    "Write a song about the relief of a confession.",
    "Write a song about the struggle to belong.",
    "Write a song about an unforgettable day at the beach.",
    "Write a song about a journey on a train.",
    "Write a song about a lesson learned the hard way.",
    "Write a song about the bliss of a lazy day.",
    "Write a song about a fight for a cause.",
    "Write a song about reconnecting with an old flame.",
    "Write a song about the agony of a tough decision.",
    "Write a song about a heroic deed.",
    "Write a song about the discovery of a secret world.",
    "Write a song about a night of wild festivities.",
    "Write a song about a lifelong friendship.",
    "Write a song about a road trip with friends.",
    "Write a song about a mysterious stranger.",
    "Write a song about the fear of the unknown.",
    "Write a song about the thrill of victory.",
    "Write a song about the despair of defeat.",
    "Write a song about a moment of clarity.",
    "Write a song about the beauty of falling leaves.",
    "Write a song about the tension of a rivalry.",
    "Write a song about the bittersweet end of a journey.",
    "Write a song about the anticipation of a reunion.",
    "Write a song about a leap of faith.",
    "Write a song about the sorrow of unfulfilled dreams.",
    "Write a song about the joy of a surprise.",
    "Write a song about the weight of responsibility.",
    "Write a song about the thrill of the unknown.",
    "Write a song about a walk in the moonlight.",
    "Write a song about the shock of a sudden change.",
    "Write a song about the comfort of a familiar song.",
    "Write a song about the pain of a harsh reality.",
    "Write a song about the excitement of a new beginning."
]

In [16]:
from openai import OpenAI
import logging
import numpy as np
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline

client = OpenAI(
    api_key="OPENAI_API_KEY",
)

logging.getLogger().setLevel(logging.ERROR)

def compare_lyrics(lyrics1, lyrics2):
    prompt_text = f"Here are two sets of song lyrics:\n\nLyrics A:\n{lyrics1}\n\nLyrics B:\n{lyrics2}\n\nWhich set of lyrics do you think is better (for your answer, just put your response ex. Lyrics _)?"
    
    chat_completion = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt_text}],
        model="gpt-3.5-turbo",
    )
    return chat_completion.choices[0].message.content

### gpt model stuff
model_base = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

### fine tuned model stuff
checkpoint_path = './results'
model_fine_tuned = GPT2LMHeadModel.from_pretrained(checkpoint_path)
tokenizer_fine_tuned = GPT2Tokenizer.from_pretrained(checkpoint_path)

text_generator_base = pipeline('text-generation', model=model_base, tokenizer=tokenizer)
text_generator_finetuned = pipeline('text-generation', model=model_fine_tuned, tokenizer=tokenizer_fine_tuned)
results = []

for prompt in song_prompts[:3]:
    generated_lyrics_finetuned = text_generator_finetuned(prompt, max_length=500, truncation=True)[0]['generated_text']
    generated_lyrics_base = text_generator_base(prompt, max_length=500, truncation=True)[0]['generated_text']
    result = compare_lyrics(generated_lyrics_base, generated_lyrics_finetuned)
    results.append(result)
    
print(results)

choices_count = np.mean([r == 'Lyrics B' for r in results])
print(f"Our Model was chosen {choices_count * 100}% of the time.")


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

['Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics A', 'Lyrics B', 'Lyrics A', 'Lyrics A', 'Lyrics B', 'Lyrics A', 'Lyrics A', 'Lyrics A', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics A', 'Lyrics B', 'Lyrics A', 'Lyrics A', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics A', 'Lyrics A', 'Lyrics B', 'Lyrics B', 'Lyrics A', 'Lyrics B', 'Lyrics B', 'Lyrics A', 'Lyrics A', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics A', 'Lyrics A', 'Lyrics A', 'Lyrics A', 'Lyrics A', 'Lyrics B', 'Lyrics A', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics A', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics B', 'Lyrics A', 'Lyrics A', 'Lyrics A', 'Lyrics B', 'Lyrics B', 'Lyrics A', 'Lyrics A', 'Lyrics A', 'Lyrics B', 'Lyrics A', 'Lyrics A', 'Lyrics B', 'Lyrics A', 'Lyrics A', 'Lyrics B', 'Lyrics A', 'Lyrics B', 'Lyrics B', 'Lyrics A', 'Lyrics B', 'Lyrics B', 'Lyrics A', 'Lyrics B', 'Lyrics A', 'Lyrics B', 'Ly

In [17]:
results

['Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics A',
 'Lyrics B',
 'Lyrics A',
 'Lyrics A',
 'Lyrics B',
 'Lyrics A',
 'Lyrics A',
 'Lyrics A',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics A',
 'Lyrics B',
 'Lyrics A',
 'Lyrics A',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics A',
 'Lyrics A',
 'Lyrics B',
 'Lyrics B',
 'Lyrics A',
 'Lyrics B',
 'Lyrics B',
 'Lyrics A',
 'Lyrics A',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics A',
 'Lyrics A',
 'Lyrics A',
 'Lyrics A',
 'Lyrics A',
 'Lyrics B',
 'Lyrics A',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics A',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics B',
 'Lyrics A',
 'Lyrics A',
 'Lyrics A',
 'Lyrics B',
 'Lyrics B',
 'Lyrics A',
 'Lyrics A',
 'Lyrics A',
 'Lyrics B',
 'Lyrics A',
 'Lyrics A',
 'Lyrics B',
 'Lyrics A',
 'Lyrics A',
 'Lyrics B',
 'Lyrics A',
 'Lyrics B',
 'Lyrics B',
 'Lyrics A',

## End Notebook