# Instructions

Run the cells in sequential order

Dataset - https://huggingface.co/datasets/huggingartists/taylor-swift

The Taylor Swift dataset on Hugging Face is a collection of lyrics from Taylor Swift's songs, which have been curated and tokenized for use with Hugging Face's natural language processing tools. The dataset includes lyrics from all of Taylor Swift's albums, including deluxe editions and bonus tracks. The dataset is provided in JSON format and includes a list of dictionaries, where each dictionary represents a song and includes the album, song name, lyrics, and the year of release. Additionally, each song's lyrics have been tokenized into individual words or subwords, which makes it easy to use with Hugging Face's transformers, a popular framework for training and deploying state-of-the-art language models

Steps performed -

1) Dataset parsing

2) Tokenization, removal of special charactors, nan removal, 

3) Loading GPT2 model

4) Word embedding creation

5) Training 

6) Generation

7) Rogue Score

## To get started
Download the dataset and place in same directory. 
Run the cells sequentially
(training takes a long time, so use GPU)

--------------
Contributed by - Tannistha Mandal

In [48]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [49]:
!pip install torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Importing Required Libraries

In [50]:
import pandas as pd
import numpy as np
from datasets import load_dataset, Dataset, DatasetDict

## Loading Data

In [51]:
data = load_dataset("huggingartists/taylor-swift")




  0%|          | 0/1 [00:00<?, ?it/s]

## Splitting into Training and Test Data

In [52]:
train_percentage = 1
validation_percentage = 0
test_percentage = 0

train, validation, test = np.split(data['train']['text'], [int(len(data['train']['text'])*train_percentage), int(len(data['train']['text'])*(train_percentage + validation_percentage))])

data = DatasetDict(
    {
        'train': Dataset.from_dict({'text': list(train)}),
        'validation': Dataset.from_dict({'text': list(validation)}),
        'test': Dataset.from_dict({'text': list(test)})
    }
)

train_data = data['train']['text']

## Preprocessing of Data

In [53]:
"""
Clean and recreate the dataset:
a. Remove special characters
b. Substitute multiple spaces with single spaces
c. Convert to lower case
"""

import re
def eda(sentences):
    processed_sentences = []

    for s in sentences:
        # Remove all the special characters
        processed_sentence = re.sub(r'\W', ' ', str(s))

        # Substituting multiple spaces with single space
        processed_sentence = re.sub(r'\s+', ' ', processed_sentence, flags=re.I)
        
        # Converting to Lowercase
        processed_sentence = processed_sentence.lower()

        processed_sentences.append(processed_sentence)
        
    return processed_sentences



In [54]:
# Printing number of songs in the dataset
song_data = eda(data['train']['text'])
print("Length of song_data :", len(song_data))


Length of song_data : 762


In [55]:
# Printing first song lyrics
song_data[0]

'car rides to malibu strawberry ice cream one spoon for two and tradin jackets laughin bout how small it looks on you watching reruns of glee bein annoying singin in harmony i bet shes braggin to all her friends sayin youre so unique hmm so when you gonna tell her that we did that too she thinks its special but its all reused that was our place i found it first i made the jokes you tell to her when shes with you do you get déjà vu when she s with you do you get déjà vu hmm do you get déjà vu huh do you call her almost say my name cause lets be honest we kinda do sound the same another actress i hate to think that i was just your type and i bet that she knows billy joel cause you played her uptown girl youre singin it together now i bet you even tell her how you love her in between the chorus and the verse so when you gonna tell her that we did that too she thinks its special but it s all reused that was the show we talked about played you the songs shes singing now when shes with you d

## Saving Data into Dataframe

In [56]:
#DATAFRAME FOR SONGS
taylorswift_songs = pd.DataFrame([item for sublist in [i.split('\n') for i in song_data] for item in sublist])
taylorswift_songs.columns = ["All_Songs"]
taylorswift_songs.head(5)

Unnamed: 0,All_Songs
0,car rides to malibu strawberry ice cream one s...
1,vintage tee brand new phone high heels on cobb...
2,i can see you standing honey with his arms aro...
3,we could leave the christmas lights up til jan...
4,im doing good im on some new shit been saying ...


In [57]:
#Drop the songs with lyrics too long (after more than 1024 tokens, does not work)
taylorswift_songs = taylorswift_songs[taylorswift_songs['All_Songs'].apply(lambda x: len(x.split(' ')) < 350)]
taylorswift_songs.head()

Unnamed: 0,All_Songs
0,car rides to malibu strawberry ice cream one s...
1,vintage tee brand new phone high heels on cobb...
3,we could leave the christmas lights up til jan...
4,im doing good im on some new shit been saying ...
7,salt air and the rust on your door i never nee...


## Removing NAN values

In [58]:
df = taylorswift_songs
(df=='').any()
df.replace('', pd.NA, inplace=True)

# Drop rows with NaN values
df.dropna(inplace=True)


In [59]:
df.to_csv('taylor_songs_1.csv', index=False)


## Creating Validation / Testing Data

In [60]:
#Create a very small test set to compare generated text with the reality
test_set = df.sample(n = 100)
df = df.loc[~df.index.isin(test_set.index)]

#Reset the indexes
test_set = test_set.reset_index()
df = df.reset_index()

In [61]:
test_set.head(5)

Unnamed: 0,index,All_Songs
0,292,ready for it i did something bad gorgeous sty...
1,406,there is some unbelievable music that has come...
2,568,1 ready for it 2 gorgeous 3 look what you made...
3,689,what a shame didnt wanna be the one that got a...
4,193,loving him is like driving a new maserati down...


### Saving last 20 words in a new column. To be used later for comparing Quality of Lyrics generated

In [62]:
#For the test set only, keep last 20 words in a new column, then remove them from original column
test_set['True_end_lyrics'] = test_set['All_Songs'].str.split().str[-20:].apply(' '.join)
test_set['Lyrics'] = test_set['All_Songs'].str.split().str[:-20].apply(' '.join)


In [63]:
test_set.head(5)

Unnamed: 0,index,All_Songs,True_end_lyrics,Lyrics
0,292,ready for it i did something bad gorgeous sty...,long live new years daygetaway carwe are never...,ready for it i did something bad gorgeous styl...
1,406,there is some unbelievable music that has come...,what it takes to make a pro blush all the boys...,there is some unbelievable music that has come...
2,568,1 ready for it 2 gorgeous 3 look what you made...,for it 2 gorgeous 3 look what you made me do 4...,1 ready
3,689,what a shame didnt wanna be the one that got a...,this is the last time you said no one else thi...,what a shame didnt wanna be the one that got a...
4,193,loving him is like driving a new maserati down...,was red red red red red red loving him is like...,loving him is like driving a new maserati down...


## Removing NA values from newly generated columns

In [64]:
(test_set=='').any()
test_set.replace('', pd.NA, inplace=True)

# Drop rows with NaN values
test_set.dropna(inplace=True)


In [65]:
(test_set=='').any()


index              False
All_Songs          False
True_end_lyrics    False
Lyrics             False
dtype: bool

### Saving new CSV

In [66]:
test_set.to_csv('taylor_songs_2.csv', index=False)


In [67]:
 !pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Importing Required Libraries

In [68]:
import pandas as pd
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import random
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm, trange
import torch.nn.functional as F
import csv


## Feeding Lyrics to Custom  SongLyrics Object

In [69]:
class SongLyrics(Dataset):
    
    def __init__(self, control_code, truncate=False, gpt2_type="gpt2", max_length=1024):

        self.tokenizer = GPT2Tokenizer.from_pretrained(gpt2_type)
        self.lyrics = []

        for row in df['All_Songs']:
          self.lyrics.append(torch.tensor(
                self.tokenizer.encode(f"<|{control_code}|>{row[:max_length]}<|endoftext|>")
            ))
                
        if truncate:
            self.lyrics = self.lyrics[:20000]
        self.lyrics_count = len(self.lyrics)
        
    def __len__(self):
        return self.lyrics_count

    def __getitem__(self, item):
        return self.lyrics[item]


In [70]:
dataset = SongLyrics(df['All_Songs'], truncate=True, gpt2_type="gpt2")

## Loading GPT2 Models

In [71]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [72]:
#Accumulated batch size (since GPT2 is so big)
def pack_tensor(new_tensor, packed_tensor, max_seq_len):
    if packed_tensor is None:
        return new_tensor, True, None
    if new_tensor.size()[1] + packed_tensor.size()[1] > max_seq_len:
        return packed_tensor, False, new_tensor
    else:
        packed_tensor = torch.cat([new_tensor, packed_tensor[:, 1:]], dim=1)
        return packed_tensor, True, None

## Training the Model

In [89]:
def train(
    dataset, model, tokenizer,
    batch_size=16, epochs=10, lr=2e-5,
    max_seq_len=400, warmup_steps=200,
    gpt2_type="gpt2", output_dir=".", output_prefix="wreckgar",
    test_mode=False,save_model_on_epoch=False,
):

    acc_steps = 100
    model.train()

    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=warmup_steps, num_training_steps=-1
    )

    train_dataloader = DataLoader(dataset, batch_size=1, shuffle=True)
    loss=0
    accumulating_batch_count = 0
    input_tensor = None

    for epoch in range(epochs):

        print(f"Training epoch {epoch}")
        print(loss)
        for idx, entry in tqdm(enumerate(train_dataloader)):
            (input_tensor, carry_on, remainder) = pack_tensor(entry, input_tensor, 768)

            if carry_on and idx != len(train_dataloader) - 1:
                continue

            #input_tensor = input_tensor.to(device)
            outputs = model(input_tensor, labels=input_tensor)
            loss = outputs[0]
            loss.backward()

            if (accumulating_batch_count % batch_size) == 0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                model.zero_grad()

            accumulating_batch_count += 1
            input_tensor = None
        if save_model_on_epoch:
            torch.save(
                model.state_dict(),
                os.path.join(output_dir, f"{output_prefix}-{epoch}.pt"),
            )
    return model

In [90]:
#Train the model on the specific data we have
model = train(dataset, model, tokenizer)




Training epoch 0
0


315it [10:28,  2.00s/it]


Training epoch 1
tensor(4.4171, grad_fn=<NllLossBackward0>)


315it [10:35,  2.02s/it]


Training epoch 2
tensor(3.0927, grad_fn=<NllLossBackward0>)


315it [10:34,  2.01s/it]


Training epoch 3
tensor(3.3269, grad_fn=<NllLossBackward0>)


315it [10:06,  1.93s/it]


Training epoch 4
tensor(4.0313, grad_fn=<NllLossBackward0>)


315it [10:15,  1.95s/it]


Training epoch 5
tensor(3.5744, grad_fn=<NllLossBackward0>)


315it [10:15,  1.95s/it]


Training epoch 6
tensor(3.4518, grad_fn=<NllLossBackward0>)


315it [10:13,  1.95s/it]


Training epoch 7
tensor(3.8574, grad_fn=<NllLossBackward0>)


315it [10:13,  1.95s/it]


Training epoch 8
tensor(2.3084, grad_fn=<NllLossBackward0>)


315it [10:15,  1.95s/it]


Training epoch 9
tensor(2.5323, grad_fn=<NllLossBackward0>)


315it [10:24,  1.98s/it]


## Generating the Lyrics

In [94]:
def generate(
    model,
    tokenizer,
    prompt,
    entry_count=10,
    entry_length=30, #maximum number of words
    top_p=0.8,
    temperature=1.,
):

    model.eval()

    generated_num = 0
    generated_list = []

    filter_value = -float("Inf")

    with torch.no_grad():

        for entry_idx in trange(entry_count):

            entry_finished = False

            generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)

            for i in range(entry_length):
                
                outputs = model(generated, labels=generated)
                loss, logits = outputs[:2]
                logits = logits[:, -1, :] / (temperature if temperature > 0 else 1.0)

                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

                sorted_indices_to_remove = cumulative_probs > top_p
                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[
                    ..., :-1
                ].clone()
                sorted_indices_to_remove[..., 0] = 0

                indices_to_remove = sorted_indices[sorted_indices_to_remove]
                logits[:, indices_to_remove] = filter_value

                next_token = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1)
                generated = torch.cat((generated, next_token), dim=1)

                if next_token in tokenizer.encode("<|endoftext|>"):
                    entry_finished = True

                if entry_finished:

                    generated_num = generated_num + 1

                    output_list = list(generated.squeeze().numpy())
                    output_text = tokenizer.decode(output_list)
                    generated_list.append(output_text)
                    break
            
            if not entry_finished:
              output_list = list(generated.squeeze().numpy())
              output_text = f"{tokenizer.decode(output_list)}<|endoftext|>" 
              generated_list.append(output_text)
                
    return generated_list
     


In [95]:
#Function to generate multiple sentences. Test data should be a dataframe
def text_generation(test_data):
  generated_lyrics = []
  for i in range(len(test_data)):
    x = generate(model, tokenizer, test_data['Lyrics'][i], entry_count=1)
    generated_lyrics.append(x)
  return generated_lyrics


In [96]:
generated_lyrics = text_generation(test_set)
generated_lyrics

100%|██████████| 1/1 [00:05<00:00,  5.61s/it]
100%|██████████| 1/1 [00:17<00:00, 17.31s/it]
100%|██████████| 1/1 [00:03<00:00,  3.47s/it]
100%|██████████| 1/1 [00:24<00:00, 24.96s/it]
100%|██████████| 1/1 [00:20<00:00, 20.98s/it]
100%|██████████| 1/1 [00:11<00:00, 11.31s/it]
100%|██████████| 1/1 [00:16<00:00, 16.86s/it]
100%|██████████| 1/1 [00:17<00:00, 17.02s/it]
100%|██████████| 1/1 [00:28<00:00, 28.02s/it]
100%|██████████| 1/1 [00:25<00:00, 25.25s/it]
100%|██████████| 1/1 [00:00<00:00,  1.67it/s]
100%|██████████| 1/1 [00:27<00:00, 27.01s/it]
100%|██████████| 1/1 [00:14<00:00, 14.46s/it]
100%|██████████| 1/1 [00:14<00:00, 14.99s/it]
100%|██████████| 1/1 [00:20<00:00, 20.94s/it]
100%|██████████| 1/1 [00:18<00:00, 18.55s/it]
100%|██████████| 1/1 [00:12<00:00, 12.09s/it]
100%|██████████| 1/1 [01:09<00:00, 69.84s/it]
100%|██████████| 1/1 [00:05<00:00,  5.66s/it]
100%|██████████| 1/1 [00:26<00:00, 26.21s/it]
100%|██████████| 1/1 [00:21<00:00, 21.57s/it]
100%|██████████| 1/1 [00:20<00:00,

[["ready for it i did something bad gorgeous style love story you belong with melook what you made me do end gameking of my heartdelicate shake it offdancing with our hands tied so it goes surprise songblank spacedressbad blood shouldve said nodont blame me fine i'll never tell that liea lace bodysuitfrozen moment loved it live and learnoh good i got a chance to buy you these<|endoftext|>"],
 ['there is some unbelievable music that has come out of artists who are from la did you know that like los angeles has put out so many bands and artists that i m such a fan of and i d love to play you some music i m a fan of that s come from la is that okay this one came out in 1981 8 years before i was born and i love this song it s called bette davis eyes her hair is harlow gold her lips sweet surprise her hands are never cold shes got bette davis eyes shell turn her music on you wont have to think twice shes pure as new york snow she s got bette davis eyes and shell tease you shell unease you a

In [97]:
my_generations=[]

for i in range(len(generated_lyrics)):
  a = test_set['Lyrics'][i].split()[-30:] #Get the matching string we want (30 words)
  b = ' '.join(a)
  c = ' '.join(generated_lyrics[i]) #Get all that comes after the matching string
  my_generations.append(c.split(b)[-1])

In [98]:
test_set['Generated_lyrics'] = my_generations
test_set.head()

Unnamed: 0,index,All_Songs,True_end_lyrics,Lyrics,Generated_lyrics
0,292,ready for it i did something bad gorgeous sty...,long live new years daygetaway carwe are never...,ready for it i did something bad gorgeous styl...,fine i'll never tell that liea lace bodysuitf...
1,406,there is some unbelievable music that has come...,what it takes to make a pro blush all the boys...,there is some unbelievable music that has come...,what it takes to make a pro blush her blue ey...
2,568,1 ready for it 2 gorgeous 3 look what you made...,for it 2 gorgeous 3 look what you made me do 4...,1 ready,"for use. In the Arduino IDE, you can clone it..."
3,689,what a shame didnt wanna be the one that got a...,this is the last time you said no one else thi...,what a shame didnt wanna be the one that got a...,ever ill ever call you babe what a waste taki...
4,193,loving him is like driving a new maserati down...,was red red red red red red loving him is like...,loving him is like driving a new maserati down...,was red red red red red red xxx listening to ...


In [100]:
test_set.to_csv("Final_Generated_1.csv")
df = pd.read_csv('Final_Generated_1.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,index,All_Songs,True_end_lyrics,Lyrics,Generated_lyrics
0,0,292,ready for it i did something bad gorgeous sty...,long live new years daygetaway carwe are never...,ready for it i did something bad gorgeous styl...,fine i'll never tell that liea lace bodysuitf...
1,1,406,there is some unbelievable music that has come...,what it takes to make a pro blush all the boys...,there is some unbelievable music that has come...,what it takes to make a pro blush her blue ey...
2,2,568,1 ready for it 2 gorgeous 3 look what you made...,for it 2 gorgeous 3 look what you made me do 4...,1 ready,"for use. In the Arduino IDE, you can clone it..."
3,3,689,what a shame didnt wanna be the one that got a...,this is the last time you said no one else thi...,what a shame didnt wanna be the one that got a...,ever ill ever call you babe what a waste taki...
4,4,193,loving him is like driving a new maserati down...,was red red red red red red loving him is like...,loving him is like driving a new maserati down...,was red red red red red red xxx listening to ...


In [101]:
import torch
import torchtext
from torchtext.data.metrics import bleu_score
import pandas as pd

# Load the data from the two columns of the dataframe
df = pd.read_csv('Final_Generated_1.csv')
preds = df['Generated_lyrics'].tolist()
refs = df['True_end_lyrics'].tolist()

In [102]:
preds[0]

" fine i'll never tell that liea lace bodysuitfrozen moment loved it live and learnoh good i got a chance to buy you these<|endoftext|>"

In [103]:
refs[0]

'long live new years daygetaway carwe are never ever getting back together this is why we cant have nice things'

# Evaluation Metrics:

In [104]:
!pip install torchmetrics


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [105]:
import pandas as pd
from torchmetrics.text.rouge import ROUGEScore

# Define a function to calculate Torch Rouge score for a pair of strings
def cc(ref, pred, app):
    rg = ROUGEScore()
    return rg(pred, ref)[app].item()

# Apply the function to each row of the DataFrame to calculate the Rouge score
df['rouge_score_fmeasure'] = df.apply(lambda row: cc(row['True_end_lyrics'], row['Generated_lyrics'], "rougeLsum_fmeasure"), axis=1)
df['rouge_score_precision'] = df.apply(lambda row: cc(row['True_end_lyrics'], row['Generated_lyrics'], "rougeLsum_precision"), axis=1)
df['rouge_score_recall'] = df.apply(lambda row: cc(row['True_end_lyrics'], row['Generated_lyrics'], "rougeLsum_recall"), axis=1)

# Print the updated DataFrame
df.head()


Unnamed: 0.1,Unnamed: 0,index,All_Songs,True_end_lyrics,Lyrics,Generated_lyrics,rouge_score_fmeasure,rouge_score_precision,rouge_score_recall
0,0,292,ready for it i did something bad gorgeous sty...,long live new years daygetaway carwe are never...,ready for it i did something bad gorgeous styl...,fine i'll never tell that liea lace bodysuitf...,0.044444,0.04,0.05
1,1,406,there is some unbelievable music that has come...,what it takes to make a pro blush all the boys...,there is some unbelievable music that has come...,what it takes to make a pro blush her blue ey...,0.36,0.3,0.45
2,2,568,1 ready for it 2 gorgeous 3 look what you made...,for it 2 gorgeous 3 look what you made me do 4...,1 ready,"for use. In the Arduino IDE, you can clone it...",0.170213,0.148148,0.2
3,3,689,what a shame didnt wanna be the one that got a...,this is the last time you said no one else thi...,what a shame didnt wanna be the one that got a...,ever ill ever call you babe what a waste taki...,0.16,0.133333,0.2
4,4,193,loving him is like driving a new maserati down...,was red red red red red red loving him is like...,loving him is like driving a new maserati down...,was red red red red red red xxx listening to ...,0.291667,0.25,0.35


In [107]:
#Mean of f_measure
print("Rouge F-Measure:", df['rouge_score_fmeasure'].mean())
#Mean of Precision
print("Rouge Precision:", df['rouge_score_precision'].mean())
#Mean of Recall
print("Rouge Recall:", df['rouge_score_recall'].mean())

Rouge F-Measure: 0.313116654753685
Rouge Precision: 0.26482994094491
Rouge Recall: 0.3878714295849204


In [108]:
df.to_csv("taylor_songs_results_final.csv")