# Yeatinator - Generate Lyrics in the Style of Your Favorite Artists

![Yeatinator Logo](./yeat.jpg)


In [2]:
from yeatinor import *

We can use Genius API to retrieve the lyrics from a various Yeat songs.

In [None]:
songs_yeat = get_songs_by_artist('Yeat',300)

In [4]:
data = []
for song in songs_yeat:
    data.append(songs_yeat[song])

#remove the title of the song
for song in data:
    start_idx = song.find('\n')
    data[data.index(song)] = song[start_idx+1:]

# Removes words in brackets
import re
for song in data:
    data[data.index(song)] = re.sub(r'\[.*?\]', '', song)

# Removes words in parentheses
for song in data:
    data[data.index(song)] = re.sub(r'\(.*?\)', '', song)


# Sanity check
print(data[0][:295])

Yeat concert, Yeat, Yeat 
Man, one of the biggest artists in the world right now
We seen it all, niggas was gettin' tazed
Bitches was poppin' pussy by the front door
Niggas was throwin' chairs, everything was goin' down
Fifty-thousand people, Yeat concert, twizz shit right here
Real twizz shit 


In [5]:
# now we define our corpus
corpus = []
for i in range(len(data)):
    lines = data[i].split("\n")
    for line in lines:
        line = line.lower()
        line_words = [word for word in line.split(' ') if word.strip() != '']
        line_words.append(' \n ')
        corpus = corpus + line_words


In [6]:
# get the unique words in the corpus
unique_words = list(set(corpus))
vocab_size = len(unique_words)
words = {w: idx for idx, w in enumerate(unique_words)}
print('Vocab size:', vocab_size)

Vocab size: 8299


In [7]:
# Now lets define our model we'll use keras for this
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout,Bidirectional,GlobalMaxPool1D
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import numpy as np



In [9]:
sequence_length = 15
step = 3
prev_words = []
next_words = []
sequence = []
for i in range(0, len(corpus) - sequence_length, step):
    prev_words.append(corpus[i: i + sequence_length ])
    next_words.append(corpus[i + sequence_length ])


In [10]:
word_index = {w: idx for idx, w in enumerate(unique_words)}
index_word = {idx: w for idx, w in enumerate(unique_words)}

In [11]:
def generator(prev_word_list, next_word_list, batch_size):
    index = 0
    while True:
        x = np.zeros((batch_size, sequence_length), dtype=np.int32)
        y = np.zeros((batch_size), dtype=np.int32)
        for i in range(batch_size):
            for t, w in enumerate(prev_word_list[index % len(prev_word_list)]):
                x[i, t] = word_index[w]
            y[i] = word_index[next_word_list[index % len(prev_word_list)]]
            index = index + 1
        yield x, y


# LSTM Model

In [12]:
model = Sequential(name='Yeatinator')
model.add(Embedding(input_dim=vocab_size, output_dim=124))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(GlobalMaxPool1D())
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(vocab_size, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(prev_words, next_words, test_size=0.2, random_state=42)

In [None]:
BATCH_SIZE = 64
model.fit(generator(X_train, y_train, BATCH_SIZE),
                    steps_per_epoch=int(len(X_train)/BATCH_SIZE) + 1,
                    epochs=10,
                    validation_data=generator(X_test, y_test, BATCH_SIZE),
                    validation_steps=int(len(X_test)/BATCH_SIZE) + 1)



In [21]:
source = "i'm at the top"
source = source.split(' ')
n_token = 100
lyrics = source.copy()
input_text = source.copy()
temperature = 1
for i in range(n_token):
    encoded = []
    for word in input_text:
        encoded.append(word_index[word])
    encoded = np.array(encoded)
    encoded = encoded.reshape(1, len(encoded))
    y_pred = model.predict(encoded, verbose=0)
    y_pred = y_pred/np.sum(y_pred)
    y_pred = np.random.choice(range(vocab_size), size=1, p=y_pred[0])[0]
    input_text.append(index_word[y_pred])
    if len(input_text) > sequence_length:
        input_text = input_text[1:]
    lyrics.append(index_word[y_pred])

print(' '.join(lyrics))

i'm at the top  
  got to me up, i got some bitch, and right go nobody, if all that sleep  
  got to of bitch that am when baby, bitch, first, the oh yeah  
  back it 'bout i'ma really wanna big feel at landin'?"  
  i gon' all these ran you just from the i'm not this but i heard i end  
  i flipping skin me  
  ah, and they pushing  
  i took gotta what's her her if yeah, when yeah, the need changin' like up  
  i'm this inside you want the never feeling  
  i be couldn't keller,


#### The model generates lyrics in the same style as yeat. We can clearly see the limits of training a Model for generation from scratch. Also the fact that in songs (especially yeat) we don't always use correct grammar it can be difficult to learn the relationships between words. Lets try another approach where we fine tune GPT-2 so it incroporates yeats style.

![Yeatinator Logo](./yeatnogrammar.jpg)


## Fine Tune GPT-2 model

In [None]:

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments

tokenizer_yeat= AutoTokenizer.from_pretrained("gpt2")
model_yeat = AutoModelForCausalLM.from_pretrained("gpt2")


def load_dataset(file_path, tokenizer, block_size = 128):
    dataset = TextDataset(
    tokenizer = tokenizer,
    file_path = file_path,
    block_size = block_size,
    )
    return dataset

def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=mlm,
    )
    return data_collator

def train(train_file_path,
      output_dir,
      overwrite_output_dir,
      per_device_train_batch_size,
      num_train_epochs,
      save_steps):


    tokenizer = tokenizer_yeat
    train_dataset = load_dataset(train_file_path, tokenizer)
    data_collator = load_data_collator(tokenizer)

    tokenizer.save_pretrained(output_dir)

    model = model_yeat
    model.save_pretrained(output_dir)

    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=overwrite_output_dir,
        per_device_train_batch_size=per_device_train_batch_size,
        num_train_epochs=num_train_epochs,
        do_train= True
   )
    trainer = Trainer(
          model=model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
    )

    trainer.train()
    trainer.save_model()





In [9]:
train_file_path = "/lines_of_data.txt"
model_name = 'gpt2'
output_dir = '/content/result'
overwrite_output_dir = False
per_device_train_batch_size = 8
num_train_epochs = 10
save_steps = 500

In [10]:
train(
    train_file_path=train_file_path,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)


Token indices sequence length is longer than the specified maximum sequence length for this model (344983 > 1024). Running this sequence through the model will result in indexing errors


Step,Training Loss
500,3.0879
1000,2.6903
1500,2.4784
2000,2.3444
2500,2.2324
3000,2.1617


In [11]:
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer

In [12]:
def generate_text(sequence, max_length):
    model = model_yeat
    tokenizer = tokenizer_yeat

    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')

    final_outputs = model.generate(
        ids.to('cuda'),
        do_sample=True,
        max_length=max_length,
        pad_token_id=model.config.eos_token_id,
        #top_k=50,
        #top_p=0.95,
    )

    # Decode the generated output
    generated_text = tokenizer.decode(final_outputs[0], skip_special_tokens=True)

    # Remove parentheses
    formatted_text = generated_text.replace('(', '').replace(')', '')

    formatted_text = formatted_text.replace(', ', ',\n').replace('. ', '.\n')

    # Split text by lines and filter out empty lines
    lines = formatted_text.splitlines()
    non_empty_lines = [line for line in lines if line.strip()]  # Filter out empty or whitespace-only lines

    # Join the non-empty lines back with new lines
    final_text = "\n".join(non_empty_lines)

    # Print the final formatted text without extra empty lines
    print(final_text)

In [13]:
generate_text('im at the top',100)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


im at the top',
'I got a whole lot of fuckin' money,
I got a whole lot of fuckin' bullets',
"I got a whole lot of fuckin' bullets,
just sayin' out loud",
'I got all these racks,
yeah,
all they rackie',
"Yeah,
I got these bitches on me,
it's way too much food",
"I'm fuckin' my money up,
bitch,
I'm fuckin' my money down ",
"I might be the


## Lets compare the results of the two models.

Prompt :  "I'm at the top"

### LSTM Model : 
```
i'm at the top  
  got to me up, i got some bitch, and right go nobody, if all that sleep  
  got to of bitch that am when baby, bitch, first, the oh yeah  
  back it 'bout i'ma really wanna big feel at landin'?"  
  i gon' all these ran you just from the i'm not this but i heard i end  
  i flipping skin me  
  ah, and they pushing  
  i took gotta what's her her if yeah, when yeah, the need changin' like up  
  i'm this inside you want the never feeling  
  i be couldn't keller,
```

### Fine Tuned GPT-2 :
```
im at the top',
'I got a whole lot of fuckin' money,
I got a whole lot of fuckin' bullets',
"I got a whole lot of fuckin' bullets,
just sayin' out loud",
'I got all these racks,
yeah,
all they rackie',
"Yeah,
I got these bitches on me,
it's way too much food",
"I'm fuckin' my money up,
bitch,
I'm fuckin' my money down ",
"I might be the
```




The results with the GPT 2 style are satisfying we can really capture yeats style in a more ordered and structured way. The LSTM model struggled with making correct sentences.