<a href="https://colab.research.google.com/github/MMBazel/LO_GenAI_Workshops/blob/main/%5BExplainer%5D_HelloTaylorSwift_FineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning a TinyLlama (tinyllama_tayswifty) Model For Fun & Profit



> TinyLlama is a 1.1B Llama model that is currently being trained on 3 trillion tokens, which recently started on September 1st. In this project, I fine-tune the latest version of TinyLlama to generate song lyrics in the style of Taylor Swift.





### Source Materials

HelloTaylorSwift tutorial is based primarily on this tutorial:

*   [Original kaggle notebook](https://www.kaggle.com/code/tommyadams/fine-tuning-tinyllama)
*   [Kaggle dataset](https://www.kaggle.com/datasets/thespacefreak/taylor-swift-song-lyrics-all-albums)


However these other resources are also helpful:

*   Similar Model: https://huggingface.co/huggingartists/taylor-swift
*   Similar Dataset: https://huggingface.co/datasets/huggingartists/taylor-swift


Tutorials on SFT & fine-tuning, TinyLlama, & HuggingFace
*   [Fine-Tune Your Own Tiny-Llama on Custom Dataset](https://www.youtube.com/watch?v=OVqe6GTrDFM)
*   [TinyLlama LLM: A Step-by-Step Guide to Implementing the 1.1B Model on Google Colab](https://dev.to/_ken0x/tinyllama-llm-a-step-by-step-guide-to-implementing-the-11b-model-on-google-colab-1pjh)
*   [Instruct-Tune Llama to Create ChatGPT Like Chatbots | Custom Dataset, Huggingface, SFT](https://www.youtube.com/watch?v=6XeTk8cZUsM)
*   [https://github.com/uygarkurt/SFT-TinyLlama/tree/main](https://github.com/uygarkurt/SFT-TinyLlama/tree/main)



### Tools

You'll need an access & accounts for:

*   `Google Colab` - Ideally [Pro](https://colab.research.google.com/signup) (it's just faster to use a GPU like A100 or V100 High-RAM ~$10)

*   `Huggingface` - Also ideally [Pro](https://huggingface.co/pricing) (there are some great benefits, including unlimited model and dataset upload ~ $9)



# Get Set-up

### 💡 What each library does



1. `torch`: PyTorch is an open-source machine learning library developed by Facebook. It provides a flexible ecosystem for building and training deep learning models. In this code, PyTorch is used for tensor operations, model training, and GPU acceleration.

2. `re`: The `re` module in Python provides support for regular expressions. It allows you to search, match, and manipulate strings based on specific patterns. In this code, `re` is used for cleaning and preprocessing the lyrics data.

3. `peft`: PEFT (Parameter-Efficient Fine-Tuning) is a library that provides methods for efficient fine-tuning of large language models. It offers techniques like LoRA (Low-Rank Adaptation) to reduce the number of trainable parameters while still achieving good performance. In this code, PEFT is used to apply LoRA to the pre-trained model.

4. `transformers`: The `transformers` library, developed by Hugging Face, provides a wide range of pre-trained transformer models and tools for natural language processing tasks. It offers a unified API for loading, fine-tuning, and using these models. In this code, `transformers` is used to load the pre-trained TinyLlama model and its associated tokenizer.

5. `trl`: TRL (Text Generation with Reinforcement Learning) is a library that provides tools for fine-tuning language models using reinforcement learning techniques. It allows for the optimization of language models based on specific rewards or metrics. In this code, TRL is used to create an SFT (Supervised Fine-Tuning) trainer for fine-tuning the model.

6. `datasets`: The `datasets` library, also developed by Hugging Face, provides a collection of popular datasets for various machine learning tasks. It offers a standardized interface for loading, preprocessing, and manipulating datasets. In this code, `datasets` is used to load the Taylor Swift example dataset and create dataset splits.

7. `huggingface_hub`: The `huggingface_hub` library provides functionality for interacting with the Hugging Face Hub, a platform for sharing and discovering pre-trained models, datasets, and other AI-related resources. In this code, `notebook_login` from `huggingface_hub` is used to authenticate and login to the Hugging Face Hub.

8. `numpy` (imported as `np`): NumPy is a fundamental library for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a wide range of mathematical functions. In this code, NumPy is used for array manipulation and splitting the dataset.

### 💬 About the techniques



> I used Hugging Face's transformers and peft (parameter-efficient fine-tuning) packages for this project. One of the major challenges of fine-tuning a large language model (LLM) is the high memory usage on the GPU. To address this challenge, I used the quantization and fine-tuning methods described in the 2023 paper "QLoRA: Efficient Finetuning of Quantized LLMs".


> These methods collectively enhance the efficiency of the project, enabling the creation of Taylor Swift-style song lyrics while optimizing GPU memory utilization and computational resources.





## ▶️ Install necessary libraries

In [None]:
!pip install trl transformers accelerate git+https://github.com/huggingface/peft.git -Uqqq
!pip install -i https://pypi.org/simple/ bitsandbytes -qqq
!pip install einops wandb -Uqqq

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
import torch
import re
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, TrainingArguments
from trl import SFTTrainer
from datasets import Dataset
import random

### ▶️ Loging With Huggingface Credentials

Note: You need to create a Huggingface account and then create a user access token.

See this [doc](https://huggingface.co/docs/hub/en/security-tokens).

You'll want access so you can upload your trained model.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Load Data

## About the dataset: `Taylor Swift Song Lyrics (All Albums)`

Link: https://www.kaggle.com/datasets/thespacefreak/taylor-swift-song-lyrics-all-albums

The following albums were included:

*   Taylor Swift (2006)
*   Fearless (Taylor's Version) (2021)
*   Speak Now (Deluxe Package) (2010)
*   Red (Deluxe Edition) (2012)
*   1989 (Deluxe) (2014)
*   reputation (2017)
*   Lover (2019)
*   folklore (deluxe version) (2020)
*   evermore (deluxe version) (2020)

To understand our data better, let's define each column.


*   album_name - Name of the album
*   track_title - Name of the song
*   track_n - Track number
*   lyric - Lyric at each line
*   line - Line number per song


## ▶️ Load small data subset

We'll use the dataset that I've already uploaded to Huggingface for ease. It's the same as the Kaggle dataset used in the original tutorial.

*   [mmbazel/Taylor-Swift-Example](https://huggingface.co/datasets/mmbazel/Taylor-Swift-Example)

In [None]:
from datasets import load_dataset

dataset = load_dataset("mmbazel/Taylor-Swift-Example")

In [None]:
# Extracting the lyrics from the dataset
train_data = dataset["train"]
lyrics = train_data["lyric"]

## ▶️ Clean & Process The Lyrics

 Below we're cleaning and preprocessing song lyrics by removing or replacing specific characters and substrings, such as accented letters, punctuation marks, and Unicode characters. The cleaned lyrics are stored in a new list called cleaned_lyrics.

In [None]:
# Cleaning the lyrics
replace_with_space = ['\u2005', '\u200b', '\u205f', '\xa0', '-']
replace_letters = {'í':'i', 'é':'e', 'ï':'i', 'ó':'o', ';':',', ''':'\'', ''':'\'', ':':',', 'е':'e'}
remove_list = ['\)', '\(', '–','"','"', '"', '\[.*\]', '.*\|.*', '—']

In [None]:
cleaned_lyrics = []
for lyric in lyrics:
    cleaned_lyric = lyric
    for old, new in replace_letters.items():
        cleaned_lyric = cleaned_lyric.replace(old, new)
    for string in remove_list:
        cleaned_lyric = re.sub(string,'',cleaned_lyric)
    for string in replace_with_space:
        cleaned_lyric = re.sub(string,' ',cleaned_lyric)
    cleaned_lyrics.append(cleaned_lyric)

### 💡 Explanantion

We're doing the following:

1. It defines three lists and a dictionary:
   - `replace_with_space`: A list of Unicode characters that will be replaced with a space character.
   - `replace_letters`: A dictionary mapping certain characters (like accented letters) to their unaccented counterparts.
   - `remove_list`: A list of characters/substrings that will be completely removed from the lyrics.

2. It initializes an empty list called `cleaned_lyrics` to store the cleaned lyrics.

3. It iterates over each lyric in the `lyrics` list (assuming `lyrics` is a list containing the original song lyrics).

4. For each lyric:
   - It makes a copy of the original lyric called `cleaned_lyric`.
   - It replaces any character in the `replace_letters` dictionary with its corresponding value (e.g., 'í' is replaced with 'i').
   - It removes any substring or character present in the `remove_list` using the `re.sub` function (e.g., it removes parentheses, em-dashes, and any text between square brackets or separated by a pipe character).
   - It replaces any character present in the `replace_with_space` list with a space character.
   - After cleaning the lyric, it appends the `cleaned_lyric` to the `cleaned_lyrics` list.

# Split train-test set

## ▶️ Determine the train, test, validation split

In [None]:
# Splitting the cleaned_lyrics into training, validation, and test sets
train_percentage = 0.9
validation_percentage = 0.05
test_percentage = 0.05

In [None]:
# Calculate split indices
train_index = int(len(cleaned_lyrics) * train_percentage)
validation_index = int(len(cleaned_lyrics) * (train_percentage + validation_percentage))

In [None]:
# Splitting cleaned_lyrics into training, validation, and test sets
train_lyrics = cleaned_lyrics[:train_index]
validation_lyrics = cleaned_lyrics[train_index:validation_index]
test_lyrics = cleaned_lyrics[validation_index:]

## ▶️ Create a Huggingface dataset object

In [None]:
# Create new datasets with only the 'lyric' column for training, validation, and testing
train_lyrics_dataset = Dataset.from_dict({'text': train_lyrics})
validation_lyrics_dataset = Dataset.from_dict({'text': validation_lyrics})
test_lyrics_dataset = Dataset.from_dict({'text': test_lyrics})

# Fine-Tune Model

Below we're going to start setting up a pre-trained language model (TinyLlama-1.1B) and configuring it for low-rank adaptation using the LoRA (Low-Rank Adaptation) technique.

## ▶️ Load Model From Huggingface

Here we're going to load the pre-trained "`TinyLlama-1.1B-step-50K-105b`" model from the `Hugging Face Hub` using the `AutoModelForCausalLM`.`from_pretrained functio`n. This is a pre-trained language model for causal language modeling tasks, such as text generation.

In [None]:
# Loading the pre-trained model
model_name = "PY007/TinyLlama-1.1B-step-50K-105b"
model = AutoModelForCausalLM.from_pretrained(model_name)

## ▶️ Setting up the tokenizer

The code loads the tokenizer associated with the pre-trained model using `AutoTokenizer.from_pretrained`. The tokenizer is responsible for converting text data into numerical representations (tokens) that the model can understand.

It sets the `pad_token` to be the same as the `eos_token` (end-of-sequence token).

In [None]:
# Creating tokenizer and defining the pad token
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

### ▶️ Move model to appropriate device

The code checks if a CUDA-enabled GPU is available. If so, it moves the model to the GPU using `model.to("cuda")`. Otherwise, it moves the model to the CPU using `model.to("cpu")`.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)  # Move the model to the appropriate device

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Line

## Defining a function for text generation

*   The `generate_lyrics` function is defined to generate text based on a given query.

*   It tokenizes the input query using the tokenizer and moves the tokens to the appropriate device.

*   It sets up the generation configuration using `GenerationConfig` with specific parameters, such as `max_new_tokens` (maximum number of new tokens to generate), `repetition_penalty` (penalizing repeated tokens), `temperature` (controlling the randomness of the generated text), and `do_sample` (enabling sampling for text generation).

*   The function generates text using the `model.generate` method and decodes the output tokens back into text using the tokenizer.

*   Finally, it prints the generated text, excluding the original query.

In [None]:
# repetition_penalty originally set to 1.3 - bumped to 2.0
# max_new_tokens originally 250

def generate_lyrics(query, model, tokenizer):
    encoding = tokenizer(query, return_tensors="pt").to(device)
    generation_config = GenerationConfig(max_new_tokens=200, pad_token_id=tokenizer.eos_token_id, repetition_penalty=1.5, eos_token_id=tokenizer.eos_token_id, temperature=1.3,do_sample=True)
    outputs = model.generate(input_ids=encoding.input_ids, generation_config=generation_config)
    text_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

    output_lines = text_output[len(query):].split('\n')
    for line in output_lines:
        if line.strip():
            print(line)

## Using LORA

Here we're setting up a pre-trained language model (TinyLlama-1.1B) for low-rank adaptation using the LoRA technique.

LoRA is a method that allows fine-tuning large language models with much fewer trainable parameters, reducing memory requirements and enabling faster fine-tuning on specific tasks or datasets.

By applying LoRA to the pre-trained model, the code prepares the model for further fine-tuning or adaptation while keeping the original pre-trained weights intact.

From the original notebook:

> **Low-rank adaptation:**
This technique freezes the existing weights of TinyLlama and adds two smaller matrices with lower rank than the weight matrices into the model. Only these two smaller matrices are then trained, instead of all of the model weights. Another way to think of this is that we are grouping weights together and traing a scalar for each group, which is much easier than traing each weight by individually. In addition, low-rank adaptation is only done for the query and values weights in the attention heads of the transformers, while all other areas of the model are frozen. This greatly reduces the computation needed to fine-tune the model, while not impairing performance.



### ▶️ Preparing the model for low-rank adaptation (LoRA)

In [None]:
# Preparing the model for low-rank adaptation (e.g., LoRA)
prepared_model = prepare_model_for_kbit_training(model)

lora_alpha = 32
lora_dropout = 0.05
lora_rank = 32

### ▶️ Configuring the LoRA parameters

In [None]:
# Configuring the LoRA parameters
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_rank,
    bias="none",  # setting to 'none' for only training weight params instead of biases
    task_type="CAUSAL_LM")


### ▶️ Applying LoRA to the prepared model

In [None]:
# Applying LoRA to the prepared model
peft_model = get_peft_model(prepared_model, peft_config)

## Set up fine-tuning configuration

### ▶️ Replace `mmbazel` with your Huggingface username

The code sets output_dir to "mmbazel/tinyllama_tayswifty", which is the repository on the Hugging Face Hub where the fine-tuned model will be saved.

In [None]:
output_dir = "mmbazel/tinyllama_tayswifty" # Model repo on your hugging face account where you want to save your model

### ▶️ Set training arguments

In [None]:
# Setting training arguments
per_device_train_batch_size = 3 # The batch size for training (set to 3)
gradient_accumulation_steps = 2 # The number of steps to accumulate gradients before updating the model weights (set to 2)
optim = "paged_adamw_32bit" # The optimization algorithm to use (set to "paged_adamw_32bit").
save_steps = 10 # The frequency of saving and logging during training (set to 10)
logging_steps = 10
learning_rate = 2e-3 # The initial learning rate for the optimizer (set to 2e-3)
max_grad_norm = 0.3 # Sets limit for gradient clipping
max_steps = 200     # Number of training steps
warmup_ratio = 0.03 # Portion of steps used for learning_rate to warmup from 0
lr_scheduler_type = "cosine" # The type of learning rate scheduler (set to "cosine").

### ▶️ Creating the SFT (Supervised Fine-Tuning) trainer


*   The code creates an `SFTTrainer` object, which is responsible for fine-tuning the LoRA-adapted model (`peft_model`) on the `train_lyrics_dataset`.

*   It passes various configurations to the `SFTTrainer`, such as the `peft_config`, `max_seq_length` (maximum sequence length), `dataset_text_field` (the field in the dataset containing the text), `tokenizer`, and `training_arguments`.

*   The `peft_model.config.use_cache `is set to `False` - disables caching for the model during training.

In [None]:
training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    lr_scheduler_type=lr_scheduler_type,
    push_to_hub=True,
    report_to='none'
)

In [None]:
# Creating the SFT trainer
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=train_lyrics_dataset,
    peft_config=peft_config,
    max_seq_length=500,
    dataset_text_field='text',
    tokenizer=tokenizer,
    args=training_arguments
)
peft_model.config.use_cache = False

Map:   0%|          | 0/7522 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


## Actually training the model

In [None]:
# Training the model
trainer.train()

Step,Training Loss
10,4.7909
20,3.4515
30,3.4947
40,3.3161
50,3.2292
60,3.181
70,3.0821
80,3.2482
90,3.0432
100,3.2312


TrainOutput(global_step=200, training_loss=3.254313144683838, metrics={'train_runtime': 61.602, 'train_samples_per_second': 19.48, 'train_steps_per_second': 3.247, 'total_flos': 103816598925312.0, 'train_loss': 3.254313144683838, 'epoch': 0.16})

# ▶️ Try Out The Fine-Tuned Model

In [None]:
# Generate lyrics using random segments of the test lyrics
num_examples = 5  # Number of random examples to generate
max_segment_length = 200  # Maximum length of each lyric segment

In [None]:
for i in range(num_examples):
    # Randomly select a starting index for the lyric segment
    start_index = random.randint(0, len(test_lyrics) - max_segment_length)
    end_index = start_index + max_segment_length

    # Extract the lyric segment
    lyric_segment = ' '.join(test_lyrics[start_index:end_index])

    print(f"Example {i+1}:")
    print("INPUT:")
    print(lyric_segment)
    print("OUTPUT:")
    generate_lyrics(lyric_segment, model, tokenizer)
    print("\n")

Example 1:
INPUT:
Never be so polite You forget your power Never wield such power You forget to be polite And if I didn't know better I'd think you were listening to me now If I didn't know better I'd think you were still around What died didn't stay dead What died didn't stay dead You're alive, you're alive in my head What died didn't stay dead What died didn't stay dead You're alive, so alive The autumn chill that wakes me up You loved the amber skies so much Long limbs and frozen swims You'd always go past where our feet could touch And I complained the whole way there The car ride back and up the stairs I should've asked you questions I should've asked you how to be Asked you to write it down for me Should've kept every grocery store receipt 'Cause every scrap of you would be taken from me Watched as you signed your name Marjorie All your closets of backlogged dreams And how you left them all to me What died didn't stay dead What died didn't stay dead You're alive, you're alive in 

# 🗣️ What's Next

### 😭 But the lyrics don't make sense?????

Even though we've already fine-tuned the model on a dataset of song lyrics, the generated output should ideally be more coherent and relevant to Taylor Swift.


There could be a few reasons for this:

*   **Insufficient fine-tuning**

Depending on the size and quality of your fine-tuning dataset, the model may require more training to effectively capture the patterns and styles of song lyrics. Fine-tuning for a longer duration or with a larger dataset may help improve the model's performance.


*   **Overfitting**

If the fine-tuning dataset is small or not diverse enough, the model may overfit to the specific examples in the dataset. This can lead to the model generating lyrics that are too similar to the training data or not generalizing well to new inputs.



*   **Model architecture limitations**

The TinyLlama model architecture may have limitations in capturing the complexities and nuances of song lyrics. Some nonsense or incoherent output may be inherent to the model's design and capacity.

### 🤔 So what can we do instead?

Try the following:

*   **`Expand and diversify the fine-tuning dataset`**

Ensure that your fine-tuning dataset is large enough and covers a wide range of song styles, themes, and vocabularies. A diverse dataset can help the model learn more robust and generalizable patterns.

*  **`Adjust fine-tuning hyperparameters`**

Experiment with different hyperparameters during fine-tuning, such as learning rate, batch size, and number of epochs. Fine-tuning with optimal hyperparameters can help the model better adapt to the song lyrics domain.

*  **`Implement post-processing techniques`**

As mentioned earlier, applying post-processing techniques to filter out or modify nonsensical parts can help improve the quality of the generated lyrics. This can include removing or replacing specific words, applying language rules, or using semantic similarity measures.

*  **`Use a different model architecture`**

If the TinyLlama model consistently produces nonsensical output even after fine-tuning, it may be worth exploring alternative model architectures that are better suited for creative text generation, such as transformer-based models like GPT-2 or GPT-3.

*  **`Combine multiple approaches`**

Integrating fine-tuned models with other techniques, such as template-based generation, rule-based generation, or retrieval-based generation, can help provide a framework or structure for the generated lyrics while leveraging the model's creativity.

*  **`Iterative refinement`**

Treat the generated lyrics as a starting point and iteratively refine them through human curation and editing. Identify the most promising parts, make necessary modifications, and use them as inspiration for further generation or manual refinement.