In [1]:
%%capture
!pip install tensorboard transformers==4.40.0 accelerate==0.29.3 evaluate==0.4.1 bitsandbytes==0.43.1 huggingface_hub==0.22.2 trl==0.8.6 peft==0.10.0 chardet
!pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118

In [2]:
%%capture
!pip install gensim nltk datasets==2.18.0 keybert

In [3]:
%%capture
#!pip install flash-attn --no-build-isolation


Usage of KeyBERT model to extract keywords from the abstracts of machine learning research papers available on the ArXiv dataset.

* **Load Dataset**: Load the "CShorten/ML-ArXiv-Papers" dataset from the Hugging Face Hub, specifically the training split.
* **Extract Abstracts**: Extract the abstracts of the papers from the dataset to use for keyword extraction.
* **Initialize KeyBERT Model**: Initialize the KeyBERT model, which leverages BERT embeddings for keyword extraction.
* **Extract Keywords**: For each abstract, extract keywords using KeyBERT, considering only single words (unigrams) and removing common English stop words. Collect these keywords into a list.
* **Store Results**: Store each abstract along with its associated keywords in a list.

In [4]:
from keybert import KeyBERT
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
# Only use 500 samples for faster training, but dataset has more than 30K.
dataset = dataset.shuffle(seed=65).select(range(500))

# Extract abstracts to train on
abstracts = dataset["abstract"]

# Initialize KeyBERT model
kw_model = KeyBERT()

# Extract keywords and associate them with abstracts
keywords_labels = []
for abstract in abstracts:
    tmp = kw_model.extract_keywords(abstract, keyphrase_ngram_range=(1, 1), stop_words='english')
    labels = [keyword for keyword, score in tmp]
    labels = ', '.join(labels)
    keywords_labels.append([abstract, labels])

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/986 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/147M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# Logging into Hugging Face Hub

The `notebook_login` function from the `huggingface_hub` library is used to authenticate and log into the Hugging Face Hub. Also request access for llama3 model since its closed model, permission can take up to 2 min to 2 hrs.
* Request llama: https://huggingface.co/meta-llama/Meta-Llama-3-8B
* Create HF token, if doesnt exist: https://huggingface.co/settings/tokens

In [5]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Importing Libraries for Model Training and Fine-Tuning

   - `os`: For interacting with the operating system.
   - `random`: For generating random numbers, useful for reproducibility.
   - `torch`: The core PyTorch library for tensor operations and deep learning.
   - `AutoModelForCausalLM`: For loading pre-trained causal language models.
   - `AutoTokenizer`: For loading tokenizers compatible with the models.
   - `TrainingArguments`: For specifying training configurations.
   - `set_seed`: For setting the random seed to ensure reproducibility.
   - `EarlyStoppingCallback`: For implementing early stopping during training.
   - `BitsAndBytesConfig`: For configurations related to memory optimization and quantization.
   - `SFTTrainer`: For supervised fine-tuning of transformer models.
   - `setup_chat_format`: For setting up the chat format, if applicable.
   - `LoraConfig`: Configuration class for Low-Rank Adaptation (LoRA).
   - `AutoPeftModelForCausalLM`: For loading pre-trained models with PEFT configurations.



In [6]:
import os
import random
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    set_seed,
    EarlyStoppingCallback,
    BitsAndBytesConfig
)
from trl import SFTTrainer, setup_chat_format
from peft import LoraConfig, AutoPeftModelForCausalLM

In [7]:
# Configuration parameters
model_id = "meta-llama/Meta-Llama-3-8B"
torch_dtype = torch.bfloat16
quant_storage_dtype = torch.bfloat16
gradient_checkpointing = True

# Setting a seed for reproducibility
set_seed(42)

In [9]:
# Initialize the tokenizer
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    set_seed,
    EarlyStoppingCallback,
    BitsAndBytesConfig
)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [10]:
from datasets import Dataset
import pandas as pd

# Custom Prompt Formatting for llama Model

To prepare the prompts for the llama model, we define a custom function `formatting_prompts_func`. This function takes a list of examples, where each example consists of an input text and its corresponding response. The function formats each example into a structured prompt that includes the input text, the expected response, and an End-of-Sequence (EOS) token to denote the end of the prompt.

The provided `llama_prompt` template defines the structure of the prompt, with placeholders for the input and response texts. It ensures that the assistant's role and standards are clearly stated in each prompt.

The `EOS_TOKEN` variable is set to the tokenizer's end-of-sequence token, ensuring that each prompt has a proper ending to prevent infinite generation.

Finally, the `formatting_prompts_func` function iterates over the examples, formats each one according to the `llama_prompt` template, and appends the EOS token. The formatted prompts are then returned as a dictionary with the key "text".


In [11]:
llama_prompt = """As an assistant for topic modeling, my role is to act considerately, reliably, and as a trustworthy guide in the domain of topic modeling. I aim to provide assistance that meets these standards consistently.

### Input:
{}

### Response:
{}"""

# Must add EOS_TOKEN
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    texts = []
    for example in examples:
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = llama_prompt.format(example[0], example[1]) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass


dataset = formatting_prompts_func(keywords_labels)

# Configuration for Model Quantization

Snippet below shows the configuration for quantizing a language model using the `BitsAndBytesConfig` class. Quantization is a technique used to reduce the precision of the model's weights and activations, thereby decreasing memory usage and improving inference speed, particularly on hardware with limited computational resources.

The `quantization_config` object is initialized with the following parameters:
- `load_in_4bit`: Enables loading the model weights in 4-bit format.
- `bnb_4bit_use_double_quant`: Indicates whether to use double quantization for 4-bit weights.
- `bnb_4bit_quant_type`: Specifies the quantization type, with "nf4" representing non-fully quantized 4-bit quantization.
- `bnb_4bit_compute_dtype`: Defines the data type for computing quantization parameters, typically `torch_dtype`.
- `bnb_4bit_quant_storage_dtype`: Specifies the data type for storing quantized weights.

Next, the model is loaded using the specified `quantization_config`. Additional parameters include:
- `model_id`: The identifier or name of the pre-trained model to load.
- `attn_implementation`: Specifies the attention mechanism implementation, with "sdpa" representing scaled dot-product attention.
- `torch_dtype`: The Torch data type used for computation, matching the specified `quant_storage_dtype`.
- `use_cache`: Controls whether to use caching during inference, typically set to `True` unless using gradient checkpointing.

After loading the model, it is set to evaluation mode using `model.eval()` to ensure that parameters are fixed during inference.


In [12]:
# Configuration for model quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    bnb_4bit_quant_storage_dtype=quant_storage_dtype,
)

# Load the model with the quantization configuration
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
     attn_implementation="sdpa", # use sdpa, alternatively use "flash_attention_2"
    torch_dtype=quant_storage_dtype,
    use_cache=not gradient_checkpointing,
)

model.eval()

Unused kwargs: ['bnb_4bit_quant_storage_dtype']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Ll

# Enabling Gradient Checkpointing and Configuring LoRA (Low-Rank Adaptation)

Code below checks if gradient checkpointing is specified and enables it if required. Additionally, it sets up the configuration for Low-Rank Adaptation (LoRA).

## Gradient Checkpointing
Gradient checkpointing is a memory optimization technique used during backpropagation to reduce the memory footprint by trading off computation for memory. If `gradient_checkpointing` is specified, the following steps are performed:

- The `model.gradient_checkpointing_enable()` method is called to enable gradient checkpointing for the model.
- The `model.enable_input_require_grads()` method is called to ensure that input gradients are enabled, which is necessary for gradient checkpointing to work effectively.

## Configuration for LoRA
LoRA is a technique used for low-rank adaptation, which helps in efficient fine-tuning of large language models. The `peft_config` object is initialized with the following parameters:

- `lora_alpha`: Hyperparameter controlling the strength of the low-rank adaptation, typically set to a value around 8.
- `lora_dropout`: Dropout probability used in the LoRA layer to prevent overfitting.
- `r`: Rank of the low-rank approximation, specifying the number of singular values retained in the factorized weight matrix.
- `bias`: Specifies whether biases are included in the low-rank approximation, set to "none" here.
- `target_modules`: Indicates which modules are targeted for low-rank adaptation, set to "all-linear" to target all linear layers.
- `task_type`: Specifies the type of task, here set to "CAUSAL_LM" for causal language modeling.

These configurations set up the model for efficient fine-tuning using LoRA, enhancing its performance while maintaining memory efficiency.


In [13]:
# Enable gradient checkpointing if specified
if gradient_checkpointing:
    model.gradient_checkpointing_enable()
    model.enable_input_require_grads()

# Configuration for LoRA (Low-Rank Adaptation)
peft_config = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.05,
    r=16,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

# Setup for Training

Early stopping is a technique used to halt the training process when certain conditions are met, typically to prevent overfitting or when the model's performance plateaus. In this code snippet, we configure early stopping for the model training process.

The `early_stop` object is initialized with the following parameters:
- `early_stopping_patience`: Specifies the number of epochs with no improvement after which training will be stopped.
- `early_stopping_threshold`: Defines the threshold for measuring improvement, indicating the minimum improvement required to continue training.

The `trainer` is then initialized with the following configurations:
- `model`: The pre-trained language model to be fine-tuned.
- `tokenizer`: The tokenizer associated with the model.
- `train_dataset`: The training dataset, converted from a Pandas DataFrame.
- `dataset_text_field`: The name of the text field in the dataset.
- `peft_config`: Configuration for Parameter-Efficient Fine-Tuning (PEFT).
- `max_seq_length`: The maximum sequence length for tokenized inputs.
- `packing`: Specifies whether to use packing for memory optimization, set to `False` here.
- `dataset_kwargs`: Additional keyword arguments for dataset processing.
- `args`: TrainingArguments object specifying various training configurations such as number of epochs, batch size, learning rate, etc.
- `output_dir`: The directory to save the trained model and logs.
- `report_to`: Indicates where to report training metrics, here set to "tensorboard".


In [14]:
# Setup for early stopping, if needed
early_stop = EarlyStoppingCallback(early_stopping_patience=5, early_stopping_threshold=0.3)

# Initialize the trainer with all configurations
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=Dataset.from_pandas(pd.DataFrame(dataset)),
    dataset_text_field="text",
    peft_config=peft_config,
    max_seq_length=512,
    packing = False,
    dataset_kwargs={
        "add_special_tokens": False,
        "append_concat_token": False,
    },
    args = TrainingArguments(
    num_train_epochs = 2,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,
    warmup_steps = 5,
    warmup_ratio = 0.03,
    learning_rate = 2e-4,
    fp16 = not torch.cuda.is_bf16_supported(),
    bf16 = torch.cuda.is_bf16_supported(),
    logging_steps = 1,
    optim = "adamw_torch_fused",
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    seed = 3407,
    output_dir = "adapter_model_dir",
    report_to="tensorboard"
    )
    # callbacks=[early_stop],
)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [19]:
# Print trainable parameters if this is the main process
if trainer.is_world_process_zero():
    trainer.model.print_trainable_parameters()

# Start training
trainer.train()

# Save the trained model
trainer.save_model()

trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5195983464188562


Step,Training Loss
1,1.5812
2,1.7591
3,1.6696
4,1.5759
5,1.7125
6,1.6477
7,1.7265
8,1.6072
9,1.6659
10,1.6272


In [20]:
# Default value for topic
default_topic = "Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a reasonable way (see inductive bias)."

# Prompt the user to enter a value
user_input = input("Please enter the value for src (press Enter to use the default value): ")

# Use the default value if the input is empty
topic = user_input if user_input else default_topic

# Print the value of src
print("Input:", topic)


Please enter the value for src (press Enter to use the default value): 
Input: Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a reasonable way (see inductive bias).


In [21]:
inputs = tokenizer(
[
    llama_prompt.format(
        topic, # input
        "",
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)

text = tokenizer.batch_decode(outputs)


# Some postprocessing for output, before displaying
response = text[0].split("### Response:")[1].strip()
unique_words = list(dict.fromkeys(response.split(', ')))
cleaned_response = ', '.join(unique_words)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [22]:
print(f"Topic:{topic}")
print(f"Model Response:{cleaned_response}")

Topic:Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a reasonable way (see inductive bias).
Model Response:learning, supervised, labeled, training, examples, input, instances, supervised,


# Loading and Merging PEFT Model for Causal Language Modeling


First, the PEFT model is loaded using the `AutoPeftModelForCausalLM.from_pretrained()` method with the following parameters:
- `output_dir`: The directory where the PEFT model is saved.
- `torch_dtype`: Specifies the Torch data type for computation, set to `torch.float16`.
- `low_cpu_mem_usage`: Indicates whether to use low CPU memory usage mode, typically set to `True`.

Next, the loaded PEFT model is merged with the base model using the `merge_and_unload()` method. This operation combines the PEFT-specific modifications with the base model's architecture and parameters.

Finally, the merged model is saved using the `save_pretrained()` method with the following parameters:
- `output_dir`: The directory where the merged model will be saved.
- `safe_serialization`: Ensures safe serialization of the model.
- `max_shard_size`: Specifies the maximum size of each file shard during saving, set to "2GB" to manage file size.

This process results in a merged model containing the benefits of both the PEFT fine-tuning and the base model's architecture, suitable for efficient causal language modeling tasks.


In [None]:
# Load PEFT model on CPU for merging, if needed.
model = AutoPeftModelForCausalLM.from_pretrained(
    output_dir,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)

# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("full_model_dir", safe_serialization=True, max_shard_size="2GB")