## PURPOSE OF THIS NOTEBOOK: 

This notebook contains the training and testing of the third version of the Job Title to ONET Family Matching model. This time, we are focusing on maintaining model performance while significantly reducing the size of the model. I will be experimenting with different models until I land one that is exactly what I need.

In [1]:
from transformers import (
        AutoModelForCausalLM, 
        AutoTokenizer, 
        BitsAndBytesConfig, 
        TrainingArguments,
        HfArgumentParser,
        Trainer,
        TrainingArguments,
        DataCollatorForLanguageModeling,
        EarlyStoppingCallback,
        pipeline,
        logging,
        set_seed)
from tqdm.autonotebook import tqdm 
from functools import partial
from torch import cuda
from datasets import load_dataset
from huggingface_hub import notebook_login
from CommonFunctions import read_and_prepare_data, CustomDataset, Loss
from torch.utils.data import DataLoader
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
from trl import SFTTrainer
from random import randrange

import pandas as pd 
import os
import numpy as np
import torch
import bitsandbytes as bnb

model_name = "meta-llama/Llama-2-7b-hf"

In [2]:
device = 'cuda' if cuda.is_available() else 'cpu'

In [3]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
def create_model_and_tokenizer():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type='nf4',
        bnb_4_bit_compute_dtype=torch.bfloat16,
    )

    n_gpus = torch.cuda.device_count()
    max_memory = f'{6900}MB'

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        use_safetensors=True,
        quantization_config=bnb_config,
        trust_remote_code=True,
        device_map="auto",
        max_memory = {i: max_memory for i in range(n_gpus)},
    )

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side="right"

    return model, tokenizer

In [5]:
model, tokenizer = create_model_and_tokenizer()
model.eval()

The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )


## Data structure for Llama model:

From the reading I have done about the llama model, this model compared to BERT expects a different format of input.
The formatting goes as follows:   
   
**Instruction**: Prompt for the model, think of telling it what to do. In this case we're asking the model to categorize a job title to one of x ONET families/codes  
  
**Input**: the job title  
  
**Output**: The resulting expected ONET code.  
  
***Now from here I would need to create a new version of the test/train dataframes with a new DataManipulation.ipynb file.***

## Data Prep

In [6]:
# Setting the directory of the output 
OUTPUT_DIR = "Data"

In [7]:
# import test and training data
train = load_dataset("csv", data_files="../Data/Training_Data.csv")
test = load_dataset("csv", data_files="../Data/TestingData.csv")


In [8]:
print(train['train'][randrange(len(train['train']))])

{'instruction': 'Categorize the job title into one of the 22 job families:\n\n11\n13\n15\n17\n19\n21\n23\n25\n27\n29\n31\n33\n35\n37\n39\n41\n43\n45\n47\n49\n51\n53\n\n', 'input': 'Security Consultant', 'output': 13}


In [9]:
# Create the prompt formats 
def create_prompt(sample):
    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruction:"
    INPUT_KEY = "Input:"
    RESPONSE_KEY = "### Response:"
    END_KEY = "### End"

    # create static strings for prompt
    intro = f'{INTRO_BLURB}'
    instruction = f"{INSTRUCTION_KEY}\n{sample['instruction']}"
    input = f"{INPUT_KEY}\n{sample['input']}" if sample['input'] else None
    response = f"{RESPONSE_KEY}\n{sample['output']}"
    end = f'{END_KEY}'

    # turn strings into a list of strings. 
    parts = [part for part in [intro, instruction, input, response, end] if part]

    # join the prompt template into one string 
    formatted_prompt = "\n\n".join(parts)

    # store formatted prompt into a key "text"
    sample['text'] = formatted_prompt
    
    return sample

In [10]:
print(create_prompt(train['train'][randrange(len(train['train']))]))

{'instruction': 'Categorize the job title into one of the 22 job families:\n\n11\n13\n15\n17\n19\n21\n23\n25\n27\n29\n31\n33\n35\n37\n39\n41\n43\n45\n47\n49\n51\n53\n\n', 'input': 'Spinneret Cleaner', 'output': 53, 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nCategorize the job title into one of the 22 job families:\n\n11\n13\n15\n17\n19\n21\n23\n25\n27\n29\n31\n33\n35\n37\n39\n41\n43\n45\n47\n49\n51\n53\n\n\n\nInput:\nSpinneret Cleaner\n\n### Response:\n53\n\n### End'}


In [11]:
def get_max_length(model):
    # Pull model config
    conf = model.config

    max_length = None

    for length_setting in ['n_positions', 'max_position_embeddings', 'seq_length']:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"found max length: {max_length}")
            break
        if not max_length:
            max_length = 1024
            print(f"using default max length: {max_length}")
        return max_length

In [12]:
def preprocess_batch(batch, tokenizer, max_length):
    print("preprocessing tokenizer...")
    return tokenizer(
        batch['text'],
        max_length = max_length,
        truncation = True,
    )

In [13]:
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset: str):
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt)

    # Apply preprocessing to each batch of the dataset & and remove "instruction", "input", "output", and "text" fields
    _preprocessing_function = partial(preprocess_batch, max_length = max_length, tokenizer = tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched = True,
        remove_columns = ["instruction", "input", "output", "text"],
    )

    # Filter out samples that have "input_ids" exceeding "max_length"
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)

    # Shuffle dataset
    dataset = dataset.shuffle(seed = seed)

    return dataset

In [14]:
seed = 42
max_length = get_max_length(model)
preprocessed_train = preprocess_dataset(tokenizer, max_length, seed, dataset=train)
preprocessed_test = preprocess_dataset(tokenizer, max_length, seed, dataset=test)
OUTPUT_DIR = '../Data/Models'

using default max length: 1024
Preprocessing dataset...
Preprocessing dataset...


Map:   0%|          | 0/13483 [00:00<?, ? examples/s]

preprocessing tokenizer...
preprocessing tokenizer...
preprocessing tokenizer...
preprocessing tokenizer...
preprocessing tokenizer...
preprocessing tokenizer...
preprocessing tokenizer...
preprocessing tokenizer...
preprocessing tokenizer...
preprocessing tokenizer...
preprocessing tokenizer...
preprocessing tokenizer...
preprocessing tokenizer...
preprocessing tokenizer...


Filter:   0%|          | 0/13483 [00:00<?, ? examples/s]

## Start training of model

In [15]:
def create_peft_config(r, lora_alpha, target_modules, lora_dropout, bias, task_type):

    config = LoraConfig(
        r = r,
        lora_alpha = lora_alpha,
        target_modules = target_modules,
        lora_dropout = lora_dropout,
        bias = bias,
        task_type = task_type,
    )

    return config

In [16]:
def find_all_linear_names(model):

    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:
        lora_module_names.remove('lm_head')
    print(f"LoRA module names: {list(lora_module_names)}")
    return list(lora_module_names)

In [17]:
def print_trainable_parameters(model, use_4bit = False):
    """
    Prints the number of trainable parameters in the model.

    :param model: PEFT model
    """

    trainable_params = 0
    all_param = 0

    for _, param in model.named_parameters():
        num_params = param.numel()
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel
        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params

    if use_4bit:
        trainable_params /= 2

    print(
        f"All Parameters: {all_param:,d} || Trainable Parameters: {trainable_params:,d} || Trainable Parameters %: {100 * trainable_params / all_param}"
    )

In [24]:
trainingarguments = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    logging_steps=1,
    learning_rate=1e-4,
    fp16=True,
    max_grad_norm=0.3,
    num_train_epochs=2,
    evaluation_strategy="steps",
    eval_steps=0.2,
    warmup_ratio=0.05,
    save_strategy='epoch',
    group_by_length=True,
    output_dir=OUTPUT_DIR,
    report_to='tensorboard',
    save_safetensors=True,
    lr_scheduler_type='cosine',
    seed=42
)

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset
)

## Train model

In [20]:
preprocessed_train = preprocessed_train['train']

In [21]:
fine_tune(model, tokenizer, preprocessed_train, lora_r, lora_alpha, lora_dropout, bias, task_type, per_device_train_batch_size, gradient_accumulation_steps, warmup_steps, max_steps, learning_rate, fp16, logging_steps, output_dir, optim, epochs)

LoRA module names: ['down_proj', 'up_proj', 'gate_proj', 'k_proj', 'o_proj', 'q_proj', 'v_proj']
All Parameters: 3,540,389,888 || Trainable Parameters: 39,976,960 || Trainable Parameters %: 1.1291682911958425
Training...


  0%|          | 0/20 [00:00<?, ?it/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 1.7048, 'learning_rate': 0.0001, 'epoch': 0.0}
{'loss': 1.757, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 1.4412, 'learning_rate': 0.00018888888888888888, 'epoch': 0.0}
{'loss': 1.012, 'learning_rate': 0.00017777777777777779, 'epoch': 0.0}
{'loss': 0.6915, 'learning_rate': 0.0001666666666666667, 'epoch': 0.0}
{'loss': 0.468, 'learning_rate': 0.00015555555555555556, 'epoch': 0.0}
{'loss': 0.3466, 'learning_rate': 0.00014444444444444444, 'epoch': 0.0}
{'loss': 0.3097, 'learning_rate': 0.00013333333333333334, 'epoch': 0.0}
{'loss': 0.2253, 'learning_rate': 0.00012222222222222224, 'epoch': 0.0}
{'loss': 0.2133, 'learning_rate': 0.00011111111111111112, 'epoch': 0.0}
{'loss': 0.2105, 'learning_rate': 0.0001, 'epoch': 0.0}
{'loss': 0.1834, 'learning_rate': 8.888888888888889e-05, 'epoch': 0.0}
{'loss': 0.1857, 'learning_rate': 7.777777777777778e-05, 'epoch': 0.0}
{'loss': 0.1975, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.0}
{'loss': 0.1774, 'learning_rate': 5.55555555555

In [23]:
# save the model and evaluate in the future.
# Load fine-tuned weights
model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map = "auto", torch_dtype = torch.bfloat16)
# Merge the LoRA layers with the base model
model = model.merge_and_unload()

# Save fine-tuned model at a new location
output_merged_dir = "../Data/Models/"
os.makedirs(output_merged_dir, exist_ok = True)
model.save_pretrained(output_merged_dir, safe_serialization = True)


The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

ValueError: Need either a `state_dict` or a `save_folder` containing offloaded weights.

In [None]:

%load_ext tensorboard
%tensorboard --logdir Data/runs