Let's think about quantization from a very high level - and use some oversimplifications to understand what's really happening under the hood.

In essence, we can think of quantization as placing a pin on the number line (our quantization constant) and then expressing a low-precision zero-centered size-64 block range around that pinned number. Exploiting the fact that our weights are normally distributed and that we scale them to be in the range [-1, 1], this lets use use our NF4 datatype to roughly optimally express our high precision weights in a low precision format. While we still do need some higher precision numbers - this process lets use represent many numbers in low precision for the cost of 1 number in high precision.

However, we can take it one step further - and we can actually quantize the range of quantization constants we wind up with as well! This winds up saving us ~0.373 bits per parameter.

In [1]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [2]:
%pip install torch
%pip install accelerate @ git+https://github.com/huggingface/accelerate.git
%pip install bitsandbytes
%pip install datasets==2.13.1
%pip install transformers @ git+https://github.com/huggingface/transformers.git
%pip install peft @ git+https://github.com/huggingface/peft.git
%pip install trl @ git+https://github.com/lvwerra/trl.git
%pip install scipy
%pip install peft

Note: you may need to restart the kernel to use updated packages.
[31mERROR: Invalid requirement: '@'[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[31mERROR: Invalid requirement: '@'[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
[31mERROR: Invalid requirement: '@'[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
[31mERROR: Invalid requirement: '@'[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Set up Python environment

***fine-tune LLaMA 2 models on  datasets***



In [2]:
import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
    DataCollatorForLanguageModeling, Trainer, TrainingArguments

In [3]:
import torch
torch.cuda.is_available()

True

In [4]:
import pandas as pd
import time

In [5]:
file_path = "Tikvah_Filtered(15,000-30,000) - Tikvah_Filtered.csv"

In [6]:
df = pd.read_csv(file_path)

In [7]:
df.tail()

Unnamed: 0,id,text,date,IsAD,TAG,hashtags,emojis,symbols,links
5352,29973,·ã®·ä¢·âµ·ãÆ·åµ·ã´ ·â∞·àµ·çã·ãé·âΩ ·ã´·àµ·â∞·à≥·à∞·à´·â∏·ãç·ç§ ·ã®·ä¢·âµ·ãÆ·åµ·ã´ ·ä≠·çç·àé·âΩ ·ç§ ·àÄ·åà·à´·âΩ·äï·äï ·ä®·åà...,2019-04-15 22:37:52,,,"['#·à∞·ãç·äê·âµ', '#·ä®·â∞·àà·ã´·ã©', '#·ã®·â∞·à∞·â£·à∞·â°·ç§', '#·äë', '#·ä®·å•·àã·âª',...",üëÜ,-_,[]
5353,29977,"·çé·â∂: ·å•·ãã·âµ ·ã®TIKVAH ETH ·â§·â∞·à∞·â• ·ä†·â£·àã·âµ(·ä®WSU, AMU, WKU ·ã®...",2019-04-15 23:10:39,,,['#·ä•·àÅ·ãµ'],,"-""""",[]
5354,29982,·â†·ãà·å£·âµ ·äê·â•·ã´·âµ ·ä†·âÖ·à´·â¢·äê·âµ ·ä•·ãç·âÄ·âµ ·ä•·äì ·å•·àã·âª ·äï·åç·åç·àÆ·âΩ ·â†·àö·àç ·ä•·äï·ã≤·àÅ·àù ·ãà...,2019-04-15 23:26:00,,,['#StopHateSpeech'],,"""""",[]
5355,29988,"·àç·ã© ·ã®·ä™·äê ·å•·â†·â• ·ãù·åç·åÖ·âµ·àù ·â∞·ä´·àÇ·ã∂ ·äê·â†·à≠·ç¢ ·ä®AMUU, WSU(·âµ·à© ·àã·ã≠·çç) ...",2019-04-15 23:36:15,,,"['#StopHateSpeech', '#wku']",,,[]
5356,29996,·ä•·àÅ·ãµ ·ã®·àõ·å†·âÄ·àà·ã´·ãç ·ãù·åç·åÖ·âµ·ã®·ä≠·â•·à≠ ·ä•·äï·åç·ã∂·âΩ ·â†·â∞·åà·äô·â†·âµ @tsegabwlde ...,2019-04-15 23:39:38,,,"['#StopHateSpeech', '#WKU']",üîù,,[]


In [8]:
dataset = df[['text','hashtags']]
dataset.tail()

Unnamed: 0,text,hashtags
5352,·ã®·ä¢·âµ·ãÆ·åµ·ã´ ·â∞·àµ·çã·ãé·âΩ ·ã´·àµ·â∞·à≥·à∞·à´·â∏·ãç·ç§ ·ã®·ä¢·âµ·ãÆ·åµ·ã´ ·ä≠·çç·àé·âΩ ·ç§ ·àÄ·åà·à´·âΩ·äï·äï ·ä®·åà...,"['#·à∞·ãç·äê·âµ', '#·ä®·â∞·àà·ã´·ã©', '#·ã®·â∞·à∞·â£·à∞·â°·ç§', '#·äë', '#·ä®·å•·àã·âª',..."
5353,"·çé·â∂: ·å•·ãã·âµ ·ã®TIKVAH ETH ·â§·â∞·à∞·â• ·ä†·â£·àã·âµ(·ä®WSU, AMU, WKU ·ã®...",['#·ä•·àÅ·ãµ']
5354,·â†·ãà·å£·âµ ·äê·â•·ã´·âµ ·ä†·âÖ·à´·â¢·äê·âµ ·ä•·ãç·âÄ·âµ ·ä•·äì ·å•·àã·âª ·äï·åç·åç·àÆ·âΩ ·â†·àö·àç ·ä•·äï·ã≤·àÅ·àù ·ãà...,['#StopHateSpeech']
5355,"·àç·ã© ·ã®·ä™·äê ·å•·â†·â• ·ãù·åç·åÖ·âµ·àù ·â∞·ä´·àÇ·ã∂ ·äê·â†·à≠·ç¢ ·ä®AMUU, WSU(·âµ·à© ·àã·ã≠·çç) ...","['#StopHateSpeech', '#wku']"
5356,·ä•·àÅ·ãµ ·ã®·àõ·å†·âÄ·àà·ã´·ãç ·ãù·åç·åÖ·âµ·ã®·ä≠·â•·à≠ ·ä•·äï·åç·ã∂·âΩ ·â†·â∞·åà·äô·â†·âµ @tsegabwlde ...,"['#StopHateSpeech', '#WKU']"


In [9]:
dataset = dataset.dropna(subset=['hashtags'])
#dataset = dataset[dataset['hashtags'].astype(bool)]  # Keep only non-empty lists
dataset = dataset[dataset['hashtags'].apply(lambda x: x != '[]')]

# Reset the index after dropping rows
dataset = dataset.reset_index(drop=True)

In [10]:

dataset.head()

Unnamed: 0,text,hashtags
0,·å®·äì‚¨ÜÔ∏è ·â†·ä®·çã ·ãû·äï ·ãà·à®·ã≥ 18/01/2011 ·ä£·àù ·â£·àà·â§·âµ·äê·â± ·ã®·àõ·äï ·ä•·äï·ã∞ ·àÜ...,"['#update', '#·â†·å®·äì']"
1,·ä¢·äï·åÇ·äê·à≠ ·â≥·ä®·àà ·ä°·àõ‚¨ÜÔ∏è ·ã®·àÄ·åà·à≠ ·àΩ·àõ·åç·àå·ãé·âΩ ·â∞·åç·â£·à≠ ·ä†·àà·àù·äï ·ã´·àµ·ã∞·äê·âÄ·ç£ ·àà·àÅ...,"['#update', '#·ã®·åã·àû', '#·ã®·àõ·ã≠·ãò·äê·åã·äì', '#·â≥·ä®·àà_·ä°·àõ']"
2,·ã®·ä¢·à¨·âª ·ã®·à∞·àã·àù ·àΩ·àç·àõ·âµ ·àÄ·åà·à≠ ·â†·âÄ·àç ·â£·àÖ·àé·âΩ·äï ·â†·àõ·åé·àç·â†·âµ ·àà·àÄ·åà·à´·ãä ·çã·ã≠·ã≥ ...,['#·ã®·ä¢·à¨·âª']
3,·ã®·àù·àµ·à´·âÖ ·ãû·äï ·çÄ·å•·â≥·äì ·ä†·àµ·â∞·ã≥·ã∞·à≠ ·àÄ·àã·çä ·ä†·â∂ ·ààBBC Afaan Oromoo ...,"['#update', '#·ãà·àà·åã', '#·â≥·ä®·àà_·â∂·àé·à∂']"
4,·åé·äï·ã∞·à≠‚¨ÜÔ∏è ·ã®·åé·äï·ã∞·à≠ ·ä®·â∞·àõ ·ä•·àù·äê·âµ ·â∞·ä®·â≥·ãÆ·âΩ ·â†·ä£·àç ·â†·ä®·â∞·àõ·ãã ·â†·ãµ·àù·âÄ·âµ ·ä•·äï...,"['#update', '#·ä•·àµ·àç·àù·äì', '#·ã®·àò·àµ·âÄ·àç', '#·ä†·äï·ãµ·äê·â≥·â∏·ãç', '#..."


In [11]:
dataset.shape

(4154, 2)

In [12]:
import re

def update_hashtags(dataset):
  ''' Preprocess data : if # followed by space/s then by word ,
  concatenate the # and the word'''

  for index, row in dataset.iterrows():
        text = row['text']

        # Using regular expression to find hashtags followed by one or more spaces and a word
        matches = re.findall(r'#\s*(\w+)', text)

        for match in matches:
            hashtag = '#' + match
            # Update 'hashtag' column
            dataset.at[index, 'hashtags'] = hashtag
            # Update 'text' column
            dataset.at[index, 'text'] = re.sub(r'#\s*' + match, hashtag, row['text'])


# Call the function to update hashtags
update_hashtags(dataset)

# Display the updated DataFrame
dataset.head()


Unnamed: 0,text,hashtags
0,·å®·äì‚¨ÜÔ∏è ·â†·ä®·çã ·ãû·äï ·ãà·à®·ã≥ 18/01/2011 ·ä£·àù ·â£·àà·â§·âµ·äê·â± ·ã®·àõ·äï ·ä•·äï·ã∞ ·àÜ...,"['#update', '#·â†·å®·äì']"
1,·ä¢·äï·åÇ·äê·à≠ ·â≥·ä®·àà ·ä°·àõ‚¨ÜÔ∏è ·ã®·àÄ·åà·à≠ ·àΩ·àõ·åç·àå·ãé·âΩ ·â∞·åç·â£·à≠ ·ä†·àà·àù·äï ·ã´·àµ·ã∞·äê·âÄ·ç£ ·àà·àÅ...,"['#update', '#·ã®·åã·àû', '#·ã®·àõ·ã≠·ãò·äê·åã·äì', '#·â≥·ä®·àà_·ä°·àõ']"
2,·ã®·ä¢·à¨·âª ·ã®·à∞·àã·àù ·àΩ·àç·àõ·âµ ·àÄ·åà·à≠ ·â†·âÄ·àç ·â£·àÖ·àé·âΩ·äï ·â†·àõ·åé·àç·â†·âµ ·àà·àÄ·åà·à´·ãä ·çã·ã≠·ã≥ ...,['#·ã®·ä¢·à¨·âª']
3,·ã®·àù·àµ·à´·âÖ ·ãû·äï ·çÄ·å•·â≥·äì ·ä†·àµ·â∞·ã≥·ã∞·à≠ ·àÄ·àã·çä ·ä†·â∂ ·ààBBC Afaan Oromoo ...,"['#update', '#·ãà·àà·åã', '#·â≥·ä®·àà_·â∂·àé·à∂']"
4,·åé·äï·ã∞·à≠‚¨ÜÔ∏è ·ã®·åé·äï·ã∞·à≠ ·ä®·â∞·àõ ·ä•·àù·äê·âµ ·â∞·ä®·â≥·ãÆ·âΩ ·â†·ä£·àç ·â†·ä®·â∞·àõ·ãã ·â†·ãµ·àù·âÄ·âµ ·ä•·äï...,"['#update', '#·ä•·àµ·àç·àù·äì', '#·ã®·àò·àµ·âÄ·àç', '#·ä†·äï·ãµ·äê·â≥·â∏·ãç', '#..."


In [13]:
df2 = dataset.copy()

In [14]:
from datasets import Dataset

# Create a dictionary containing your Amharic text data
data_dict = {"text": dataset['text'].tolist(), "hashtags": dataset['hashtags'].tolist()}

# Create a Dataset object
dataset = Dataset.from_dict(data_dict)



In [None]:
# df2['formatted_text'] = 'text: ' + df2['text'] +',' + 'hashtags: #' + df2['hashtags'].astype(str)

# # Create a dictionary containing your Amharic text data
# data_dict = {"formatted_text": df2['formatted_text'].tolist()}

# # Create a Dataset object
# fullDataset = Dataset.from_dict(data_dict)



In [None]:
# # Print the first few examples
# print(fullDataset['formatted_text'][:5])

In [None]:
# print(len(fullDataset))

In [None]:
# # Save the dataset to a file (e.g., in Arrow format)
# fullDataset.to_csv("sample_data/fullDataset.csv")


In [15]:
dataset.shape

(4154, 2)

In [16]:
train_dataset = dataset.select(range(2000))
test_dataset = dataset.select(range(2000, len(dataset)))
dataset = train_dataset
dataset_subset = test_dataset

In [17]:
print(dataset['text'][0])

·å®·äì‚¨ÜÔ∏è ·â†·ä®·çã ·ãû·äï ·ãà·à®·ã≥ 18/01/2011 ·ä£·àù ·â£·àà·â§·âµ·äê·â± ·ã®·àõ·äï ·ä•·äï·ã∞ ·àÜ·äê ·ä•·äì ·àà·àù·äï ·ä†·àã·àõ ·ä•·äï·ã∞ ·àÜ·äê ·ã´·àç·â≥·ãà·âÄ ·åç·àù·â± ·ä® 28,000,000 ·â†·à≠ ·â†·àã·ã≠ 5 ·ä©·àù·â≥·àç ·ã®·ä¢·âµ·ãÆ·åµ·ã´ ·â•·à≠ ·â†·äÆ·ãµ 3 ·ä† ·ä† 47875 ·â†·àÜ·äê·âΩ ·çï·ä™·ä†·çï ·â∂·ãÆ·â≥ ·àò·ä™·äì ·ãà·ã∞ ·â§·äï·âΩ ·àõ·åÖ ·ãû·äï ·ä†·âÖ·å£·å´ ·ä•·ã®·àÑ·ã∞·âΩ ·à≥·àà ·â†·ãã·âª ·ä®·â∞·àõ ·ãà·å£·â∂·âΩ ·ä®·çç·â∞·äõ ·ä•·äì ·â†·àµ·àç·â≥·ãä ·à≠·â•·à®·â• ·â†·âÅ·å•·å•·à≠ ·àµ·à≠ ·àç·ãâ·àç ·âΩ·àà·ãã·àç ·âÄ·à™·ãâ·äï ·åâ·ã≥·ã≠ ·ã®·àù·àò·àà·ä®·â∞·ãç ·ä†·ä´·àç ·â†·àõ·å£·à´·âµ ·àã·ã≠ ·ã≠·åà·äõ·àç·ç¢ ·åà·äï·ãò·â°·àù ·àà·åä·ãú·ãç ·â†·ä®·â∞·àõ·ãç ·â†·àù·åà·äù CBE ·â£·äï·ä≠ ·â†·çå·ã∞·à´·àç ·çñ·àä·àµ ·ä•·äì ·â†·ãà·à®·ã≥·ãç ·çñ·àä·àµ ·ä•·ã®·â∞·å†·â†·âÄ ·ã≠·åà·äõ·àç·ç¢ ¬©·çé·â∂ Amanuel(TIKVAH ETH) @tsegabwolde @tikvahethiopia


In [18]:
print(dataset_subset['text'][0])

·ä†·àõ·à´ ·ä≠·àç·àç‚ÄºÔ∏è ·ã®·ä†·àõ·à´ ·ä≠·àç·àç ·ãà·å£·â∂·âΩ ·ã®·ä†·ä´·â£·â¢·ã´·â∏·ãç·äï ·à∞·àã·àù·äì ·ã∞·àÖ·äï·äê·âµ ·äê·âÖ·â∞·ãç ·â†·àò·å†·â†·âÖ ·â†·àÉ·åà·à™·â± ·àà·â∞·åÄ·àò·à®·ãç ·ã®·àà·ãç·å• ·ä•·äï·âÖ·àµ·âÉ·à¥ ·àä·à∞·à© ·ä•·äï·ã∞·àö·åà·â£ ·ã®·ä≠·àç·àâ ·ã®·à∞·àã·àù·äì ·ã∞·àÖ·äï·äê·âµ ·â¢·àÆ ·àÉ·àã·çä ·ä†·àµ·åà·äê·ãò·â°·ç¢ ·ã®·â¢·àÆ·ãç ·àÉ·àã·çä ·â•·à≠·åã·ã¥·àç ·åÄ·äê·à´·àç ·âµ·äì·äï·âµ ·ä®·ãà·àç·ãµ·ã´ ·ä®·â∞·àõ ·äê·ãã·à™·ãé·âΩ ·åã·à≠ ·â£·ã∞·à®·åâ·âµ ·ãç·ã≠·ã≠·âµ ·ä•·äï·ã∞·åà·àà·çÅ·âµ ·â†·ä†·åà·à™·â± ·àà·â∞·åÄ·àò·à®·ãç ·ã®·àà·ãç·å• ·ä•·äï·âÖ·àµ·âÉ·à¥ ·àµ·ä¨·â≥·àõ·äê·âµ ·ã®·ãà·å£·â∂·âΩ ·â∞·à≥·âµ·çé ·ãà·à≥·äù ·äê·ãç·ç°·ç° ·â£·àà·çâ·âµ ·ä£·àò·â≥·âµ ·ä†·àõ·à´·ãç ·â†·ä≠·àç·àâ ·ãç·àµ·å•·àù ·àÜ·äê ·â†·àå·àé·âΩ ·ä≠·àç·àé·âΩ ·àõ·äï·äê·â±·äï ·àò·à∞·à®·âµ ·ã´·ã∞·à®·åâ ·ã®·à∞·â•·ä†·ãä ·àò·â•·âµ ·å•·à∞·â∂·âΩ ·à≤·çà·çÄ·àô·â†·âµ ·àò·âÜ·ã®·â≥·â∏·ãç·äï ·åà·àç·çÄ·ãã·àç·ç°·ç° ·ä†·àÅ·äï ·àã·ã≠ ·ã®·ãú·åé·âΩ·äï ·àõ·àÖ·â†·à´·ãä·äì ·ä¢·äÆ·äñ·àö·ã´·ãä ·ã∞·àÖ·äï·äê·âµ ·àä·ã´·à®·åã·åç·å• ·ã®·àö·âΩ·àç ·ã®·àà·ã

In [20]:
# Custom Tokenizer
class CustomTokenizer:
  def __init__(self):
        self.pad_token = "[PAD]"  # You can choose any string for the pad_token

  def tokenize(self, text):
    # Custom tokenization logic here
    # For simplicity, let's split the text into tokens based on spaces
    tokens = text.split()
    return tokens

# Instantiate the custom tokenizer
custom_tokenizer = CustomTokenizer()

Function  to download LLaMA 2 model and its tokenizer. It requires a bitsandbytes configuration

In [19]:
def load_model(model_name, bnb_config):
    n_gpus = torch.cuda.device_count()
    max_memory = f'{40960}MB'

#method from the Hugging Face Transformers library to load a pre-trained language model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto", # dispatch efficiently the model on the available ressources
        max_memory = {i: max_memory for i in range(n_gpus)},
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)

    # Needed for LLaMA tokenizer
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer


Pre-processing dataset

Instruction fine-tuning is a common technique used to fine-tune a base LLM for a specific downstream use-case.



In [20]:
def create_prompt_formats(sample):
    """
    Format various fields of the sample ('text', 'hashtags',)
    Then concatenate them using two newline characters
    :param sample: Sample dictionnary
    """

    INTRO_BLURB = "Identify Hashtags from the given text."
    INSTRUCTION_KEY = "### Text:"
    RESPONSE_KEY = "Hashtags:"
    END_KEY = "### End"

    blurb = f"{INTRO_BLURB}"
    text = f"{INSTRUCTION_KEY}\n{sample['text']}"
    response = f"{RESPONSE_KEY}\n{sample['hashtags']}"
    end = f"{END_KEY}"

    parts = [part for part in [blurb, text, response, end] if part]

    formatted_prompt = "\n\n".join(parts)

    sample["text"] = formatted_prompt

    return sample

use the model tokenizer to process these prompts into tokenized ones.

* The goal is to create input sequences of uniform length (which are suitable for fine-tuning the language model

because it maximizes efficiency and minimize computational overhead), that must not exceed the model‚Äôs maximum token limit.

In [21]:
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )


def preprocess_dataset(tokenizer, max_length: int, seed, dataset):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """

    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)

    # Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=["text", "hashtags"],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)

    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset

**Create a bitsandbytes configuration**

> This allows to load our LLM in 4 bits. This way, we can divide the used memory by 4 and import the model on smaller devices. We choose to apply bfloat16 compute data type and nested quantization for memory-saving purposes.



In [22]:
''' This function, create_bnb_config(), is designed to create and return a
configuration object for quantization using the Bits and Bytes (BNB)
quantization scheme. '''
def create_bnb_config():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    return bnb_config

** LoRa configuration**

> To leverage the LoRa method, we need to wrap the model as a PeftModel.


In [23]:
def create_peft_config(modules):
    """
    Create Parameter-Efficient Fine-Tuning config for the model
    :param modules: Names of the modules to apply Lora to
    """
    config = LoraConfig(
        r=16,  # dimension of the updated matrices
        lora_alpha=64,  # parameter for scaling
        target_modules=modules,
        lora_dropout=0.1,  # dropout probability for layers
        bias="none",
        task_type="CAUSAL_LM",
    )

    return config

> Previous function needs the target modules to update the necessary
matrices. The following function will get them for our model:

In [24]:


def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

> Once everything is set up and the base model is prepared, we can
use the print_trainable_parameters() helper function to see how many trainable parameters are in the model.

In [25]:
def print_trainable_parameters(model, use_4bit=False):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
    if use_4bit:
        trainable_params /= 2
    print(
        f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
    )


**Train**

Now, we can pre-process our dataset and load our model using the set configurations


In [28]:

from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [29]:
# Load model from HF with user's token and with bitsandbytes config

model_name = "meta-llama/Llama-2-7b-hf"

bnb_config = create_bnb_config()

model, tokenizer2 = load_model(model_name, bnb_config)

OSError: You are trying to access a gated repo.
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-hf and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`.

In [None]:
#tokenizer = custom_tokenizer

In [None]:

import random

seed = 42
random.seed(50)

In [None]:
## Preprocess dataset

max_length = get_max_length(model)

dataset = preprocess_dataset(tokenizer2, max_length, seed, dataset)

Found max lenth: 4096
Preprocessing dataset...


Map:   0%|          | 0/18404 [00:00<?, ? examples/s]

Map:   0%|          | 0/18404 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18404 [00:00<?, ? examples/s]

**Fine-tuning process using Single GPU**

In [None]:
def train(model, tokenizer, dataset, output_dir):
    # Apply preprocessing to the model to prepare it by
    # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
    model.gradient_checkpointing_enable()

    # 2 - Using the prepare_model_for_kbit_training method from PEFT
    model = prepare_model_for_kbit_training(model)

    # Get lora module names
    modules = find_all_linear_names(model)

    # Create PEFT config for these modules and wrap the model to PEFT
    peft_config = create_peft_config(modules)
    model = get_peft_model(model, peft_config)

    # Print information about the percentage of trainable parameters
    print_trainable_parameters(model)

    # Training parameters
    trainer = Trainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            max_steps=50,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=1,
            output_dir="outputs",
            optim="paged_adamw_8bit",
        ),
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )

    model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs


    # Verifying the datatypes before training

    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes: dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items(): total+= v
    for k, v in dtypes.items():
        print(k, v, v/total)

    do_train = True

    # Launch training
    print("Training...")

    if do_train:
        train_result = trainer.train()
        metrics = train_result.metrics
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()
        print(metrics)

    ###

    # Saving model
    print("Saving last checkpoint of the model...")
    os.makedirs(output_dir, exist_ok=True)
    trainer.model.save_pretrained(output_dir)

    # Free memory for merging weights
    del model
    del trainer
    torch.cuda.empty_cache()


output_dir = "results/llama2/final_checkpoint"
train(model, tokenizer2, dataset, output_dir)


all params: 3,540,389,888 || trainable params: 39,976,960 || trainable%: 1.1291682911958425
torch.float32 302387200 0.08541070604255438
torch.uint8 3238002688 0.9145892939574456
Training...


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,1.0765
2,1.1646
3,1.1747
4,0.9646
5,1.0061
6,0.997
7,0.9498
8,0.9518
9,0.9249
10,0.8949


***** train metrics *****
  epoch                    =       0.01
  total_flos               =  7728012GF
  train_loss               =     0.8976
  train_runtime            = 0:03:17.67
  train_samples_per_second =      1.012
  train_steps_per_second   =      0.253
{'train_runtime': 197.6706, 'train_samples_per_second': 1.012, 'train_steps_per_second': 0.253, 'total_flos': 8297890158944256.0, 'train_loss': 0.8975745522975922, 'epoch': 0.01}
Saving last checkpoint of the model...


* If we prefer to have a number of epochs (entire training dataset
 will be passed through the model) instead of a number of training
 steps (forward and backward passes through the model with one batch
 of data), we can replace the max_steps argument by num_train_epochs.

* The trainer.model.save_pretrained(output_dir) function, saves the fine-tuned model‚Äôs weights, configuration, and tokenizer files to load later and use the model for inference.

**Merge weights**

> Once we have our fine-tuned weights, we can build our fine-tuned
model and save it to a new directory, with its associated tokenizer
By performing these steps, we can have a memory-efficient fine-tuned
model and tokenizer ready for inference!

In [None]:
model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
model = model.merge_and_unload()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
output_merged_dir = "results/llama2/final_merged_checkpoint"
os.makedirs(output_merged_dir, exist_ok=True)


In [None]:
# save tokenizer for easy inference
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_merged_dir)

('results/llama2/final_merged_checkpoint/tokenizer_config.json',
 'results/llama2/final_merged_checkpoint/special_tokens_map.json',
 'results/llama2/final_merged_checkpoint/tokenizer.json')

In [None]:
#model.save_pretrained(output_merged_dir, safe_serialization=True)


In [None]:
def create_prompt_formats_for_test(sample):
    """
    Format various fields of the sample ('text', 'hashtags',)
    Then concatenate them using two newline characters
    :param sample: Sample dictionnary
    """

    INTRO_BLURB = "Identify Hashtags from the given text."
    INSTRUCTION_KEY = "### Text:"
    # RESPONSE_KEY = "Hashtags:"
    END_KEY = "### End"

    blurb = f"{INTRO_BLURB}"
    text = f"{INSTRUCTION_KEY}\n{sample['text']}"
    # response = f"{RESPONSE_KEY}\n{sample['hashtags']}"
    # end = f"{END_KEY}"

    parts = [part for part in [blurb, text] if part]

    formatted_prompt = "\n\n".join(parts)

    sample["text"] = formatted_prompt

    return sample

In [None]:
sample = dataset_subset[10]

prompt = create_prompt_formats_for_test(sample)

In [None]:
print(prompt)

{'text': 'Identify Hashtags from the given text.\n\n### Text:\n#DrAbiyAhmed ·å†/·àö·à≠ ·ã∂·ä≠·â∞·à≠ ·ä†·â¢·ã≠ ·ä†·àÖ·àò·ãµ ·â†·à∞·à≠·â¢·ã´ ·â§·àç·åç·à¨·ãµ ·â†·â∞·ä´·àÑ·ã∞·ãç 18·äõ·ãç ·ã®·ä†·àà·àù ·ã®·â§·âµ ·ãç·àµ·å• ·ãç·ãµ·ãµ·à≠ ·ä¢·âµ·ãÆ·åµ·ã´ ·â†·ä†·äï·ã∞·äù·äê·âµ ·ã∞·à®·åÉ ·àµ·àã·å†·äì·âÄ·âÄ·âΩ ·ã®·â∞·à∞·àõ·â∏·ãç·äï ·ã∞·àµ·â≥ ·åà·àà·çÅ·ç¢ ·å†·âÖ·àã·ã≠ ·àö·äí·àµ·âµ·à© ·àà·àò·àã·ãç ·ä¢·âµ·ãÆ·åµ·ã´·ãç·ã´·äï ·ä•·äï·ä≥·äï ·ã∞·àµ ·ä†·àà·äï ! ·ä•·äï·ä≥·äï ·ã∞·àµ ·ä†·àã·âΩ·àÅ ! ·â•·àà·ãã·àç·ç¢', 'hashtags': '#DrAbiyAhmed'}


In [None]:
import time

**Inference using Instruction or Question Only**


In [None]:
input_text = f"Instruction: {prompt['text']}"

In [None]:
# Tokenize the input
input_ids = tokenizer.encode(input_text, return_tensors="pt").to(model.device)

# Measure inference time
start_time = time.time()

# Generate predictions
output = model.generate(input_ids, max_length=500, temperature=1.0, top_k=50, top_p=0.95, num_return_sequences=1)
generated_output = tokenizer.decode(output[0], skip_special_tokens=True)

end_time = time.time()

# Calculate and print the inference time
inference_time = end_time - start_time


In [None]:
# Print the formatted input
print(f"======")
print(f"Input:\n======\n{input_text}\n")
print(f"======================")
print(f"Generated Output:\n======================\n{generated_output}\n")
print(f"=========================================")
print(f"Inference Time:{inference_time} seconds\n==========================================")

Input:
Instruction: Identify Hashtags from the given text.

### Text:
#DrAbiyAhmed ·å†/·àö·à≠ ·ã∂·ä≠·â∞·à≠ ·ä†·â¢·ã≠ ·ä†·àÖ·àò·ãµ ·â†·à∞·à≠·â¢·ã´ ·â§·àç·åç·à¨·ãµ ·â†·â∞·ä´·àÑ·ã∞·ãç 18·äõ·ãç ·ã®·ä†·àà·àù ·ã®·â§·âµ ·ãç·àµ·å• ·ãç·ãµ·ãµ·à≠ ·ä¢·âµ·ãÆ·åµ·ã´ ·â†·ä†·äï·ã∞·äù·äê·âµ ·ã∞·à®·åÉ ·àµ·àã·å†·äì·âÄ·âÄ·âΩ ·ã®·â∞·à∞·àõ·â∏·ãç·äï ·ã∞·àµ·â≥ ·åà·àà·çÅ·ç¢ ·å†·âÖ·àã·ã≠ ·àö·äí·àµ·âµ·à© ·àà·àò·àã·ãç ·ä¢·âµ·ãÆ·åµ·ã´·ãç·ã´·äï ·ä•·äï·ä≥·äï ·ã∞·àµ ·ä†·àà·äï ! ·ä•·äï·ä≥·äï ·ã∞·àµ ·ä†·àã·âΩ·àÅ ! ·â•·àà·ãã·àç·ç¢

Generated Output:
Instruction: Identify Hashtags from the given text.

### Text:
#DrAbiyAhmed ·å†/·àö·à≠ ·ã∂·ä≠·â∞·à≠ ·ä†·â¢·ã≠ ·ä†·àÖ·àò·ãµ ·â†·à∞·à≠·â¢·ã´ ·â§·àç·åç·à¨·ãµ ·â†·â∞·ä´·àÑ·ã∞·ãç 18·äõ·ãç ·ã®·ä†·àà·àù ·ã®·â§·âµ ·ãç·àµ·å• ·ãç·ãµ·ãµ·à≠ ·ä¢·âµ·ãÆ·åµ·ã´ ·â†·ä†·äï·ã∞·äù·äê·âµ ·ã∞·à®·åÉ ·àµ·àã·å†·äì·âÄ·âÄ·âΩ ·ã®·â∞·à∞·àõ·â∏·ãç·äï ·ã∞·àµ·â≥ ·åà·àà·çÅ·ç¢ ·å†·âÖ·àã·ã≠ ·àö·äí·àµ·âµ·à© ·àà·àò·àã·ãç ·ä¢·âµ·ãÆ·åµ·ã´·ãç·ã´·äï ·ä•·äï·ä≥·äï ·ã∞·àµ ·ä†·àà·äï ! ·ä•·äï·ä≥·äï ·ã∞·àµ ·ä†·àã·âΩ·àÅ ! ·â

In [None]:
# Print the formatted input
print(f"======")
print(f"Input:\n======\n{input_text}\n")
print(f"======================")
print(f"Generated Output:\n======================\n{generated_output}\n")
print(f"=========================================")
print(f"Inference Time:{inference_time} seconds\n==========================================")

Input:
Instruction: Identify Hashtags from the given text.

### Text:
#DrAbiyAhmed ·å†/·àö·à≠ ·ã∂·ä≠·â∞·à≠ ·ä†·â¢·ã≠ ·ä†·àÖ·àò·ãµ ·â†·à∞·à≠·â¢·ã´ ·â§·àç·åç·à¨·ãµ ·â†·â∞·ä´·àÑ·ã∞·ãç 18·äõ·ãç ·ã®·ä†·àà·àù ·ã®·â§·âµ ·ãç·àµ·å• ·ãç·ãµ·ãµ·à≠ ·ä¢·âµ·ãÆ·åµ·ã´ ·â†·ä†·äï·ã∞·äù·äê·âµ ·ã∞·à®·åÉ ·àµ·àã·å†·äì·âÄ·âÄ·âΩ ·ã®·â∞·à∞·àõ·â∏·ãç·äï ·ã∞·àµ·â≥ ·åà·àà·çÅ·ç¢ ·å†·âÖ·àã·ã≠ ·àö·äí·àµ·âµ·à© ·àà·àò·àã·ãç ·ä¢·âµ·ãÆ·åµ·ã´·ãç·ã´·äï ·ä•·äï·ä≥·äï ·ã∞·àµ ·ä†·àà·äï ! ·ä•·äï·ä≥·äï ·ã∞·àµ ·ä†·àã·âΩ·àÅ ! ·â•·àà·ãã·àç·ç¢

Generated Output:
Instruction: Identify Hashtags from the given text.

### Text:
#DrAbiyAhmed ·å†/·àö·à≠ ·ã∂·ä≠·â∞·à≠ ·ä†·â¢·ã≠ ·ä†·àÖ·àò·ãµ ·â†·à∞·à≠·â¢·ã´ ·â§·àç·åç·à¨·ãµ ·â†·â∞·ä´·àÑ·ã∞·ãç 18·äõ·ãç ·ã®·ä†·àà·àù ·ã®·â§·âµ ·ãç·àµ·å• ·ãç·ãµ·ãµ·à≠ ·ä¢·âµ·ãÆ·åµ·ã´ ·â†·ä†·äï·ã∞·äù·äê·âµ ·ã∞·à®·åÉ ·àµ·àã·å†·äì·âÄ·âÄ·âΩ ·ã®·â∞·à∞·àõ·â∏·ãç·äï ·ã∞·àµ·â≥ ·åà·àà·çÅ·ç¢ ·å†·âÖ·àã·ã≠ ·àö·äí·àµ·âµ·à© ·àà·àò·àã·ãç ·ä¢·âµ·ãÆ·åµ·ã´·ãç·ã´·äï ·ä•·äï·ä≥·äï ·ã∞·àµ ·ä†·àà·äï ! ·ä•·äï·ä≥·äï ·ã∞·àµ ·ä†·àã·âΩ·àÅ ! ·â

**Fine Tuning Using multiple GPU**

In [None]:
# def train(model, tokenizer, dataset, output_dir):
#     # Apply preprocessing to the model to prepare it by
#     # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
#     model.gradient_checkpointing_enable()

#     # 2 - Using the prepare_model_for_kbit_training method from PEFT
#     model = prepare_model_for_kbit_training(model)

#     # Get lora module names
#     modules = find_all_linear_names(model)

#     # Create PEFT config for these modules and wrap the model to PEFT
#     peft_config = create_peft_config(modules)
#     model = get_peft_model(model, peft_config)

#     # Print information about the percentage of trainable parameters
#     print_trainable_parameters(model)

#     #total_batch_size = n_gpus * per_device_batch_size
#     # Training parameters
#     trainer = Trainer(
#         model=model,
#         train_dataset=dataset,
#         args=TrainingArguments(
#             n_gpu=2,
#             per_device_train_batch_size=2,
#             gradient_accumulation_steps=4,
#             warmup_steps=2,
#             max_steps=20,
#             learning_rate=2e-4,
#             fp16=True,
#             logging_steps=1,
#             output_dir="outputs",
#             optim="paged_adamw_8bit",

#         ),
#         data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
#     )

#     model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs


#     # Verifying the datatypes before training

#     dtypes = {}
#     for _, p in model.named_parameters():
#         dtype = p.dtype
#         if dtype not in dtypes: dtypes[dtype] = 0
#         dtypes[dtype] += p.numel()
#     total = 0
#     for k, v in dtypes.items(): total+= v
#     for k, v in dtypes.items():
#         print(k, v, v/total)

#     do_train = True

#     # Launch training
#     print("Training...")

#     if do_train:
#         train_result = trainer.train()
#         metrics = train_result.metrics
#         trainer.log_metrics("train", metrics)
#         trainer.save_metrics("train", metrics)
#         trainer.save_state()
#         print(metrics)

#     ###

#     # Saving model
#     print("Saving last checkpoint of the model...")
#     os.makedirs(output_dir, exist_ok=True)
#     trainer.model.save_pretrained(output_dir)

#     # Free memory for merging weights
#     del model
#     del trainer
#     torch.cuda.empty_cache()


# output_dir = "results/llama2/final_checkpoint_2g"
# train(model, tokenizer, dataset, output_dir)


In [None]:
# model_2g = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
# model_2g = model_2g.merge_and_unload()

In [None]:
# # save tokenizer for easy inference
# tokenizer_2g = AutoTokenizer.from_pretrained(model_name)