In [10]:
%%capture
!pip install transformers datasets accelerate bitsandbytes xformers langchain sentence_transformers autotrain-advanced faiss-gpu

# 🤗 HuggingFace Hub Credentials
Before we can load in Llama2 using a number of tricks, we will first need to accept the License for using Llama2. The steps are as follows:


* Create a HuggingFace account [here](https://huggingface.co)
* Apply for Llama 2 access [here](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)
* Get your HuggingFace token [here](https://huggingface.co/settings/tokens)

After doing so, we can login with our HuggingFace credentials so that this environment knows we have permission to download the Llama 2 model that we are interested in.

In [18]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# 🦙 **Llama 2**

Now comes one of the more interesting components of this tutorial, how to load in a Llama 2 model on a T4-GPU!

We will be focusing on the `'meta-llama/Llama-2-13b-chat-hf'` variant. It is large enough to give interesting and useful results whilst small enough that it can be run on our environment.

We start by defining our model and identifying if our GPU is correctly selected. We expect the output of `device` to show a cuda device:

We will start with prompting the model without any examples by simply asking the LLM the question directly:

Personally, I am not that convinced with the answer. I think it is more neutral than positive. Also, we have to search in the text for the answer.
Instead, let's give it an example of how we want the answer to be generated:

In [None]:
prompt = """
<s>[INST] <<SYS>>

You are a helpful assistant.

<</SYS>>

Classify the text into neutral, negative or positive.
Text: I think the food was alright.
Sentiment:
[/INST]

Neutral</s><s>

[INST]
Classify the text into neutral, negative or positive.
Text: I think the food was okay.
Sentiment:
[/INST]
"""
print(generator(prompt)[0]["generated_text"])


<s>[INST] <<SYS>>

You are a helpful assistant.

<</SYS>>

Classify the text into neutral, negative or positive.
Text: I think the food was alright.
Sentiment:
[/INST]

Neutral</s><s>

[INST]
Classify the text into neutral, negative or positive.
Text: I think the food was okay.
Sentiment:
[/INST]

Neutral


### RAG

In [6]:

from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-7b-chat-hf'

# 4-bit Quantization to load Llama 2 with less GPU memory
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# Llama 2 Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

# Llama 2 Model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto',
)
model.eval()

# Our text generator
generator = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    temperature=0.1,
    max_new_tokens=500,
    repetition_penalty=1.1
)

In [None]:
# loading huggingpafe model
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generator)

In [9]:
import pandas as pd
column_names = ['context','label']
train=pd.read_csv("/content/drive/MyDrive/limited_liability-drive/train.csv",  header=None, sep='\t',names=column_names)
validation=pd.read_csv("/content/drive/MyDrive/limited_liability-drive/validation.csv",  header=None, sep='\t',names=column_names)

In [10]:
def removing_text(context):
  text=context.split(":")
  text=" ".join(text[1:])
  return text
train.head()

Unnamed: 0,context,label
0,text lib:associate addendum entered into by th...,limitations. notwithstanding any other provisi...
1,text lib:disaster management or disaster backu...,11 . limitation of liability. in the event tha...
2,text lib:form shall immediately be returned by...,limits on liability 10.01 pbms' liability here...
3,text lib:network or client s website(s) will b...,limitation of liability. in no event will the ...
4,text lib:temporary restraining order or other ...,7. limitation of liability. in no event will c...


In [11]:
train['context']=train['context'].apply(lambda x:removing_text(x))
validation['context']=validation['context'].apply(lambda x:removing_text(x))

In [12]:
def promt_fomat(instruction,context,response):
  p=f"""
  <s>[INST] <<SYS>>

  {instruction}

  <</SYS>>

  context: {context}
  ans:
  [/INST]

  {response}</s><s>"""
  return p
instruction=" your task to extract limited liability clause from the given context as it is, which talk about outlines the limitations and restrictions on the liability of the service provider (or the party providing goods or services) to the purchaser (the party receiving the goods or services). if you dont know the answer say No, dont make thing up "
pr=promt_fomat(instruction,train['context'][0],train['label'][0])
train['text']=[ promt_fomat(instruction,train['context'][i],train['label'][i]) for i in range(len(train)) ]
validation['text']=[ promt_fomat(instruction,validation['context'][i],validation['label'][i]) for i in range(len(validation)) ]

In [13]:
train['text'][0]



' \n  <s>[INST] <<SYS>>\n\n   your task to extract limited liability clause from the given context as it is, which talk about outlines the limitations and restrictions on the liability of the service provider (or the party providing goods or services) to the purchaser (the party receiving the goods or services). if you dont know the answer say No, dont make thing up \n\n  <</SYS>>\n\n  context: associate addendum entered into by the parties pursuant to section 17.2 hereof. nothing contained in this section shall bar a claim for contributory negligence. 12.2 intellectual property. subject to the limitations set forth herein, provider shall indemnify and hold the indemnified parties harmless from and against any claim asserted or any claim, suit or proceeding brought against any indemnified party as a result of the infringement of the services or the works uponany patent, trademark, copyright, trade secret or other intellectual property or proprietary right of anythird party. indemnified

In [12]:
! pip install --quiet torch
! pip install --quiet sentencepiece
! pip install --quiet --upgrade bitsandbytes
! pip install --quiet --upgrade accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [13]:
!  pip install langchain -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.5/177.5 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.0/47.0 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
kk

In [15]:
%%capture
%pip install accelerate peft bitsandbytes trl

In [16]:
! pip show bitsandbytes

Name: bitsandbytes
Version: 0.41.2.post2
Summary: k-bit optimizers and matrix multiplication routines.
Home-page: https://github.com/TimDettmers/bitsandbytes
Author: Tim Dettmers
Author-email: dettmers@cs.washington.edu
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: 
Required-by: 


In [1]:
import os
import torch
#from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer



In [15]:
# Model from Hugging Face hub
base_model ="openlm-research/open_llama_3b_v2"#"NousResearch/Llama-2-7b-chat-hf"

# New instruction dataset
guanaco_dataset = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model
new_model = "llama-2-7b-chat-QA"

In [19]:
#dataset = load_dataset(guanaco_dataset, split="train")

In [16]:
train['text'][0]

' \n  <s>[INST] <<SYS>>\n\n   your task to extract limited liability clause from the given context as it is, which talk about outlines the limitations and restrictions on the liability of the service provider (or the party providing goods or services) to the purchaser (the party receiving the goods or services). if you dont know the answer say No, dont make thing up \n\n  <</SYS>>\n\n  context: associate addendum entered into by the parties pursuant to section 17.2 hereof. nothing contained in this section shall bar a claim for contributory negligence. 12.2 intellectual property. subject to the limitations set forth herein, provider shall indemnify and hold the indemnified parties harmless from and against any claim asserted or any claim, suit or proceeding brought against any indemnified party as a result of the infringement of the services or the works uponany patent, trademark, copyright, trade secret or other intellectual property or proprietary right of anythird party. indemnified

In [21]:
#QLORA

4. 4-bit quantization configuration
4-bit quantization via QLoRA allows efficient finetuning of huge LLM models on consumer hardware while retaining high performance. This dramatically improves accessibility and usability for real-world applications.

QLoRA quantizes a pre-trained language model to 4 bits and freezes the parameters. A small number of trainable Low-Rank Adapter layers are then added to the model.

During fine-tuning, gradients are backpropagated through the frozen 4-bit quantized model into only the Low-Rank Adapter layers. So, the entire pretrained model remains fixed at 4 bits while only the adapters are updated. Also, the 4-bit quantization does not hurt model performance.

In [17]:
#create 4-bit quantization (QLORA) with NF4 type configuration using BitsAndBytes
compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)
#loading the model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map={"":0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# loading tockenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/6.85G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/593 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/512k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/330 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [18]:
t=train[['text']][1:5]


In [19]:
t.head()

Unnamed: 0,text
1,\n <s>[INST] <<SYS>>\n\n your task to extr...
2,\n <s>[INST] <<SYS>>\n\n your task to extr...
3,\n <s>[INST] <<SYS>>\n\n your task to extr...
4,\n <s>[INST] <<SYS>>\n\n your task to extr...


In [20]:
# get the LORA configuration
peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

In [21]:
from torch.utils.data import Dataset
class CustomDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = str(self.data.iloc[idx]['text'])
        encoding = self.tokenizer(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
        }
#model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

# Create an instance of your custom dataset
max_length = 3000  # Adjust as needed
train_dataset = CustomDataset(t, tokenizer, max_length)

In [22]:
training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    peft_config=peft_params,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)
trainer.train()
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


OutOfMemoryError: ignored

In [None]:
# evaluation
from tensorboard import notebook
log_dir = "results/runs"
notebook.start("--logdir {} --port 4000".format(log_dir))


In [None]:
logging.set_verbosity(logging.CRITICAL)

prompt = "Who is Leonardo Da Vinci?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

#https://www.datacamp.com/tutorial/fine-tuning-llama-2   refer this