Copyright (c) Meta Platforms, Inc. and affiliates.
This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.



This notebook shows how to train a Llama 2 model on a single GPU (e.g. A10 with 24GB) using int8 quantization and LoRA.

### Step 0: Install pre-requirements and convert checkpoint

We use the Hugging Face trainer and model which means that the checkpoint has to be converted from its original format into the dedicated Hugging Face format.
The conversion can be achieved by running the `convert_llama_weights_to_hf.py` script provided with the transformer package.
Given that the original checkpoint resides under `models/7B` we can install all requirements and convert the checkpoint with:

In [1]:
# %%bash
# pip install transformers datasets accelerate sentencepiece protobuf==3.20 py7zr scipy peft bitsandbytes fire torch_tb_profiler ipywidgets tqdm vllm
# TRANSFORM=`python -c "import transformers;print('/'.join(transformers.__file__.split('/')[:-1])+'/models/llama/convert_llama_weighjts_to_hf.py')"`
# python ${TRANSFORM} --input_dir models --model_size 7B --output_dir models_hf/7B

In [2]:
# llama-recipes/src/llama_recipes/utils/dataset_utils.py

### Step 1: Load the model

Point model_id to model weight folder

In [3]:
from datasets import load_from_disk
train_data = load_from_disk("custom_data/linear_work_data.hf")

  from pandas.core.computation.check import NUMEXPR_INSTALLED


In [4]:
!nvidia-smi

Tue Jan  9 10:38:13 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   21C    P8              19W / 300W |      2MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [5]:
from huggingface_hub import login
login(token='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')

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/ec2-user/.cache/huggingface/token
Login successful


In [6]:
import time

## Important 

It is important to consider here which model we're using to parse the resume

In [7]:
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer

model_id="meta-llama/Llama-2-7b-chat-hf"

tokenizer = LlamaTokenizer.from_pretrained(model_id)

model = LlamaForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map='auto', torch_dtype=torch.float16, token='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')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
import pandas as pd
import pickle
from datasets import Dataset

In [9]:

import sys
sys.path.append('/home/ec2-user/SageMaker/llama_root/src')
sys.path.append('../llama-recipes/src/llama_recipes/')

### Step 3: Check base model

Run the base model on an example input:

In [10]:

eval_prompt =work_prompt = f'''
You are an accurate agent working for a job platform. You will be given the raw 
unstructured text of a user's resume, and the task is to extract the entire work experience of the 
user from the resume. The response should be broken into a numbered list with each item of the list 
containing the complete and accurate information about the work experience of the users.
1. Designation 1 @ Company 1 [From "mm/yyy" to "mm/yyyy"] : "complete job description as given in resume"\n
2. Designation 2 @ Company 2 [From "mm/yyy" to "mm/yyyy"] :  "complete job description as given in resume"\n
Please follow this structure closely and keep the response within the token limit." 

This is the resume text:\n{{resume_text}}\n
This is the output in the required_format:\n'''

In [11]:
example_resume_text = '''
 S\n EVANAND\n\n\nEmail: sevaanand863@gmail.com\nMobile: +919110416415\n\n\nPROFESSIONAL SUMMARY:\n Having 2+ years of technical experience in Analysis, Design, Development, Testing\n and Implementation of Client Server Application and Data warehousing ETL (Extract,\n Transform and Load) in Informatica Power Center 10.4 and INFORMATICA intelligent\n cloud services.\n Main areas of expertise are Developing and Testing the data warehousing\n projects with data quality standards.\n Extensive experience in Extraction, Transformation and Loading of data\n directly from heterogeneous source systems like fat fles, Oracle by using\n Informatica power center.\n Tuned several mappings for the better performance and involved in Performance\n Testing.\n Implemented exceptional handling mechanism by using Exception transformation &amp;\n Human Task.\n Creating Informatica IICS mappings for the diferent plans using various\n transformations.\n Have working experience in Informatica Intelligent Cloud Services IICS components -\n application integration, data integration, Informatica data quality and Informatica\n power center and CRM application - Salesforce.\n Worked on SCD Type1,SCD Type2 in IICS\n Worked on Mapping, Mapping Task, Mapplet, Task Flows\n Experience on all important General transformations.\n Used informatica developer tool to develop the mapping with power center\n transformations.\n Customized SQL override queries where ever possible to minimize the use of Joiner,\n Aggregator and Lookup Transformations.\n Developed all the mappings according to the design document and mapping specs\n provided and performed unit testing.\n Used Parameterization for Mapping, Workfows and sessions.\n Worked on running &amp; scheduling the Informatica jobs using Shell Scripts written on\n the UNIX box.\n Error handling &amp; issue analysis during the testing and maintenance.\n Hands on dynamic parameter fle creation.\n Identifying the bottlenecks and implement the Performance tuning &amp;\n Optimization techniques in power center.\n Review and initial approval for various Docs like IDS, IRS, PDI, KEDB, Mapping\n sheets.\n Good Knowledge on Data Warehousing concepts like Star Schema, Dimensions\n and Fact tables.\n Optimizing Informatica Mappings and Sessions to improve the performance.\n Experience of handling slowly changing dimensions to maintain complete\n history using Type I, Type II and Type III strategies.\n Created UNIX Shell scripts to run the Informatica Workfows &amp; controlling the ETL\n fow.\n Hands on Admin activities.\n Excellent problem-solving skills with strong technical background and good\n interpersonal skills.\n\n\n\n\nEXPERIENCE SUMMARY:\n, Worked as a Programmer Analyst with COGNIZANT from Jan 2022 to April 2023.\n\n Worked as a Software Engineer with Birla Soft LTD from Jan 2021 to Jan 2022.\n\n\n\n\n TECHNICAL ENVIRONMENT:\nOperating System : Windows, Linux\nTools : Informatica developer, IICS, PUTTY, SQL Developer and WinSCP\nRDBMS : Oracle ,SQL, PostgreSQL\nLanguages : Unix,\nScheduling Tools : Autosys, Control-M\n\n\n\n PROJECT PROFILE:\n\n\n #PROJECT 1\n\n Client : Verizon\n Project Name : HR Union Recruit in\n Domain : Telecom\n Role : IICS Developer\n Environment : IICS, Oracle 11g, PostgreSQL , Windows 10\n\nProject Description:\n The Project HR Union involves the migration of severance&rsquo;s data in PeopleSoft to\nPostgreSQL.\n\nInformatica Cloud&rsquo;s Data Integration Services consume the Data from Peoplesoft system\nand perform the\n\nbusiness logic to load in Severance&rsquo;s database (PostgreSQL) and then provide the data to\ndownstream\n\nvendors in the form of Files.\n Responsibilities:\n\n Creating Informatica IICS mappings for the diferent plans using various\n transformations.\n Have working experience in Informatica Intelligent Cloud Services IICS\n components - application integration, data integration, Informatica data quality\n and Informatica power center and CRM application - Salesforce.\n Analysis of the specifcations provided by the clients.\n Used Various Transformations such as Sorted, Lookup, Joiner, Aggregator,\n Sequence Generator. Lookup, Normalizer, Transaction Control Transformation.\n Worked on Diferent tasks like Mapping Task Replication Task, Synchronization\n Task, Power Center Task in IICS.\n Designed, Developed and implemented ETL Processes using IICS Data\n Integration\n Created IICS connection using various cloud connectors in IICS Administrator\n Extensively used informatica IICS&ndash; Mapping, Mapping Task, Task Flow.\n, Developed complex mappings using transformations such as the Source\n qualifer, Joiner, Aggregator, Update Strategy, Expression, Connected Lookup,\n Unconnected Lookup and Router transformations.\n Created informatica mappings for stage, Dimensions and Fact table loads.\n Created SCD type-1 and type-2 mappings for loading the dimension tables.\n Done extensive testing and wrote queries in SQL to ensure the loading of the\n data.\n Developed and implemented the coding of Informatica Mapping for the\n diferent stages of ETL.\n Involved in Unit testing\n On-time Production migration without defects\n Involved in Post production Support.\n\n\n\n\n#Project 2\n\n Client : Discover Fin bank\n Domain : Banking\n Environment : Informatica power center 9.X, Oracle10g\n Role : Informatica Support and Developer\n\n\n\nDISCRIPTIOIN:\n\n This application was designed to load member and subscriber eligibility information\nas received from the customers in the form of fat fles and oracle database. The system\nwas designed to store the eligibility information of the members belonging to the various\ncontracts for the various vendor customer services being provided to them by the client.\nIt was used to store the historical information pertaining to each and every member who\nwas entitled to receive the customer services. The various other front-end applications\nwould access this database to determine the authenticity of the members and the type of\nservices they were entitled to the system.\nResponsibilities:\n\n Understanding existing business model and customer requirements.\n Understanding the mapping specifcations and requirements.\n Managing priorities of tasks, scheduling and tracking progress.\n Extraction of data from various sources using Informatica.\n Designed various mappings for extracting data from various sources involving fat\n fles and relational tables.\n Used Source Analyzer and Warehouse Designer to import the source and target\n database schemas and the mapping designer to map source to the target.\n Used Transformation Developer to create the flters, joiner, update strategy, lookups\n and\n Aggregation transformations, which are used in mappings.\n Created various tasks like sessions, worklets, and workfows in the workfow\n manager to test the mapping during development.\n To keep track of historical data slowly changing dimensions are implemented.\n Created and Monitored Batches and Sessions using Informatica Power Centre.\n Created and executed sessions and batches using Server Manager.\n Worked with Mapping Variables and Mapping Parameters.\n Developed all the mappings according to the design document and mapping specs\n provided and performed unit testing.\n, Created test plan, Test Design, Test scripts and responsible for implementation of\n Test cases as Manual test scripts.\n Developed mapping to load the data in slowly changing dimension.\n Checked the output according to the specifcations.\n Confgured and ran the Debugger from within the Mapping Designer to troubleshoot\n the mapping before the normal run of the workfow.\n Tuned several mappings for the better performance and involved in Performance\n Testing.\n Documenting test cases and Informatica mappings\n Prepared documentation for business data fow from source to target and also for\n the changes made to the mappings/sessions existing to eliminate the errors.\n Provide weekly status report to the Project Manager and discuss issues related to\n quality and deadlines.'
'''

In [12]:
ep = eval_prompt.format(resume_text=example_resume_text)

In [13]:
from transformers import TextStreamer

streamer = TextStreamer(tokenizer)
model_input = tokenizer(ep,streamer=streamer,return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=1024)[0], skip_special_tokens=True))

Keyword arguments {'streamer': <transformers.generation.streamers.TextStreamer object at 0x7f74253eaf50>} not recognized.



You are an accurate agent working for a job platform. You will be given the raw 
unstructured text of a user's resume, and the task is to extract the entire work experience of the 
user from the resume. The response should be broken into a numbered list with each item of the list 
containing the complete and accurate information about the work experience of the users.
1. Designation 1 @ Company 1 [From "mm/yyy" to "mm/yyyy"] : "complete job description as given in resume"

2. Designation 2 @ Company 2 [From "mm/yyy" to "mm/yyyy"] :  "complete job description as given in resume"

Please follow this structure closely and keep the response within the token limit." 

This is the resume text:

 S
 EVANAND


Email: sevaanand863@gmail.com
Mobile: +919110416415


PROFESSIONAL SUMMARY:
 Having 2+ years of technical experience in Analysis, Design, Development, Testing
 and Implementation of Client Server Application and Data warehousing ETL (Extract,
 Transform and Load) in Informatica Power Ce

We can see that the base model only repeats the conversation.

### Step 4: Prepare model for PEFT

Let's prepare the model for Parameter Efficient Fine Tuning (PEFT):

In [14]:
model.train()

def create_peft_config(model):
    from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
        prepare_model_for_int8_training,
    )

    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=64,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules = ["q_proj", "v_proj"]
    )
    
    # peft_config = LoraConfig(
    #     task_type=TaskType.CAUSAL_LM,
    #     inference_mode=False,
    #     r=8,
    #     lora_alpha=32,
    #     lora_dropout=0.05,
    #     target_modules = ["q_proj", "v_proj"]
    # )

    # prepare int-8 model for training
    model = prepare_model_for_int8_training(model)
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()
    return model, peft_config

# create peft config
model, lora_config = create_peft_config(model)





trainable params: 33,554,432 || all params: 6,771,970,048 || trainable%: 0.49548996469513035


### Step 5: Define an optional profiler

In [15]:
from transformers import TrainerCallback
from contextlib import nullcontext
enable_profiler = False
output_dir = "tmp/linear_workex"

config = {
    'lora_config': lora_config,
    'learning_rate': 1e-4,
    'num_train_epochs': 1,
    'gradient_accumulation_steps': 2,
    'per_device_train_batch_size': 2,
    'gradient_checkpointing': False,
}

# Set up profiler
if enable_profiler:
    wait, warmup, active, repeat = 1, 1, 2, 1
    total_steps = (wait + warmup + active) * (1 + repeat)
    schedule =  torch.profiler.schedule(wait=wait, warmup=warmup, active=active, repeat=repeat)
    profiler = torch.profiler.profile(
        schedule=schedule,
        on_trace_ready=torch.profiler.tensorboard_trace_handler(f"{output_dir}/logs/tensorboard"),
        record_shapes=True,
        profile_memory=True,
        with_stack=True)
    
    class ProfilerCallback(TrainerCallback):
        def __init__(self, profiler):
            self.profiler = profiler
            
        def on_step_end(self, *args, **kwargs):
            self.profiler.step()

    profiler_callback = ProfilerCallback(profiler)
else:
    profiler = nullcontext()

In [16]:
!nvidia-smi

Tue Jan  9 10:41:12 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   27C    P0              74W / 300W |  16496MiB / 23028MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Step 6: Fine tune the model

Here, we fine tune the model for a single epoch which takes a bit more than an hour on a A100.

In [17]:
from transformers import default_data_collator, Trainer, TrainingArguments

# Define training args
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    bf16=True,  # Use BF16 if available
    # logging strategies
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=5,
    save_strategy="no",
    optim="adamw_torch_fused",
    max_steps=total_steps if enable_profiler else -1,
    **{k:v for k,v in config.items() if k != 'lora_config'}
)

with profiler:
    # Create Trainer instance
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_data,
        data_collator=default_data_collator,
        callbacks=[profiler_callback] if enable_profiler else [],
    )

    # Start training
    trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
5,2.2261
10,2.0146
15,2.1033
20,1.9992
25,1.9039
30,1.8484
35,1.9857
40,2.0104
45,1.8887
50,1.9122


In [18]:
print('done, on the 9th of Jan, the year of our lord 2024')

done, on the 9th of Jan, the year of our lord 2024


### Step 7:
Save model checkpoint

In [19]:
model.save_pretrained(output_dir)

### Step 8:
Try the fine tuned model on the same example again to see the learning progress:

In [20]:
eval_df = pd.read_csv('custom_data/model_eval_df.csv')

In [21]:
eval_df.shape

(161, 5)

In [22]:
import html 

In [27]:
rt = eval_df.sample()['resume'].values[0]
rt = html.unescape(rt)
print(rt)

PRADEEP KUMAR
Project Manager- IT Infra
 pkindians@gmail.com
 9910393860
 1097/2 FF Pinewood Enclave Sec-2 Wave City Ghaziabad U.P.-201002




Professional Experience
Project Manager-IT Infra, 04/2022 – present | Noida, India
iBoss Tech Solutions Private Limited
 •Work closely with IT Director to leverage IT for business benefit.
 •Build system and process for smooth operations.
 •Run projects to ensure that they meet deadline, customer requirements and organizational goals in
 efficient manner.
 •Plan, schedule and supervise the work of each tech team to ensure the services are provided on time and
 in efficient manner
 •Evaluate, plan and procure, operationalize and retire appropriate technology solutions.
 •Mange relevant contracts and ensure compliance and governance.
 •Control cost and budgeting regarding IT systems.
 •Responsible for managing Operations and Projects ensure highest uptime of IT services. Service call
 closure to meet business SLAs and ensure all systems are as per

In [28]:

eval_prompt = f'''
You are an accurate agent working for a job platform. You will be given the raw 
unstructured text of a user's resume, and the task is to extract the entire work experience of the 
user from the resume. The response should be presented into a numbered list with each item of the list 
being an unbroken line of text containing the complete and accurate information about the work experience of the users. 
Here is an example structure:\n
1. Designation 1 @ Company 1 [From "mm/yyy" to "mm/yyyy"] : "complete job description as given in resume"\n
2. Designation 2 @ Company 2 [From "mm/yyy" to "mm/yyyy"] :  "complete job description as given in resume"\n
Please follow this structure accurately and keep the response within the token limit." 

This is the resume text:\n{{resume_text}}\n
This is the output in the required_format:\n'''

In [29]:
ep = eval_prompt.format(resume_text=rt)

In [30]:
from transformers import TextStreamer

streamer = TextStreamer(tokenizer)
model_input = tokenizer(ep,streamer=streamer,return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=1024)[0], skip_special_tokens=True))

Keyword arguments {'streamer': <transformers.generation.streamers.TextStreamer object at 0x7f74cc601510>} not recognized.



You are an accurate agent working for a job platform. You will be given the raw 
unstructured text of a user's resume, and the task is to extract the entire work experience of the 
user from the resume. The response should be presented into a numbered list with each item of the list 
being an unbroken line of text containing the complete and accurate information about the work experience of the users. 
Here is an example structure:

1. Designation 1 @ Company 1 [From "mm/yyy" to "mm/yyyy"] : "complete job description as given in resume"

2. Designation 2 @ Company 2 [From "mm/yyy" to "mm/yyyy"] :  "complete job description as given in resume"

Please follow this structure accurately and keep the response within the token limit." 

This is the resume text:
PRADEEP KUMAR
Project Manager- IT Infra
 pkindians@gmail.com
 9910393860
 1097/2 FF Pinewood Enclave Sec-2 Wave City Ghaziabad U.P.-201002




Professional Experience
Project Manager-IT Infra, 04/2022 – present | Noida, India
iBoss T

In [None]:
# eval_prompt = f'''
# You are a helpful language model working for a job platform. You will be given the raw 
#  unstructured text of a user's resume, and the task is to extract the work experience of the 
#  user from the raw text in the following format: \n{{work_format}}\n

#  This is the resume text:\n{{resume_text}}\n
#  This is the output in the required format:\n
# '''

In [None]:
# work_format = '''{
#     'work_experience': [{'company': 'company Name 1',
#                          'role': 'job designation 1',
#                          'start_date': 'mm/yyyy',
#                          'end_date': 'mm/yyyy',
#                          'description': 'complete Job description taken from resume'},
#                         {'company': 'company name 2',
#                          'role': 'job designation 2',
#                          'start_date': mm/yyyy',
#                          'end_date': 'mm/yyyy',
#                          'description': 'complete Job description taken from resume'}]
# }'''

In [31]:
model.push_to_hub('lakshay/linear-work-peft',token='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')

adapter_model.safetensors:   0%|          | 0.00/134M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/lakshay/linear-work-peft/commit/dd2610e6d1fdc7621e824849db8762d0d6b21200', commit_message='Upload model', commit_description='', oid='dd2610e6d1fdc7621e824849db8762d0d6b21200', pr_url=None, pr_revision=None, pr_num=None)

## Personal Information Evaluation

In [None]:
model.push_to_hub('lakshay/llama2-test',token='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', max_shard_size='2GB')

adapter_model.safetensors:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/lakshay/llama2-test/commit/9460af41bdcca6c6b9cafac27d3ee09a4bd6c36a', commit_message='Upload model', commit_description='', oid='9460af41bdcca6c6b9cafac27d3ee09a4bd6c36a', pr_url=None, pr_revision=None, pr_num=None)

## PI validation loop

In [None]:
validation_data = pd.read_csv('custom_data/validation_dataset.csv')

In [None]:
validation_data.sample()

In [None]:
validation_data.resume.values[:1]

In [None]:
from tqdm.notebook import tqdm
import ast

In [None]:

error_list = list()
correct_list = list()

for uid,rt in tqdm(validation_data[['id','resume']].sample(frac=1).values[:200]):

    eval_prompt = pi_eval_prompt.substitute(
                pi_format=pi_format,
                resume_text=rt)

    sample_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
    try:
        model.eval()
        with torch.no_grad():
            full_document = tokenizer.decode(model.generate(**sample_input, max_new_tokens=200)[0], skip_special_tokens=True)
    except:
        print('feck')
        continue
    
    try:
        out_str = full_document.replace(eval_prompt,'').replace('$','')
        out_json = ast.literal_eval(out_str)
        u_info = {}
        u_info[uid] = out_json
        correct_list.append(u_info)
    except:
        error_list.append(full_document)
        continue

In [None]:
# correct_list

In [None]:
'hello there, $, yes'.replace('there,','').replace('$','')

In [None]:
len(correct_list)

In [None]:
# correct_list

with open('custom_data/validation_output.pkl','wb') as f:
    pickle.dump(correct_list,f)

In [25]:
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer

model_id="lakshay/llama2-test"

# tokenizer = LlamaTokenizer.from_pretrained(model_id)

model = LlamaForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map='auto', torch_dtype=torch.float16, token='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')

adapter_config.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

ValueError: 
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.
                        