# Finetune Llama-3 with LLaMA Factory

Please use a **free** Tesla T4 Colab GPU to run this!

Project homepage: https://github.com/hiyouga/LLaMA-Factory

## Install Dependencies

In [None]:
%cd /content/
%rm -rf LLaMA-Factory
!git clone https://github.com/hiyouga/LLaMA-Factory.git
%cd LLaMA-Factory
%ls
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" --quiet
!pip install --no-deps xformers==0.0.25 --quiet
!pip install .[torch,bitsandbytes] --quiet

/content
Cloning into 'LLaMA-Factory'...
remote: Enumerating objects: 12069, done.[K
remote: Counting objects: 100% (860/860), done.[K
remote: Compressing objects: 100% (389/389), done.[K
remote: Total 12069 (delta 559), reused 686 (delta 456), pack-reused 11209[K
Receiving objects: 100% (12069/12069), 218.35 MiB | 27.05 MiB/s, done.
Resolving deltas: 100% (8798/8798), done.
/content/LLaMA-Factory
[0m[01;34massets[0m/       docker-compose.yml  [01;34mexamples[0m/  pyproject.toml  requirements.txt  [01;34msrc[0m/
CITATION.cff  Dockerfile          LICENSE    README.md       [01;34mscripts[0m/          [01;34mtests[0m/
[01;34mdata[0m/         [01;34mevaluation[0m/         Makefile   README_zh.md    setup.py
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Check GPU environment

In [None]:
import torch
try:
  assert torch.cuda.is_available() is True
except AssertionError:
  print("Please set up a GPU before using LLaMA Factory: https://medium.com/mlearning-ai/training-yolov4-on-google-colab-316f8fff99c6")

## Update Dataset

In [None]:
import json
import re
from datasets import load_dataset

# 加载数据集
dataset = load_dataset("dzunggg/legal-qa-v1")

# 定义一个函数去掉开头的“Q:”和“A:”
def remove_prefix(example):
    if example['question'].startswith('Q:'):
        example['question'] = example['question'][2:].strip()
    if example['answer'].startswith('A:'):
        example['answer'] = example['answer'][2:].strip()
    return example

# 去掉转义字符和链接的函数
def clean_text(text):
    # 去掉转义字符
    text = text.replace('\n', ' ').replace('\r', ' ')
    # 去掉链接
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    return text

# 处理数据集
dataset = dataset.map(remove_prefix)

# 生成一个通用的instruction
def generate_instruction():
    return "Please provide detailed answers to the following legal questions."

# 将数据集转换为新的格式
def convert_to_new_format(example):
    question = clean_text(example['question'])
    answer = clean_text(example['answer'])
    return {
        "instruction": generate_instruction(),
        "input": question,
        "output": answer
    }

# 应用转换函数
new_dataset = {}
new_dataset['train'] = [convert_to_new_format(example) for example in dataset['train']]

# 将数据集保存为JSON文件
with open('legal_qa_v1_train.json', 'w') as f:
    json.dump(new_dataset['train'], f, indent=4, ensure_ascii=False)

print("Dataset has been saved to legal_qa_v1_train.json files.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/6.21M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3742 [00:00<?, ? examples/s]

Map:   0%|          | 0/3742 [00:00<?, ? examples/s]

Dataset has been saved to legal_qa_v1_train.json files.


In [None]:
%mv legal_qa_v1_train.json data/legal_qa_v1_train.json

In [None]:
import json
from datasets import load_dataset
import re

# 加载数据集
dataset = load_dataset("ibunescu/qa_legal_dataset_train")

def clean_text(text):
    # 去掉转义字符
    text = text.replace('\n', ' ').replace('\r', ' ')
    # 去掉链接
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    return text

# 提取需要的列并生成新的格式
def convert_to_new_format(example):
    question = clean_text(example['Question'])
    answer = clean_text(example['Answer'])
    return {
        "instruction": "Please provide a detailed answer to the following legal question.",
        "input": question,
        "output": answer
    }

# 应用转换函数
new_dataset = {}
new_dataset['train'] = [convert_to_new_format(example) for example in dataset['train']]

# 将数据集保存为JSON文件
with open('legal_qa_train.json', 'w') as f:
    json.dump(new_dataset['train'], f, indent=4, ensure_ascii=False)

print("Dataset has been saved to legal_qa_train.json files.")

Downloading data:   0%|          | 0.00/42.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/97539 [00:00<?, ? examples/s]

Dataset has been saved to legal_qa_train.json files.


In [None]:
%mv legal_qa_train.json data/legal_qa_train.json

In [None]:
import json

%cd /content/LLaMA-Factory/

NAME = "Llama-3"
AUTHOR = "LLaMA Factory"

with open("data/legal_qa_v1_train.json", "r", encoding="utf-8") as f:
  dataset = json.load(f)
for sample in dataset:
  sample["output"] = sample["output"].replace("{{"+ "name" + "}}", NAME).replace("{{"+ "author" + "}}", AUTHOR)
with open("data/legal_qa_v1_train.json", "w", encoding="utf-8") as f:
  json.dump(dataset, f, indent=2, ensure_ascii=False)

with open("data/legal_qa_train.json", "r", encoding="utf-8") as f:
  dataset = json.load(f)
for sample in dataset:
  sample["output"] = sample["output"].replace("{{"+ "name" + "}}", NAME).replace("{{"+ "author" + "}}", AUTHOR)
with open("data/legal_qa_train.json", "w", encoding="utf-8") as f:
  json.dump(dataset, f, indent=2, ensure_ascii=False)

/content/LLaMA-Factory


## Fine-tune model via LLaMA Board

In [None]:
%cd /content/LLaMA-Factory/
!GRADIO_SHARE=1 llamafactory-cli webui

## Fine-tune model via Command Line

It takes ~30min for training.

In [None]:
!rm -rf llama3_lora

In [None]:
# !python -m xformers.info
!pip uninstall xformers -y
!pip install xformers --quiet
# !python -m xformers.info

Found existing installation: xformers 0.0.25
Uninstalling xformers-0.0.25:
  Successfully uninstalled xformers-0.0.25
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.7/222.7 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
torch_gc()

In [None]:
import json

args = dict(
  stage="sft",                        # do supervised fine-tuning
  do_train=True,
  model_name_or_path="nvidia/Llama3-ChatQA-1.5-8B", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  # dataset="legal_qa_v1_train",             # use alpaca and identity datasets
  dataset="legal_qa_v1_train,legal_qa_train",
  # dataset='processed_data',
  template="llama3",                     # use llama3 prompt template
  finetuning_type="lora",                   # use LoRA adapters to save memory
  lora_target="all",                     # attach LoRA adapters to all linear layers
  output_dir="llama3_lora",                  # the path to save LoRA adapters
  per_device_train_batch_size=8,               # the batch size
  gradient_accumulation_steps=6,               # the gradient accumulation steps
  lr_scheduler_type="cosine",                 # use cosine learning rate scheduler
  logging_steps=10,                      # log every 10 steps
  warmup_ratio=0.1,                      # use warmup scheduler
  save_steps=1000,                      # save checkpoint every 1000 steps
  learning_rate=1e-4,                     # the learning rate
  num_train_epochs=10.0,                    # the epochs of training
  max_samples=500,                      # use 500 examples in each dataset
  max_grad_norm=1.0,                     # clip gradient norm to 1.0
  quantization_bit=8,                     # use 4-bit QLoRA
  loraplus_lr_ratio=16.0,                   # use LoRA+ algorithm with lambda=16.0
  use_unsloth=True,                      # use UnslothAI's LoRA optimization for 2x faster training
  # use_unsloth=False,
  fp16=True,                         # use float16 mixed precision training
  overwrite_output_dir=True,
)

json.dump(args, open("train_llama3.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli train train_llama3.json

/content/LLaMA-Factory
2024-05-25 02:01:50.323358: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-25 02:01:50.377005: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-25 02:01:50.377047: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-25 02:01:50.379101: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-25 02:01:50.387623: I tensorflow/core/pl

## Infer the fine-tuned model

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your term

In [None]:
# !pip uninstall functorch
# !pip install -U functorch
# !pip install -U triton
# !pip uninstall torch torchvision torchaudio -y
# !pip install torch==2.0.1+cu121 torchvision==0.15.2+cu121 torchaudio==2.0.2+cu121 -f https://download.pytorch.org/whl/torch_stable.html
!pip install -U -r /content/LLaMA-Factory/requirements.txt

In [None]:
torch_gc()

In [None]:
from llamafactory.chat import ChatModel
from llamafactory.extras.misc import torch_gc

%cd /content/LLaMA-Factory/

args = dict(
  model_name_or_path="nvidia/Llama3-ChatQA-1.5-8B", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  adapter_name_or_path="llama3_lora_11",            # load the saved LoRA adapters
  template="llama3",                     # same to the one in training
  finetuning_type="lora",                  # same to the one in training
  quantization_bit=8,                    # load 4-bit quantized model
  use_unsloth=True,                     # use UnslothAI's LoRA optimization for 2x faster generation
)
chat_model = ChatModel(args)

messages = []
print("Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.")
while True:
  query = input("\nUser: ")
  if query.strip() == "exit":
    break
  if query.strip() == "clear":
    messages = []
    torch_gc()
    print("History has been removed.")
    continue

  messages.append({"role": "user", "content": query})
  print("Assistant: ", end="", flush=True)

  response = ""
  for new_text in chat_model.stream_chat(messages):
    print(new_text, end="", flush=True)
    response += new_text
  print()
  messages.append({"role": "assistant", "content": response})

torch_gc()

[INFO|tokenization_utils_base.py:2108] 2024-05-25 05:13:30,209 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--nvidia--Llama3-ChatQA-1.5-8B/snapshots/3b98162e3f97550d62aeeb19ea50208f968c678a/tokenizer.json
[INFO|tokenization_utils_base.py:2108] 2024-05-25 05:13:30,210 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2108] 2024-05-25 05:13:30,212 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--nvidia--Llama3-ChatQA-1.5-8B/snapshots/3b98162e3f97550d62aeeb19ea50208f968c678a/special_tokens_map.json
[INFO|tokenization_utils_base.py:2108] 2024-05-25 05:13:30,214 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--nvidia--Llama3-ChatQA-1.5-8B/snapshots/3b98162e3f97550d62aeeb19ea50208f968c678a/tokenizer_config.json


/content/LLaMA-Factory




05/25/2024 05:13:30 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>


INFO:llamafactory.data.template:Replace eos token: <|eot_id|>


05/25/2024 05:13:30 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>


INFO:llamafactory.data.template:Add pad token: <|eot_id|>
[INFO|configuration_utils.py:733] 2024-05-25 05:13:30,758 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--nvidia--Llama3-ChatQA-1.5-8B/snapshots/3b98162e3f97550d62aeeb19ea50208f968c678a/config.json
[INFO|configuration_utils.py:796] 2024-05-25 05:13:30,760 >> Model config LlamaConfig {
  "_name_or_path": "nvidia/Llama3-ChatQA-1.5-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "t

05/25/2024 05:13:30 - INFO - llamafactory.model.utils.quantization - Quantizing model to 8 bit.


INFO:llamafactory.model.utils.quantization:Quantizing model to 8 bit.


05/25/2024 05:13:30 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.


INFO:llamafactory.model.patcher:Using KV cache for faster generation.


05/25/2024 05:13:30 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.


INFO:llamafactory.model.adapter:Upcasting trainable params to float32.


05/25/2024 05:13:30 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA


INFO:llamafactory.model.adapter:Fine-tuning method: LoRA
[INFO|configuration_utils.py:733] 2024-05-25 05:13:30,829 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--nvidia--Llama3-ChatQA-1.5-8B/snapshots/3b98162e3f97550d62aeeb19ea50208f968c678a/config.json
[INFO|configuration_utils.py:796] 2024-05-25 05:13:30,831 >> Model config LlamaConfig {
  "_name_or_path": "nvidia/Llama3-ChatQA-1.5-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "to

==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


[INFO|configuration_utils.py:733] 2024-05-25 05:13:30,990 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--nvidia--Llama3-ChatQA-1.5-8B/snapshots/3b98162e3f97550d62aeeb19ea50208f968c678a/config.json
[INFO|configuration_utils.py:796] 2024-05-25 05:13:30,993 >> Model config LlamaConfig {
  "_name_or_path": "nvidia/Llama3-ChatQA-1.5-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.41.0"

model.safetensors.index.json:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

[INFO|modeling_utils.py:3474] 2024-05-25 05:13:31,167 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--nvidia--Llama3-ChatQA-1.5-8B/snapshots/3b98162e3f97550d62aeeb19ea50208f968c678a/model.safetensors.index.json


Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/6.08G [00:00<?, ?B/s]

[INFO|modeling_utils.py:1519] 2024-05-25 05:15:59,663 >> Instantiating LlamaForCausalLM model under default dtype torch.float16.
[INFO|configuration_utils.py:962] 2024-05-25 05:15:59,680 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 

In [None]:
from llamafactory.chat import ChatModel
from llamafactory.extras.misc import torch_gc

%cd /content/LLaMA-Factory/

args = dict(
  # model_name_or_path="nvidia/Llama3-ChatQA-1.5-8B", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  model_name_or_path="meta-llama/Meta-Llama-3-8B",
#   cache_path = './weights',
  adapter_name_or_path="llama3_lora_12",            # load the saved LoRA adapters
  template="llama3",                     # same to the one in training
  finetuning_type="lora",                  # same to the one in training
  quantization_bit=8,                    # load 4-bit quantized model
  use_unsloth=True,                     # use UnslothAI's LoRA optimization for 2x faster generation
#   device = "cuda"
)
chat_model = ChatModel(args)


background_prompt = """
As an AI legal assistant, you are a highly trained expert in U.S. and Canadian law. Your purpose is to provide accurate, comprehensive, and professional legal information to assist users with a wide range of legal questions and issues.

When responding to queries, adhere to the following guidelines:

1. Clarity and Precision:
- Provide clear, concise answers using precise legal terminology.
- Explain complex legal concepts in a manner accessible to non-legal professionals.

2. Comprehensive Coverage:
- Offer thorough, well-rounded responses that address all relevant aspects of the question.
- Explain pertinent legal principles, statutes, case law, and their implications.

3. Contextual Relevance:
- Tailor your advice to the specific context of each question.
- Utilize examples or analogies to illustrate legal concepts when appropriate.

4. Statutory and Case Law References:
- When citing statutes, explain their relevance and application to the matter at hand.
- When referencing case law, summarize the key facts, legal issues, court decisions, and the broader implications of the ruling.

5. Professional Tone:
- Maintain a professional, respectful demeanor in all interactions.
- Ensure your advice is legally sound and adheres to the highest ethical standards.

Remember, your role is to provide general legal information and analysis.

This is a detailed description of the case or general questions, or detailed instructions for you:
"""

messages = []
print("Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.")
while True:
    query = input("\nUser: ")
    if query.strip() == "exit":
        break
    if query.strip() == "clear":
        messages = []
        torch_gc()
        print("History has been removed.")
        continue

    # Combine the user input with the background prompt
    combined_query = background_prompt + query
    messages.append({"role": "user", "content": combined_query})
    print("\n\nAssistant: ", end="", flush=True)

    response = ""
    for new_text in chat_model.stream_chat(messages):
        print(new_text, end="", flush=True)
        response += new_text
    print()
    messages.append({"role": "assistant", "content": response})

torch_gc()

/content/LLaMA-Factory


tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

[INFO|tokenization_utils_base.py:2108] 2024-05-25 20:50:29,500 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/tokenizer.json
[INFO|tokenization_utils_base.py:2108] 2024-05-25 20:50:29,501 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2108] 2024-05-25 20:50:29,501 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/special_tokens_map.json
[INFO|tokenization_utils_base.py:2108] 2024-05-25 20:50:29,502 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/tokenizer_config.json


05/25/2024 20:50:29 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>


INFO:llamafactory.data.template:Replace eos token: <|eot_id|>


05/25/2024 20:50:29 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>


INFO:llamafactory.data.template:Add pad token: <|eot_id|>


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

[INFO|configuration_utils.py:733] 2024-05-25 20:50:30,104 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/config.json
[INFO|configuration_utils.py:796] 2024-05-25 20:50:30,106 >> Model config LlamaConfig {
  "_name_or_path": "meta-llama/Meta-Llama-3-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.0",

05/25/2024 20:50:30 - INFO - llamafactory.model.utils.quantization - Quantizing model to 8 bit.


INFO:llamafactory.model.utils.quantization:Quantizing model to 8 bit.


05/25/2024 20:50:30 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.


INFO:llamafactory.model.patcher:Using KV cache for faster generation.


05/25/2024 20:50:30 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.


INFO:llamafactory.model.adapter:Upcasting trainable params to float32.


05/25/2024 20:50:30 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA


INFO:llamafactory.model.adapter:Fine-tuning method: LoRA


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


[INFO|configuration_utils.py:733] 2024-05-25 20:50:31,016 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/config.json
[INFO|configuration_utils.py:796] 2024-05-25 20:50:31,019 >> Model config LlamaConfig {
  "_name_or_path": "meta-llama/Meta-Llama-3-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.0",

==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: NVIDIA L4. Max memory: 22.168 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


[INFO|configuration_utils.py:733] 2024-05-25 20:50:31,314 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/config.json
[INFO|configuration_utils.py:796] 2024-05-25 20:50:31,316 >> Model config LlamaConfig {
  "_name_or_path": "meta-llama/Meta-Llama-3-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.0",

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

[INFO|modeling_utils.py:3474] 2024-05-25 20:50:31,705 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/model.safetensors.index.json


Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

[INFO|modeling_utils.py:1519] 2024-05-25 20:53:53,909 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:962] 2024-05-25 20:53:53,912 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

[INFO|modeling_utils.py:4280] 2024-05-25 20:54:00,299 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4288] 2024-05-25 20:54:00,301 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at meta-llama/Meta-Llama-3-8B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.


generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

[INFO|configuration_utils.py:917] 2024-05-25 20:54:00,647 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/generation_config.json
[INFO|configuration_utils.py:962] 2024-05-25 20:54:00,648 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": 128001,
  "max_length": 4096,
  "temperature": 0.6,
  "top_p": 0.9
}

[INFO|tokenization_utils_base.py:2106] 2024-05-25 20:54:00,697 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2106] 2024-05-25 20:54:00,698 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2106] 2024-05-25 20:54:00,699 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2106] 2024-05-25 20:54:00,699 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2106] 2024-05-25 20:54:00,705 >> loading file tokenizer.json
[INFO|tokenizati

05/25/2024 20:54:02 - INFO - llamafactory.model.adapter - Loaded adapter(s): llama3_lora_12


INFO:llamafactory.model.adapter:Loaded adapter(s): llama3_lora_12


05/25/2024 20:54:02 - INFO - llamafactory.model.loader - all params: 8051232768


INFO:llamafactory.model.loader:all params: 8051232768


Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.


Assistant: If you're looking for legal advice tailored to your individual circumstances, it's important to consult with a qualified attorney. Online legal resources like this forum can provide general information and perspectives, but they cannot substitute for the personalized legal advice that comes from a direct attorney-client relationship. Your question is quite general, making it challenging to provide specific guidance in this format. However, you could begin by reaching out to local attorneys to arrange a consultation. In the meantime, continue to research and educate yourself on the matters you've mentioned, such as estate planning and probate law. Being informed can be empowering and helpful in guiding your interactions with legal counsel. Remember, hiring an attorney is a decision based on personal needs and circumstances. Ensure that you select an attorney who aligns wit

In [None]:
import torch

# 打印显存使用情况
print(torch.cuda.memory_summary())

# 清理缓存
torch.cuda.empty_cache()

# 再次打印显存使用情况
print(torch.cuda.memory_summary())


|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  15934 MiB |  16545 MiB |   1654 GiB |   1639 GiB |
|       from large pool |  15807 MiB |  16418 MiB |   1613 GiB |   1597 GiB |
|       from small pool |    127 MiB |    163 MiB |     41 GiB |     41 GiB |
|---------------------------------------------------------------------------|
| Active memory         |  15934 MiB |  16545 MiB |   1654 GiB |   1639 GiB |
|       from large pool |  15807 MiB |  16418 MiB |   1613 GiB |   1597 GiB |
|       from small pool |    127 MiB |    163 MiB |     41 GiB |     41 GiB |
|---------------------------------------------------------------

## Merge the LoRA adapter and optionally upload model

NOTE: the Colab free version has merely 12GB RAM, where merging LoRA of a 8B model needs at least 18GB RAM, thus you **cannot** perform it in the free version.

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your term

In [None]:
import json

%cd /content/LLaMA-Factory

args = dict(
  model_name_or_path="nvidia/Llama3-ChatQA-1.5-8B", # use official non-quantized Llama-3-8B-Instruct model
  adapter_name_or_path="/content/LLaMA-Factory/llama3_lora_11",            # load the saved LoRA adapters
  template="llama3",                     # same to the one in training
  finetuning_type="lora",                  # same to the one in training
  export_dir="llama3_lora_merged_11",              # the path to save the merged model
  export_size=2,                       # the file shard size (in GB) of the merged model
  export_device="cuda",                    # the device used in export, can be chosen from `cpu` and `cuda`
  #export_hub_model_id="your_id/your_model",         # the Hugging Face hub ID to upload model
)

json.dump(args, open("merge_llama3.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli export merge_llama3.json

/content/LLaMA-Factory
/content/LLaMA-Factory
2024-05-25 06:41:55.659554: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-25 06:41:55.715008: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-25 06:41:55.715069: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-25 06:41:55.716995: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-25 06:41:55.72597

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import shutil

# 源文件夹路径
src_lora_merged = '/content/LLaMA-Factory/llama3_lora_merged'
src_lora = '/content/LLaMA-Factory/llama3_lora'

# 目标文件夹路径
dst_lora_merged = '/content/drive/MyDrive/Colab Notebooks/LLaMA-Lawyer-Finetuned/llama3_lora_merged'
dst_lora = '/content/drive/MyDrive/Colab Notebooks/LLaMA-Lawyer-Finetuned/llama3_lora'

# 复制文件夹
shutil.copytree(src_lora_merged, dst_lora_merged)
shutil.copytree(src_lora, dst_lora)


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 模型和 tokenizer 的路径
model_path = "StevenChen16/llama3-8b-Lawyer"

# 加载 tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# 加载模型
model = AutoModelForCausalLM.from_pretrained(model_path)
model.to('cuda' if torch.cuda.is_available() else 'cpu')
# model.to('cpu')

# 测试推理
def generate_response(prompt, max_length=2000):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(inputs["input_ids"], max_length=max_length)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# 示例输入
prompt = "Please briefly describe the legal basis of Roe v. Wade."
response = generate_response(prompt)
print("\nResponse:\n", response)
print('\nEnd\n')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/51.3k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/325 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/9 [00:00<?, ?it/s]

model-00001-of-00009.safetensors:   0%|          | 0.00/1.97G [00:00<?, ?B/s]

model-00002-of-00009.safetensors:   0%|          | 0.00/1.90G [00:00<?, ?B/s]

model-00003-of-00009.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00009.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00009.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00009.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00009.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00009.safetensors:   0%|          | 0.00/1.31G [00:00<?, ?B/s]

model-00009-of-00009.safetensors:   0%|          | 0.00/1.05G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/194 [00:00<?, ?B/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 

In [None]:
!pip install ollama

Collecting ollama
  Downloading ollama-0.2.0-py3-none-any.whl (9.5 kB)
Collecting httpx<0.28.0,>=0.27.0 (from ollama)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<0.28.0,>=0.27.0->ollama)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<0.28.0,>=0.27.0->ollama)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, ollama
Successfully installed h11-0.14.0 httpcore-1.0.5 httpx-0.27.0 ollama-0.2.0


In [None]:
%cd /content
!unzip 4.zip

/content
Archive:  4.zip
   creating: llama3_lora_4/
   creating: llama3_lora_4/runs/
   creating: llama3_lora_4/runs/May23_01-18-46_autodl-container-6d0f499972-4221ed4b/
  inflating: llama3_lora_4/runs/May23_01-18-46_autodl-container-6d0f499972-4221ed4b/events.out.tfevents.1716398343.autodl-container-6d0f499972-4221ed4b.2451.0  
  inflating: llama3_lora_4/trainer_log.jsonl  
  inflating: llama3_lora_4/README.md  
  inflating: llama3_lora_4/adapter_model.safetensors  
  inflating: llama3_lora_4/adapter_config.json  
  inflating: llama3_lora_4/tokenizer_config.json  
  inflating: llama3_lora_4/special_tokens_map.json  
  inflating: llama3_lora_4/tokenizer.json  
  inflating: llama3_lora_4/training_args.bin  
  inflating: llama3_lora_4/train_results.json  
  inflating: llama3_lora_4/all_results.json  
  inflating: llama3_lora_4/trainer_state.json  


In [None]:
import os
from huggingface_hub import HfApi, HfFolder

api = HfApi()
# token = HfFolder.get_token()
token = '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'

model_dir = "/content/LLaMA-Factory/llama3_lora_merged_11"
repo_id_base = "StevenChen16/llama3-8b-lawyer-v2"

# 创建主仓库
api.create_repo(repo_id=repo_id_base, token=token, private=False, exist_ok=True)

# 遍历模型文件夹并上传每个模型
for model_name in os.listdir(model_dir):
    model_path = os.path.join(model_dir, model_name)
    if os.path.isfile(model_path):
        path_in_repo = f"{model_name}"
        api.upload_file(
            path_or_fileobj=model_path,
            path_in_repo=path_in_repo,
            repo_id=repo_id_base,
            token=token
        )


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model-00003-of-00009.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00005-of-00009.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00002-of-00009.safetensors:   0%|          | 0.00/1.90G [00:00<?, ?B/s]

model-00006-of-00009.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00009.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00009-of-00009.safetensors:   0%|          | 0.00/1.05G [00:00<?, ?B/s]

model-00008-of-00009.safetensors:   0%|          | 0.00/1.31G [00:00<?, ?B/s]

model-00001-of-00009.safetensors:   0%|          | 0.00/1.97G [00:00<?, ?B/s]

model-00004-of-00009.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

In [None]:
%cd /content
!unzip 11.zip
%mv llama3_lora_11 LLaMA-Factory/

/content
Archive:  11.zip
replace llama3_lora_11/runs/May25_10-33-02_autodl-container-5cef4489a6-e842d0bc/events.out.tfevents.1716604396.autodl-container-5cef4489a6-e842d0bc.1961.0? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [None]:
%cd /content
!unzip 12.zip
%mv llama3_lora_12 LLaMA-Factory/

/content
Archive:  12.zip
   creating: llama3_lora_12/
   creating: llama3_lora_12/runs/
   creating: llama3_lora_12/runs/May25_14-47-25_autodl-container-5cef4489a6-e842d0bc/
  inflating: llama3_lora_12/runs/May25_14-47-25_autodl-container-5cef4489a6-e842d0bc/events.out.tfevents.1716619658.autodl-container-5cef4489a6-e842d0bc.7003.0  
   creating: llama3_lora_12/runs/May25_14-54-29_autodl-container-5cef4489a6-e842d0bc/
  inflating: llama3_lora_12/runs/May25_14-54-29_autodl-container-5cef4489a6-e842d0bc/events.out.tfevents.1716620080.autodl-container-5cef4489a6-e842d0bc.7188.0  
   creating: llama3_lora_12/runs/May26_00-40-46_autodl-container-5cef4489a6-e842d0bc/
  inflating: llama3_lora_12/runs/May26_00-40-46_autodl-container-5cef4489a6-e842d0bc/events.out.tfevents.1716655267.autodl-container-5cef4489a6-e842d0bc.1877.0  
  inflating: llama3_lora_12/README.md  
  inflating: llama3_lora_12/adapter_model.safetensors  
  inflating: llama3_lora_12/adapter_config.json  
  inflating: llama3_l