In [None]:
"""
This notebook evaluates the performance of the fine-tuned model: Pot-l/llama-7b-lawbot-true on the joey234/mmlu-professional_law dataset.
The final accuracy is 0.4 - 8 correct answers out of 20 questions.
"""

In [None]:

!pip install  --upgrade \
  "transformers==4.38.1" \
  "datasets==2.17.0" \
  "accelerate==0.27.1" \
  "evaluate==0.4.1" \
  "bitsandbytes==0.42.0" \


In [None]:

!pip install git+https://github.com/huggingface/trl --upgrade

In [None]:
!pip install git+https://github.com/huggingface/peft --upgrade

In [None]:
import torch;
# assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'

In [None]:
# install flash-attn
!pip install ninja packaging

In [None]:
from huggingface_hub import login
import os

os.environ["HF_KEY"] = "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"

login(
    token=os.environ.get('HF_KEY'),
    add_to_git_credential=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
from datasets import load_dataset

def create_conversation(sample):

    exam_instruction = """
    You are a law exam bot. Users will provide legal scenarios or questions,
    and you are required to generate correct answer options based on the provided context.
    INSTRUCTIONS:
        1. Read the provided legal scenario or question carefully.
        2. Analyze the legal principles, rules, and precedents relevant to the scenario.
        3. Generate the correct answer options (A, B, C, D) based on the provided context.
        Note: Assign option 0 to A, option 1 to B, option 2 to C, and option 3 to D.

    Strictly follow the OUTPUT Schema:
    [Options character(A or B or C or D)]: [Selected answer option's content]
    e.g.
    A: Cold pursuit
        """

    prompt = f"Question: {sample['question']}\nOptions:\n"
    ans_list = ["A","B","C","D"]
    for i, choice in enumerate(sample['choices']):
        prompt += f"{ans_list[i]}: {choice}\n"

    answer_idx = sample['answer']; answer = ans_list[answer_idx] + ": " + sample['choices'][answer_idx]


    return {
        "messages": [
            {"role": "system", "content": exam_instruction},
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": answer}
        ]
    }

In [None]:
dataset = load_dataset("joey234/mmlu-professional_law", split="test")
dataset = dataset.shuffle().select(range(1500))

dataset = dataset.map(create_conversation, remove_columns=dataset.features,batched=False)

dataset = dataset.train_test_split(test_size=300/1500)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/965 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/33.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.04M [00:00<?, ?B/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1534 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [None]:

dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")

Creating json from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

642813

In [None]:
from datasets import load_dataset


dataset = load_dataset("json", data_files="train_dataset.json", split="train")

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
dataset

Dataset({
    features: ['messages'],
    num_rows: 1200
})

# Evaluation

## Merge model

In [None]:
### COMMENT IN TO MERGE PEFT AND BASE MODEL ####
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import AutoPeftModelForCausalLM

# Load PEFT model on CPU
model_id = "Pot-l/llama-7b-lawbot-true"

config = PeftConfig.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, model_id)
model = AutoPeftModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)
# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("llama-7b-lawbot-true-merged-local",safe_serialization=True, max_shard_size="2GB")

adapter_config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/557 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


adapter_model.safetensors:   0%|          | 0.00/3.61G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Eval

In [None]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoTokenizer, pipeline

model = AutoModelForCausalLM.from_pretrained(
    "./llama-7b-lawbot-true-merged-local",
    low_cpu_mem_usage=True)

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

In [None]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

In [None]:
from datasets import load_dataset
from random import randint

# load the tet dataset
eval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
rand_idx = randint(0, len(eval_dataset))

# sample test
prompt = pipe.tokenizer.apply_chat_template(
    eval_dataset[rand_idx]["messages"][:2],
    tokenize=False,
    add_generation_prompt=True
    )

outputs = pipe(prompt,
               max_new_tokens=256,
               do_sample=False,
               temperature=0.1,
               top_k=50,
               top_p=0.1,
               eos_token_id=pipe.tokenizer.eos_token_id,
               pad_token_id=pipe.tokenizer.pad_token_id
               )

print(f"Query:\n{eval_dataset[rand_idx]['messages'][1]['content']}")
print(f"Original Answer:\n{eval_dataset[rand_idx]['messages'][2]['content']}")
print(f"Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}")

Query:
Question: A pet breeder is in the business of breeding calves at his cattle ranch where he has a stable of prolific cows who are very fertile. The newborn calves need constant attention and care. One day one of the employees inadvertently leaves the fence door open and a newly-born calf breaks free and goes to his neighbor's land. The breeder went to the neighbor's land to retrieve the calf for its safety and to make sure it was unharmed. However, he was arrested on a trespass charge after entering the land. The breeder appealed. Will the court dismiss the charge?
Options:
A: Yes, because he had a limited privilege to enter the land to prevent harm to his chattel.
B: Yes, because the tender pet doctrine allows temporary entry to retrieve baby animals.
C: No, because the neighbor had a right to keep any living chattels that crossed onto his land.
D: No, because his status as a breeder made him unqualified for a limited license.

Original Answer:
A: Yes, because he had a limited p

In [None]:
from tqdm import tqdm

def evaluate(sample):
    prompt = pipe.tokenizer.apply_chat_template(sample["messages"][:2], tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.1, top_k=50, top_p=0.1, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)
    predicted_answer = outputs[0]['generated_text'][len(prompt):].strip()

    print(f"Original Answer:\n{sample['messages'][2]['content']}")
    print(f"Generated Answer:\n{predicted_answer}"  )

    if predicted_answer[0] == sample["messages"][2]["content"][0]:
        return 1
    else:
        return 0

success_rate = []
number_of_eval_samples = 20
# 迭代eval数据集并预测
for s in tqdm(eval_dataset.shuffle().select(range(number_of_eval_samples))):
    success_rate.append(evaluate(s))

# 计算精度
accuracy = sum(success_rate)/len(success_rate)

print(f"Accuracy: {accuracy*100:.2f}%")

  5%|▌         | 1/20 [00:18<05:59, 18.94s/it]

Original Answer:
A: it violates the statute of frauds.
Generated Answer:
B: there was no consideration.


 10%|█         | 2/20 [00:54<08:33, 28.53s/it]

Original Answer:
C: Yes, because that kind of testimony is reliable and not excludable as hearsay.
Generated Answer:
C: Yes, because that kind of testimony is reliable and not excludable as hearsay.


 15%|█▌        | 3/20 [01:31<09:14, 32.62s/it]

Original Answer:
D: No, because it is hearsay within hearsay, and there are no hearsay exceptions that apply.
Generated Answer:
A: Yes, because it is as an exception to hearsay as a spontaneous declaration to an opponent-party.


 20%|██        | 4/20 [04:27<23:46, 89.14s/it]

Original Answer:
C: A defendant started a joke about the victim's brother. When word got to the victim about the defendant's joke, the victim became incensed. He rushed to the defendant's home, broke open the door and found the defendant preparing dinner in the kitchen. He immediately said, "I'm going to kill you. " The defendant knew that the victim had been convicted of attempted murder several years ago, and he cringed when the victim took out a gun and pointed it at him. The defendant could have easily darted for the open front door and evaded the victim but, instead, he suddenly pulled a knife from the kitchen wall, lunged at the victim, and stabbed him to death. Unknown to the defendant, the victim's gun was not loaded.
Generated Answer:
C: A defendant started a joke about the victim's brother. When word got to the victim about the defendant's joke, the victim became incensed. He rushed to the defendant's home, broke open the door and found the defendant preparing dinner in the k

 25%|██▌       | 5/20 [04:58<17:02, 68.13s/it]

Original Answer:
B: The bank, because its note is secured by a purchase money mortgage.
Generated Answer:
B: The bank, because its note is secured by a purchase money mortgage.


 30%|███       | 6/20 [05:32<13:13, 56.67s/it]

Original Answer:
B: The librarian is not in privity of contract with the homeowners' association.
Generated Answer:
C: The librarian is not in privity of estate with the teacher.


 35%|███▌      | 7/20 [05:53<09:42, 44.82s/it]

Original Answer:
A: It is an agreement between two or more to commit a crime.
Generated Answer:
C: No act is needed other than the solicitation.


 40%|████      | 8/20 [06:40<09:05, 45.46s/it]

Original Answer:
A: The testimony is admissible because it is a party admission made through the party's authorized agent concerning a matter within the scope of his employment.
Generated Answer:
A: The testimony is admissible because it is a party admission made through the party's authorized agent concerning a matter within the scope of his employment.


 45%|████▌     | 9/20 [07:05<07:11, 39.27s/it]

Original Answer:
C: admissible under the excited utterance exception.
Generated Answer:
B: admissible, even though it is hearsay.


 50%|█████     | 10/20 [07:44<06:31, 39.20s/it]

Original Answer:
B: He has been unjustly enriched and he owes her restitution under a quasi-contract legal theory.
Generated Answer:
B: He has been unjustly enriched and he owes her restitution under a quasi-contract legal theory.


 55%|█████▌    | 11/20 [08:26<06:00, 40.09s/it]

Original Answer:
C: Yes, because the patient has failed to introduce evidence that the first orthopedist's care fell below the professional standard of care.
Generated Answer:
C: Yes, because the patient has failed to introduce evidence that the first orthopedist's care fell below the professional standard of care.


 60%|██████    | 12/20 [09:09<05:25, 40.75s/it]

Original Answer:
A: Yes because the Sheriff is stressing a speculative possibility that his office may incur funding shortages and more enforcement problems at some undefined future time.
Generated Answer:
C: No, because anyone has the right to object to the executive branch when it tries to legislate its own opinion into the existing law.


 65%|██████▌   | 13/20 [09:35<04:15, 36.54s/it]

Original Answer:
B: A price increase is sufficient for discharge.
Generated Answer:
D: The impracticality must go to a basic assumption on which the contract was made.


 70%|███████   | 14/20 [10:15<03:44, 37.44s/it]

Original Answer:
B: The evidence is admissible to show that the written agreement was subject to an oral condition precedent.
Generated Answer:
C: The evidence is barred, because the written contract appears to be a complete and total integration of the parties' agreement.


 75%|███████▌  | 15/20 [10:46<02:57, 35.60s/it]

Original Answer:
A: guilty, because she failed to pay the $14 before regaining possession of her car.
Generated Answer:
C: not guilty, because the $14 charge was excessively high.


 80%|████████  | 16/20 [11:18<02:17, 34.48s/it]

Original Answer:
B: The case presents a nonjusticiable political question.
Generated Answer:
B: The case presents a nonjusticiable political question.


 85%|████████▌ | 17/20 [11:49<01:40, 33.40s/it]

Original Answer:
A: competent, because she had personal knowledge of the matter.
Generated Answer:
A: competent, because she had personal knowledge of the matter.


 90%|█████████ | 18/20 [12:33<01:13, 36.59s/it]

Original Answer:
B: The particular materials involved consisted of serious scientific studies of human sexual urges.
Generated Answer:
C: The police did not have a search warrant when they entered the bookstore to purchase the particular materials involved in this obscenity prosecution.


 95%|█████████▌| 19/20 [13:10<00:36, 36.79s/it]

Original Answer:
D: not succeed, because the resident's remarks were not published or communicated to anyone but the plaintiff.
Generated Answer:
C: not succeed, because the resident's remarks were a matter of personal opinion rather than statements of fact.


100%|██████████| 20/20 [13:39<00:00, 41.00s/it]

Original Answer:
A: The performance of the roofer would be a constructive condition precedent to the performance by the homeowner.
Generated Answer:
C: The performances of the homeowner and the roofer would be constructive concurrent conditions.
Accuracy: 40.00%



