Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use custom dataset for evaluate? #1383

Open
Gooooooogo opened this issue May 4, 2024 · 2 comments
Open

How to use custom dataset for evaluate? #1383

Gooooooogo opened this issue May 4, 2024 · 2 comments
Labels
question Further information is requested

Comments

@Gooooooogo
Copy link

No description provided.

@lzd-1230
Copy link

lzd-1230 commented Jul 5, 2024

have you know how to eval in own dataset?

@rasbt
Copy link
Collaborator

rasbt commented Jul 5, 2024

One way is to evaluate the models responses on the test set. I have an example here: https://github.com/rasbt/LLM-workshop-2024/blob/main/06_finetuning/06_part-4.ipynb

Note that the code is not 100% complete there as this is for a workshop where part of it is an exercise, but let me post the missing parts below:

IN:

from litgpt import LLM
from tqdm import tqdm

llm = LLM.load("path/to/your/model")

for i in tqdm(range(len(test_data))):
    response = llm.generate(format_input(test_data[i]))
    test_data[i]["response"] = response

OUT:

100%|██████████| 165/165 [00:47<00:00,  3.51it/s]

IN:

with open("test_with_response.json", "w") as json_file:
    json.dump(test_data, json_file, indent=4)

del llm

llm = LLM.load("meta-llama/Meta-Llama-3-8B-Instruct", access_token="...")


def generate_model_scores(json_data, json_key):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only."
        )
        score = llm.generate(prompt, max_new_tokens=50)
        try:
            scores.append(int(score))
        except ValueError:
            continue

    return scores


scores = generate_model_scores(json_data, "response")
print(f"\n{model}")
print(f"Number of scores: {len(scores)} of {len(json_data)}")
print(f"Average score: {sum(scores)/len(scores):.2f}\n")

OUT:

Scoring entries: 100%|██████████| 165/165 [00:30<00:00,  5.50it/s]

response_before
Number of scores: 161 of 165
Average score: 84.02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants