### Turotial: Coarse-grained Topic Retrieval and Fine-grained Line Retrieval Evaluation ###

In this tutorial, we show the steps to conduct the Topic Retrieval and Line Retrieval evaluation on our model LongChat-13B-16K. To be specific, we present the process of evaluating the model LongChat-13B-16K on Topic Retrieval with 10-topic testcases and use our auto_topic_eval module to check the accuracy of the outputs. We demonstrate how to run the Line Retrieval evaluation with 300-line testcases as well. Through this tutorial, users will understand how to use our evaluation moduel and understand its output.

### Topic Retrieval Evaluation ###
In this section, we demonstrate how to run the Topic Retrieval evaluation on our model LongChat-13B-16K with 10-topic testcases and use our auto_topic_eval module to examine the accuracy of the model outputs.

#### Step 1: Import necessary modules ####

In [1]:
from longeval.utils import maybe_monkey_patch, get_output_dir, longeval_load_model, load_testcases, test_topics_one_sample, test_lines_one_sample 
from longeval.eval import longeval_test
from argparse import Namespace
from tqdm import tqdm

import os

#### Step 2: Configurate evaluation options and setup output directory####
Here we set the model to be evaluated as LongChat-13B-16K. The evaluation task we set is Topic Retrieval. A GPU with 40GBs of memory is provided for this evaluation. Flash-attention is used to save memory.

In [2]:
args = Namespace(model_name_or_path="lmsys/longchat_13b_16k",
                 task="topics",
                 num_gpus=1,
                 max_gpu_memory=40,
                 longchat_ratio=8,
                 longchat_flash_attn=True)

output_dir = get_output_dir(args)

output to evaluation/topics/predictions/longchat_13b_16k


#### Step 3: Load patches, tokenizer, and model ####

In [3]:
maybe_monkey_patch(args)
model, tokenizer = longeval_load_model(args)

lmsys/longchat_13b_16k


Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing Positional embeddings from 16384 to 2048
Condensing P

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

#### Step 4: Evoke the function for evaluation ####

In [4]:
# we use 10-topics testcases
num_topics = 10

print(f"************ Start testing {num_topics} topics per prompt ***********")
# a variable used to know hte average length of the testcases
avg_length = 0

# test_file contains our pre-generated testcases
test_file = f"evaluation/topics/testcases/{num_topics}_topics.jsonl"
# the output of this evaluation is directed to output_file
output_file = os.path.join(output_dir, f"{num_topics}_response.txt")

# load testcases and start evaluation
test_cases = load_testcases(test_file)
for idx, test_case in tqdm(enumerate(test_cases)):
    _, prompt_length, summary = test_topics_one_sample(model=model, tokenizer=tokenizer, test_case=test_case, output_file=output_file, idx=idx, args=args)
    avg_length += prompt_length / len(test_cases)

print(f"************ Finish testing {num_topics} topics per prompt with average prompt length {avg_length} ************")

************ Start testing 10 topics per prompt ***********


1it [00:20, 20.15s/it]

Label: The future of sustainable agriculture, Predict: ['The first topic we discussed was the future of sustainable agriculture.'], prompt length: 6472


2it [00:40, 20.54s/it]

Label: The impact of technology on privacy and security, Predict: ['The first topic we discussed was the impact of technology on privacy and security.'], prompt length: 6479


3it [01:04, 21.89s/it]

Label: The effects of climate change on ocean ecosystems, Predict: ['The first topic we discussed was "The effects of climate change on ocean ecosystems."'], prompt length: 6148


4it [01:21, 19.79s/it]

Label: The role of sports in society, Predict: ['The first topic we discussed was the role of sports in society.'], prompt length: 6263


5it [01:38, 19.05s/it]

Label: The benefits of learning a new language, Predict: ['The first topic we discussed was the benefits of learning a new language.'], prompt length: 6192


6it [01:57, 19.09s/it]

Label: The history and impact of the Renaissance, Predict: ['The first topic we discussed was the history and impact of the Renaissance.'], prompt length: 6690


7it [02:17, 19.30s/it]

Label: The impact of technology on human connection, Predict: ['The first topic we discussed was "The impact of technology on human connection."'], prompt length: 6540


8it [02:34, 18.66s/it]

Label: The benefits of reading for pleasure, Predict: ['The first topic we discussed was the benefits of reading for pleasure.'], prompt length: 6547


9it [02:51, 18.03s/it]

Label: The role of sports in society, Predict: ['The first topic we discussed was the role of sports in society.'], prompt length: 6289


10it [03:09, 17.93s/it]

Label: The history and culture of ancient civilizations, Predict: ['The first topic we discussed was the history and culture of ancient civilizations.'], prompt length: 5978


11it [03:27, 18.13s/it]

Label: The future of renewable energy storage, Predict: ['The first topic we discussed was the future of renewable energy storage.'], prompt length: 6594


12it [03:47, 18.58s/it]

Label: The benefits of spending time in nature, Predict: ['The first topic we discussed was the benefits of spending time in nature.'], prompt length: 6401


13it [04:04, 18.21s/it]

Label: The role of art in society, Predict: ['The first topic we discussed was The role of art in society.'], prompt length: 6606


14it [04:23, 18.29s/it]

Label: The history and impact of the Renaissance, Predict: ['The first topic we discussed was the history and impact of the Renaissance.'], prompt length: 6503


15it [04:40, 17.98s/it]

Label: The role of sports in society, Predict: ['The first topic we discussed was the role of sports in society.'], prompt length: 6562


16it [05:00, 18.67s/it]

Label: The history and culture of the Middle Ages, Predict: ['The first topic we discussed was the history and culture of the Middle Ages.'], prompt length: 6339


17it [05:19, 18.79s/it]

Label: The benefits of mindfulness meditation, Predict: ['The first topic we discussed was the benefits of mindfulness meditation.'], prompt length: 6334


18it [05:40, 19.43s/it]

Label: The benefits of a plant-based diet, Predict: ['The first topic we discussed was the benefits of a plant-based diet.'], prompt length: 6454


19it [05:59, 19.14s/it]

Label: The benefits of learning a new language, Predict: ['The first topic we discussed was the benefits of learning a new language.'], prompt length: 6426


20it [06:19, 19.46s/it]

Label: The benefits of a plant-based diet, Predict: ['The first topic we discussed was the benefits of a plant-based diet.'], prompt length: 6283


21it [06:34, 18.24s/it]

Label: The benefits of regular exercise, Predict: ['The first topic we discussed was the benefits of regular exercise.'], prompt length: 6197


22it [06:58, 19.86s/it]

Label: The impact of social media on mental health in adults, Predict: ['The first topic we discussed was the impact of social media on mental health in adults.'], prompt length: 6615


23it [07:21, 20.92s/it]

Label: The impact of social media on mental health in adults, Predict: ['The first topic we discussed was the impact of social media on mental health in adults.'], prompt length: 6500


24it [07:37, 19.28s/it]

Label: The benefits of volunteering, Predict: ['The first topic we discussed was the benefits of volunteering.'], prompt length: 6294


25it [07:57, 19.40s/it]

Label: The future of sustainable agriculture, Predict: ['The first topic we discussed was the future of sustainable agriculture.'], prompt length: 6424


26it [08:17, 19.70s/it]

Label: The history and culture of the Middle Ages, Predict: ['The first topic we discussed was the history and culture of the Middle Ages.'], prompt length: 6379


27it [08:32, 18.44s/it]

Label: The psychology of happiness, Predict: ['The first topic we discussed was the psychology of happiness.'], prompt length: 6322


28it [08:51, 18.45s/it]

Label: The benefits of reading for pleasure, Predict: ['The first topic we discussed was "The benefits of reading for pleasure."'], prompt length: 6451


29it [09:13, 19.41s/it]

Label: The history and culture of the Middle Ages, Predict: ['The first topic we discussed was the history and culture of the Middle Ages.'], prompt length: 6732


30it [09:32, 19.27s/it]

Label: The history and culture of ancient civilizations, Predict: ['The first topic we discussed was the history and culture of ancient civilizations.'], prompt length: 6162


31it [09:52, 19.77s/it]

Label: The effects of air pollution on human health, Predict: ['The first topic we discussed was the effects of air pollution on human health.'], prompt length: 6506


32it [10:08, 18.49s/it]

Label: The benefits of volunteering, Predict: ['The first topic we discussed was the benefits of volunteering.'], prompt length: 6336


33it [10:31, 19.95s/it]

Label: The impact of social media on mental health in adults, Predict: ['The first topic we discussed was the impact of social media on mental health in adults.'], prompt length: 6410


34it [10:49, 19.30s/it]

Label: The history and impact of the Renaissance, Predict: ['The first topic we discussed was the history and impact of the Renaissance.'], prompt length: 6179


35it [11:08, 19.06s/it]

Label: The future of sustainable agriculture, Predict: ['The first topic we discussed was the future of sustainable agriculture.'], prompt length: 6138


36it [11:28, 19.36s/it]

Label: The benefits of a plant-based diet, Predict: ['The first topic we discussed was the benefits of a plant-based diet.'], prompt length: 6156


37it [11:46, 19.16s/it]

Label: The benefits of learning a new language, Predict: ['The first topic we discussed was the benefits of learning a new language.'], prompt length: 6629


38it [12:03, 18.46s/it]

Label: The benefits of learning a new language, Predict: ['The first topic we discussed was the benefits of learning a new language.'], prompt length: 6092


39it [12:21, 18.28s/it]

Label: The impact of technology on human connection, Predict: ['The first topic we discussed was the impact of technology on human connection.'], prompt length: 6294


40it [12:39, 18.25s/it]

Label: The benefits of learning a new language, Predict: ['The first topic we discussed was the benefits of learning a new language.'], prompt length: 6392


41it [12:57, 18.15s/it]

Label: The history and impact of the Renaissance, Predict: ['The first topic we discussed was the history and impact of the Renaissance.'], prompt length: 6338


42it [13:13, 17.51s/it]

Label: The psychology of happiness, Predict: ['The first topic we discussed was The psychology of happiness.'], prompt length: 6414


43it [13:31, 17.62s/it]

Label: The effects of sleep on overall health, Predict: ['The first topic we discussed was the effects of sleep on overall health.'], prompt length: 6310


44it [13:54, 19.12s/it]

Label: The impact of social media on mental health in adults, Predict: ['The first topic we discussed was the impact of social media on mental health in adults.'], prompt length: 6283


45it [14:12, 18.92s/it]

Label: The impact of social media on communication, Predict: ['The first topic we discussed was the impact of social media on communication.'], prompt length: 6413


46it [14:28, 18.08s/it]

Label: The psychology of happiness, Predict: ['The first topic we discussed was The psychology of happiness.'], prompt length: 6583


47it [14:45, 17.65s/it]

Label: The role of art in society, Predict: ['The first topic we discussed was The role of art in society.'], prompt length: 6266


48it [15:03, 17.91s/it]

Label: The effects of sleep on overall health, Predict: ['The first topic we discussed was the effects of sleep on overall health.'], prompt length: 6546


49it [15:20, 17.58s/it]

Label: The benefits of reading for pleasure, Predict: ['The first topic we discussed was "The benefits of reading for pleasure."'], prompt length: 6066


50it [15:39, 18.78s/it]

Label: The benefits of learning a new language, Predict: ['The first topic we discussed was the benefits of learning a new language.'], prompt length: 6520
************ Finish testing 10 topics per prompt with average prompt length 6380.939999999998 ************





#### Step 5: Examine the model output with auto_topic_eval module ####
Since all the model output of the Topic Retrieval evaluation is in natural language, there is not an easy method to parse and examine the correctness of these output. We developed an auto_topic_eval module that queries ChatGPT to assess the model outputs. 

Before we start using our auto_topic_eval module, we need to set OPENAI_API_KEY as an environment variable, e.g. export OPENAI_API_KEY=<YOUR_KEY>.

Then we can import the module, load the model output we got from last step and pass it to ChatGPT.

In [5]:
from longeval.auto_topic_eval import chatgpt_auto_eval

In [6]:
with open(output_file, "r") as json_file:
    json_list = list(json_file)

chatgpt_auto_eval(json_list)

--------------- Start auto-evaluation, you should verify it does this correctly --------------
Question #0: Label: The future of sustainable agriculture, Predict: 'The first topic we discussed was the future of sustainable agriculture. - auto-eval goes with correct
Question #1: Label: The impact of technology on privacy and security, Predict: 'The first topic we discussed was the impact of technology on privacy and security. - auto-eval goes with correct
Question #2: Label: The effects of climate change on ocean ecosystems, Predict: 'The first topic we discussed was "The effects of climate change on ocean ecosystems." - auto-eval goes with correct
Question #3: Label: The role of sports in society, Predict: 'The first topic we discussed was the role of sports in society. - auto-eval goes with correct
Question #4: Label: The benefits of learning a new language, Predict: 'The first topic we discussed was the benefits of learning a new language. - auto-eval goes with correct
Question #5: L

As shown at the end of the above outputs, the Topic Retrieval accuracy of our LongChat-13B-16K model on 10-topic testcases is 1.0.

### Line Retrieval Evaluation ###
In this section, we show the steps to run Line Retrieval evaluation on our model LongChat-13B-16K with 300-line testcases.

The steps of running Line Retrieval evaluation are very similar to those of Topic Retrieval evaluation.

#### Step 1: Import necessary modules ####

In [7]:
from longeval.utils import maybe_monkey_patch, get_output_dir, longeval_load_model, load_testcases, test_topics_one_sample, test_lines_one_sample 
from longeval.eval import longeval_test
from argparse import Namespace
from tqdm import tqdm

import os

#### Step 2: Configurate evaluation options and setup output directory####
Here we set the model to be evaluated as LongChat-13B-16K. The evaluation task we set is Line Retrieval. A GPU with 40GBs of memory is provided for this evaluation. Flash-attention is used to save memory.

In [8]:
args = Namespace(model_name_or_path="lmsys/longchat_13b_16k",
                 task="lrt",
                 num_gpus=1,
                 max_gpu_memory=40,
                 longchat_ratio=8,
                 longchat_flash_attn=True)

output_dir = get_output_dir(args)

output to evaluation/lrt/predictions/longchat_13b_16k


#### Step 3: Load patches, tokenizer, and model ####

In [None]:
maybe_monkey_patch(args)
model, tokenizer = longeval_load_model(args)

#### Step 4: Start the evaluation ####

In [9]:
# we use 300-line testcases for this demonstration
num_lines = 300

print(f"************ Start testing {num_lines} lines per LRT prompt ************")
# a variable used to know the average length of the testcases
avg_length = 0
# a variable used to count the number of correct model outputs
num_correct = 0

# test_file contains our pre-generated testcases
test_file = f"evaluation/lines/testcases/{num_lines}_lines.jsonl"
# the output of this evaluation is directed to output_file
output_file = os.path.join(output_dir, f"{num_lines}_response.txt")

# load testcases and start evaluation
test_cases = load_testcases(test_file)
for idx, test_case in tqdm(enumerate(test_cases)):
    correct, prompt_length, summary = test_lines_one_sample(model=model, tokenizer=tokenizer, test_case=test_case, output_file=output_file, idx=idx, args=args)
    avg_length += prompt_length / len(test_cases)
    num_correct += correct
accuracy = num_correct / len(test_cases)

with open(output_file, "a+") as f:
    f.write(f"Accuracy: {accuracy}")

print(f"************ Finish testing {num_lines} lines per prompt with average prompt length {avg_length}, accuracy: {accuracy} ************")

************ Start testing 300 lines per LRT prompt ************


0it [00:00, ?it/s]

Using conversation template: vicuna_v1.1


1it [00:35, 35.52s/it]

Label: 29324, Predict: The <REGISTER_CONTENT> in line racial-fedora is <29324>., Parsed: 29324, prompt length: 7117
Using conversation template: vicuna_v1.1


2it [01:11, 35.62s/it]

Label: 29602, Predict: The <REGISTER_CONTENT> in line scandalous-typewriter is <29602>., Parsed: 29602, prompt length: 7136
Using conversation template: vicuna_v1.1


3it [01:46, 35.66s/it]

Label: 34055, Predict: The <REGISTER_CONTENT> in line measly-pocketbook is <34055>., Parsed: 34055, prompt length: 7126
Using conversation template: vicuna_v1.1


4it [02:21, 35.14s/it]

Label: 41534, Predict: The <REGISTER_CONTENT> in line agreeable-cleric is <41534>., Parsed: 41534, prompt length: 7123
Using conversation template: vicuna_v1.1


5it [02:56, 35.33s/it]

Label: 34230, Predict: The <REGISTER_CONTENT> in line internal-adulthood is <34230>., Parsed: 34230, prompt length: 7116
Using conversation template: vicuna_v1.1


6it [03:31, 34.95s/it]

Label: 28113, Predict: The <REGISTER_CONTENT> in line erratic-dinner is <28113>., Parsed: 28113, prompt length: 7086
Using conversation template: vicuna_v1.1


7it [04:05, 34.72s/it]

Label: 41644, Predict: The <REGISTER_CONTENT> in line dirty-briefly is <41644>., Parsed: 41644, prompt length: 7108
Using conversation template: vicuna_v1.1


8it [04:39, 34.61s/it]

Label: 15711, Predict: The <REGISTER_CONTENT> in line worried-monitor is <15711>., Parsed: 15711, prompt length: 7142
Using conversation template: vicuna_v1.1


9it [05:15, 34.92s/it]

Label: 18766, Predict: The <REGISTER_CONTENT> in line screeching-testing is <18766>., Parsed: 18766, prompt length: 7086
Using conversation template: vicuna_v1.1


10it [05:49, 34.74s/it]

Label: 13860, Predict: The <REGISTER_CONTENT> in line decorous-afterlife is <13860>., Parsed: 13860, prompt length: 7129
Using conversation template: vicuna_v1.1


11it [06:28, 35.89s/it]

Label: 14266, Predict: The <REGISTER_CONTENT> in line highfalutin-arrogance is <14266>., Parsed: 14266, prompt length: 7139
Using conversation template: vicuna_v1.1


12it [07:02, 35.39s/it]

Label: 28264, Predict: The <REGISTER_CONTENT> in line innocent-pony is <28264>., Parsed: 28264, prompt length: 7104
Using conversation template: vicuna_v1.1


13it [07:35, 34.68s/it]

Label: 1470, Predict: The <REGISTER_CONTENT> in line hot-chairperson is <1470>., Parsed: 1470, prompt length: 7147
Using conversation template: vicuna_v1.1


14it [08:08, 34.17s/it]

Label: 1217, Predict: The <REGISTER_CONTENT> in line noisy-ecology is <1217>., Parsed: 1217, prompt length: 7133
Using conversation template: vicuna_v1.1


15it [08:45, 35.06s/it]

Label: 21374, Predict: The <REGISTER_CONTENT> in line relieved-guacamole is <21374>., Parsed: 21374, prompt length: 7142
Using conversation template: vicuna_v1.1


16it [09:19, 34.83s/it]

Label: 5195, Predict: The <REGISTER_CONTENT> in line momentous-ruckus is <34408>., Parsed: 34408, prompt length: 7125
Using conversation template: vicuna_v1.1


17it [09:54, 34.70s/it]

Label: 42378, Predict: The <REGISTER_CONTENT> in line hissing-architect is <42378>., Parsed: 42378, prompt length: 7144
Using conversation template: vicuna_v1.1


18it [10:30, 35.07s/it]

Label: 39807, Predict: The <REGISTER_CONTENT> in line apathetic-iris is <39807>., Parsed: 39807, prompt length: 7151
Using conversation template: vicuna_v1.1


19it [11:05, 35.24s/it]

Label: 38877, Predict: The <REGISTER_CONTENT> in line talented-military is <38877>., Parsed: 38877, prompt length: 7114
Using conversation template: vicuna_v1.1


20it [11:40, 34.96s/it]

Label: 11672, Predict: The <REGISTER_CONTENT> in line foolish-karate is <11672>., Parsed: 11672, prompt length: 7123
Using conversation template: vicuna_v1.1


21it [12:14, 34.75s/it]

Label: 800, Predict: The <REGISTER_CONTENT> in line wet-leptocephalus is <800>., Parsed: 800, prompt length: 7110
Using conversation template: vicuna_v1.1


22it [12:51, 35.45s/it]

Label: 2547, Predict: The <REGISTER_CONTENT> in line berserk-cappelletti is <2547>., Parsed: 2547, prompt length: 7140
Using conversation template: vicuna_v1.1


23it [13:27, 35.50s/it]

Label: 41865, Predict: The <REGISTER_CONTENT> in line wacky-cob is <41865>., Parsed: 41865, prompt length: 7101
Using conversation template: vicuna_v1.1


24it [14:01, 35.10s/it]

Label: 3957, Predict: The <REGISTER_CONTENT> in line obedient-pony is <3957>., Parsed: 3957, prompt length: 7090
Using conversation template: vicuna_v1.1


25it [14:37, 35.40s/it]

Label: 47089, Predict: The <REGISTER_CONTENT> in line open-hippodrome is <47089>., Parsed: 47089, prompt length: 7158
Using conversation template: vicuna_v1.1


26it [15:11, 35.05s/it]

Label: 20619, Predict: The <REGISTER_CONTENT> in line dull-trap is <20619>., Parsed: 20619, prompt length: 7107
Using conversation template: vicuna_v1.1


27it [15:44, 34.39s/it]

Label: 38707, Predict: The <REGISTER_CONTENT> in line young-hair is <38707>., Parsed: 38707, prompt length: 7112
Using conversation template: vicuna_v1.1


28it [16:18, 34.37s/it]

Label: 48262, Predict: The <REGISTER_CONTENT> in line capable-ripple is <48262>., Parsed: 48262, prompt length: 7132
Using conversation template: vicuna_v1.1


29it [16:53, 34.36s/it]

Label: 14845, Predict: The <REGISTER_CONTENT> in line ignorant-purple is <14845>., Parsed: 14845, prompt length: 7143
Using conversation template: vicuna_v1.1


30it [17:24, 33.53s/it]

Label: 39683, Predict: The <REGISTER_CONTENT> in line ugly-real is <39683>., Parsed: 39683, prompt length: 7139
Using conversation template: vicuna_v1.1


31it [18:00, 34.15s/it]

Label: 9156, Predict: The <REGISTER_CONTENT> in line abashed-dulcimer is <9156>., Parsed: 9156, prompt length: 7108
Using conversation template: vicuna_v1.1


32it [18:34, 34.24s/it]

Label: 41880, Predict: The <REGISTER_CONTENT> in line sour-sensor is <41880>., Parsed: 41880, prompt length: 7149
Using conversation template: vicuna_v1.1


33it [19:11, 35.09s/it]

Label: 9659, Predict: The <REGISTER_CONTENT> in line crabby-lilac is <10923>., Parsed: 10923, prompt length: 7138
Using conversation template: vicuna_v1.1


34it [19:46, 34.98s/it]

Label: 8559, Predict: The <REGISTER_CONTENT> in line macho-cymbal is <8559>., Parsed: 8559, prompt length: 7160
Using conversation template: vicuna_v1.1


35it [20:19, 34.37s/it]

Label: 24384, Predict: The <REGISTER_CONTENT> in line industrious-discussion is <24384>., Parsed: 24384, prompt length: 7141
Using conversation template: vicuna_v1.1


36it [20:53, 34.35s/it]

Label: 18869, Predict: The <REGISTER_CONTENT> in line unable-rudiment is <18869>., Parsed: 18869, prompt length: 7136
Using conversation template: vicuna_v1.1


37it [21:26, 33.92s/it]

Label: 42911, Predict: The <REGISTER_CONTENT> in line calm-lunge is <42911>., Parsed: 42911, prompt length: 7134
Using conversation template: vicuna_v1.1


38it [22:03, 34.85s/it]

Label: 47394, Predict: The <REGISTER_CONTENT> in line tranquil-detective is <47394>., Parsed: 47394, prompt length: 7123
Using conversation template: vicuna_v1.1


39it [22:39, 35.07s/it]

Label: 5917, Predict: The <REGISTER_CONTENT> in line flippant-iron is <5917>., Parsed: 5917, prompt length: 7113
Using conversation template: vicuna_v1.1


40it [23:15, 35.25s/it]

Label: 45363, Predict: The <REGISTER_CONTENT> in line flagrant-championship is <45363>., Parsed: 45363, prompt length: 7135
Using conversation template: vicuna_v1.1


41it [23:50, 35.38s/it]

Label: 2391, Predict: The <REGISTER_CONTENT> in line tacit-banyan is <2391>., Parsed: 2391, prompt length: 7138
Using conversation template: vicuna_v1.1


42it [24:25, 35.13s/it]

Label: 6694, Predict: The <REGISTER_CONTENT> in line lamentable-clarification is <6694>., Parsed: 6694, prompt length: 7153
Using conversation template: vicuna_v1.1


43it [24:56, 34.04s/it]

Label: 178, Predict: The <REGISTER_CONTENT> in line impossible-mattress is <178>., Parsed: 178, prompt length: 7117
Using conversation template: vicuna_v1.1


44it [25:31, 34.10s/it]

Label: 40530, Predict: The <REGISTER_CONTENT> in line low-struggle is <40530>., Parsed: 40530, prompt length: 7119
Using conversation template: vicuna_v1.1


45it [26:02, 33.32s/it]

Label: 9638, Predict: The <REGISTER_CONTENT> in line mighty-hemp is <9638>., Parsed: 9638, prompt length: 7116
Using conversation template: vicuna_v1.1


46it [26:37, 33.71s/it]

Label: 30419, Predict: The <REGISTER_CONTENT> in line spicy-indicator is <30419>., Parsed: 30419, prompt length: 7156
Using conversation template: vicuna_v1.1


47it [27:12, 34.28s/it]

Label: 27954, Predict: The <REGISTER_CONTENT> in line bewildered-robe is <27954>., Parsed: 27954, prompt length: 7111
Using conversation template: vicuna_v1.1


48it [27:47, 34.28s/it]

Label: 1639, Predict: The <REGISTER_CONTENT> in line itchy-satisfaction is <1639>., Parsed: 1639, prompt length: 7135
Using conversation template: vicuna_v1.1


49it [28:21, 34.26s/it]

Label: 41767, Predict: The <REGISTER_CONTENT> in line bewildered-craft is <17565>., Parsed: 17565, prompt length: 7110
Using conversation template: vicuna_v1.1


50it [28:55, 34.71s/it]

Label: 6138, Predict: The <REGISTER_CONTENT> in line faulty-sushi is <6138>., Parsed: 6138, prompt length: 7119
************ Finish testing 300 lines per prompt with average prompt length 7126.68, accuracy: 0.94 ************





As shown in the end of the output, our model has an accuracy of 0.94 on 300-line Line Retrieval testcases.