## Turotial: Coarse-grained Topic Retrieval and Fine-grained Line Retrieval Evaluation ##

In this tutorial, we show the steps to conduct the Topic Retrieval and Line Retrieval evaluation on our model LongChat-13B-16K. To be specific, we present the process of evaluting the model LongChat-13B-16K on Topic Retrieval with 10-topic testcases and use our auto_topic_eval module to check the accuracy of the outputs. We demonstrate how to run the Line Retrieval evaluation with 300-line testcases as well. Through this tutorial, users will understand how to use our evaluation module and understand its output.

### Topic Retrieval Evaluation ###
In this section, we demonstrate how to run the Topic Retrieval evaluation on our model LongChat-13B-16K with 10-topic testcases and use our auto_topic_eval module to examine the accuracy of the model outputs.

#### Step 1: Import necessary modules ####

In [8]:
from longeval.utils import maybe_monkey_patch, get_output_dir, longeval_load_model, load_testcases, test_topics_one_sample, test_lrt_one_sample 
from longeval.eval import longeval_test
from argparse import Namespace
from tqdm import tqdm

import os

#### Step 2: Configurate evaluation options and setup output directory ####

Here we set the model to be evaluated as LongChat-13B-16K. The evaluation task we are running first is Topic Recall. A GPU with 40 GBs of memory is provided for this evaluation.Flash-attention is used to save memory.

In [2]:
args = Namespace(model_name_or_path="lmsys/longchat_13b_16k",
                 task="topics",
                 num_gpus=1,
                 max_gpu_memory=40,
                 longchat_ratio=8,
                 longchat_flash_attn=True)

output_dir = get_output_dir(args)

output to evaluation/topics/predictions/longchat_13b_16k


#### Step 3: Load patches, tokenizer, and model ####

In [3]:
maybe_monkey_patch(args)
model, tokenizer = longeval_load_model(args)

lmsys/longchat_13b_16k
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpol

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

#### Step 4: Start the evaluation ####

In [9]:
# we use 10-topics testcases
num_topics = 10

print(f"************ Start testing {num_topics} topics per prompt ***********")
# a variable used to know the average length of the testcases
avg_length = 0

# test_file contains our pre-generated testcases
test_file = f"evaluation/topics/testcases/{num_topics}_topics.jsonl"
# the output of this evaluation is directed to output_file
output_file = os.path.join(output_dir, f"{num_topics}_response.txt")

# load testcases and start evaluation
test_cases = load_testcases(test_file)
for idx, test_case in tqdm(enumerate(test_cases)):
    _, prompt_length, summary = test_topics_one_sample(model=model, tokenizer=tokenizer, test_case=test_case, output_file=output_file, idx=idx, args=args)
    avg_length += prompt_length / len(test_cases)

print(f"************ Finish testing {num_topics} topics per prompt with average prompt length {avg_length} ************")

************ Start testing 10 topics per prompt ***********


1it [00:19, 19.21s/it]

Label: The future of sustainable agriculture, Predict: ['The first topic we discussed was the future of sustainable agriculture.'], prompt length: 6472


2it [00:39, 19.96s/it]

Label: The impact of technology on privacy and security, Predict: ['The first topic we discussed was the impact of technology on privacy and security.'], prompt length: 6479


3it [01:02, 21.42s/it]

Label: The effects of climate change on ocean ecosystems, Predict: ['The first topic we discussed was "The effects of climate change on ocean ecosystems."'], prompt length: 6148


4it [01:19, 19.40s/it]

Label: The role of sports in society, Predict: ['The first topic we discussed was the role of sports in society.'], prompt length: 6263


5it [01:36, 18.71s/it]

Label: The benefits of learning a new language, Predict: ['The first topic we discussed was the benefits of learning a new language.'], prompt length: 6192


6it [01:55, 18.76s/it]

Label: The history and impact of the Renaissance, Predict: ['The first topic we discussed was the history and impact of the Renaissance.'], prompt length: 6690


7it [02:14, 19.00s/it]

Label: The impact of technology on human connection, Predict: ['The first topic we discussed was "The impact of technology on human connection."'], prompt length: 6540


8it [02:32, 18.38s/it]

Label: The benefits of reading for pleasure, Predict: ['The first topic we discussed was the benefits of reading for pleasure.'], prompt length: 6547


9it [02:48, 17.76s/it]

Label: The role of sports in society, Predict: ['The first topic we discussed was the role of sports in society.'], prompt length: 6289


10it [03:05, 17.67s/it]

Label: The history and culture of ancient civilizations, Predict: ['The first topic we discussed was the history and culture of ancient civilizations.'], prompt length: 5978


11it [03:24, 17.86s/it]

Label: The future of renewable energy storage, Predict: ['The first topic we discussed was the future of renewable energy storage.'], prompt length: 6594


12it [03:43, 18.29s/it]

Label: The benefits of spending time in nature, Predict: ['The first topic we discussed was the benefits of spending time in nature.'], prompt length: 6401


13it [04:00, 17.94s/it]

Label: The role of art in society, Predict: ['The first topic we discussed was The role of art in society.'], prompt length: 6606


14it [04:18, 18.01s/it]

Label: The history and impact of the Renaissance, Predict: ['The first topic we discussed was the history and impact of the Renaissance.'], prompt length: 6503


15it [04:35, 17.72s/it]

Label: The role of sports in society, Predict: ['The first topic we discussed was the role of sports in society.'], prompt length: 6562


16it [04:55, 18.39s/it]

Label: The history and culture of the Middle Ages, Predict: ['The first topic we discussed was the history and culture of the Middle Ages.'], prompt length: 6339


17it [05:14, 18.50s/it]

Label: The benefits of mindfulness meditation, Predict: ['The first topic we discussed was the benefits of mindfulness meditation.'], prompt length: 6334


18it [05:35, 19.13s/it]

Label: The benefits of a plant-based diet, Predict: ['The first topic we discussed was the benefits of a plant-based diet.'], prompt length: 6454


19it [05:53, 18.84s/it]

Label: The benefits of learning a new language, Predict: ['The first topic we discussed was the benefits of learning a new language.'], prompt length: 6426


20it [06:13, 19.16s/it]

Label: The benefits of a plant-based diet, Predict: ['The first topic we discussed was the benefits of a plant-based diet.'], prompt length: 6283


21it [06:28, 17.96s/it]

Label: The benefits of regular exercise, Predict: ['The first topic we discussed was the benefits of regular exercise.'], prompt length: 6197


22it [06:51, 19.56s/it]

Label: The impact of social media on mental health in adults, Predict: ['The first topic we discussed was the impact of social media on mental health in adults.'], prompt length: 6615


23it [07:14, 20.60s/it]

Label: The impact of social media on mental health in adults, Predict: ['The first topic we discussed was the impact of social media on mental health in adults.'], prompt length: 6500


24it [07:29, 18.98s/it]

Label: The benefits of volunteering, Predict: ['The first topic we discussed was the benefits of volunteering.'], prompt length: 6294


25it [07:49, 19.10s/it]

Label: The future of sustainable agriculture, Predict: ['The first topic we discussed was the future of sustainable agriculture.'], prompt length: 6424


26it [08:09, 19.38s/it]

Label: The history and culture of the Middle Ages, Predict: ['The first topic we discussed was the history and culture of the Middle Ages.'], prompt length: 6379


27it [08:24, 18.14s/it]

Label: The psychology of happiness, Predict: ['The first topic we discussed was the psychology of happiness.'], prompt length: 6322


28it [08:42, 18.16s/it]

Label: The benefits of reading for pleasure, Predict: ['The first topic we discussed was "The benefits of reading for pleasure."'], prompt length: 6451


29it [09:04, 19.09s/it]

Label: The history and culture of the Middle Ages, Predict: ['The first topic we discussed was the history and culture of the Middle Ages.'], prompt length: 6732


30it [09:22, 18.95s/it]

Label: The history and culture of ancient civilizations, Predict: ['The first topic we discussed was the history and culture of ancient civilizations.'], prompt length: 6162


31it [09:43, 19.45s/it]

Label: The effects of air pollution on human health, Predict: ['The first topic we discussed was the effects of air pollution on human health.'], prompt length: 6506


32it [09:58, 18.19s/it]

Label: The benefits of volunteering, Predict: ['The first topic we discussed was the benefits of volunteering.'], prompt length: 6336


33it [10:21, 19.61s/it]

Label: The impact of social media on mental health in adults, Predict: ['The first topic we discussed was the impact of social media on mental health in adults.'], prompt length: 6410


34it [10:38, 18.99s/it]

Label: The history and impact of the Renaissance, Predict: ['The first topic we discussed was the history and impact of the Renaissance.'], prompt length: 6179


35it [10:57, 18.74s/it]

Label: The future of sustainable agriculture, Predict: ['The first topic we discussed was the future of sustainable agriculture.'], prompt length: 6138


36it [11:16, 19.05s/it]

Label: The benefits of a plant-based diet, Predict: ['The first topic we discussed was the benefits of a plant-based diet.'], prompt length: 6156


37it [11:35, 18.85s/it]

Label: The benefits of learning a new language, Predict: ['The first topic we discussed was the benefits of learning a new language.'], prompt length: 6629


38it [11:51, 18.16s/it]

Label: The benefits of learning a new language, Predict: ['The first topic we discussed was the benefits of learning a new language.'], prompt length: 6092


39it [12:09, 17.98s/it]

Label: The impact of technology on human connection, Predict: ['The first topic we discussed was the impact of technology on human connection.'], prompt length: 6294


40it [12:27, 17.94s/it]

Label: The benefits of learning a new language, Predict: ['The first topic we discussed was the benefits of learning a new language.'], prompt length: 6392


41it [12:44, 17.84s/it]

Label: The history and impact of the Renaissance, Predict: ['The first topic we discussed was the history and impact of the Renaissance.'], prompt length: 6338


42it [13:00, 17.21s/it]

Label: The psychology of happiness, Predict: ['The first topic we discussed was The psychology of happiness.'], prompt length: 6414


43it [13:18, 17.31s/it]

Label: The effects of sleep on overall health, Predict: ['The first topic we discussed was the effects of sleep on overall health.'], prompt length: 6310


44it [13:40, 18.78s/it]

Label: The impact of social media on mental health in adults, Predict: ['The first topic we discussed was the impact of social media on mental health in adults.'], prompt length: 6283


45it [13:58, 18.59s/it]

Label: The impact of social media on communication, Predict: ['The first topic we discussed was the impact of social media on communication.'], prompt length: 6413


46it [14:14, 17.77s/it]

Label: The psychology of happiness, Predict: ['The first topic we discussed was The psychology of happiness.'], prompt length: 6583


47it [14:30, 17.34s/it]

Label: The role of art in society, Predict: ['The first topic we discussed was The role of art in society.'], prompt length: 6266


48it [14:48, 17.62s/it]

Label: The effects of sleep on overall health, Predict: ['The first topic we discussed was the effects of sleep on overall health.'], prompt length: 6546


49it [15:05, 17.28s/it]

Label: The benefits of reading for pleasure, Predict: ['The first topic we discussed was "The benefits of reading for pleasure."'], prompt length: 6066


50it [15:23, 18.47s/it]

Label: The benefits of learning a new language, Predict: ['The first topic we discussed was the benefits of learning a new language.'], prompt length: 6520
************ Finish testing 10 topics per prompt with average prompt length 6380.939999999998 ************





#### Step 5:  Examine the model output with auto_topic_eval module ####
Since all the model output of our evaluation is in natural language, there is not an easy method to parse and examine the correctness of these output. We developed an auto_topic_eval moduel that queries ChatGPT to assess the model outputs.

Before we start using our auto_topic_eval module, we need to set OPENAI_API_KEY as an environment variable, e.g. export OPENAI_API_KEY=<YOUR_KEY>.

Then we can load the model output we got from last step and pass it to ChatGPT to assess.

In [18]:
from longeval.auto_topic_eval import chatgpt_auto_eval

In [19]:
with open(output_file, 'r') as json_file:
    json_list = list(json_file)

chatgpt_auto_eval(json_list)

--------------- Start auto-evaluation, you should verify it does this correctly --------------
Question #0: Label: The future of sustainable agriculture, model output: 'The first topic we discussed was the future of sustainable agriculture. - auto-eval goes with correct
Question #1: Label: The impact of technology on privacy and security, model output: 'The first topic we discussed was the impact of technology on privacy and security. - auto-eval goes with correct
Question #2: Label: The effects of climate change on ocean ecosystems, model output: 'The first topic we discussed was "The effects of climate change on ocean ecosystems." - auto-eval goes with correct
Question #3: Label: The role of sports in society, model output: 'The first topic we discussed was the role of sports in society. - auto-eval goes with correct
Question #4: Label: The benefits of learning a new language, model output: 'The first topic we discussed was the benefits of learning a new language. - auto-eval goes wi

Question #45: Label: The psychology of happiness, model output: 'The first topic we discussed was The psychology of happiness. - auto-eval goes with correct
Question #46: Label: The role of art in society, model output: 'The first topic we discussed was The role of art in society. - auto-eval goes with correct
Question #47: Label: The effects of sleep on overall health, model output: 'The first topic we discussed was the effects of sleep on overall health. - auto-eval goes with correct
Question #48: Label: The benefits of reading for pleasure, model output: 'The first topic we discussed was "The benefits of reading for pleasure." - auto-eval goes with correct
Question #49: Label: The benefits of learning a new language, model output: 'The first topic we discussed was the benefits of learning a new language. - auto-eval goes with correct
---------- End auto-evaluation, predict accuracy 1.0 ---------------


As shown at the end of the outputs, the Topic Retrieval accuracy of our LongChat-13B-16K model on 10-topic testcases is 1.0.

### Line Retrieval Evaluation ###
In this section, we show the steps to run Line Retrieval evaluation on our model LongChat-13B-16K with 300-line testcases.

The steps of running Line Retrieval evaluation are very similar to those of Topic Retrieval evaluation. 

#### Step 1: Import necessary modules ####

In [1]:
from longeval.utils import maybe_monkey_patch, get_output_dir, longeval_load_model, load_testcases, test_topics_one_sample, test_lrt_one_sample 
from longeval.eval import longeval_test
from argparse import Namespace
from tqdm import tqdm

import os

#### Step 2: Configurate evaluation options and setup output directory ####

Here we set the model to be evaluated as LongChat_13B_16K. The evaluation task we are running first is Line Retrieval. A GPU with 40 GBs of memory is provided for this evaluation.Flash-attention is used to save memory.

In [2]:
# we change the task to "lrt" for Line Recall evaluation
args = Namespace(model_name_or_path="lmsys/longchat_13b_16k",
                 task="lrt",
                 num_gpus=1,
                 max_gpu_memory=40,
                 longchat_ratio=8,
                 longchat_flash_attn=True)

output_dir = get_output_dir(args)

output to evaluation/lrt/predictions/longchat_13b_16k


#### Step 3: Load patches, tokenizer, and model ####

In [3]:
maybe_monkey_patch(args)
model, tokenizer = longeval_load_model(args)

lmsys/longchat_13b_16k
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpolation to 16384
building interpol

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

#### Step4: Start the evaluation ####

In [6]:
# we use 300-line testcases for this demonstration
num_lines = 300

print(f"************ Start testing {num_lines} lines per LRT prompt ************")
# a variable used to know the average length of the testcases
avg_length = 0
# a variable used to count the number of correct model outputs
num_correct = 0

# test_file contains our pre-generated testcases
test_file = f"evaluation/lrt/testcases/{num_lines}_lines.jsonl"
# the output of this evaluation is directed to output_file
output_file = os.path.join(output_dir, f"{num_lines}_response.txt")

# load testcases and start evaluation
test_cases = load_testcases(test_file)
for idx, test_case in tqdm(enumerate(test_cases)):
    correct, prompt_length, summary = test_lrt_one_sample(model=model, tokenizer=tokenizer, test_case=test_case, output_file=output_file, idx=idx, args=args)
    avg_length += prompt_length / len(test_cases)
    num_correct += correct
accuracy = num_correct / len(test_cases)

with open(output_file, "a+") as f:
    f.write(f"Accuracy: {accuracy}")

print(f"************ Finish testing {num_lines} lines per prompt with average prompt length {avg_length}, accuracy: {accuracy} ************")

************ Start testing 300 lines per LRT prompt ************
