# Evaluation of long memory
Evaluation will include two parts
1. Evaluation as chat memory
2. Evaluation as RAG system

The score will be submitted to LLM for evaluation.

## Evaluation as chat memory
- Datasets : MSC(2022), classify by Mem-GPT

In [1]:
import pandas as pd
df = pd.read_json('memory_datasets_full_rewrite.json', orient='records', lines=True)
df.head()

Unnamed: 0,metadata,normal_memory_question,conflict_memory_question,generated_answer_dialog,full_dialog,B_Contradiction,A_Correct
0,"{'initial_data_id': 'valid_1', 'session_id': 4}","{'B': 'Hey, remember that time we talked about...","{'B:': 'Hey, remember when we discussed our fa...","{'time': '7 days 8 hours ago', 'dialog': [{'A'...","[{'time': '7 days 8 hours ago', 'dialog': [{'A...",Contradiction,Correct
1,"{'initial_data_id': 'valid_2', 'session_id': 4}","{'B': 'Hey, remember that time we talked about...",{'B:': 'I remember you said you went to Disney...,"{'time': '14 days 1 hour ago', 'dialog': [{'A'...","[{'time': '14 days 1 hour ago', 'dialog': [{'A...",Contradiction,Correct
2,"{'initial_data_id': 'valid_0', 'session_id': 4}","{'B': 'Hey, remember that time we talked about...",{'B:': 'I remember you saying that your dad wa...,"{'time': '6 days 12 hours ago', 'dialog': [{'A...","[{'time': '6 days 12 hours ago', 'dialog': [{'...",Contradiction,Correct
3,"{'initial_data_id': 'valid_10', 'session_id': 4}","{'B': 'Hey, remember that time we talked about...",{'B:': 'I remember you said your parents were ...,"{'time': '6 days 6 hours ago', 'dialog': [{'A'...","[{'time': '6 days 6 hours ago', 'dialog': [{'A...",Contradiction,Correct
4,"{'initial_data_id': 'valid_7', 'session_id': 4}","{'B': 'Hey, remember that time we talked about...",{'B:': 'I remember you mentioned you prefer re...,"{'time': '11 days 4 hours ago', 'dialog': [{'A...","[{'time': '11 days 4 hours ago', 'dialog': [{'...",Contradiction,Correct


In [95]:
segments = []
current_segment = ""

for dialog in df['full_dialog'][1]:
    chat_log = dialog['dialog']
    for message in chat_log:
        for speaker, text in message.items():
            current_segment += f"{speaker}: {text} "
            # TODO 太長要讓 LLM 進行 rewrite
            if speaker=="B":
                segments.append(current_segment.strip())
                current_segment = ""
    if current_segment:
        segments.append(current_segment.strip())

In [None]:
for i, segment in enumerate(segments, 1):
        print(f"Segment {i}:\n{segment}\n")

In [73]:
segments[0]

'A: Hello, how are you doing? B: I love spending time with my family'

In [2]:
from dotenv import load_dotenv
from openai import OpenAI
import os

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

In [3]:
def create_embedding(text):
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

In [96]:
embeddings = []
for log in segments:
    vector = create_embedding(log)
    embeddings.append(vector)

In [97]:
import numpy as np
from numpy.linalg import norm

vector_similarity = []
for i in range(len(embeddings)-1):
    score = np.dot(embeddings[i], embeddings[i+1]) / (norm(embeddings[i]) * norm(embeddings[i+1]))
    vector_similarity.append(score)

In [98]:
LOWER_BOUND = 0.35

breakpoints = [x for x in vector_similarity if x < LOWER_BOUND]

print("分段:", breakpoints)

分段: [0.28731827595322723, 0.323692597246544, 0.3071488772589786, 0.2854928553250751, 0.25922169305843074, 0.3068364065999007, 0.335511122009109]


### Group by breakpoints

Group chat log and embedding full group text

In [99]:
classify_group = []
start = 0
for i in breakpoints:
    classify_group.append(segments[start:vector_similarity.index(i)])
    start = vector_similarity.index(i)
classify_group.append(segments[start:])
eval_full_group_text = []
for group in classify_group:
    group_full_text = ' '.join(map(str, group))
    vector = create_embedding(group_full_text)
    eval_full_group_text.append({
        "text":group_full_text,
        "vector":vector
    })

In [102]:
eval_full_group_text[3]['text']

'A: I mostly eat a fresh and raw diet, so I save on groceries. B: Your economic skills are amazing'

Group chat log and generate group summary and embedding summary

In [5]:
from prompt import summary_prompt
def summary_group(chat_log):
    messages = [{"role": "user", "content": summary_prompt.format(chat_log=chat_log)}]
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )
    return completion.choices[0].message.content

In [103]:
classify_group_with_vector = []
start = 0
for i in breakpoints:
    group_list = []
    for j in range(start,vector_similarity.index(i)):
        log_memory = {
            "text":segments[j],
            "vector":embeddings[j]
        }
        group_list.append(log_memory)    
    classify_group_with_vector.append(group_list)
    start = vector_similarity.index(i)
group_list = []
for i in range(start,len(segments)):
    log_memory = {
        "text":segments[i],
        "vector":embeddings[i]
    }
    group_list.append(log_memory)
classify_group_with_vector.append(group_list)

In [104]:
eval_long_memory = []
for i, group in enumerate(classify_group_with_vector):
    group_text = summary_group(eval_full_group_text[i]['text'])
    group_dict = {
        "description":group_text,
        "child":group
    }
    eval_long_memory.append(group_dict)

In [106]:
from component import LongMemory
m = LongMemory()
for group in eval_long_memory:
    m.add_group_memory(group)

Only embedding chat log

In [84]:
eval_chat_log = []
for group in classify_group_with_vector:
    eval_chat_log.extend(group)
eval_chat_log[0]['text']

'A: Hello, how are you doing? B: I love spending time with my family'

### Evaluation 
分成三組測資
1. Long memory 的結構
2. 將 chat log 分組並且直接將組的所有文字 embedding
3. 不分組，只將 chat log embedding

Judge LLM

In [84]:
tools = [
    {
    "type": "function",
    "function": {
      "name": "judge",
      "parameters": {
        "type": "object",
        "properties": {
          "response": {"type": "string",
                      "description":"Is RAG sufficient",
                      "enum": ["sufficient", "insufficient"]},
        },
        "required": ["response"],
      },
    }
  }
]

judge_prompt = """You are a judge evaluating the output of a Retrieval-Augmented Generation (RAG) system. 
You will receive a Q&A and the corresponding source document that generated the question. 
Your task is to evaluate whether the memory system's response provides relevant information

Question:{question}

Generated document:{gold_document}

Memory response:{text}"""

In [85]:
def judgement(question, gold_document, text):
    messages = [{"role": "user", "content": judge_prompt.format(question=question, gold_document=gold_document, text=text)}]
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        tools=tools,
        tool_choice={
            "type":"function",
            "function":{"name":"judge"},
        }
    )
    res = eval(completion.choices[0].message.tool_calls[0].function.arguments)
    return res["response"]

In [22]:
generated_answer_dialog = ""

for chat_log in df['generated_answer_dialog'][1]['dialog']:
    for speaker, text in chat_log.items():
        generated_answer_dialog += f"{speaker}: {text} "

In [107]:
question = f"B: {df['normal_memory_question'][1]['B']} A: {df['normal_memory_question'][1]['A']}"
question

'B: Hey, remember that time we talked about our jobs and expenses? What was that one thing you said you did to save money? A: I eat a fresh and raw diet to save on groceries.'

1. Long memory

In [108]:
question_vector = create_embedding(question)

In [109]:
documents = m.get_relavant_memory(question, vector=question_vector)

Searching with provided vector


In [98]:
if documents.get('memory'):
    text = f"Abstract: {documents.get('group_description')}\nOrigin text: {documents.get('memory')[0].get('text')}"
else:
    text = None
print(text)

Abstract: In the chat, Person A mentions that they primarily consume a fresh and raw diet, which helps them save money on groceries. Person B compliments A on their impressive economic skills.
Origin text: A: I mostly eat a fresh and raw diet, so I save on groceries. B: Your economic skills are amazing


In [90]:
judgement(question=question, gold_document=generated_answer_dialog, text=text)

'True'

2. 將 chat log 分組並且直接將組的所有文字 embedding

In [99]:
max_score = 0
similar_group = None
for i, group in enumerate(eval_full_group_text):
    group_vector = group['vector']
    score = np.dot(group_vector, question_vector) / (norm(group_vector) * norm(question_vector))
    if score>max_score:
        max_score = score
        similar_group = group
text = similar_group['text']

In [100]:
judgement(question=question, gold_document=generated_answer_dialog, text=text)

'True'

3. 不分組，只將 chat log embedding

In [116]:
max_score = 0
similar_chat_log = None
for i, chat_log in enumerate(eval_chat_log):
    log_vector = chat_log['vector']
    score = np.dot(log_vector, question_vector) / (norm(log_vector) * norm(question_vector))
    if score>max_score:
        max_score = score
        similar_chat_log = chat_log
text = similar_chat_log['text']

'A: Maybe you should consider going back to school. I did. I major in economics. B: I have to walk 3 miles to work to save money everyday what do you do?'

In [117]:
judgement(question=question, gold_document=generated_answer_dialog, text=text)

'False'

### Auto evaluation

In [25]:
import pandas as pd
df = pd.read_json('memory_datasets_full_rewrite.json', orient='records', lines=True)
df = df.drop(['B_Contradiction', 'A_Correct', 'conflict_memory_question', 'metadata'], axis=1)
df['long_memory'] = None
df['long_memory_doc'] = None
df['full_group_text'] = None
df['full_group_doc'] = None
df['chat_log'] = None
df['chat_log_doc'] = None
df.head()

Unnamed: 0,normal_memory_question,generated_answer_dialog,full_dialog,long_memory,long_memory_doc,full_group_text,full_group_doc,chat_log,chat_log_doc
0,"{'B': 'Hey, remember that time we talked about...","{'time': '7 days 8 hours ago', 'dialog': [{'A'...","[{'time': '7 days 8 hours ago', 'dialog': [{'A...",,,,,,
1,"{'B': 'Hey, remember that time we talked about...","{'time': '14 days 1 hour ago', 'dialog': [{'A'...","[{'time': '14 days 1 hour ago', 'dialog': [{'A...",,,,,,
2,"{'B': 'Hey, remember that time we talked about...","{'time': '6 days 12 hours ago', 'dialog': [{'A...","[{'time': '6 days 12 hours ago', 'dialog': [{'...",,,,,,
3,"{'B': 'Hey, remember that time we talked about...","{'time': '6 days 6 hours ago', 'dialog': [{'A'...","[{'time': '6 days 6 hours ago', 'dialog': [{'A...",,,,,,
4,"{'B': 'Hey, remember that time we talked about...","{'time': '11 days 4 hours ago', 'dialog': [{'A...","[{'time': '11 days 4 hours ago', 'dialog': [{'...",,,,,,


In [111]:
for index in range(len(df)):
    print(f'---process : {index}---')
    # 處理 full dialog
    segments = []
    current_segment = ""
    for dialog in df['full_dialog'][index]:
        chat_log = dialog['dialog']
        for message in chat_log:
            for speaker, text in message.items():
                current_segment += f"{speaker}: {text} "
                # TODO 太長要讓 LLM 進行 rewrite
                if speaker=="B":
                    segments.append(current_segment.strip())
                    current_segment = ""
        if current_segment:
            segments.append(current_segment.strip())
            
    # 將每個 chat log embedding
    embeddings = []
    for log in segments:
        vector = create_embedding(log)
        embeddings.append(vector)
        
    # 依照 breakpoints 進行分組
    vector_similarity = []
    for i in range(len(embeddings)-1):
        score = np.dot(embeddings[i], embeddings[i+1]) / (norm(embeddings[i]) * norm(embeddings[i+1]))
        vector_similarity.append(score)
        
    LOWER_BOUND = 0.35
    breakpoints = [x for x in vector_similarity if x < LOWER_BOUND]
    
    classify_group = []
    start = 0
    for i in breakpoints:
        classify_group.append(segments[start:vector_similarity.index(i)])
        start = vector_similarity.index(i)
    classify_group.append(segments[start:])
    
    # 分組，將全部的 text 做 embedding
    eval_full_group_text = []
    for group in classify_group:
        group_full_text = ' '.join(map(str, group))
        vector = create_embedding(group_full_text)
        eval_full_group_text.append({
            "text":group_full_text,
            "vector":vector
        })
    
    # long memory 架構，分組，做 summary 後 embedding
    classify_group_with_vector = []
    start = 0
    for i in breakpoints:
        group_list = []
        for j in range(start,vector_similarity.index(i)):
            log_memory = {
                "text":segments[j],
                "vector":embeddings[j]
            }
            group_list.append(log_memory)    
        classify_group_with_vector.append(group_list)
        start = vector_similarity.index(i)
    group_list = []
    for i in range(start,len(segments)):
        log_memory = {
            "text":segments[i],
            "vector":embeddings[i]
        }
        group_list.append(log_memory)
    classify_group_with_vector.append(group_list)
    
    eval_long_memory = []
    for i, group in enumerate(classify_group_with_vector):
        group_text = summary_group(eval_full_group_text[i]['text'])
        group_dict = {
            "description":group_text,
            "child":group
        }
        eval_long_memory.append(group_dict)
    
    m = LongMemory()
    for group in eval_long_memory:
        m.add_group_memory(group)
    
    # 只對 chat log 做 embedding
    eval_chat_log = []
    for group in classify_group_with_vector:
        eval_chat_log.extend(group)
        
    # golden documents
    generated_answer_dialog = ""
    for chat_log in df['generated_answer_dialog'][index]['dialog']:
        for speaker, text in chat_log.items():
            generated_answer_dialog += f"{speaker}: {text} "
    
    # question
    question = f"B: {df['normal_memory_question'][index]['B']}"
    question_vector = create_embedding(question)
    
    # Evaluation
    # 1. long memory result
    documents = m.get_relavant_memory(question, vector=question_vector)
    if documents.get('memory'):
        long_memory_doc = f"Abstract: {documents.get('group_description')}\nOrigin text: {documents.get('memory')[0].get('text')}"
    else:
        long_memory_doc = None
    # 2. full group text embedding
    max_score = 0
    similar_group = None
    for i, group in enumerate(eval_full_group_text):
        group_vector = group['vector']
        score = np.dot(group_vector, question_vector) / (norm(group_vector) * norm(question_vector))
        if score>max_score:
            max_score = score
            similar_group = group
    full_group_doc = similar_group['text']
    # 3. chat log embedding only
    max_score = 0
    similar_chat_log = None
    for i, chat_log in enumerate(eval_chat_log):
        log_vector = chat_log['vector']
        score = np.dot(log_vector, question_vector) / (norm(log_vector) * norm(question_vector))
        if score>max_score:
            max_score = score
            similar_chat_log = chat_log
    chat_log_doc = similar_chat_log['text']
    # judge
    long_memory_result = judgement(question=question, gold_document=generated_answer_dialog, text=long_memory_doc)
    full_group_result = judgement(question=question, gold_document=generated_answer_dialog, text=full_group_doc)
    chat_log_result = judgement(question=question, gold_document=generated_answer_dialog, text=chat_log_doc)
    # record
    df.loc[index, 'long_memory'] = long_memory_result
    df.loc[index, 'long_memory_doc'] = long_memory_doc
    df.loc[index, 'full_group_text'] = full_group_result
    df.loc[index, 'full_group_doc'] = full_group_doc
    df.loc[index, 'chat_log'] = chat_log_result
    df.loc[index, 'chat_log_doc'] = chat_log_doc

---process : 1---
Searching with provided vector


In [112]:
df[:5]

Unnamed: 0,normal_memory_question,generated_answer_dialog,full_dialog,long_memory,long_memory_doc,full_group_text,full_group_doc,chat_log,chat_log_doc
0,"{'B': 'Hey, remember that time we talked about...","{'time': '7 days 8 hours ago', 'dialog': [{'A'...","[{'time': '7 days 8 hours ago', 'dialog': [{'A...",insufficient,"Abstract: In the chat, A greets B and shares t...",insufficient,A: Hi! How are you doing tonight? B: I'm doing...,sufficient,A: A little bit. I can get into taylor swift. ...
1,"{'B': 'Hey, remember that time we talked about...","{'time': '14 days 1 hour ago', 'dialog': [{'A'...","[{'time': '14 days 1 hour ago', 'dialog': [{'A...",sufficient,Abstract: A mentions that they primarily consu...,sufficient,"A: I mostly eat a fresh and raw diet, so I sav...",insufficient,A: Maybe you should consider going back to sch...
2,"{'B': 'Hey, remember that time we talked about...","{'time': '6 days 12 hours ago', 'dialog': [{'A...","[{'time': '6 days 12 hours ago', 'dialog': [{'...",sufficient,"Abstract: In the chat, person A and person B d...",insufficient,"A: Hello what are doing today? B: I am good, I...",sufficient,A: Neat!! I used to work in the human services...
3,"{'B': 'Hey, remember that time we talked about...","{'time': '6 days 6 hours ago', 'dialog': [{'A'...","[{'time': '6 days 6 hours ago', 'dialog': [{'A...",sufficient,"Abstract: In this chat, A and B engage in a co...",sufficient,A: I never drink or use drugs. I am 19 and jus...,sufficient,A: Not sure. I have a part time job at burger ...
4,"{'B': 'Hey, remember that time we talked about...","{'time': '11 days 4 hours ago', 'dialog': [{'A...","[{'time': '11 days 4 hours ago', 'dialog': [{'...",sufficient,"Abstract: In the chat, Person A and Person B d...",sufficient,A: Same. I try to get a small workout in a thr...,sufficient,A: Same. I try to get a small workout in a thr...


In [93]:
index = 1
s = f"Question: B: {df['normal_memory_question'][index]}\n\nlong memory:\n{df['long_memory'][index]}\n{df['long_memory_doc'][index]}\n\nfull group\n{df['full_group_text'][index]}\n{df['full_group_doc'][index]}\n\nchat log\n{df['chat_log'][index]}\n{df['chat_log_doc'][index]}"
print(s)

Question: B: {'B': 'Hey, remember that time we talked about our jobs and expenses? What was that one thing you said you did to save money?', 'A': 'I eat a fresh and raw diet to save on groceries.'}

long memory:
insufficient
Abstract: In the chat, Person A and Person B discuss B's interest in studying economics instead of medicine, emphasizing the importance of pursuing a career that brings happiness. A shares their experience of moving into a new home, mentioning they haven't furnished it yet due to financial constraints, but have future plans for the space. B expresses hope that A will make good money at their new job to furnish the house, while A acknowledges the challenge of spending on the house. The conversation wraps up with encouragement about managing finances wisely.
Origin text: A: I'm hoping so. I'll still have no money though as I will be spending it all on the new house! B: At least those are worthwile expenses.  Your an economist, so I trust you'll figure out the best wa

In [87]:
judgement(question=question, gold_document=generated_answer_dialog, text=df['full_group_doc'][index])

'sufficient'

In [82]:
generated_answer_dialog = ""

for chat_log in df['generated_answer_dialog'][index]['dialog']:
    for speaker, text in chat_log.items():
        generated_answer_dialog += f"{speaker}: {text} "
question = f"B: {df['normal_memory_question'][index]['B']} A: {df['normal_memory_question'][index]['A']}"

print(judge_prompt.format(question=question, gold_document=generated_answer_dialog, text=df['long_memory_doc'][index]))

You are a judge evaluating the output of a Retrieval-Augmented Generation (RAG) system. 
You will receive a Q&A and the corresponding source document that generated the question. 
Your task is to evaluate whether the RAG system's response provides sufficient information
If the response is irrelevant or if the provided context is insufficient to make a determination, consider it an incorrect or inadequate answer.

Question:B: Hey, remember that time we talked about our favorite movies? What was yours? A: Clueless!

Generated document:A: Hello, I'm sitting here with my dog. How are you? B: I'm well friend. Looking for new employment at the moment. A: What would be your dream job? B: A writer. I'm currently an er doctor. A: What was the worst accident you have seen in the er? B: A man had his throat slit in a home invasion A: That is very scary. I would rather stick to my knitting passion. B: I have a daughter who people say is a child prodigy A: What talents does she have? B: Math! I hat

In [94]:
df.to_json('evaluation.json', orient='records')

### Observe result

In [None]:
import pandas as pd

df = pd.read_json('evaluation.json')

In [116]:
long_memory_score = df[df['long_memory'] == 'sufficient'].shape[0]
full_group_score = df[df['full_group_text'] == 'sufficient'].shape[0]
chat_log_score = df[df['chat_log'] == 'sufficient'].shape[0]
print(f'MSC datasets counts:{len(df)}')
print(f'Long memory score:{(long_memory_score/len(df))*100}%')
print(f'Full group score:{(full_group_score/len(df))*100}%')
print(f'Chat log score:{(chat_log_score/len(df))*100}%')

MSC datasets counts:500
Long memory score:71.2%
Full group score:53.2%
Chat log score:59.4%


## Evaluation as a RAG system
- Datasets : MultiHop-RAG (2024)