In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [19]:
import nest_asyncio

nest_asyncio.apply()

In [3]:
import pandas as pd
import os
import pathlib

root_dir = pathlib.PurePath(os.path.dirname(os.getcwd())).parent
data_dir = os.path.join(root_dir, 'data')

# 대화 데이터 평가
## 데이터 제작

먼저, [DSTC11-Track5](https://github.com/alexa/dstc11-track5) 데이터을 DeepEval용 대화 데이터로 변환합니다.

In [7]:
from datasets import load_dataset

ds = load_dataset("NomaDamas/DSTC-11-Track-5", "default")
ds_df = ds["train"].to_pandas()
original_df = ds_df.loc[ds_df['target'] == True].sample(20).reset_index(drop=True)

In [8]:
original_df.head()

Unnamed: 0,log,target,knowledge,response
0,"[{'speaker': 'U', 'text': 'Can you help me fin...",True,"[{'doc_id': 7, 'doc_type': 'faq', 'domain': 'r...","Yes, the patio for outdoor eating is really ni..."
1,"[{'speaker': 'U', 'text': 'I need a 4 star hot...",True,"[{'doc_id': 1, 'doc_type': 'review', 'domain':...",The Autumn House has pretty bad reviews about ...
2,"[{'speaker': 'U', 'text': 'I'm looking for a t...",True,"[{'doc_id': 8, 'doc_type': 'review', 'domain':...",Hamilton Lodge offers clean rooms for their gu...
3,"[{'speaker': 'U', 'text': 'Hi I am looking to ...",True,"[{'doc_id': 1, 'doc_type': 'review', 'domain':...",Although one reviewer found the staff attentiv...
4,"[{'speaker': 'U', 'text': 'A moderately priced...",True,"[{'doc_id': 1, 'doc_type': 'review', 'domain':...","Yes, those that have previously dined at the C..."


In [9]:
original_df.iloc[0]['log']

array([{'speaker': 'U', 'text': 'Can you help me find a restaurant that serves African food with a moderate price range please'},
       {'speaker': 'S', 'text': "I am sorry, there aren't any options available. May I ask if there is another type of restaurant you would be interested in?"},
       {'speaker': 'U', 'text': 'Yes how about Asian food in the same price range?'},
       {'speaker': 'S', 'text': 'Yes I have the Yippee Noodle Bar in the center of town on King street in the moderate price range. They serve Asian cuisine. Is there anything else I can do for you?'},
       {'speaker': 'U', 'text': 'Yeah, are they situated in a nice part of town that provides a nice outdoor eating experience?'}],
      dtype=object)

In [10]:
original_df.iloc[0]['response']

'Yes, the patio for outdoor eating is really nice, especially in the fall. Do you want to make a reservation?'

`ConversationalTestCase` 인스턴스를 여러개 만들겠습니다.

In [17]:
from deepeval.test_case import LLMTestCase, ConversationalTestCase

conversation_test_cases = []
for idx, row in original_df.iterrows():
	turns = []
	full_dialog = row["log"]
	for i in range(len(full_dialog) - 1):
		turns.append(LLMTestCase(input=full_dialog[i], actual_output=full_dialog[i + 1]))
	turns.append(LLMTestCase(input=full_dialog[-1], actual_output=row["response"]))
	conversation_test_cases.append(ConversationalTestCase(turns=turns, 
														  chatbot_role="Chatbot is a helpful assistant to find a great hotel.",))

### Role Adherence

사용을 위해서는 반드시 `chatbot_role`을 지정해야 합니다.

In [18]:
from deepeval.metrics import RoleAdherenceMetric

metric = RoleAdherenceMetric()

metric.measure(conversation_test_cases[0])
print(metric.score)
print(metric.reason)

Output()

0.4
The score is 0.4 because the LLM chatbot responses are out of character for a role that assists users in finding a great hotel. For instance, in turn #1, the chatbot says, 'I am sorry, there aren't any options available. May I ask if there is another type of restaurant you would be interested in?' which is not relevant to hotel finding. In turn #3, the chatbot provides a suggestion for a restaurant, 'Yes I have the Yippee Noodle Bar in the center of town on King street in the moderate price range. They serve Asian cuisine. Is there anything else I can do for you?' which again deviates from its role. Furthermore, in turn #5, 'Yes, the patio for outdoor eating is really nice, especially in the fall. Do you want to make a reservation?' focuses on restaurant reservations instead of hotel recommendations. These deviations not only affect the relevance of the conversation but also the chatbot's adherence to the role of providing hotel suggestions.


### Conversation Relevancy

In [20]:
from deepeval.metrics import ConversationRelevancyMetric

metric = ConversationRelevancyMetric()

metric.measure(conversation_test_cases[0])
print(metric.score)
print(metric.reason)

Output()

1.0
The score is 1.0 because there are no irrelevancies found in the messages, indicating that all 'actual_output' responses are perfectly relevant to their respective 'inputs'.


### Knowledge Retention

In [21]:
from deepeval.metrics import KnowledgeRetentionMetric

metric = KnowledgeRetentionMetric()

metric.measure(conversation_test_cases[0])
print(metric.score)
print(metric.reason)

Output()

1.0
The score is 1.00 because there are no attritions indicating forgetfulness or inconsistencies, suggesting perfect retention.


### Conversation Completeness

In [22]:
from deepeval.metrics import ConversationCompletenessMetric

metric = ConversationCompletenessMetric()

metric.measure(conversation_test_cases[0])
print(metric.score)
print(metric.reason)

Output()

1.0
The score is 1.0 because the LLM response fully satisfies the user intention by providing information on a restaurant that serves either African or Asian food, falls within a moderate price range, and offers a nice outdoor eating experience, without any incompleteness reported.
