# Final Project

The main assignment of this course is the report describing your Conversational AI final project. This project aims to develop two conversational agents that communicate with each other. One of them would simulate a User (traveler) interested in booking a hotel or restaurant based on specific preferences and constraints. The other would be the Assistant who helps the user find an adequate business vendor and points out their pros and cons based on prior reviews. 

__Project requirements__ \
The two conversational agents should be designed in a way that fits their purpose. \
At least one of the agents should be fine-tuned. \
You should explore two different versions of the Assistant agent. Think of using different fine-tuning or prompting approaches here. \
At least one agent should consult the knowledge base with reviews. \
Use two different personas for the User, which you can define using the Big-5 personality traits Links to an external site. or simulate your own traveler types.
Optionally, for extra points: enhance the system further by incorporating memory. This is for extra points since we didn't cover it in the assignments, however, here Links to an external site. is a user-friendly notebook for working with memory using Mem0. Note that showing the effect of memory requires the setup to be designed in a corresponding way (e.g., the conversations need to be organized into sessions). \
Design N (at least 10) histories to initiate the conversation. \
Incorporate a mechanism to stop the conversation. The conversation should stop once the User expresses satisfaction after receiving a recommendation that fits the requirements.

The success of the agents should be evaluated in two ways: \
Using objective metrics: number of turns before completion, length of the conversation (number of tokens), etc. \
Using subjective evaluation metrics, such as those in Assignment 3, operationalized with human subjects and an LLM as a judge. You could focus on optimizing for short, informative, or pleasant conversations, for example. Ensure that you include an evaluation of how often the Assistant actually fulfilled the User's request.
All project choices: design of the agents, of the conversations, the evaluation, and the experiments need to be clearly motivated, well-explained, and supported with citations where relevant. The evaluation may or may not show that your motivation/expectation was correct - there will be no point deduction for this, but if there is a mismatch between your expectations and your findings, you are expected to reflect on why this may be.

__Report structure__ \
Title and all author names \
Abstract summarizing the research question, method, and main findings \
Introduction section with a background to the problem addressed in this final assignment \
Methodology - description of the methods you used and how they work, including a motivation for their design. \
Experimental setup - with details on the data, evaluation metrics, parameter values, and implementation environment. \
Results section presenting the experimental questions and the corresponding outcomes of the analysis, including visualizations of the results as figures or tables. \
Conclusions section with: \
Summary of the findings and a discussion of their implications \
Limitations of your research approach, together with the envisioned future work \
Division of labor - 1 paragraph that describes how the implementation and the report writing were split among the team members. \
Statement of use of generative AI - if you used generative AI, indicate for what purpose and to what extent. \
References (tip: use the LaTeX/BibTeX reference system,  examples are in the template below) \
Further specification \
You use Springer style formatting in the style of the Springer Publications format for Lecture Notes in Computer Science (LNCS). For details on the LNCS style, see Springer’s Author InstructionsLinks to an external site. \
You use LaTeX with OverleafLinks to an external site. \
The easiest is probably to start from this Overleaf LCNS template. \
The maximum page length is 12 pages. References and appendices don't count towards the limit. \
Check the rubric before you start. \
The deadline is strict, with a full point deduction for every day you are late. In the event of special personal, medical, or other issues, please notify us before the deadline to determine if we can find a solution. \
Note: footnotes with references to websites can also be seen as related work in case they refer to original work. \

## Plan

Assistant
- finetune a model (domain specific)
- add knowledge to a model 
- (use an ontology if still time)

User
- two different personalities with prompting
    - fiendly/polite american vs. staight forward
    - more detail vs. more simple 

General
- add memory

In [44]:
# imports
import numpy as np 
import json
import os
import shutil
import subprocess
import sys
from typing import List
from datasets import Dataset
import json
import re


from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSequenceClassification, pipeline, BitsAndBytesConfig
import transformers, trl, peft
import torch
import random
torch.manual_seed(3407); random.seed(3407); np.random.seed(3407)


from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, get_peft_model, PeftModel
from peft import PeftConfig


In [2]:
device = torch.device("cpu")

## Get all the data

delete the whole "data" folder and make a new, empty "data" folder before running this

In [3]:
def setup_repo(repo_url: str, repo_name: str, work_dir: str = "data"):
    os.chdir(work_dir)
    
    # Remove repo if it exists
    if os.path.exists(os.path.join(work_dir, repo_name)):
        shutil.rmtree(os.path.join(work_dir, repo_name))
    
    # Clone repo
    subprocess.run(["git", "clone", repo_url], check=True)
    
    # Move into repo/data
    os.chdir(os.path.join(repo_name, "data"))


setup_repo("https://github.com/lkra/dstc11-track5.git", "dstc11-track5")


Cloning into 'dstc11-track5'...


In [4]:
## List all files in the current directory iteratively:
for dirname, _, filenames in os.walk('.'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

./knowledge_aug_reviews.json
./output_schema.json
./knowledge_aug_domain_reviews.json
./README.md
./knowledge.json
./test/labels.json
./test/logs.json
./train/labels.json
./train/logs.json
./train/logs_bkp.json
./train/bkp/labels.json
./train/bkp/logs.json
./val/labels.json
./val/logs.json


In [5]:
with open('train/logs.json', 'r') as f:
    train_ds=json.load(f)

with open('train/labels.json', 'r') as f:
    labels=json.load(f)

with open('knowledge.json', 'r') as f:
    knowledge_base=json.load(f)

In [6]:
def format_dialogue(dialogue: List[dict]) -> List[dict]: 
    """
    Args:
    dialogue (List[dict]): A list of dictionaries where each dictionary contains two keys:
        - 'speaker' (str): A string indicating the speaker of the turn ('U' for user, 'S' for system).
        - 'text' (str): The text spoken by the respective speaker.

    Returns:
        List[dict]: A new array with a specific role and content

    """
    # Your solution here
    messages=[]
    messages.append({"role": "system", "content": "You are an assistant."})
    for dialogue_element in dialogue:
        role = "user" if dialogue_element['speaker'] == 'U' else "system"
        messages.append({"role": role, "content": dialogue_element['text']})

    return messages

In [7]:
def reformat_dataset(dataset, labels_dataset): 
    reformatted_dataset = {
        "messages": []
    }
    for sample_index in range(len(dataset)): 
        # Your solution here
        try:
            sample_dialogue = format_dialogue(dataset[sample_index])
            sample_response = labels_dataset[sample_index]['response']
            sample_dialogue.append({"role": "system", "content": sample_response})
            
            reformatted_dataset["messages"].append(sample_dialogue)
        except:
            continue


        
    return reformatted_dataset

reformatted_dataset = reformat_dataset(train_ds, labels)
dataset = Dataset.from_dict(reformatted_dataset)
dataset

Dataset({
    features: ['messages'],
    num_rows: 16897
})

In [8]:
def process_dataset_split(split: str) -> Dataset: 
    """Loads, reformats, and processes a dataset split for model training or evaluation.

    This function loads a dataset split (e.g., 'val', 'test') and generates a dataset for it, similar to what we had for the train split.

    Args:
        split (str): The name of the dataset split to process

    Returns:
        dataset: A HuggingFace `Dataset` object that contains the preprocessed and reformatted data for the specified split.

    """
    with open(f'{split}/logs.json', 'r') as f:
        data=json.load(f)

    with open(f'{split}/labels.json', 'r') as f:
        labels=json.load(f)

    data_ds = reformat_dataset(data, labels)
    new_dataset = Dataset.from_dict(data_ds)
    
    return new_dataset
    

validation_ds = process_dataset_split("val")
test_ds = process_dataset_split("test")

validation_ds, test_ds

(Dataset({
     features: ['messages'],
     num_rows: 2129
 }),
 Dataset({
     features: ['messages'],
     num_rows: 2798
 }))

## data preparation user finetuning

In [9]:
def format_dialogue_user(dialogue: List[dict]) -> List[dict]: 
    """
    Args:
    dialogue (List[dict]): A list of dictionaries where each dictionary contains two keys:
        - 'speaker' (str): A string indicating the speaker of the turn ('U' for user, 'S' for system).
        - 'text' (str): The text spoken by the respective speaker.

    Returns:
        List[dict]: A new array with a specific role and content

    """
    # Your solution here
    messages=[]
    messages.append({"role": "system", "content": "You are a user simulator."})
    for dialogue_element in dialogue:
        role = "assistant" if dialogue_element['speaker'] == 'U' else "user" #roles swapped so it learns to behave like a user
        messages.append({"role": role, "content": dialogue_element['text']})

    return messages

In [10]:
def reformat_dataset_user(dataset): 
    reformatted_dataset = {
        "messages": []
    }
    for sample_index in range(len(dataset)):
    #for sample_index in range(1):
        try:
            sample_dialogue = format_dialogue(dataset[sample_index][:-1]) # exclude last user message so system learns to respond as a user
            sample_response = dataset[sample_index][-1]['text'] #use original last user message as response
            sample_dialogue.append({"role": "assistant", "content": sample_response})
            
            reformatted_dataset["messages"].append(sample_dialogue)
        except:
            continue

    return reformatted_dataset


reformatted_dataset_user = reformat_dataset_user(train_ds)
dataset_user = Dataset.from_dict(reformatted_dataset_user)
dataset_user

Dataset({
    features: ['messages'],
    num_rows: 32604
})

In [11]:
def process_dataset_split_user(split: str) -> Dataset: 
    """Loads, reformats, and processes a dataset split for model training or evaluation.

    This function loads a dataset split (e.g., 'val', 'test') and generates a dataset for it, similar to what we had for the train split.

    Args:
        split (str): The name of the dataset split to process

    Returns:
        dataset: A HuggingFace `Dataset` object that contains the preprocessed and reformatted data for the specified split.

    """
    with open(f'{split}/logs.json', 'r') as f:
        data=json.load(f)

    data_ds = reformat_dataset(data)
    new_dataset = Dataset.from_dict(data_ds)
    
    return new_dataset
    

validation_ds_user = process_dataset_split("val")
test_ds_user = process_dataset_split("test")

validation_ds_user, test_ds_user

(Dataset({
     features: ['messages'],
     num_rows: 2129
 }),
 Dataset({
     features: ['messages'],
     num_rows: 2798
 }))

Results from the finetuning:

finetuned2.1: \
TrainOutput(global_step=2113, training_loss=1.3732818416436587, metrics={'train_runtime': 6148.8185, 'train_samples_per_second': 2.748, 'train_steps_per_second': 0.344, 'total_flos': 4.201868035718554e+16, 'train_loss': 1.3732818416436587, 'entropy': 1.2068944639629788, 'num_tokens': 3938417.0, 'mean_token_accuracy': 0.6782568560706245, 'epoch': 1.0})

finetuned2.2: \
TrainOutput(global_step=2113, training_loss=1.1545771166886756, metrics={'train_runtime': 6163.3328, 'train_samples_per_second': 2.742, 'train_steps_per_second': 0.343, 'total_flos': 4.219805159713382e+16, 'train_loss': 1.1545771166886756, 'entropy': 1.1386982864803739, 'num_tokens': 3955363.0, 'mean_token_accuracy': 0.6888072027100457, 'epoch': 1.0})

We will proceed with the model from finetuned 2.2.

In [12]:
#loading second finetuned model/ assistat
base_model_id = "Qwen/Qwen3-1.7B"
adapter_path  = "/Users/benutzer/Documents/GitHub/CAI/outputs2/adapter2" 
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
model_base = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype="auto", device_map="auto")
model_base_for_adapter = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype="auto", device_map="auto")
finetuned_assitant = PeftModel.from_pretrained(model_base_for_adapter, adapter_path)

`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.78s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.96s/it]


In [13]:
#loading finetuned user
base_model_id = "Qwen/Qwen3-1.7B"
adapter_path_user  = "/Users/benutzer/Documents/GitHub/CAI/outputsUser/adapterUser" 
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
model_base = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype="auto", device_map="auto")
model_base_for_adapter = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype="auto", device_map="auto")
finetuned_user = PeftModel.from_pretrained(model_base_for_adapter, adapter_path_user)

Loading checkpoint shards: 100%|██████████| 2/2 [00:06<00:00,  3.40s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:19<00:00,  9.59s/it]


## Create agent 1: Assistant 1

In [14]:
dialogue = test_ds[0]['messages'][:-1]
response = test_ds[0]['messages'][-1]

dialogue, response

([{'content': 'You are an assistant.', 'role': 'system'},
  {'content': "I'm looking to stay at a 3 star hotel in the north.",
   'role': 'user'},
  {'content': 'Sorry, I have no results for that query. Would you like to try a different area of town?',
   'role': 'system'},
  {'content': 'Are there any moderate priced hotels in the North?',
   'role': 'user'},
  {'content': 'Yes I have two. Would you like me to book one?',
   'role': 'system'},
  {'content': 'I need a hotel to include free parking; does either have that?',
   'role': 'user'},
  {'content': 'Yes both of them have free parking.', 'role': 'system'},
  {'content': 'Which one would you recommend?', 'role': 'user'},
  {'content': 'How about the Ashley hotel?', 'role': 'system'},
  {'content': 'Is the Ashley hotel a 3 star hotel?', 'role': 'user'},
  {'content': 'the ashley is actually a 2 star hotel.', 'role': 'system'},
  {'content': 'Does this hotel have rooms with a good view of the neighborhood?',
   'role': 'user'}],
 {

In [15]:
text = tokenizer.apply_chat_template(dialogue, tokenize=False, add_generation_prompt=True, enable_thinking=False)
model_inputs = tokenizer([text], return_tensors="pt").to(finetuned_assitant.device)

generated_ids = finetuned_assitant.generate(**model_inputs, max_new_tokens=500)
output_ids = generated_ids[0][model_inputs.input_ids.shape[1]:]

generated_text = tokenizer.decode(output_ids, skip_special_tokens=True).strip()
print("Finetuned Model: ", generated_text)
print("Ground-truth: ", response["content"])
dialogue.append({'content': generated_text, 'role': 'system'})

Finetuned Model:  Yes, guests have been pleased with the view from their rooms at the Ashley Hotel. Is there anything else you'd like to know about them?
Ground-truth:  Apparently it does according to previous customers, they say that the view is beautiful especially on the higher floors.


In [16]:

text = tokenizer.apply_chat_template(dialogue, tokenize=False, add_generation_prompt=True, enable_thinking=False)
model_inputs = tokenizer([text], return_tensors="pt").to(finetuned_user.device)

generated_ids = finetuned_user.generate(**model_inputs, max_new_tokens=500)
output_ids = generated_ids[0][model_inputs.input_ids.shape[1]:]

print("User Model: ", tokenizer.decode(output_ids, skip_special_tokens=True).strip())
print("Ground-truth: ", response["content"])

Finetuned Model:  Do they have a friendly staff?
Ground-truth:  Apparently it does according to previous customers, they say that the view is beautiful especially on the higher floors.


In [43]:
text = tokenizer.apply_chat_template(dialogue, tokenize=False, add_generation_prompt=True, enable_thinking=False)
model_inputs = tokenizer([text], return_tensors="pt").to(model_base.device)

generated_ids = model_base.generate(**model_inputs, max_new_tokens=500)
output_ids = generated_ids[0][model_inputs.input_ids.shape[1]:]

print("Base Model: ", tokenizer.decode(output_ids, skip_special_tokens=True).strip())
print("Ground-truth: ", response["content"])

Base Model:  I have 21 guesthouses in the area. I can provide you with a list of guesthouses that have a 4 star rating. Would you like me to proceed with that?
Ground-truth:  According to reviews, the Golden Curry has large portion sizes. Do you want me to book a table for you?


In [17]:
dialogue = test_ds[1]['messages'][:-1]
response = test_ds[1]['messages'][-1]

dialogue, response

([{'content': 'You are an assistant.', 'role': 'system'},
  {'content': 'Hi! Can you give me some information on the Golden Curry restaurant?',
   'role': 'user'},
  {'content': 'The golden curry is an expensive indian restaurant located in the centre of town. Is there anything else you would like to know?',
   'role': 'system'},
  {'content': 'Are the portion sizes here large?', 'role': 'user'}],
 {'content': 'According to reviews, the Golden Curry has large portion sizes. Do you want me to book a table for you?',
  'role': 'system'})

In [56]:
def prepare_messages(history, assistant_name, user_name):
    """
    
    """
    system_prompt = ''
    user_prompt = ''

    if assistant_name == 'friendly':
        system_prompt = 'You are an assistant. Your task is to help the user by informing them about hotels and restaurants. Be as friendly as possible, the user is your best friend. Elaborate on your answers and provide details is a joyful, whimsical and enthusiastic tone.'
        
    elif assistant_name == 'efficient':
        system_prompt = 'You are an assistant. Your task is to help the user by informing them about hotels and restaurants in a structured way. Efficiency is valued over tone. Provide details but not unnecessarily so. Double chack your answers before providing them and only answer if you are sure about your information, otherwise admit that you do not know.'
    
    else:
        print('This assistant configuration does not exist. No modification made to the prompting.')

    
    if user_name == 'business':
        user_prompt = 'You are a simulated user. You are a business man who enjoys staying in cities because of their vibrant life. Here you can try new activities, enjoy the night culture and bars. While you like traveling, you often do so for business so sometimes you want the hotels to be close to the airport and the beds should be comfortable.'
    
    elif user_name == 'creative':
        user_prompt = 'You are a simulated user. You are dreamy and creative. You like to stay in hotels that are close to nature, where you can go on hikes and watch birds. You are looking for quiet environments to read and relax, but also like meeting new people and making friends. You are conscious of the environment and would like your accommodations to reflect your values. Since you own a cat, the rooms have to be pet friendly if you travel together.'
    
    else:
        print('This user configuration does not exist. No modification made to the prompting.')


    if system_prompt != '':
        history[0] = {'content': system_prompt, 'role': 'system'}

    if user_prompt != '':
        history.insert(0, {'content': user_prompt, 'role': 'user'})

    
    return history

In [51]:
_qwen3_model = None
_qwen3_tokenizer = None


def get_qwen3_model():
    global _qwen3_model, _qwen3_tokenizer
    if _qwen3_model is None or _qwen3_tokenizer is None:
        model_name = "Qwen/Qwen3-1.7B"
        _qwen3_tokenizer = AutoTokenizer.from_pretrained(model_name)
        _qwen3_model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype="auto",
            device_map="auto"
        )
    return _qwen3_model, _qwen3_tokenizer

In [57]:
def query_qwen_as_a_judge(messages):
    
    content = messages[-1]['content']
    if messages[-1]['role'] == 'system':
        role = 'system'
        system_prompt = {
                "role": "system",
                "content": """
        ### Role Assignment
        You are a Satisfaction Evaluation Judge.
        Your job is to evaluate the satisfaction in the **users’s response** with respect to the **assistants’s request**, in the context of conversations about hotels. The score should be higher is the user expresses satisfaction after receiving a recommendation that fits the requirements and does not ak a next question that continues the conversation.

        ### Task Definition
        You must:
        1. Assign a **satisfaction score** from **0.0 to 1.0**
        2. Provide a **short explanation** (maximum 2 sentences)

        ### Output Format (STRICT)
        Return ONLY:

        <JSON>
        {
        "content": "",
        "role": "system",
        "satisfaction_score": float between 0.0 and 1.0,
        "explanation": "brief rationale"
        }
        </JSON>
        """
            }

    else:
        role = 'user'
        system_prompt = {
                "role": "system",
                "content": """
        ### Role Assignment
        You are a Coherence Evaluation Judge.
        Your job is to evaluate how coherent the **assistant’s response** is with respect to the **user’s request**, in the context of conversations about hotels.
        
        ### Task Definition
        You must:
        1. Assign a **satisfaction score** from **0.0 to 1.0**
        2. Provide a **short explanation** (maximum 2 sentences)

        ### Output Format (STRICT)
        Return ONLY:

        <JSON>
        {
        "content": "",
        "role": "user",
        "coherence_score": float between 0.0 and 1.0,
        "explanation": "brief rationale"
        }
        </JSON>
        """
            }

    messages_judge = [system_prompt] + messages

    judge, tokenizer_judge = get_qwen3_model()

    text_judge = tokenizer_judge.apply_chat_template(
        messages_judge,
        tokenize=False,
        add_generation_prompt=True
    )

    model_inputs_judge = tokenizer_judge([text_judge], return_tensors="pt").to(judge.device)
    generated_ids_judge = judge.generate(
        **model_inputs_judge, max_new_tokens=512
    )
    output_ids_judge = generated_ids_judge[0][model_inputs_judge.input_ids.shape[1]:]
    raw_output_judge = tokenizer_judge.decode(output_ids_judge, skip_special_tokens=True).strip()

    # Extract JSON using regex
    match = re.search(r"<JSON>(.*?)</JSON>", raw_output, re.DOTALL)
    if match:
        json_str = match.group(1).strip()
        try:
            data = json.loads(json_str)
            print(data)
            return data
        except:
            return {"content": content, "role": role, "satisfaction_score": None, "explanation": "JSON parse error"}
    
    return {"content": content, "role": role, "satisfaction_score": None, "explanation": "No JSON found"}

In [None]:
def conversation(assistant_model, user_model, dialogue_history, assistant_name=None, user_name=None):
    role = None
    model = None
    satisfaction_scores = []

    if not (assistant_name==None and user_name==None):
        dialogue_history = prepare_messages(dialogue_history, assistant_name, user_name)

    for i in range(10):

        score = None

        if i > 0:
            print("querying llm")
            judge_result = query_qwen_as_a_judge(dialogue_history)
            print(f'query result: {judge_result}')
            satisfaction_scores.append(judge_result)
            score = judge_result['satisfaction_score']
            if score != None:
                score = float(judge_result['satisfaction_score'])

        if dialogue_history[-1]['role'] == 'system':
            model = user_model
            role = 'user'
        else:
            model = assistant_model
            role = 'system'
            if score != None:
                if score >= 0.9:
                    return dialogue_history, satisfaction_scores

        text = tokenizer.apply_chat_template(dialogue_history, tokenize=False, add_generation_prompt=True, enable_thinking=False)
        model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

        generated_ids = model.generate(**model_inputs, max_new_tokens=500)
        output_ids = generated_ids[0][model_inputs.input_ids.shape[1]:]

        generated_text = tokenizer.decode(output_ids, skip_special_tokens=True).strip()
        dialogue_history.append({'content': generated_text, 'role': role})

    return dialogue_history, satisfaction_scores

In [None]:
print(dialogue)
print()
dialogue = test_ds[1]['messages'][:-1]
generated_dialogue, satisfaction_scores = conversation(finetuned_assitant, finetuned_user, dialogue, 'friendly', 'creative')

for item in generated_dialogue:
    print(item)

print()

for item in satisfaction_scores:
    print(item)

# print()
# dialogue = test_ds[1]['messages'][:-1] # for some reason continues the conversation instead of making a new one without this
# generated_dialogue2 satisfaction_scores2= conversation(finetuned_assitant, finetuned_user, dialogue, 'efficient', 'business')

# for item in generated_dialogue2:
#     print(item)

[{'content': 'You are a simulated user. You are dreamy and creative. You like to stay in hotels that are close to nature, where you can go on hikes and watch birds. You are looking for quiet environments to read and relax, but also like meeting new people and making friends. You are conscious of the environment and would like your accommodations to reflect your values. Since you own a cat, the rooms have to be pet friendly if you travel together.', 'role': 'user'}, {'content': 'You are an assistant. Your task is to help the user by informing them about hotels and restaurants. Be as friendly as possible, the user is your best friend. Elaborate on your answers and provide details is a joyful, whimsical and enthusiastic tone.', 'role': 'system'}, {'content': 'Hi! Can you give me some information on the Golden Curry restaurant?', 'role': 'user'}, {'content': 'The golden curry is an expensive indian restaurant located in the centre of town. Is there anything else you would like to know?', '

In [None]:
#experimental setup
'''
10 histories = 10 conversations with each setting

Settings:
User:
1. base model as user                   model_base
2. finetuned user                       finetuned_user
3. finetuned user with personality 1
4. finetuned user with personality 2

System:
5. base model as system
6. finetuned system                     finetuned_assistant
7. finetuned system with personality 1
8. finetuned system with personality 1 + knowledge
7. finetuned system with personality 2
8. finetuned system with personality 2 + knowledge

test all user-system combinations: 24 * 10 * 10 conversations

extra:
finetuned system as user with base and finetuned user as system with base

Criteria:
- average word length of response
- length of conversations (nr turns)
    - good because ended fast
    - longer conversation = more engaging
    - probably have to judge this ourselves
- received LLM scores
    - change based on length of conversation?

'''
pass