# Final Project

The main assignment of this course is the report describing your Conversational AI final project. This project aims to develop two conversational agents that communicate with each other. One of them would simulate a User (traveler) interested in booking a hotel or restaurant based on specific preferences and constraints. The other would be the Assistant who helps the user find an adequate business vendor and points out their pros and cons based on prior reviews. 

__Project requirements__ \
The two conversational agents should be designed in a way that fits their purpose. \
At least one of the agents should be fine-tuned. \
You should explore two different versions of the Assistant agent. Think of using different fine-tuning or prompting approaches here. \
At least one agent should consult the knowledge base with reviews. \
Use two different personas for the User, which you can define using the Big-5 personality traits Links to an external site. or simulate your own traveler types.
Optionally, for extra points: enhance the system further by incorporating memory. This is for extra points since we didn't cover it in the assignments, however, here Links to an external site. is a user-friendly notebook for working with memory using Mem0. Note that showing the effect of memory requires the setup to be designed in a corresponding way (e.g., the conversations need to be organized into sessions). \
Design N (at least 10) histories to initiate the conversation. \
Incorporate a mechanism to stop the conversation. The conversation should stop once the User expresses satisfaction after receiving a recommendation that fits the requirements.

The success of the agents should be evaluated in two ways: \
Using objective metrics: number of turns before completion, length of the conversation (number of tokens), etc. \
Using subjective evaluation metrics, such as those in Assignment 3, operationalized with human subjects and an LLM as a judge. You could focus on optimizing for short, informative, or pleasant conversations, for example. Ensure that you include an evaluation of how often the Assistant actually fulfilled the User's request.
All project choices: design of the agents, of the conversations, the evaluation, and the experiments need to be clearly motivated, well-explained, and supported with citations where relevant. The evaluation may or may not show that your motivation/expectation was correct - there will be no point deduction for this, but if there is a mismatch between your expectations and your findings, you are expected to reflect on why this may be.

__Report structure__ \
Title and all author names \
Abstract summarizing the research question, method, and main findings \
Introduction section with a background to the problem addressed in this final assignment \
Methodology - description of the methods you used and how they work, including a motivation for their design. \
Experimental setup - with details on the data, evaluation metrics, parameter values, and implementation environment. \
Results section presenting the experimental questions and the corresponding outcomes of the analysis, including visualizations of the results as figures or tables. \
Conclusions section with: \
Summary of the findings and a discussion of their implications \
Limitations of your research approach, together with the envisioned future work \
Division of labor - 1 paragraph that describes how the implementation and the report writing were split among the team members. \
Statement of use of generative AI - if you used generative AI, indicate for what purpose and to what extent. \
References (tip: use the LaTeX/BibTeX reference system,  examples are in the template below) \
Further specification \
You use Springer style formatting in the style of the Springer Publications format for Lecture Notes in Computer Science (LNCS). For details on the LNCS style, see Springer’s Author InstructionsLinks to an external site. \
You use LaTeX with OverleafLinks to an external site. \
The easiest is probably to start from this Overleaf LCNS template. \
The maximum page length is 12 pages. References and appendices don't count towards the limit. \
Check the rubric before you start. \
The deadline is strict, with a full point deduction for every day you are late. In the event of special personal, medical, or other issues, please notify us before the deadline to determine if we can find a solution. \
Note: footnotes with references to websites can also be seen as related work in case they refer to original work. \

## Plan

Assistant
- finetune a model (domain specific)
- add knowledge to a model 
- (use an ontology if still time)

User
- two different personalities with prompting
    - fiendly/polite american vs. staight forward
    - more detail vs. more simple 

General
- add memory

In [1]:
# imports
import numpy as np 
import json
import os
import shutil
import subprocess
import sys
from typing import List
from datasets import Dataset


from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers, trl, peft
import torch
import random
torch.manual_seed(3407); random.seed(3407); np.random.seed(3407)


from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, get_peft_model, PeftModel


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
device = torch.device("cpu")

## Get all the data

In [3]:
def setup_repo(repo_url: str, repo_name: str, work_dir: str = "data"):
    os.chdir(work_dir)
    
    # Remove repo if it exists
    if os.path.exists(os.path.join(work_dir, repo_name)):
        shutil.rmtree(os.path.join(work_dir, repo_name))
    
    # Clone repo
    subprocess.run(["git", "clone", repo_url], check=True)
    
    # Move into repo/data
    os.chdir(os.path.join(repo_name, "data"))


setup_repo("https://github.com/lkra/dstc11-track5.git", "dstc11-track5")


Cloning into 'dstc11-track5'...


In [4]:
## List all files in the current directory iteratively:
for dirname, _, filenames in os.walk('.'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

./knowledge_aug_reviews.json
./output_schema.json
./knowledge_aug_domain_reviews.json
./README.md
./knowledge.json
./test/labels.json
./test/logs.json
./train/labels.json
./train/logs.json
./train/logs_bkp.json
./train/bkp/labels.json
./train/bkp/logs.json
./val/labels.json
./val/logs.json


In [5]:
with open('train/logs.json', 'r') as f:
    train_ds=json.load(f)

with open('train/labels.json', 'r') as f:
    labels=json.load(f)

with open('knowledge.json', 'r') as f:
    knowledge_base=json.load(f)

In [6]:
def format_dialogue(dialogue: List[dict]) -> List[dict]: 
    """
    Args:
    dialogue (List[dict]): A list of dictionaries where each dictionary contains two keys:
        - 'speaker' (str): A string indicating the speaker of the turn ('U' for user, 'S' for system).
        - 'text' (str): The text spoken by the respective speaker.

    Returns:
        List[dict]: A new array with a specific role and content

    """
    # Your solution here
    messages=[]
    messages.append({"role": "system", "content": "You are an assistant."})
    for dialogue_element in dialogue:
        role = "user" if dialogue_element['speaker'] == 'U' else "system"
        messages.append({"role": role, "content": dialogue_element['text']})

    return messages

In [7]:
def reformat_dataset(dataset, labels_dataset): 
    reformatted_dataset = {
        "messages": []
    }
    for sample_index in range(len(dataset)): 
        # Your solution here
        try:
            sample_dialogue = format_dialogue(dataset[sample_index])
            sample_response = labels_dataset[sample_index]['response']
            sample_dialogue.append({"role": "system", "content": sample_response})
            
            reformatted_dataset["messages"].append(sample_dialogue)
        except:
            continue


        
    return reformatted_dataset

reformatted_dataset = reformat_dataset(train_ds, labels)
dataset = Dataset.from_dict(reformatted_dataset)
dataset

Dataset({
    features: ['messages'],
    num_rows: 16897
})

In [8]:
def process_dataset_split(split: str) -> Dataset: 
    """Loads, reformats, and processes a dataset split for model training or evaluation.

    This function loads a dataset split (e.g., 'val', 'test') and generates a dataset for it, similar to what we had for the train split.

    Args:
        split (str): The name of the dataset split to process

    Returns:
        dataset: A HuggingFace `Dataset` object that contains the preprocessed and reformatted data for the specified split.

    """
    with open(f'{split}/logs.json', 'r') as f:
        data=json.load(f)

    with open(f'{split}/labels.json', 'r') as f:
        labels=json.load(f)

    data_ds = reformat_dataset(data, labels)
    new_dataset = Dataset.from_dict(data_ds)
    
    return new_dataset
    

validation_ds = process_dataset_split("val")
test_ds = process_dataset_split("test")

validation_ds, test_ds

(Dataset({
     features: ['messages'],
     num_rows: 2129
 }),
 Dataset({
     features: ['messages'],
     num_rows: 2798
 }))

In [9]:
model_id = "Qwen/Qwen3-1.7B"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
base = AutoModelForCausalLM.from_pretrained(model_id, dtype="auto", device_map="auto")

Loading checkpoint shards: 100%|██████████| 2/2 [00:06<00:00,  3.19s/it]


In [10]:
peft_cfg = LoraConfig(r=16, 
           lora_alpha=32, 
           lora_dropout = 0.05, 
           bias = "none", 
           use_rslora = False, 
           target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"])

model = get_peft_model(base, peft_cfg)

In [11]:
def pick_bf16():
    if torch.cuda.is_available():
        major, _ = torch.cuda.get_device_capability()
        return major >= 8
    return False

In [12]:
NUM_TRAIN_EPOCHS = 2
LEARNING_RATE    = 1e-4
WARMUP_STEPS     = ((2113 * NUM_TRAIN_EPOCHS)//100) * 9

os.environ["WANDB_DISABLED"] = "true"

sft_args = SFTConfig(
    output_dir="outputs",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=LEARNING_RATE,
    warmup_steps=WARMUP_STEPS,
    num_train_epochs=NUM_TRAIN_EPOCHS,
    logging_steps=10,
    lr_scheduler_type="cosine_with_restarts",
    weight_decay=0.01,
    max_length=1024,
    optim="adamw_torch_fused",
    fp16=not pick_bf16(),
    bf16=pick_bf16(),
    packing=False,
    dataset_num_proc=2,
    report_to="none",
    seed=3407
)

In [13]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,      
    eval_dataset=validation_ds,
    args=sft_args,
    processing_class=tok
)

Tokenizing train dataset (num_proc=2): 100%|██████████| 16897/16897 [00:10<00:00, 1612.10 examples/s]
Truncating train dataset (num_proc=2): 100%|██████████| 16897/16897 [00:00<00:00, 17122.97 examples/s]
Tokenizing eval dataset (num_proc=2): 100%|██████████| 2129/2129 [00:02<00:00, 966.67 examples/s] 
Truncating eval dataset (num_proc=2): 100%|██████████| 2129/2129 [00:00<00:00, 10632.50 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.


In [None]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Step,Training Loss


## Create agent 1: Assistant 1

In [None]:
trainer.save_model("outputs/adapter")  
tok.save_pretrained("outputs/adapter")

## Create agent 2: Assistant 2

## Create agent 3: User 1 MAD

## Create agent 4: User  SIMP

In [3]:
mydict = {'one': 1, 'two': 2}
for key in mydict:
    print(key)

for key in mydict.keys():
    print(key)

one
two
one
two
