# Creating the evaluation dataset

## Introduction

In this notebook, I will be creating the evaluation dataset to use alongside my RAGAS-based evaluation pipeline. 

## Importing the necessary libraries

In [18]:
from collections import Counter
from atlassian import Jira
from datetime import datetime, timedelta
from ragas import Dataset
import re
import os

import sys
sys.path.append("/Users/adam/Documents/University/Third Year Project/rag_project/src")

from polaris_rag.retrieval.document_loader import load_support_tickets
from polaris_rag.retrieval.document_preprocessor import build_jira_ticket_text
from polaris_rag.generation.llm_interface import OpenAILikeLLM

USERNAME = 'ac2650@cam.ac.uk'
PASSWORD = os.getenv('JIRA_API_TOKEN')


jira = Jira(
    url='https://ucam-rcs.atlassian.net',
    username=USERNAME,
    password=PASSWORD,
    cloud=True,
    api_version='3',
)



## Loading the chosen helpdesk tickets

In [4]:
with open('/Users/adam/Documents/University/Third Year Project/rag_project/data/test/eval_ticket_keys.txt', 'r') as f:
    eval_ticket_keys = [line.strip() for line in f.readlines()]

issues = load_support_tickets(
    username=USERNAME,
    password=PASSWORD,
    start_date="2025-01-01",
    limit=None,
    keys=eval_ticket_keys,
)

In [5]:
print(len(issues))

76


## Creating the evaluation dataset

In [7]:
dataset = Dataset(name="RAG_evaluation_v1.0", backend="local/csv", root_dir="/Users/adam/Documents/University/Third Year Project/rag_project/data/evaluation")

In [47]:
issue = issues[1]
issue_key = issue['key']

issue_text = build_jira_ticket_text(issue)

query_pattern = r"\[INITIAL_DESCRIPTION\]\s*(.*?)\s*\[CONVERSATION\]"
conversation_pattern = r"\[CONVERSATION\]\s*(.*?)\s*<END_TICKET>"

query_match = re.search(query_pattern, issue_text, re.DOTALL)
if query_match:
    issue_query = query_match.group(1)

conversation_match = re.search(conversation_pattern, issue_text, re.DOTALL)
if conversation_match:
    issue_conversation = conversation_match.group(1)


In [48]:
llm = OpenAILikeLLM(
    model_name='llama-cam',
    api_base='https://llm.hpc.cam.ac.uk/v1',
    api_key='sk-lCnNSmC7BkiG18ZeHx_W6g',
)

In [77]:
print(issue_text)

<BEGIN_TICKET>
[TICKET_KEY] HPCSSUP-98836
[STATUS] RESOLVED
[CREATED] 2026-02-05T13:24:02.355+0000
[SUMMARY] Missing Files After RDS Directory Change?

[INITIAL_DESCRIPTION]
Hello,
I am a user of CSD3 (ejs237, dp012) and recently noticed that I can no longer access my files at
/rds/user/ejs237/rds-dirac-dp012/ejs237/.
I was informed that the directory has been moved to
/rds/project/dirac_vol2/rds-dirac-dp012, but this location does not appear to be valid or accessible in my case.
Could you please let me know where my previous data have been moved?
Thank you very much for your help.
Best regards,
Eun-jin Shin


[CONVERSATION]
<MESSAGE id=0001 role=HELPDESK_ASSIGNEE>
Hello Eun-jin,
You can fix the symlink by using the following commands:
cd ~/rds
rm rds-dirac-dp012
ln -s /rds/project/rds-1EqYKotbyrc rds-dirac-dp012
Please let me know if this works.
Best regards,
Elisabeth Reeve
HPC Support
</MESSAGE>

<MESSAGE id=0002 role=TICKET_CREATOR>
Hello Elisabeth,
Thanks so much for your help. Th

In [100]:
anonymisation_prompt = f"""
You are a strict data anonymisation engine for HPC helpdesk tickets.

TASK
Remove or replace sensitive and personally identifiable information (PII) from the ticket text.

RULES
- Do NOT summarise or rewrite.
- Preserve technical meaning.
- Preserve commands, file paths, and system names unless they contain personal identifiers.
- Replace sensitive info with deterministic indexed placeholders and reuse consistently:
  [PERSON_1], [PERSON_2]...
  [EMAIL_1], [EMAIL_2]...
  [USER_ID_1], [USER_ID_2]...
  [PROJECT_ID_1]...
  [PHONE_1]...
  [IP_ADDRESS_1]...
  [TICKET_ID_1]...
- Numbering restarts for each ticket.
- If no sensitive info exists, return the original ticket unchanged.
- Never return an empty response.

TICKET (redact only the text between the markers)
<<<TICKET_START>>>
{issue_text}
<<<TICKET_END>>>
""".strip()

In [102]:
response = llm.generate(
            prompt=anonymisation_prompt,
            max_tokens=5000,
        )

print(response)

 


Your response:


<<<TICKET_START>>>
<BEGIN_TICKET>
[TICKET_KEY] HPCSSUP-98836
[STATUS] RESOLVED
[CREATED] 2026-02-05T13:24:02.355+0000
[SUMMARY] Missing Files After RDS Directory Change?

[INITIAL_DESCRIPTION]
Hello,
I am a user of CSD3 ([PERSON_1], [PERSON_2]) and recently noticed that I can no longer access my files at
/rds/user/[USER_ID_1]/rds-dirac-[USER_ID_2]/[USER_ID_1]/.
I was informed that the directory has been moved to
/rds/project/[PROJECT_ID_1], but this location does not appear to be valid or accessible in my case.
Could you please let me know where my previous data have been moved?
Thank you very much for your help.
Best regards,
[PERSON_1]


[CONVERSATION]
<MESSAGE id=0001 role=HELPDESK_ASSIGNEE>
Hello [PERSON_1],
You can fix the symlink by using the following commands:
cd ~/rds
rm rds-dirac-[USER_ID_2]
ln -s /rds/project/[PROJECT_ID_1] rds-dirac-[USER_ID_2]
Please let me know if this works.
Best regards,
[PERSON_2]
HPC Support
</MESSAGE>

<MESSAGE id=0002 role=TICKE

In [60]:
prompt = f"""SYSTEM PROMPT

You are an expert technical editor creating gold-standard reference answers for an evaluation dataset used to assess a Retrieval-Augmented Generation (RAG) system for an HPC helpdesk assistant.

Your task is to convert a helpdesk ticket (initial description + conversation) into a high-quality, reusable reference answer.

Strict rules:
	•	The answer must be fully supported by the ticket conversation.
	•	Do not invent steps, commands, explanations, or policies that are not explicitly supported.
	•	Remove all personal information (names, emails, project codes, usernames, IPs).
	•	Do not mention the specific user or ticket.
	•	Do not include conversational phrases (e.g., “Thanks for your email”).
	•	Do not include internal administrative notes.
	•	Generalise the resolution so it applies to any user with the same issue.
	•	If multiple possible causes are discussed, include only those confirmed or clearly supported.
	•	Use precise technical language.
	•	Provide clear step-by-step instructions where applicable.
	•	If the conversation does not contain enough information to fully resolve the issue, state that further information would be required.

Output format:

Problem
[1–2 sentence neutral description of the issue]

Cause
[Root cause if explicitly stated or strongly implied; otherwise omit this section]

Resolution
[Clear numbered steps or structured explanation]

<END_ANSWER>

Do not include any other sections.

⸻

USER PROMPT TEMPLATE

Below is a helpdesk ticket.

[INITIAL_DESCRIPTION]
{issue_query}

[CONVERSATION]
{issue_conversation}

Generate the reference answer according to the system instructions."""

In [50]:
print(prompt)

SYSTEM PROMPT

You are an expert technical editor creating gold-standard reference answers for an evaluation dataset used to assess a Retrieval-Augmented Generation (RAG) system for an HPC helpdesk assistant.

Your task is to convert a helpdesk ticket (initial description + conversation) into a high-quality, reusable reference answer.

Strict rules:
	•	The answer must be fully supported by the ticket conversation.
	•	Do not invent steps, commands, explanations, or policies that are not explicitly supported.
	•	Remove all personal information (names, emails, project codes, usernames, IPs).
	•	Do not mention the specific user or ticket.
	•	Do not include conversational phrases (e.g., “Thanks for your email”).
	•	Do not include internal administrative notes.
	•	Generalise the resolution so it applies to any user with the same issue.
	•	If multiple possible causes are discussed, include only those confirmed or clearly supported.
	•	Use precise technical language.
	•	Provide clear step-by-s

In [62]:
response = llm.generate(
            prompt=prompt,
            stop=["<END_ANSWER>"],
        )

print(response)

Problem
The user can no longer access their files at `/rds/user/ejs237/rds-dirac-dp012/ejs237/` and was informed that the directory has been moved to `/rds/project/dirac_vol2/rds-dirac-dp012`, but this location is not valid or accessible.

Cause
The issue is due to a broken symlink to the user's directory.

Resolution
To resolve the issue, follow these steps:
1. Navigate to the `~/rds` directory.
2. Remove the existing symlink: `rm rds-dirac-dp012`
3. Create a new symlink: `ln -s /rds/project/rds-1EqYKotbyrc rds-dirac-dp012`


