#### Auto-fill Questionnaire using Chain of Thought or Few-Shot Examples

This notebook showcases the application of few-shot examples in autofilling questionnaires. It utilizes a json file (`risk_questionnaire_cot.json`) to
provide the LLM with example responses for some use-cases.

By leveraging these few-shot examples, we can enable seamless completion of lengthy questionnaires, minimizing manual effort and improving overall efficiency.


In [1]:
from risk_atlas_nexus.blocks.inference import (
    RITSInferenceEngine,
    WMLInferenceEngine,
    OllamaInferenceEngine,
    VLLMInferenceEngine,
)
from risk_atlas_nexus.blocks.inference.params import (
    InferenceEngineCredentials,
    RITSInferenceEngineParams,
    WMLInferenceEngineParams,
    OllamaInferenceEngineParams,
    VLLMInferenceEngineParams,
)

from risk_atlas_nexus.data import load_resource
from risk_atlas_nexus.library import RiskAtlasNexus

  from tqdm.autonotebook import tqdm


##### Risk Atlas Nexus uses Large Language Models (LLMs) to infer risks dimensions. Therefore requires access to LLMs to inference or call the model.

**Available Inference Engines**: WML, Ollama, vLLM, RITS. Please follow the [Inference APIs](https://github.com/IBM/risk-atlas-nexus?tab=readme-ov-file#install-for-inference-apis) guide before going ahead.

_Note:_ RITS is intended solely for internal IBM use and requires TUNNELALL VPN for access.


In [2]:
inference_engine = OllamaInferenceEngine(
    model_name_or_path="granite3.2:8b",
    credentials=InferenceEngineCredentials(api_url="http://localhost:11434"),
    parameters=OllamaInferenceEngineParams(
        num_predict=1000, temperature=0, repeat_penalty=1, num_ctx=8192
    ),
)

# inference_engine = WMLInferenceEngine(
#     model_name_or_path="ibm/granite-20b-code-instruct",
#     credentials={
#         "api_key": "WML_API_KEY",
#         "api_url": "WML_API_URL",
#         "project_id": "WML_PROJECT_ID",
#     },
#     parameters=WMLInferenceEngineParams(
#         max_new_tokens=1000, decoding_method="greedy", repetition_penalty=1
#     ),
# )

# inference_engine = VLLMInferenceEngine(
#     model_name_or_path="ibm-granite/granite-3.1-8b-instruct",
#     credentials=InferenceEngineCredentials(
#         api_url="VLLM_API_URL", api_key="VLLM_API_KEY"
#     ),
#     parameters=VLLMInferenceEngineParams(max_tokens=1000, temperature=0.7),
# )

# inference_engine = RITSInferenceEngine(
#     model_name_or_path="ibm-granite/granite-3.1-8b-instruct",
#     credentials={
#         "api_key": "RITS_API_KEY",
#         "api_url": "RITS_API_URL",
#     },
#     parameters=RITSInferenceEngineParams(max_tokens=1000, temperature=0.7),
# )

[2025-07-20 21:48:13:985] - INFO - RiskAtlasNexus - OLLAMA inference engine will execute requests on the server at http://localhost:11434.
[2025-07-20 21:48:14:24] - INFO - RiskAtlasNexus - Created OLLAMA inference engine.


##### Create an instance of RiskAtlasNexus

_Note: (Optional)_ You can specify your own directory in `RiskAtlasNexus(base_dir=<PATH>)` to utilize custom AI ontologies. If left blank, the system will use the provided AI ontologies.


In [3]:
risk_atlas_nexus = RiskAtlasNexus()

[2025-07-20 21:48:14:187] - INFO - RiskAtlasNexus - Created RiskAtlasNexus instance. Base_dir: None


#### Defining Examples for Auto-Assist Functionality

The auto-assist feature utilizes few-shot examples defined in the file `risk_atlas_nexus/data/templates/risk_questionnaire_cot.json` to predict the output of the risk questionnaire.

**Customization:**

To adapt this auto-assist functionality to custom risk questionnaire, users need to provide their own set of questions, example intents, and corresponding answers in a json file such as in [risk_questionnaire_cot.json](https://github.com/IBM/risk-atlas-nexus/blob/main/src/risk_atlas_nexus/data/templates/risk_questionnaire_cot.json). This will enable the LLM to learn from these few-shot examples and generate responses for unseen queries.

**CoT Template - Zero Shot method**

Each question is accompanied by corresponding examples provided as an empty list.

```shell
  [
      {
          "question": "In which environment is the system used?",
          "cot_examples": []
      }
      ...
  ]
```

**CoT Template - Few Shot method**

Each question is associated with a list of examples, each containing intent, answer, and optional explanation.

```shell
  [
      {
          "question": "In which environment is the system used?",
          "cot_examples": [
            {
              "intent": "Find patterns in healthcare insurance claims",
              "answer": "Insurance Claims Processing or Risk Management or Data Analytics",
              "explanation": "The system might be used by an insurance company's claims processing department to analyze and identify patterns in healthcare insurance claims."
            },
            {
                "intent": "optimize supply chain management in Investment banks",
                "answer": "Treasury Departments or Asset Management Divisions or Private Banking Units",
                "explanation": null
            },
            ...
          ]
      }
      ...
  ]
```

In this notebook, we're using a simplified template to cover 7 questions
from the Airo questionnaire:

1. AI Domain
2. System environment
3. Utilized techniques
4. Intended User
5. Intended Purpose
6. System Application
7. AI Subject


#### Load Risk Questionnaire

**Note:** The cell below loads examples of risk questionnaires from Risk Atlas Master. To load your custom questionnaire, create it according to the specified format and load it instead.


In [4]:
risk_questionnaire = load_resource("risk_questionnaire_cot.json")

risk_questionnaire[0]

{'no': 'Q1',
 'question': 'What domain does your use request fall under? Customer service/support, Technical, Information retrieval, Strategy, Code/software engineering, Communications, IT/business automation, Writing assistant, Financial, Talent and Organization including HR, Product, Marketing, Cybersecurity, Healthcare, User Research, Sales, Risk and Compliance, Design, Other',
 'cot_examples': [{'intent': 'Optimize supply chain management in Investment banks',
   'answer': 'Strategy',
   'explanation': 'Since the task is involved in improving the processes to ensure better performance. It is not finance since the task is on supply chain optimization and not on financial aspects even though the application domain is banks.'},
  {'intent': 'Ability to create dialog flows and integrations from natural language instructions.',
   'answer': 'Customer service/support',
   'explanation': 'Since the task relates to human conversations or generating human converstations or support.'},
  {'i

There are two ways to use the inference engine to get the LLM outputs. `generate_zero_shot_risk_questionnaire_output` which gives the zero-shot output for the question and `generate_few_shot_risk_questionnaire_output` which gives the output using few-shot examples defined above.


#### Auto Assist Questionnaire - Zero Shot


In [None]:
usecase = "Medical Diagnosis and Treatment Suggestions"

results = risk_atlas_nexus.generate_zero_shot_risk_questionnaire_output(
    usecase, risk_questionnaire, inference_engine
)

# Display Results
for index, (question_data, result) in enumerate(
    zip(risk_questionnaire, results), start=1
):
    print(
        f"\n{index}: "
        + question_data["question"]
        + "\nA: "
        + result.prediction["answer"]
    )

#### Auto Assist Questionnaire - Few Shot


In [None]:
import json 
with open("risk_questionnaire_benchmark.json") as f:
    usecases = json.load(f)

predictions = []

In [7]:
index_map = {0: "UseCase", 1: "Domain", 2: "Environment", 3: "Techniques_Utilised", 4: "Intended_User", 5: "Purpose", 6: "Application", 7: "Subject" }

In [None]:
for usecase_dict in usecases:
    usecase = usecase_dict["UseCase"]
    results = risk_atlas_nexus.generate_few_shot_risk_questionnaire_output(
        usecase,
        risk_questionnaire,
        inference_engine,
    )
    
    result_dict = {}
    # Display Results
    for index, (question_data, result) in enumerate(
        zip(risk_questionnaire, results), start=1
    ):
        print(
            f"\n{index}: "
            + question_data["question"]
            + "\nA: "
            + result.prediction["answer"]
        )
        result_dict[index_map[index]] = {}
        result_dict[index_map[index]]["answer"] = result.prediction["answer"]
        result_dict[index_map[index]]["explanation"] = result.prediction["explanation"]
        result_dict["usecase"] = usecase
        # print(result)
    
    predictions.append(result_dict)

Inferring with OLLAMA: 100%|██████████| 7/7 [01:19<00:00, 11.32s/it]



1: What domain does your use request fall under? Customer service/support, Technical, Information retrieval, Strategy, Code/software engineering, Communications, IT/business automation, Writing assistant, Financial, Talent and Organization including HR, Product, Marketing, Cybersecurity, Healthcare, User Research, Sales, Risk and Compliance, Design, Other
A: Communications

2: In which environment is the system used?
A: Multilingual Websites or Digital Content Platforms

3: What techniques are utilised in the system? Multi-modal: {Document Question/Answering, Image and text to text, Video and text to text, visual question answering}, Natural language processing: {feature extraction, fill mask, question answering, sentence similarity, summarization, table question answering, text classification, text generation, token classification, translation, zero shot classification}, computer vision: {image classification, image segmentation, text to image, object detection}, audio:{audio classif

Inferring with OLLAMA: 100%|██████████| 7/7 [01:10<00:00, 10.11s/it]


1: What domain does your use request fall under? Customer service/support, Technical, Information retrieval, Strategy, Code/software engineering, Communications, IT/business automation, Writing assistant, Financial, Talent and Organization including HR, Product, Marketing, Cybersecurity, Healthcare, User Research, Sales, Risk and Compliance, Design, Other
A: Code/software engineering

2: In which environment is the system used?
A: Software Development Teams

3: What techniques are utilised in the system? Multi-modal: {Document Question/Answering, Image and text to text, Video and text to text, visual question answering}, Natural language processing: {feature extraction, fill mask, question answering, sentence similarity, summarization, table question answering, text classification, text generation, token classification, translation, zero shot classification}, computer vision: {image classification, image segmentation, text to image, object detection}, audio:{audio classification, audi




In [None]:
with open("./autoassist_questionnaire.json", "w") as f:
    json.dump(predictions, f)

In [None]:
import json

with open("./autoassist_questionnaire.json") as f:
    autoassist_data = json.load(f)

usecase_autoassist = []
domain_autoassist = []
environment_autoassist = []
techniques_utilised_autoassist = []
intended_user_autoassist = []
purpose_autoassist = []
application_autoassist = []
subject_autoassist = []


for data in autoassist_data:
    domain_autoassist.append(data["Domain"])
    environment_autoassist.append(data["Environment"])
    techniques_utilised_autoassist.append(data["Techniques_Utilised"])
    intended_user_autoassist.append(data["Intended_User"])
    purpose_autoassist.append(data["Purpose"])
    application_autoassist.append(data["Application"])
    subject_autoassist.append(data["Subject"])
    usecase_autoassist.append(data["usecase"])



In [32]:
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Coherence (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and coherence whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

coherence_metric = GEval(
    name="Coherence",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)

test_cases = []
question = "In which environment is the system used?"
for index, data in enumerate(environment_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[coherence_metric])




Metrics Summary

  - ✅ Coherence [GEval] (score: 0.9, threshold: 0.5, strict: False, evaluation model: gemma3n (Ollama), reason: The response directly addresses the prompt by identifying relevant departments (Customer Service, Sales, Helpdesks) and providing plausible explanations for how automated email response generation can benefit each. The explanations are specific and demonstrate an understanding of the potential use cases within each department. The response is well-organized and easy to understand., error: None)

For test case:

  - input: Intent: Automated Email Response Generation.In which environment is the system used?
  - actual output: Answer: Customer Service or Sales Departments or Helpdesks.Explanation: 1. Customer Service: Automated email response generation can be used by customer service teams to quickly and efficiently respond to customer inquiries, improving response times and customer satisfaction. 2. Sales Departments: Sales teams might use this system to aut

EvaluationResult(test_results=[TestResult(name='test_case_9', success=True, metrics_data=[MetricData(name='Coherence [GEval]', threshold=0.5, success=True, score=0.9, reason='The response directly addresses the prompt by identifying relevant departments (Customer Service, Sales, Helpdesks) and providing plausible explanations for how automated email response generation can benefit each. The explanations are specific and demonstrate an understanding of the potential use cases within each department. The response is well-organized and easy to understand.', strict_mode=False, evaluation_model='gemma3n (Ollama)', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nCoherence (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and coherence whereby the answer should be\nrelevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received. \n \nEvaluation Steps:\n

In [None]:
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Coherence (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and coherence whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

coherence_metric = GEval(
    name="Coherence",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)


test_cases = []
question = "What domain does your use request fall under? Customer service/support, Technical, Information retrieval, Strategy, Code/software engineering, Communications, IT/business automation, Writing assistant, Financial, Talent and Organization including HR, Product, Marketing, Cybersecurity, Healthcare, User Research, Sales, Risk and Compliance, Design, Other"
for index, data in enumerate(domain_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[coherence_metric])


In [33]:
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Coherence (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and coherence whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

coherence_metric = GEval(
    name="Coherence",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)


test_cases = []
question = "What techniques are utilised in the system? Multi-modal: {Document Question/Answering, Image and text to text, Video and text to text, visual question answering}, Natural language processing: {feature extraction, fill mask, question answering, sentence similarity, summarization, table question answering, text classification, text generation, token classification, translation, zero shot classification}, computer vision: {image classification, image segmentation, text to image, object detection}, audio:{audio classification, audio to audio, text to speech}, tabular: {tabular classification, tabular regression}, reinforcement learning"
for index, data in enumerate(techniques_utilised_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[coherence_metric])




Metrics Summary

  - ✅ Coherence [GEval] (score: 0.9, threshold: 0.5, strict: False, evaluation model: gemma3n (Ollama), reason: The response directly addresses the prompt by identifying natural language processing, specifically text generation, as the core technique for automated email response generation. It accurately describes the process of analyzing the email, understanding its intent, and generating a natural language response. The explanation is coherent, relevant, and provides a clear rationale for the answer., error: None)

For test case:

  - input: Intent: Automated Email Response Generation.What techniques are utilised in the system? Multi-modal: {Document Question/Answering, Image and text to text, Video and text to text, visual question answering}, Natural language processing: {feature extraction, fill mask, question answering, sentence similarity, summarization, table question answering, text classification, text generation, token classification, translation, zero sho



In [41]:
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Coherence (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and coherence whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

coherence_metric = GEval(
    name="Coherence",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)


test_cases = []
question = "Who is the intended user of the system?"
for index, data in enumerate(intended_user_autoassist[12:]):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[coherence_metric])




Metrics Summary

  - ✅ Coherence [GEval] (score: 0.9, threshold: 0.5, strict: False, evaluation model: gemma3n (Ollama), reason: The response directly addresses the prompt by explaining why website owners or developers need to use automated language translation systems. It covers the relevance to a diverse audience, improved user experience, increased traffic, and potential benefits like sales or engagement. The explanation is well-structured, relevant, and provides a clear rationale., error: None)

For test case:

  - input: Intent: Sentiment Analysis for Social Media Monitoring.Who is the intended user of the system?
  - actual output: Answer: Website owners or developers.Explanation: Website owners or developers need to cater to a diverse audience that speaks different languages. By using an automated language translation system, they can easily translate their website content into multiple languages, making it accessible to a broader audience. This can help improve user experienc

EvaluationResult(test_results=[TestResult(name='test_case_2', success=True, metrics_data=[MetricData(name='Coherence [GEval]', threshold=0.5, success=True, score=0.9, reason='The response directly addresses the prompt by explaining why website owners or developers need to use automated language translation systems. It covers the relevance to a diverse audience, improved user experience, increased traffic, and potential benefits like sales or engagement. The explanation is well-structured, relevant, and provides a clear rationale.', strict_mode=False, evaluation_model='gemma3n (Ollama)', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nCoherence (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and coherence whereby the answer should be\nrelevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received. \n \nEvaluation Steps:\n[\n    "Read the input 

In [45]:
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Coherence (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and coherence whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

coherence_metric = GEval(
    name="Coherence",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)


test_cases = []
question = "What is the intended purpose of the system?"
for index, data in enumerate(purpose_autoassist[8:]):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[coherence_metric])



Metrics Summary

  - ✅ Coherence [GEval] (score: 0.9, threshold: 0.5, strict: False, evaluation model: gemma3n (Ollama), reason: The response directly addresses the prompt by explaining the benefits of automating email response generation. It covers key aspects like efficiency, consistency, reduced manual effort, and error reduction, aligning well with the evaluation steps. The explanation is relevant, coherent, and provides a clear rationale for the proposed automation., error: None)

For test case:

  - input: Intent: Code Autocompletion and Debugging in Software Engineering.What is the intended purpose of the system?
  - actual output: Answer: To automate the generation of customized email responses, improving customer service efficiency and consistency, while reducing manual effort and potential human error..Explanation: The intent of automating email response generation is to create a system that can automatically produce tailored email replies based on predefined templates and 

EvaluationResult(test_results=[TestResult(name='test_case_1', success=True, metrics_data=[MetricData(name='Coherence [GEval]', threshold=0.5, success=True, score=0.9, reason='The response directly addresses the prompt by explaining the benefits of automating email response generation. It covers key aspects like efficiency, consistency, reduced manual effort, and error reduction, aligning well with the evaluation steps. The explanation is relevant, coherent, and provides a clear rationale for the proposed automation.', strict_mode=False, evaluation_model='gemma3n (Ollama)', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nCoherence (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and coherence whereby the answer should be\nrelevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received. \n \nEvaluation Steps:\n[\n    "Read the input and question c

In [46]:
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Coherence (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and coherence whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

coherence_metric = GEval(
    name="Coherence",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)


test_cases = []
question = "What is the application of the system?"
for index, data in enumerate(application_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[coherence_metric])



Metrics Summary

  - ✅ Coherence [GEval] (score: 0.9, threshold: 0.5, strict: False, evaluation model: gemma3n (Ollama), reason: The response directly addresses the prompt by defining automated data analysis in the context of automated report generation. It accurately describes the process, including the use of machine learning algorithms for data extraction and trend identification, and explains the value of these insights for business intelligence. The explanation is relevant, coherent, and provides a clear understanding of the concept., error: None)

For test case:

  - input: Intent: Automated Report Generation in Business Intelligence.What is the application of the system?
  - actual output: Answer: Automated Data Analysis: Use machine learning algorithms to extract insights and trends from large datasets..Explanation: The application of the system for automated report generation in business intelligence involves Automated Data Analysis. This involves using machine learning algo



In [47]:
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import ContextualRelevancyMetric

criteria = """Coherence (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and coherence whereby the answer should be
relevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received."""

coherence_metric = GEval(
    name="Coherence",
    criteria=criteria,
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Read the input and question carefully and identify if the actual output is relevant to the input.",
        "Read the actual output. Check if the actual output covers the input and the explanation of the output is valid, relevant and probable to be correct.",
        "Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)


test_cases = []
question = "Who is the subject as per the intent?"
for index, data in enumerate(subject_autoassist):

    actual_output = "Answer: " + data["answer"] + ".Explanation: " + data["explanation"] 
    input = "Intent: " + usecase_autoassist[index] + "." + question
    retrieval_context = [input]
    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output, 
    )
    test_cases.append(test_case)


evaluate(test_cases=test_cases, metrics=[coherence_metric])



Metrics Summary

  - ✅ Coherence [GEval] (score: 0.5, threshold: 0.5, strict: False, evaluation model: gemma3n (Ollama), reason: The explanation directly addresses the input by stating the AI system's subject is email correspondence and detailing the core functionality of analyzing and generating responses. It accurately reflects the need for understanding email content and context, demonstrating strong coherence with the evaluation steps., error: None)

For test case:

  - input: Intent: Automated Email Response Generation.Who is the subject as per the intent?
  - actual output: Answer: Email Correspondence.Explanation: The system would need to analyze and generate responses to incoming emails, implying that the subject of the AI system is the email correspondence itself. This could involve understanding the content and context of incoming emails to generate appropriate automated responses.
  - expected output: None
  - context: None
  - retrieval context: None


Overall Metric Pass

EvaluationResult(test_results=[TestResult(name='test_case_9', success=True, metrics_data=[MetricData(name='Coherence [GEval]', threshold=0.5, success=True, score=0.5, reason="The explanation directly addresses the input by stating the AI system's subject is email correspondence and detailing the core functionality of analyzing and generating responses. It accurately reflects the need for understanding email content and context, demonstrating strong coherence with the evaluation steps.", strict_mode=False, evaluation_model='gemma3n (Ollama)', error=None, evaluation_cost=0.0, verbose_logs='Criteria:\nCoherence (1-5) - the collective relevance and correctness of the answer. We align this dimension with the structure and coherence whereby the answer should be\nrelevant and correct. The answer should not be completely irrelevant, but should be plausible extraction based on the information received. \n \nEvaluation Steps:\n[\n    "Read the input and question carefully and identify if the act