### Evaluation of the Aggregate Requirement Artifact Set

In [212]:
# Read the evaluation prompt from file
with open("prompts/evaluation_prompt_aggregate.txt", "r") as file:
    EVALUATION_PROMPT = file.read()

In [213]:

# Metric 1: Feasibility
FEASIBILITY_SCORE_CRITERIA = """
Feasability (1-5) - A requirement or an aggregate is achievable/feasible/attainable if and only if \
there exists at least one system design and implementation that correctly \
implements the requirement or all the requirements stated in the aggregate at a \
definable cost. 
"""

FEASIBILITY_SCORE_STEPS = """
1. Read the generated requirement artifacts and the SRS document carefully.
2. Check if the set of requirement artifacts is feasible to accomplish according the Evaluation Criteria.
3. Assign a single feasibility score from 1 to 5, where 1 is the lowest and 5 is the highest based on the evaluation criteria.
"""

# Metric 2: Clarity
CLARITY_SCORE_CRITERIA = """
Clarity (1-5) - A requirement or an aggregate is clear/precise/meaningful if and only if \
numeric quantities are used whenever possible, and the appropriate levels of \
precision are used for all numeric quantities.
"""

CLARITY_SCORE_STEPS = """
1. Read the generated requirement artifacts and the SRS document carefully.
2. Check if the set of requirement artifacts is clear and meaningful according the Evaluation Criteria.
3. Assign a single clarity score from 1 to 5, where 1 is too difficult to understand and 5 is the highest based on the evaluation criteria.
"""

# Metric 3: Completeness
COMPLETENESS_SCORE_CRITERIA = """
Completeness (1-5) - A requirement is complete if it is capable of standing alone when separated from \
other requirements and does not need further amplification. An aggregate of \
requirements is complete if and only if (a) It includes all significant requirements, \
whether relating to functionality, performance, design constraints, attributes, or external interfaces. 
"""

COMPLETENESS_SCORE_STEPS = """
1. Read the generated requirement artifacts and the SRS document carefully.
2. Check if the set of requirement artifacts is complete according the Evaluation Criteria.
2. Assign a single completeness score from 1 to 5, where 1 is the lowest and 5 is the highest based on the evaluation criteria.
"""

# Metric 4: Ambiguity
UNAMBIGUITY_SCORE_CRITERIA = """
Unambiguity (1-5) - A requirement or an aggregate is unambiguous if different readers \
with similar backgrounds would be able to draw only one interpretation \
of the requirement or of each requirement in the aggregate. 
"""

UNAMBIGUITY_SCORE_STEPS = """
1. Read the generated requirement artifacts and the SRS document carefully.
2. Check if the set of requirement artifacts is unambiguous according the Evaluation Criteria.
2. Assign a single unambiguity score from 1 to 5, where 1 is the lowest and 5 is the highest based on the evaluation criteria.
"""

# Metric 5: Consistency
CONSISTENCY_SCORE_CRITERIA = """
Consistency (1-5) - An aggregate is internally consistent if and only if \
no subset of individual requirements stated in it conflict. The new requirement artifacts should not contradict or confuse \
the existing requirments in the SRS document.
"""

CONSISTENCY_SCORE_STEPS = """
1. Read the generated requirement artifacts and the SRS document carefully.
2. Check if the aggregate set is in conflict with itself or the other requirement artifacts in the document according to the evaluation criteria.
3. Assign a single score for consistency based on the Evaluation Criteria.
"""

# Metric 6: Modifiability
MODIFIABILITY_SCORE_CRITERIA = """
Modifiability (1-5) - An aggregate is modifiable if and only if \
its structure and style are such that any changes to the requirements \
can be made easily, completely, and consistently while retaining \
the structure and style. 
"""

MODIFIABILITY_SCORE_STEPS = """
1. Read the generated requirement artifacts and the SRS document carefully.
2. Check is the requirement artifacts are easily modifiable based on the evaluation criteria.
3. Assign a single score for modifiability based on the Evaluation Criteria.
"""

# Metric 7: Correctness
CORRECTNESS_SCORE_CRITERIA = """
Correctness (1-5) - A requirement is correct if it accurately describes \
a functionality to be delivered. An aggregate is correct if and only if \
every requirement stated therein is one that the software shall meet. 
"""

CORRECTNESS_SCORE_STEPS = """
1. Read the generated requirement artifacts and the SRS document carefully.
2. Check if the full functionality of the set of requirements are captured based on the evaluation criteria.
3. Assign a single score for correctness based on the Evaluation Criteria.
"""

# Metric 8: Necessity
NECESSITY_SCORE_CRITERIA = """
Necessity (1-5) - A requirement is necessary if the stated requirement \
is an essential capability, physical characteristic, or quality factor of \
the product or process. If it is removed or deleted, a deficiency will exist, \
which cannot be fulfilled by other capabilities of the product or process. 
"""

NECESSITY_SCORE_STEPS = """
1. Read the generated requirement artifacts and the SRS document carefully.
2. Check if the requirement artifacts are necessary based on the evaluation criteria.
3. Assign a single score for necessity based on the Evaluation Criteria.
"""

In [214]:
# Get the OpenAI api key
import json

config_data = json.load(open("config.json"))
openai_api_key = config_data["OPENAI_API_KEY"]

In [215]:
# Create the OpenAI Client
from openai import OpenAI
client = OpenAI(api_key=openai_api_key)

In [216]:
# Provide the generated requirement artifacts
req_artifact_1 = open("outputs/best_outputs/user_story_1.json", "r").read()
req_artifact_2 = open("outputs/best_outputs/user_story_2.json", "r").read()
req_artifact_3 = open("outputs/best_outputs/user_story_3.json", "r").read()
req_artifact_4 = open("outputs/best_outputs/user_story_4.json", "r").read()
bad_artifact_5 = open("inputs/bad_examples/bad_story_1.json", "r").read()

In [217]:
# Provide the SRS standard
import pymupdf4llm
import pathlib

# Parse the pdf file into markdown format for passing to the model
ieee_std_path = "inputs/IEEE 830-1998.pdf"
ieee_std_txt = pymupdf4llm.to_markdown(ieee_std_path)

In [218]:
# Provide the reference SRS
srs = open("inputs/edited_srs_trimmed.md", "r").read()
    

In [219]:
# Function to get an evaluation score of the generated requirements artifacts
def get_geval_score(
    criteria: str, steps: str, standard: str, srs: str, requirement: str, metric_name: str
):
    prompt = EVALUATION_PROMPT.format(
        criteria=criteria,
        steps=steps,
        metric_name=metric_name,
        standard=standard,
        srs=srs,
        requirement=requirement,
    )
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=200,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    return response.choices[0].message.content

In [220]:
import pandas as pd
import re

evaluation_metrics = {
    # "Feasibility": (FEASIBILITY_SCORE_CRITERIA, FEASIBILITY_SCORE_STEPS),
    "Clarity": (CLARITY_SCORE_CRITERIA, CLARITY_SCORE_STEPS),
    "Completeness": (COMPLETENESS_SCORE_CRITERIA, COMPLETENESS_SCORE_STEPS),
    "Unambiguous": (UNAMBIGUITY_SCORE_CRITERIA, UNAMBIGUITY_SCORE_STEPS),
    "Consistency": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),
    # "Modifiability": (MODIFIABILITY_SCORE_CRITERIA, MODIFIABILITY_SCORE_STEPS),
    "Correctness": (CORRECTNESS_SCORE_CRITERIA, CORRECTNESS_SCORE_STEPS),
    "Necessity": (NECESSITY_SCORE_CRITERIA, NECESSITY_SCORE_STEPS),
    }

requirements = {
    "Traffic Data": req_artifact_1,
    "Inspection Notifications": req_artifact_2,
    "Color-Code": req_artifact_3,
    "PowerPoint Report": req_artifact_4,
    "Bad Example": bad_artifact_5
    }

data = {"Evaluation Type": [], "Requirement Type": [], "Score": []}

for eval_type, (criteria, steps) in evaluation_metrics.items():
    for req_type, req in requirements.items():
        data["Evaluation Type"].append(eval_type)
        data["Requirement Type"].append(req_type)
        result = get_geval_score(criteria, steps, ieee_std_txt, srs, req, eval_type)
        print("-"*30)
        print(f"{eval_type} for {req_type}:")
        print("---")
        print(result)
        score_num = int(re.search(r'\d+', result).group())
        data["Score"].append(score_num)

------------------------------
Clarity for Traffic Data:
---
**Clarity Score: 4**

The requirement artifacts are generally clear and precise, with numeric quantities and appropriate levels of precision used in most cases. However, there is a slight room for improvement in specifying certain details more explicitly. For example, the term "user-friendly interface" in LAF-1 could be more precisely defined, and the success-end condition in UC-4 could specify what constitutes "supporting information" more clearly. Overall, the artifacts are well-articulated and understandable.
------------------------------
Clarity for Inspection Notifications:
---
**Clarity Score: 4**

**Rationale:**
The requirement artifacts are generally clear and precise, with specific functionalities and conditions well-defined. Numeric quantities are used appropriately, such as "inspections coming due in the next month" and "99.9% of the time." However, some descriptions could benefit from additional precision, such a

In [221]:
pivot_df = pd.DataFrame(data, index=None).pivot(
    index="Evaluation Type", columns="Requirement Type", values="Score"
)
styled_pivot_df = pivot_df.style.highlight_max(color = 'lightgreen', axis = 1)
display(styled_pivot_df)

Requirement Type,Bad Example,Color-Code,Inspection Notifications,PowerPoint Report,Traffic Data
Evaluation Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Clarity,2,3,4,4,4
Completeness,3,4,4,4,4
Consistency,2,5,5,5,5
Correctness,2,5,5,5,5
Necessity,3,4,5,3,4
Unambiguous,2,4,4,4,4
