## `TL;DR`: How sampling works

1. Iterate over each dataset directory; the directory name corresponds with the `<dataset_name>`

2. Load the dataset CSV-files (`RE.csv`, `ST.csv`, `mapping.csv`) and create dataframes for each

3. Randomly generate `number_of_subsets` subsets by selecting one of two sampling functions; controlled using the `use_primitive` boolean variable:
> **Hint**: the key difference lies in how the functions deal with `one-to-many` / `many-to-many` mappings. To include additional indirectly connected requirements and tests select the first alternative; note that this may increase the sample size.
   - Option 1: `sample_data()` - recursive; subset size >= sample_size argument:
      1. Randomly sample *`n`* requirements (ID's)
      2. Use the mappings to find all related tests (ID's)
      3. Check if any tests are related to more than one requirement 
      4. Recursively add any indirectly connected requirements and check their connected tests as well
      5. Repeat until all directly or indirectly connected requirements and tests have been identified
      6. Filter and return the new subset (`requirements`, `tests`, and `mappings` dataframes)

  - Option 2: `sample_data_primitive()` - non-recursive alternative; strictly respect sample_size argument:
    1. Randomly sample *`n`* requirements (ID's)
    2. Use the mappings to find all related tests (ID's)
    3. Filter and return the new subset (`requirements`, `tests`, and `mappings` dataframes)

4. Store the subsets within the corresponding `dataset` directory:
    - subset directories use an index naming convention starting at `01`
    - each subset directory contains three files: `RE.csv`, `ST.csv`, and `mapping.csv`

**Potential TODO's**:
- Add additional code to find how many different mappings are in each dataset / subset? Save these mappings to CSV?

---

## Imports

In [1]:
import os
import pandas as pd
import random

## Read & store data

In [2]:
def read_requirements(directory: str) -> pd.DataFrame:
    file_path = os.path.join(directory, "RE.csv")
    return pd.read_csv(file_path, dtype=str, on_bad_lines="warn")

def read_tests(directory: str) -> pd.DataFrame:
    file_path = os.path.join(directory, "ST.csv")
    test_df = pd.read_csv(file_path, dtype=str, on_bad_lines="warn")
    
    # Some Purpose columns are intentionally left blank; populate them with empty strings
    test_df["Purpose"] = test_df["Purpose"].fillna("")
    return test_df

def read_mappings(directory: str) -> pd.DataFrame:
    file_path = os.path.join(directory, "mapping.csv")
    return pd.read_csv(file_path, dtype=str, on_bad_lines="warn")

In [3]:
def save_subset_data(requirements: pd.DataFrame, tests: pd.DataFrame, mappings: pd.DataFrame, output_dir: str, subset_index: int):
    # Output directory uses an integer index denoting the subset
    prefix = "0" if subset_index < 10 else "" # Prefix with 0 if single digit
    output_dir = f"{output_dir}/{prefix}{subset_index}"
    os.makedirs(output_dir, exist_ok=True)
    
    requirements.to_csv(os.path.join(output_dir, f"RE.csv"), index=False)
    tests.to_csv(os.path.join(output_dir, f"ST.csv"), index=False)
    mappings.to_csv(os.path.join(output_dir, f"mapping.csv"), index=False)


### Sampling functions

In [4]:
def recursive_include_connected(req_id: str, requirements_set: set, tests_set: set, mappings: dict[str, set]):
    """
    Helper function for `sample_data()`.

    Recursively identifies all directly or indirectly connected system test cases and requirements -  
    adds each identified artifact to the `requirements_set` or `tests_set` respectively.
    """
    # Base case: current requirement has already been processed
    if req_id in requirements_set: 
        return
    
    # Add the current req. ID to the set of requirement ID's
    requirements_set.add(req_id)

    # Find the list of tests that map to the requirement
    connected_tests = mappings[req_id]

    # Add connected tests to the set of tests
    tests_set |= connected_tests

    # Recursively check if any connected test maps to more than one requirement
    for test_id in connected_tests:
        # Requirement ID's connected to the current test
        connected_reqs = [req for req in mappings if test_id in mappings[req]]

        for req in connected_reqs:
            recursive_include_connected(req, requirements_set, tests_set, mappings)


def sample_data(requirements: pd.DataFrame, tests: pd.DataFrame, mappings: pd.DataFrame, sample_size: int) -> \
        tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Randomly samples a subset of requirement specifications and recursively includes all 
    directly or indirectly connected system test cases and requirements 
    (e.g., through through one-to-many or many-to-one relationships).

    This means the final subset may include more requirements than specified by `sample_size`, 
    depending on the interconnections in the mapping data. 
    """
    # Convert ground truth data into a dictionary where keys are req_id and values are sets of test_ids
    mappings_dict: dict[str, set] = {
        row["Req ID"]: # key
        set(map(str.strip, str(row["Test ID"]).split(","))) if pd.notna(row["Test ID"]) else set() # value 
        for _, row in mappings.iterrows()
    }
    
    # Sample "n" random requirement ID's
    sampled_requirements: set[str] = set(random.sample(list(mappings["Req ID"]), sample_size))

    # Sets to hold the sampled requirement and test ID's
    sampled_requirements_set: set[str] = set()
    sampled_tests_set: set[str] = set()

    # Recursively find any connected tests and requirements; add them to the corresponding set
    for req_id in sampled_requirements:
        recursive_include_connected(req_id, sampled_requirements_set, sampled_tests_set, mappings_dict)

    # Create filtered dataframes containing only the sampled test, requirements and mappings
    filtered_reqs = requirements[requirements["ID"].isin(sampled_requirements_set)]
    filtered_tests = tests[tests["ID"].isin(sampled_tests_set)]
    filtered_mapping = mappings[
            # Find all rows that contain a sampled req_id or test_id
            mappings["Req ID"].isin(sampled_requirements_set) | 
            mappings["Test ID"].apply(
                    lambda test_id_str: bool(sampled_tests_set & set(map(str.strip, test_id_str.split(","))))
                    if pd.notna(test_id_str) else False
                )
        ]
    
    return filtered_reqs, filtered_tests, filtered_mapping


def sample_data_primitive(requirements: pd.DataFrame, tests: pd.DataFrame, mappings: pd.DataFrame, sample_size: int) -> \
        tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """ 
    Randomly samples a subset of requirement specifications and includes only the directly 
    connected system test cases based on the provided mappings.

    Unlike the recursive version, this function does not follow indirect connections, 
    so the returned subset strictly reflects the initially sampled requirements.
    """

    # Convert ground truth data into a dictionary where keys are req_id and values are sets of test_ids
    mappings_dict: dict[str, set] = {
        row["Req ID"]: # key
        set(map(str.strip, str(row["Test ID"]).split(","))) if pd.notna(row["Test ID"]) else set() # value 
        for _, row in mappings.iterrows()
    }

    # Sample "n" random requirement ID's
    sampled_requirements: set[str] = set(random.sample(list(mappings["Req ID"]), sample_size))
    
    # Set to hold the sampled test ID's
    sampled_tests: set[str] = set()

    # Find all test ID's connected to the sampled requirements
    for req_id in sampled_requirements:
        sampled_tests |= mappings_dict[req_id]

    # Create filtered dataframes containing only the sampled test, requirements and mappings
    filtered_reqs = requirements[requirements["ID"].isin(sampled_requirements)]
    filtered_tests = tests[tests["ID"].isin(sampled_tests)]
    filtered_mapping = mappings[
            # Find all rows that contain a sampled req_id
            mappings["Req ID"].isin(sampled_requirements)
        ]

    return filtered_reqs, filtered_tests, filtered_mapping

## Main execution cell

In [None]:
datasets_dir = "." # Datasets are currently stored in the same directory as this Jupyter Notebook

number_of_subsets = 10
# TODO: Placeholder value
sample_size = 3 
# The datasets have vastly different population sizes which makes it hard to use the same sample size
# Either work around by hard-coding some values (array?) corresponding with the datasets, or
# make some dynamic logic to create a sample based on a proportion of the population i.e. some percentage

use_primitive = False # Determines sampling function

sampling_function = sample_data_primitive if use_primitive else sample_data # Select function pointer

for dataset in os.listdir(datasets_dir):
    # Construct filepath to the current dataset
    dataset_path = os.path.join(datasets_dir, dataset)

    # Skip current iteration if we encountered a file
    if not os.path.isdir(dataset_path): continue
    
    requirements = read_requirements(dataset_path)
    tests        = read_tests(dataset_path)
    mappings     = read_mappings(dataset_path)
    
    # Shift the entire range by 1 so that indexing starts at 1
    for i in range(1, number_of_subsets + 1):
        # Sample dataset to create new subset
        sampled_reqs, sampled_tests, sampled_mappings = sampling_function(
            requirements, tests, mappings, sample_size
        )

        # Save the sampled data to CSV-file, use i as subset index
        save_subset_data(sampled_reqs, sampled_tests, sampled_mappings, dataset_path, i)     


---

### [Extra] execution cell to manually sample 100 Mozilla requirements

Uses the *primitive* sampling method.

In [None]:
number_of_subsets = 10
sample_size = 25 

dataset_path = "./deprecated_datasets/AMINA-100/"
output_path = "./AMINA/"

requirements = read_requirements(dataset_path)
tests        = read_tests(dataset_path)
mappings     = read_mappings(dataset_path)

# Shift the entire range by 1 so that indexing starts at 1
for i in range(1, number_of_subsets + 1):
    # Sample dataset to create new subset
    sampled_reqs, sampled_tests, sampled_mappings = sample_data_primitive(
        requirements, tests, mappings, sample_size
    )

    print(f"\nITERATION: {i}")
    print("Number of rows in requirements DataFrame:", sampled_reqs.shape[0])
    print("Number of rows in tests DataFrame:", sampled_tests.shape[0])
    print("Number of rows in mappings DataFrame:", sampled_mappings.shape[0])

    # Save the sampled data to CSV-file, use i as subset index
    save_subset_data(sampled_reqs, sampled_tests, sampled_mappings, output_path, i)     



ITERATION: 1
Number of rows in requirements DataFrame: 25
Number of rows in tests DataFrame: 25
Number of rows in mappings DataFrame: 25

ITERATION: 2
Number of rows in requirements DataFrame: 25
Number of rows in tests DataFrame: 21
Number of rows in mappings DataFrame: 25

ITERATION: 3
Number of rows in requirements DataFrame: 25
Number of rows in tests DataFrame: 20
Number of rows in mappings DataFrame: 25

ITERATION: 4
Number of rows in requirements DataFrame: 25
Number of rows in tests DataFrame: 20
Number of rows in mappings DataFrame: 25

ITERATION: 5
Number of rows in requirements DataFrame: 25
Number of rows in tests DataFrame: 23
Number of rows in mappings DataFrame: 25

ITERATION: 6
Number of rows in requirements DataFrame: 25
Number of rows in tests DataFrame: 23
Number of rows in mappings DataFrame: 25

ITERATION: 7
Number of rows in requirements DataFrame: 25
Number of rows in tests DataFrame: 22
Number of rows in mappings DataFrame: 25

ITERATION: 8
Number of rows in re