Receives input from Generate_Gherkins

INPUT: raw generated gherkins (model, timestamp, us_id, user_story, assistant_response, prompt_tokens, completion_tokens, created)
- created is the unix timestamp returned by the model for when the request was processed
- timestamp is generated in our code when the request is made, as a back up in case created is null

REQUIRED OUTPUT FORMAT: 
For input to METEOR/TF-IDF/SentenceTransformer: us_id, us_text, scenario_title, model, scenario_text, scenario_id - scenario_text is the full scenario text, (unsure if we want feature info, e.g. title, description)
Other outputs:
- Above but for step data
- Parse error data
- Lint report data

Combining Sample_Data_Base_Preprocessing and Pipeline? Or do we want this to be generic to all data and still have a sample data base preprocess? I think all generated data will be the same/can be made the same so if we need separate for human data we can make that later.

All outputs:
- Feature files
- Full scenario data for traceability scoring (this is the input?) X
- Step data (for clustering?)
- Parser error data (parser is used to create above set)
- Lint report data


Completed outputs:
1. Feature file for each ai_response, written to `gherkins/sample_data/<exp_label>/feature_files/<model>/<app>` directory
2. 

In [None]:
import pandas as pd
import numpy as np
import re
import string
from pathlib import Path

from importlib import reload
import config
reload(config)

from config import DATASET_NAME, EXPERIMENT_NAME, INPUT_DATA_PATH, GENERATION_TECHNIQUE

In [None]:
exp_dir = Path(f"../data/{DATASET_NAME}/experiment_outputs/{EXPERIMENT_NAME}/{GENERATION_TECHNIQUE}/")
exp_dir.mkdir(parents=True, exist_ok=True)

input_file_name = f"{GENERATION_TECHNIQUE}_raw_results.csv"

In [None]:
# Read the raw BDD dataset, containing the model outputs
raw_df = pd.read_csv(exp_dir / input_file_name) # TODO: store prompts? Important for multi-turn chats where we should record the order of presentation of user stories.

raw_df.head()

In [None]:
# Temporary changes that I have fixed in the data generation code (but don't want to rerun generation)
raw_df.rename(columns={'user_prompt': 'us_text'}, inplace=True)

raw_df["us_id"] = raw_df["us_id"].str.split('_').str[1]

raw_df["us_id"] = raw_df["us_id"].astype(np.int64)

In [None]:
raw_df.dtypes

In [None]:
raw_df.head()

In [None]:
# Check for duplicates (on model and us_id)
duplicates = raw_df[raw_df.duplicated(subset=["model", "us_id"], keep=False)]

print(duplicates.shape[0], "duplicate rows found:")

duplicates.head()

In [None]:
# Check for missing values
raw_df.isna().sum()

In [None]:
raw_df.head()

In [None]:
# Check a response has been generated for each user story by each model (number of rows should equal number of unique user stories * number of unique models)
print("Number of rows in raw_df:", raw_df.shape[0])
 
raw_df.nunique()

In [None]:
# Remove triple backticks and language specifiers from 
def remove_padding(input_string):
    match = re.search(r"```[\w]*\n(.*?)\n```", input_string, re.DOTALL)

    if match:
        return match.group(1).strip()
    
    return input_string.strip()

raw_df['ai_response'] = raw_df['ai_response'].apply(remove_padding)

In [None]:
raw_df.head()

<b>Create Feature Files</b>

At this point, we write each `ai_response` to its own feature file, then parse and lint those files.

In [None]:
# Create and write feature file for each ai_response
# TODO: fix this for directory structure
def write_feature_file(record, experiment_dir):
    model = record['model']
    app_id = record['app_id']
    us_id = record['us_id']

    feature_content = record['ai_response']

    filename = f"{app_id}_{model}_{us_id}"

    feature_dir = experiment_dir / "features" / model
    feature_dir.mkdir(parents=True, exist_ok=True)

    feature_file_path = feature_dir / f"{filename}.feature"

    try:   
        with open(feature_file_path, 'w', encoding='utf-8') as f:
            f.write(feature_content.strip())
            
    except Exception as e:
        print(f"Error writing {feature_file_path}: {e}")

# for index, row in raw_df.iterrows():
#     write_feature_file(row, exp_dir)

<b>Review and Process Parsed Data</b>

Next, we read and review the parsed gherkin step data (generated in Gherkin_Parser.ipynb).

In [None]:
# Read parsed step data from gherkin parser output
parse_df = pd.read_csv(exp_dir / 'parsed_step_data.csv')

In [None]:
parse_df.head()

In [None]:
# parse_df contains a record per step
parse_df.shape

In [None]:
parse_df.nunique()

In [None]:
parse_df.dtypes

In [None]:
# Add us_text to parse_df by merging with raw_df on model, app_id, us_id
parse_df = parse_df.merge(raw_df[['model', 'app_id', 'us_id', 'us_text']], on=['model', 'app_id', 'us_id'], how='left')

In [None]:
parse_df.head(1)

In [None]:
parse_df.nunique()

In [None]:
parse_df.to_csv(exp_dir / 'processed_step_data.csv', index=False)

<b>Review `gherkin-lint` Reports</b>

Read and review reports generated by `gherkin-lint`.

TODO: perform this in another notebook and read results here.

<b>Create Full Scenario Dataset for Traceability Evaluation</b>

Next, we use the parsed data to create a dataset of complete scenarios (joining the parsed steps) to use in computing similarity between user stories and gherkins, in our traceability experiments.

In [None]:
processed_step_df = parse_df.copy()

In [None]:
processed_step_df.shape

In [None]:
# Remove all rows for us_ids that have error == True, i.e. where parsing failed for one or more model's output (to maintain a matched-pair dataset for traceability experiments)
error_us_ids = processed_step_df.loc[processed_step_df['error'] == True, 'us_id'].unique()
processed_step_df = processed_step_df[~processed_step_df['us_id'].isin(error_us_ids)].reset_index(drop=True)

processed_step_df.shape

In [None]:
# TODO: also remove user stories that resulted in gherkins with gherkin lint errors

In [None]:
# Check that for each model's output, us-feature mapping is one-to-one
us_feature_counts = processed_step_df.groupby(['model', 'us_id'])['feature_name'].nunique()

us_feature_counts[us_feature_counts > 1]

In [None]:
# Assign unique numeric scenario_id to each scenario_name within each model and us_id
processed_step_df['scenario_id'] = processed_step_df.groupby(['model', 'us_id'])['scenario_name'].transform(lambda x: pd.factorize(x)[0] + 1)
processed_step_df['scenario_id'] = processed_step_df["model"] + "_" + processed_step_df["us_id"].astype(str) + "_" + processed_step_df['scenario_id'].astype(str)

In [None]:
processed_step_df.head()

In [None]:
processed_step_df.shape

In [None]:
processed_step_df.nunique()

In [None]:
def flatten_step(row):
    step_text = f"{row['step_keyword']} {row['step_text']}"

    if pd.notna(row['step_data_table']):
        for table_row in row['step_data_table']:
            step_text += " | " + " | ".join(table_row)
        step_text += " | "

    if pd.notna(row['step_doc_string']):
        step_text += f" \"\"\" {row['step_doc_string']} \"\"\" "

    return step_text.strip()

In [None]:
processed_step_df['flat_step'] = processed_step_df.apply(flatten_step, axis=1)

In [None]:
processed_step_df.head()

In [None]:
full_scenarios = (
    processed_step_df.groupby(['app_id', 'model', 'us_id', 'scenario_id'])
    .agg({
        'flat_step': lambda steps: " ".join(steps),  # join all steps
        'feature_name': 'first',
        'scenario_name': 'first',
        'scenario_examples': 'first',
        'us_text': 'first'
    })
    .reset_index()
)

full_scenarios.rename(columns={'flat_step': 'scenario_text'}, inplace=True)

# TODO: keep scenario description?

In [None]:
full_scenarios.head()

In [None]:
full_scenarios.to_csv(exp_dir / 'parsed_scenario_data.csv', index=False)

<b>Process Step Dataset for Step-Based Traceability Experiments</b>

`flat_step` combines `step_text`, `step_keyword`, `step_data_table`, and `step_doc_string`, where present in the case of the latter two. 

In [None]:
processed_step_df.nunique()

In [None]:
processed_step_df.head(1)

In [None]:
# Assign unique numeric step_id to each step within each scenario_id, model, and us_id
processed_step_df['step_id'] = processed_step_df.groupby(['model', 'us_id', "scenario_id"])['flat_step'].transform(lambda x: pd.factorize(x)[0] + 1)
processed_step_df['step_id'] = processed_step_df['scenario_id'].astype(str) + "_" + processed_step_df['step_id'].astype(str)

In [None]:
processed_step_df.head(1)

In [None]:
processed_step_df.drop(columns=['filepath', 'feature_keyword', 'feature_tags', 'rule_name', 'rule_description', 'rule_tags', 'scenario_keyword', 'scenario_tags', 'step_keyword', 'step_text', 'step_data_table', 'step_doc_string', 'error'], inplace=True)

In [None]:
processed_step_df.head()

In [None]:
processed_step_df.to_csv(exp_dir / 'processed_step_data.csv', index=False)