Receives input from Generate_Gherkins

INPUT: raw generated gherkins (model, timestamp, us_id, user_story, assistant_response, prompt_tokens, completion_tokens, created)
- created is the unix timestamp returned by the model for when the request was processed
- timestamp is generated in our code when the request is made, as a back up in case created is null

REQUIRED OUTPUT FORMAT: 
For input to METEOR/TF-IDF/SentenceTransformer: us_id, us_text, scenario_title, model, scenario_text, scenario_id - scenario_text is the full scenario text, (unsure if we want feature info, e.g. title, description)
Other outputs:
- Above but for step data
- Parse error data
- Lint report data

Combining Sample_Data_Base_Preprocessing and Pipeline? Or do we want this to be generic to all data and still have a sample data base preprocess? I think all generated data will be the same/can be made the same so if we need separate for human data we can make that later.

All outputs:
- Feature files
- Full scenario data for traceability scoring (this is the input?) X
- Step data (for clustering?)
- Parser error data (parser is used to create above set)
- Lint report data


Completed outputs:
1. Feature file for each ai_response, written to `gherkins/sample_data/<exp_label>/feature_files/<model>/<app>` directory
2. 

In [1]:
import pandas as pd
import numpy as np
import re
import string
from pathlib import Path

from importlib import reload
import config
reload(config)

from config import DATASET_NAME, EXPERIMENT_NAME, INPUT_DATA_PATH, GENERATION_TECHNIQUE

In [2]:
exp_dir = Path(f"../data/{DATASET_NAME}/experiment_outputs/{EXPERIMENT_NAME}/{GENERATION_TECHNIQUE}/")
exp_dir.mkdir(parents=True, exist_ok=True)

input_file_name = f"{GENERATION_TECHNIQUE}_raw_results.csv"

In [3]:
# Read the raw BDD dataset, containing the model outputs
raw_df = pd.read_csv(exp_dir / input_file_name) # TODO: store prompts? Important for multi-turn chats where we should record the order of presentation of user stories.

raw_df.head()

Unnamed: 0,model,app_id,system_prompt,reminder,us_id,user_prompt,ai_response,prompt_tokens,completion_tokens,response_created
0,openai-gpt-4o-mini,g04-recycling,You are a QA Engineer. Please generate a compl...,,g04-recycling_1,"As a user, I want to click on the address, so ...",```gherkin\nFeature: Open Google Maps from add...,83,250,1762428981
1,google-gemini-2.0-flash-001,g04-recycling,You are a QA Engineer. Please generate a compl...,,g04-recycling_1,"As a user, I want to click on the address, so ...",```gherkin\nFeature: Address Link Opens Google...,70,219,1762428981
2,openai-gpt-4o-mini,g04-recycling,You are a QA Engineer. Please generate a compl...,,g04-recycling_2,"As a user, I want to be able to anonymously vi...",```gherkin\nFeature: Anonymous viewing of publ...,87,272,1762428981
3,google-gemini-2.0-flash-001,g04-recycling,You are a QA Engineer. Please generate a compl...,,g04-recycling_2,"As a user, I want to be able to anonymously vi...",```gherkin\nFeature: Anonymous User Can View P...,74,355,1762428981
4,openai-gpt-4o-mini,g04-recycling,You are a QA Engineer. Please generate a compl...,,g04-recycling_3,"As a user, I want to be able to enter my zip c...",```gherkin\nFeature: Nearby Recycling Faciliti...,92,363,1762428981


In [4]:
# Temporary changes that I have fixed in the data generation code (but don't want to rerun generation)
raw_df.rename(columns={'user_prompt': 'us_text'}, inplace=True)

raw_df["us_id"] = raw_df["us_id"].str.split('_').str[1]

raw_df["us_id"] = raw_df["us_id"].astype(np.int64)

In [5]:
raw_df.dtypes

model                 object
app_id                object
system_prompt         object
reminder             float64
us_id                  int64
us_text               object
ai_response           object
prompt_tokens          int64
completion_tokens      int64
response_created       int64
dtype: object

In [6]:
raw_df.head()

Unnamed: 0,model,app_id,system_prompt,reminder,us_id,us_text,ai_response,prompt_tokens,completion_tokens,response_created
0,openai-gpt-4o-mini,g04-recycling,You are a QA Engineer. Please generate a compl...,,1,"As a user, I want to click on the address, so ...",```gherkin\nFeature: Open Google Maps from add...,83,250,1762428981
1,google-gemini-2.0-flash-001,g04-recycling,You are a QA Engineer. Please generate a compl...,,1,"As a user, I want to click on the address, so ...",```gherkin\nFeature: Address Link Opens Google...,70,219,1762428981
2,openai-gpt-4o-mini,g04-recycling,You are a QA Engineer. Please generate a compl...,,2,"As a user, I want to be able to anonymously vi...",```gherkin\nFeature: Anonymous viewing of publ...,87,272,1762428981
3,google-gemini-2.0-flash-001,g04-recycling,You are a QA Engineer. Please generate a compl...,,2,"As a user, I want to be able to anonymously vi...",```gherkin\nFeature: Anonymous User Can View P...,74,355,1762428981
4,openai-gpt-4o-mini,g04-recycling,You are a QA Engineer. Please generate a compl...,,3,"As a user, I want to be able to enter my zip c...",```gherkin\nFeature: Nearby Recycling Faciliti...,92,363,1762428981


In [7]:
# Check for duplicates (on model and us_id)
duplicates = raw_df[raw_df.duplicated(subset=["model", "us_id"], keep=False)]

print(duplicates.shape[0], "duplicate rows found:")

duplicates.head()

0 duplicate rows found:


Unnamed: 0,model,app_id,system_prompt,reminder,us_id,us_text,ai_response,prompt_tokens,completion_tokens,response_created


In [8]:
# Check for missing values
raw_df.isna().sum()

model                  0
app_id                 0
system_prompt          0
reminder             102
us_id                  0
us_text                0
ai_response            0
prompt_tokens          0
completion_tokens      0
response_created       0
dtype: int64

In [9]:
raw_df.head()

Unnamed: 0,model,app_id,system_prompt,reminder,us_id,us_text,ai_response,prompt_tokens,completion_tokens,response_created
0,openai-gpt-4o-mini,g04-recycling,You are a QA Engineer. Please generate a compl...,,1,"As a user, I want to click on the address, so ...",```gherkin\nFeature: Open Google Maps from add...,83,250,1762428981
1,google-gemini-2.0-flash-001,g04-recycling,You are a QA Engineer. Please generate a compl...,,1,"As a user, I want to click on the address, so ...",```gherkin\nFeature: Address Link Opens Google...,70,219,1762428981
2,openai-gpt-4o-mini,g04-recycling,You are a QA Engineer. Please generate a compl...,,2,"As a user, I want to be able to anonymously vi...",```gherkin\nFeature: Anonymous viewing of publ...,87,272,1762428981
3,google-gemini-2.0-flash-001,g04-recycling,You are a QA Engineer. Please generate a compl...,,2,"As a user, I want to be able to anonymously vi...",```gherkin\nFeature: Anonymous User Can View P...,74,355,1762428981
4,openai-gpt-4o-mini,g04-recycling,You are a QA Engineer. Please generate a compl...,,3,"As a user, I want to be able to enter my zip c...",```gherkin\nFeature: Nearby Recycling Faciliti...,92,363,1762428981


In [10]:
# Check a response has been generated for each user story by each model (number of rows should equal number of unique user stories * number of unique models)
print("Number of rows in raw_df:", raw_df.shape[0])
 
raw_df.nunique()

Number of rows in raw_df: 102


model                  2
app_id                 1
system_prompt          1
reminder               0
us_id                 51
us_text               51
ai_response          102
prompt_tokens         34
completion_tokens     90
response_created      67
dtype: int64

In [11]:
# Remove triple backticks and language specifiers from 
def remove_padding(input_string):
    match = re.search(r"```[\w]*\n(.*?)\n```", input_string, re.DOTALL)

    if match:
        return match.group(1).strip()
    
    return input_string.strip()

raw_df['ai_response'] = raw_df['ai_response'].apply(remove_padding)

In [12]:
raw_df.head()

Unnamed: 0,model,app_id,system_prompt,reminder,us_id,us_text,ai_response,prompt_tokens,completion_tokens,response_created
0,openai-gpt-4o-mini,g04-recycling,You are a QA Engineer. Please generate a compl...,,1,"As a user, I want to click on the address, so ...",Feature: Open Google Maps from address link\n\...,83,250,1762428981
1,google-gemini-2.0-flash-001,g04-recycling,You are a QA Engineer. Please generate a compl...,,1,"As a user, I want to click on the address, so ...",Feature: Address Link Opens Google Maps in New...,70,219,1762428981
2,openai-gpt-4o-mini,g04-recycling,You are a QA Engineer. Please generate a compl...,,2,"As a user, I want to be able to anonymously vi...",Feature: Anonymous viewing of public informati...,87,272,1762428981
3,google-gemini-2.0-flash-001,g04-recycling,You are a QA Engineer. Please generate a compl...,,2,"As a user, I want to be able to anonymously vi...",Feature: Anonymous User Can View Public Recycl...,74,355,1762428981
4,openai-gpt-4o-mini,g04-recycling,You are a QA Engineer. Please generate a compl...,,3,"As a user, I want to be able to enter my zip c...",Feature: Nearby Recycling Facilities Search\n\...,92,363,1762428981


<b>Create Feature Files</b>

At this point, we write each `ai_response` to its own feature file, then parse and lint those files.

In [13]:
# Create and write feature file for each ai_response
# TODO: fix this for directory structure
def write_feature_file(record, experiment_dir):
    model = record['model']
    app_id = record['app_id']
    us_id = record['us_id']

    feature_content = record['ai_response']

    filename = f"{app_id}_{model}_{us_id}"

    feature_dir = experiment_dir / "features" / model
    feature_dir.mkdir(parents=True, exist_ok=True)

    feature_file_path = feature_dir / f"{filename}.feature"

    try:   
        with open(feature_file_path, 'w', encoding='utf-8') as f:
            f.write(feature_content.strip())
            
    except Exception as e:
        print(f"Error writing {feature_file_path}: {e}")

for index, row in raw_df.iterrows():
    write_feature_file(row, exp_dir)

<b>Review and Process Parsed Data</b>

Next, we read and review the parsed gherkin step data (generated in Gherkin_Parser.ipynb).

In [14]:
# Read parsed step data from gherkin parser output
parse_df = pd.read_csv(exp_dir / 'parsed_step_data.csv')

In [15]:
parse_df.head()

Unnamed: 0,filepath,model,app_id,us_id,feature_name,feature_description,feature_keyword,feature_tags,rule_name,rule_description,...,scenario_description,scenario_keyword,scenario_tags,scenario_examples,step_keyword,step_keyword_type,step_text,step_data_table,step_doc_string,error
0,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Scenario,,,Given,Context,I am on the website,,,False
1,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Scenario,,,When,Action,I click the address link,,,False
2,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Scenario,,,Then,Outcome,a new tab should open,,,False
3,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Scenario,,,And,Conjunction,"the new tab's URL should contain ""google.com/m...",,,False
4,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Scenario,,,Given,Context,I am on the website,,,False


In [16]:
# parse_df contains a record per step
parse_df.shape

(2318, 22)

In [17]:
parse_df.nunique()

filepath                 102
model                      2
app_id                     1
us_id                     51
feature_name              89
feature_description       58
feature_keyword            1
feature_tags               0
rule_name                  0
rule_description           0
rule_tags                  0
scenario_name            487
scenario_description       0
scenario_keyword           1
scenario_tags              0
scenario_examples          0
step_keyword               4
step_keyword_type          4
step_text               1633
step_data_table            6
step_doc_string            0
error                      1
dtype: int64

In [18]:
parse_df.dtypes

filepath                 object
model                    object
app_id                   object
us_id                     int64
feature_name             object
feature_description      object
feature_keyword          object
feature_tags            float64
rule_name               float64
rule_description        float64
rule_tags               float64
scenario_name            object
scenario_description    float64
scenario_keyword         object
scenario_tags           float64
scenario_examples       float64
step_keyword             object
step_keyword_type        object
step_text                object
step_data_table          object
step_doc_string         float64
error                      bool
dtype: object

In [19]:
# Add us_text to parse_df by merging with raw_df on model, app_id, us_id
parse_df = parse_df.merge(raw_df[['model', 'app_id', 'us_id', 'us_text']], on=['model', 'app_id', 'us_id'], how='left')

In [20]:
parse_df.head(1)

Unnamed: 0,filepath,model,app_id,us_id,feature_name,feature_description,feature_keyword,feature_tags,rule_name,rule_description,...,scenario_keyword,scenario_tags,scenario_examples,step_keyword,step_keyword_type,step_text,step_data_table,step_doc_string,error,us_text
0,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,Scenario,,,Given,Context,I am on the website,,,False,"As a user, I want to click on the address, so ..."


In [21]:
parse_df.nunique()

filepath                 102
model                      2
app_id                     1
us_id                     51
feature_name              89
feature_description       58
feature_keyword            1
feature_tags               0
rule_name                  0
rule_description           0
rule_tags                  0
scenario_name            487
scenario_description       0
scenario_keyword           1
scenario_tags              0
scenario_examples          0
step_keyword               4
step_keyword_type          4
step_text               1633
step_data_table            6
step_doc_string            0
error                      1
us_text                   51
dtype: int64

In [22]:
parse_df.to_csv(exp_dir / 'parsed_step_data.csv', index=False)

<b>Review `gherkin-lint` Reports</b>

Read and review reports generated by `gherkin-lint`.

TODO: perform this in another notebook and read results here.

<b>Create Full Scenario Dataset for Traceability Evaluation</b>

Next, we use the parsed data to create a dataset of complete scenarios (joining the parsed steps) to use in computing similarity between user stories and gherkins, in our traceability experiments.

In [23]:
scenarios_df = parse_df.copy()

In [24]:
scenarios_df.shape

(2318, 23)

In [25]:
# Remove all rows for us_ids that have error == True, i.e. where parsing failed for one or more model's output (to maintain a matched-pair dataset for traceability experiments)
error_us_ids = scenarios_df.loc[scenarios_df['error'] == True, 'us_id'].unique()
scenarios_df = scenarios_df[~scenarios_df['us_id'].isin(error_us_ids)].reset_index(drop=True)

scenarios_df.shape

(2318, 23)

In [26]:
# TODO: also remove user stories that resulted in gherkins with gherkin lint errors

In [27]:
# Check that for each model's output, us-feature mapping is one-to-one
us_feature_counts = scenarios_df.groupby(['model', 'us_id'])['feature_name'].nunique()

us_feature_counts[us_feature_counts > 1]

Series([], Name: feature_name, dtype: int64)

In [28]:
# Assign unique numeric scenario_id to each scenario_name within each model and us_id
scenarios_df['scenario_id'] = scenarios_df.groupby(['model', 'us_id'])['scenario_name'].transform(lambda x: pd.factorize(x)[0] + 1)
scenarios_df['scenario_id'] = scenarios_df["model"] + "_" + scenarios_df["us_id"].astype(str) + "_" + scenarios_df['scenario_id'].astype(str)

In [29]:
scenarios_df.head()

Unnamed: 0,filepath,model,app_id,us_id,feature_name,feature_description,feature_keyword,feature_tags,rule_name,rule_description,...,scenario_tags,scenario_examples,step_keyword,step_keyword_type,step_text,step_data_table,step_doc_string,error,us_text,scenario_id
0,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,,Given,Context,I am on the website,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1
1,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,,When,Action,I click the address link,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1
2,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,,Then,Outcome,a new tab should open,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1
3,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,,And,Conjunction,"the new tab's URL should contain ""google.com/m...",,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1
4,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,,Given,Context,I am on the website,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_2


In [30]:
scenarios_df.shape

(2318, 24)

In [31]:
scenarios_df.nunique()

filepath                 102
model                      2
app_id                     1
us_id                     51
feature_name              89
feature_description       58
feature_keyword            1
feature_tags               0
rule_name                  0
rule_description           0
rule_tags                  0
scenario_name            487
scenario_description       0
scenario_keyword           1
scenario_tags              0
scenario_examples          0
step_keyword               4
step_keyword_type          4
step_text               1633
step_data_table            6
step_doc_string            0
error                      1
us_text                   51
scenario_id              491
dtype: int64

In [32]:
def flatten_step(row):
    step_text = f"{row['step_keyword']} {row['step_text']}"

    if pd.notna(row['step_data_table']):
        for table_row in row['step_data_table']:
            step_text += " | " + " | ".join(table_row)
        step_text += " | "

    if pd.notna(row['step_doc_string']):
        step_text += f" \"\"\" {row['step_doc_string']} \"\"\" "

    return step_text.strip()

In [33]:
scenarios_df['flat_step'] = scenarios_df.apply(flatten_step, axis=1)

In [34]:
scenarios_df.head()

Unnamed: 0,filepath,model,app_id,us_id,feature_name,feature_description,feature_keyword,feature_tags,rule_name,rule_description,...,scenario_examples,step_keyword,step_keyword_type,step_text,step_data_table,step_doc_string,error,us_text,scenario_id,flat_step
0,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Given,Context,I am on the website,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1,Given I am on the website
1,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,When,Action,I click the address link,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1,When I click the address link
2,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Then,Outcome,a new tab should open,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1,Then a new tab should open
3,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,And,Conjunction,"the new tab's URL should contain ""google.com/m...",,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1,"And the new tab's URL should contain ""google.c..."
4,..\data\mendeley\experiment_outputs\g04-recycl...,google-gemini-2.0-flash-001,g04-recycling,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Given,Context,I am on the website,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_2,Given I am on the website


In [35]:
full_scenarios = (
    scenarios_df.groupby(['app_id', 'model', 'us_id', 'scenario_id'])
    .agg({
        'flat_step': lambda steps: " ".join(steps),  # join all steps
        'feature_name': 'first',
        'scenario_name': 'first',
        'scenario_examples': 'first',
        'us_text': 'first'
    })
    .reset_index()
)

full_scenarios.rename(columns={'flat_step': 'scenario_text'}, inplace=True)

In [36]:
full_scenarios.head()

Unnamed: 0,app_id,model,us_id,scenario_id,scenario_text,feature_name,scenario_name,scenario_examples,us_text
0,g04-recycling,google-gemini-2.0-flash-001,1,google-gemini-2.0-flash-001_1_1,Given I am on the website When I click the add...,Address Link Opens Google Maps in New Tab,Clicking the address link opens Google Maps in...,,"As a user, I want to click on the address, so ..."
1,g04-recycling,google-gemini-2.0-flash-001,1,google-gemini-2.0-flash-001_1_2,Given I am on the website When I click the add...,Address Link Opens Google Maps in New Tab,The Google Maps URL contains the correct address,,"As a user, I want to click on the address, so ..."
2,g04-recycling,google-gemini-2.0-flash-001,1,google-gemini-2.0-flash-001_1_3,Given I am on the website When I click the add...,Address Link Opens Google Maps in New Tab,Clicking the address link does not close the c...,,"As a user, I want to click on the address, so ..."
3,g04-recycling,google-gemini-2.0-flash-001,1,google-gemini-2.0-flash-001_1_4,Given I am on the website Then the address lin...,Address Link Opens Google Maps in New Tab,Address link uses the correct HTML attribute t...,,"As a user, I want to click on the address, so ..."
4,g04-recycling,google-gemini-2.0-flash-001,2,google-gemini-2.0-flash-001_2_1,Given I am an anonymous user When I visit the ...,Anonymous User Can View Public Recycling Cente...,Anonymous user views the recycling center list,,"As a user, I want to be able to anonymously vi..."


In [37]:
full_scenarios.to_csv(exp_dir / 'parsed_scenario_data.csv', index=False)