Receives input from Generate_Gherkins

INPUT: raw generated gherkins (model, timestamp, us_id, user_story, assistant_response, prompt_tokens, completion_tokens, created)
- created is the unix timestamp returned by the model for when the request was processed
- timestamp is generated in our code when the request is made, as a back up in case created is null

REQUIRED OUTPUT FORMAT: 
For input to METEOR/TF-IDF/SentenceTransformer: us_id, us_text, scenario_title, model, scenario_text, scenario_id - scenario_text is the full scenario text, (unsure if we want feature info, e.g. title, description)
Other outputs:
- Above but for step data
- Parse error data
- Lint report data

Combining Sample_Data_Base_Preprocessing and Pipeline? Or do we want this to be generic to all data and still have a sample data base preprocess? I think all generated data will be the same/can be made the same so if we need separate for human data we can make that later.

All outputs:
- Feature files
- Full scenario data for traceability scoring (this is the input?) X
- Step data (for clustering?)
- Parser error data (parser is used to create above set)
- Lint report data


Completed outputs:
1. Feature file for each ai_response, written to `gherkins/sample_data/<exp_label>/feature_files/<model>/<app>` directory
2. 

In [1]:
import pandas as pd
import re
import string
from pathlib import Path

In [2]:
exp_dir = Path("../data/gherkins/sample_data/test")

In [3]:
# Read the raw BDD dataset, containing the model outputs
raw_df = pd.read_csv(exp_dir / 'test_preprocess_input.csv') # TODO: store prompts? Important for multi-turn chats where we should record the order of presentation of user stories.

raw_df.head()

Unnamed: 0,experiment,model,app_id,us_id,us_text,ai_response,prompt_tokens,completion_tokens,timestamp_response_generated
0,test,openai-gpt-4o-mini,g04,1,"As a user, I want to click on the address, so ...",```gherkin\nFeature: Open Google Maps from add...,78,70,1760086208
1,test,openai-gpt-4o-mini,g04,2,"As a user, I want to be able to anonymously vi...",```gherkin\nFeature: Anonymous viewing of publ...,185,162,1760086210
2,test,openai-gpt-4o-mini,g04,3,"As a user, I want to be able to enter my zip c...",```gherkin\nFeature: Search for nearby recycli...,389,228,1760086216
3,test,openai-gpt-4o-mini,g04,4,"As a user, I want to be able to get the hours ...",```gherkin\nFeature: View hours of operation f...,660,187,1760086222
4,test,openai-gpt-4o-mini,g04,5,"As a user, I want to have a flexible pick up t...",```gherkin\nFeature: Flexible pickup time for ...,879,244,1760086228


In [4]:
# df['instance_id'] = df['model'].str.replace('/','-') + '_' + df['us_id'].astype(str)

In [5]:
# Check for duplicates (on model and us_text)
duplicates = raw_df[raw_df.duplicated(subset=["model", "us_text"], keep=False)]

print(duplicates.shape[0], "duplicate rows found:")

duplicates.head()

0 duplicate rows found:


Unnamed: 0,experiment,model,app_id,us_id,us_text,ai_response,prompt_tokens,completion_tokens,timestamp_response_generated


In [6]:
# Check for missing values
raw_df.isna().sum()

experiment                      0
model                           0
app_id                          0
us_id                           0
us_text                         0
ai_response                     0
prompt_tokens                   0
completion_tokens               0
timestamp_response_generated    0
dtype: int64

In [7]:
# Check a response has been generated for each user story by each model (number of rows should equal number of unique user stories * number of unique models)
print("Number of rows in raw_df:", raw_df.shape[0])
 
raw_df.nunique()

Number of rows in raw_df: 10


experiment                       1
model                            2
app_id                           1
us_id                            5
us_text                          5
ai_response                     10
prompt_tokens                   10
completion_tokens               10
timestamp_response_generated     8
dtype: int64

In [8]:
# Remove triple backticks and language specifiers from 
def remove_padding(input_string):
    match = re.search(r"```[\w]*\n(.*?)\n```", input_string, re.DOTALL)

    if match:
        return match.group(1).strip()
    
    return input_string.strip()

raw_df['ai_response'] = raw_df['ai_response'].apply(remove_padding)

In [9]:
raw_df.head()

Unnamed: 0,experiment,model,app_id,us_id,us_text,ai_response,prompt_tokens,completion_tokens,timestamp_response_generated
0,test,openai-gpt-4o-mini,g04,1,"As a user, I want to click on the address, so ...",Feature: Open Google Maps from address link\n\...,78,70,1760086208
1,test,openai-gpt-4o-mini,g04,2,"As a user, I want to be able to anonymously vi...",Feature: Anonymous viewing of public informati...,185,162,1760086210
2,test,openai-gpt-4o-mini,g04,3,"As a user, I want to be able to enter my zip c...",Feature: Search for nearby recycling facilitie...,389,228,1760086216
3,test,openai-gpt-4o-mini,g04,4,"As a user, I want to be able to get the hours ...",Feature: View hours of operation for recycling...,660,187,1760086222
4,test,openai-gpt-4o-mini,g04,5,"As a user, I want to have a flexible pick up t...",Feature: Flexible pickup time for recycling se...,879,244,1760086228


<b>Create Feature Files</b>

At this point, we write each `ai_response` to its own feature file, then parse and lint those files.

In [10]:
feature_file_dir = Path("../data/gherkins/sample_data")

In [11]:
# Create and write feature file for each ai_response
def write_feature_file(record, feature_file_dir=feature_file_dir):
    experiment = record['experiment']
    model = record['model']
    app_id = record['app_id']
    us_id = record['us_id']

    feature_content = record['ai_response']

    filename = f"{model}_{app_id}_{us_id}"

    exp_sub_dir = feature_file_dir / experiment
    exp_sub_dir.mkdir(parents=True, exist_ok=True)

    model_sub_dir = exp_sub_dir / "feature_files" / model
    model_sub_dir.mkdir(parents=True, exist_ok=True)

    app_sub_dir = model_sub_dir / app_id
    app_sub_dir.mkdir(parents=True, exist_ok=True)

    feature_file_path = app_sub_dir / f"{filename}.feature"

    try:   
        with open(feature_file_path, 'w', encoding='utf-8') as f:
            f.write(feature_content.strip())
            
    except Exception as e:
        print(f"Error writing {feature_file_path}: {e}")

# for index, row in raw_df.iterrows():
#     write_feature_file(row)

<b>Review and Process Parsed Data</b>

Next, we read and review the parsed gherkin step data (generated in Gherkin_Parser.ipynb).

In [12]:
# Read parsed step data from gherkin parser output
parse_df = pd.read_csv(exp_dir / 'parsed_step_data.csv')

In [13]:
parse_df.head()

Unnamed: 0,filepath,model,app_id,us_id,feature_name,feature_description,feature_keyword,feature_tags,rule_name,rule_description,...,scenario_description,scenario_keyword,scenario_tags,scenario_examples,step_keyword,step_keyword_type,step_text,step_data_table,step_doc_string,error
0,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Scenario,,,Given,Context,I am on the page with the address link,,,False
1,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Scenario,,,When,Action,I click on the address link,,,False
2,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Scenario,,,Then,Outcome,a new tab should open,,,False
3,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Scenario,,,And,Conjunction,"the new tab's URL should start with ""https://w...",,,False
4,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Scenario,,,And,Conjunction,the new tab's URL should contain the address a...,,,False


In [14]:
# parse_df contains a record per step
parse_df.shape

(118, 22)

In [15]:
# Add us_text to parse_df by merging with raw_df on model, app_id, us_id
parse_df = parse_df.merge(raw_df[['model', 'app_id', 'us_id', 'us_text']], on=['model', 'app_id', 'us_id'], how='left')

In [16]:
parse_df.head(1)

Unnamed: 0,filepath,model,app_id,us_id,feature_name,feature_description,feature_keyword,feature_tags,rule_name,rule_description,...,scenario_keyword,scenario_tags,scenario_examples,step_keyword,step_keyword_type,step_text,step_data_table,step_doc_string,error,us_text
0,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,Scenario,,,Given,Context,I am on the page with the address link,,,False,"As a user, I want to click on the address, so ..."


In [17]:
# parse_df.to_csv(exp_dir / 'parser_step_data.csv', index=False)

In [18]:
parse_df.nunique()

filepath                10
model                    2
app_id                   1
us_id                    5
feature_name            10
feature_description      0
feature_keyword          1
feature_tags             0
rule_name                0
rule_description         0
rule_tags                0
scenario_name           26
scenario_description     0
scenario_keyword         1
scenario_tags            0
scenario_examples        0
step_keyword             4
step_keyword_type        4
step_text               97
step_data_table          0
step_doc_string          0
error                    1
us_text                  5
dtype: int64

<b>Review `gherkin-lint` Reports</b>

Read and review reports generated by `gherkin-lint`.

TODO: perform this in another notebook and read results here.

<b>Create Full Scenario Dataset for Traceability Evaluation</b>

Next, we use the parsed data to create a dataset of complete scenarios (joining the parsed steps) to use in computing similarity between user stories and gherkins, in our traceability experiments.

In [19]:
scenarios_df = parse_df.copy()

In [20]:
# Remove all rows for us_ids that have error == True, i.e. where parsing failed for one or more model's output (to maintain a matched-pair dataset for traceability experiments)
error_us_ids = scenarios_df.loc[scenarios_df['error'] == True, 'us_id'].unique()
scenarios_df = scenarios_df[~scenarios_df['us_id'].isin(error_us_ids)].reset_index(drop=True)

scenarios_df.shape

(118, 23)

In [21]:
# TODO: also remove user stories that resulted in gherkins with gherkin lint errors

In [24]:
# Check that for each model's output, us-feature mapping is one-to-one
us_feature_counts = scenarios_df.groupby(['model', 'us_id'])['feature_name'].nunique()

us_feature_counts[us_feature_counts > 1]

Series([], Name: feature_name, dtype: int64)

In [None]:
# Assign unique numeric scenario_id to each scenario_name within each model and us_id
scenarios_df['scenario_id'] = scenarios_df.groupby(['model', 'us_id'])['scenario_name'].transform(lambda x: pd.factorize(x)[0] + 1)
scenarios_df['scenario_id'] = scenarios_df["model"] + "_" + scenarios_df["us_id"].astype(str) + "_" + scenarios_df['scenario_id'].astype(str)

In [33]:
scenarios_df.head()

Unnamed: 0,filepath,model,app_id,us_id,feature_name,feature_description,feature_keyword,feature_tags,rule_name,rule_description,...,scenario_tags,scenario_examples,step_keyword,step_keyword_type,step_text,step_data_table,step_doc_string,error,us_text,scenario_id
0,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,,Given,Context,I am on the page with the address link,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1
1,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,,When,Action,I click on the address link,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1
2,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,,Then,Outcome,a new tab should open,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1
3,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,,And,Conjunction,"the new tab's URL should start with ""https://w...",,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1
4,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,,And,Conjunction,the new tab's URL should contain the address a...,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1


In [34]:
scenarios_df.nunique()

filepath                10
model                    2
app_id                   1
us_id                    5
feature_name            10
feature_description      0
feature_keyword          1
feature_tags             0
rule_name                0
rule_description         0
rule_tags                0
scenario_name           26
scenario_description     0
scenario_keyword         1
scenario_tags            0
scenario_examples        0
step_keyword             4
step_keyword_type        4
step_text               97
step_data_table          0
step_doc_string          0
error                    1
us_text                  5
scenario_id             26
dtype: int64

In [36]:
def flatten_step(row):
    step_text = f"{row['step_keyword']} {row['step_text']}"

    if pd.notna(row['step_data_table']):
        for table_row in row['step_data_table']:
            step_text += " | " + " | ".join(table_row)
        step_text += " | "

    if pd.notna(row['step_doc_string']):
        step_text += f" \"\"\" {row['step_doc_string']} \"\"\" "

    return step_text.strip()

In [37]:
scenarios_df['flat_step'] = scenarios_df.apply(flatten_step, axis=1)

In [38]:
scenarios_df.head()

Unnamed: 0,filepath,model,app_id,us_id,feature_name,feature_description,feature_keyword,feature_tags,rule_name,rule_description,...,scenario_examples,step_keyword,step_keyword_type,step_text,step_data_table,step_doc_string,error,us_text,scenario_id,flat_step
0,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Given,Context,I am on the page with the address link,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1,Given I am on the page with the address link
1,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,When,Action,I click on the address link,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1,When I click on the address link
2,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,Then,Outcome,a new tab should open,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1,Then a new tab should open
3,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,And,Conjunction,"the new tab's URL should start with ""https://w...",,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1,"And the new tab's URL should start with ""https..."
4,..\data\gherkins\sample_data\test\features\goo...,google-gemini-2.0-flash-001,g04,1,Address Link Opens Google Maps in New Tab,,Feature,,,,...,,And,Conjunction,the new tab's URL should contain the address a...,,,False,"As a user, I want to click on the address, so ...",google-gemini-2.0-flash-001_1_1,And the new tab's URL should contain the addre...


In [39]:
full_scenarios = (
    scenarios_df.groupby(['app_id', 'model', 'us_id', 'scenario_id'])
    .agg({
        'flat_step': lambda steps: " ".join(steps),  # join all steps
        'feature_name': 'first',
        'scenario_name': 'first',
        'scenario_examples': 'first',
        'us_text': 'first'
    })
    .reset_index()
)

full_scenarios.rename(columns={'flat_step': 'scenario_text'}, inplace=True)

In [40]:
full_scenarios.head(20)

Unnamed: 0,app_id,model,us_id,scenario_id,scenario_text,feature_name,scenario_name,scenario_examples,us_text
0,g04,google-gemini-2.0-flash-001,1,google-gemini-2.0-flash-001_1_1,Given I am on the page with the address link W...,Address Link Opens Google Maps in New Tab,Clicking the address link opens Google Maps in...,,"As a user, I want to click on the address, so ..."
1,g04,google-gemini-2.0-flash-001,2,google-gemini-2.0-flash-001_2_1,Given I am an anonymous user When I navigate t...,Anonymous User Can View Public Recycling Cente...,Anonymous user can view a list of recycling ce...,,"As a user, I want to be able to anonymously vi..."
2,g04,google-gemini-2.0-flash-001,2,google-gemini-2.0-flash-001_2_2,Given I am an anonymous user And a recycling c...,Anonymous User Can View Public Recycling Cente...,Anonymous user can view details of a specific ...,,"As a user, I want to be able to anonymously vi..."
3,g04,google-gemini-2.0-flash-001,3,google-gemini-2.0-flash-001_3_1,Given I am on the recycling facility search pa...,Find Recycling Facilities by Zip Code,Entering a valid zip code displays nearby recy...,,"As a user, I want to be able to enter my zip c..."
4,g04,google-gemini-2.0-flash-001,3,google-gemini-2.0-flash-001_3_2,Given I am on the recycling facility search pa...,Find Recycling Facilities by Zip Code,Entering an invalid zip code displays an error...,,"As a user, I want to be able to enter my zip c..."
5,g04,google-gemini-2.0-flash-001,3,google-gemini-2.0-flash-001_3_3,Given I am on the recycling facility search pa...,Find Recycling Facilities by Zip Code,Entering a zip code with no nearby facilities ...,,"As a user, I want to be able to enter my zip c..."
6,g04,google-gemini-2.0-flash-001,4,google-gemini-2.0-flash-001_4_1,Given I am on the details page of a recycling ...,Display Recycling Facility Hours,Recycling facility hours are displayed on the ...,,"As a user, I want to be able to get the hours ..."
7,g04,google-gemini-2.0-flash-001,4,google-gemini-2.0-flash-001_4_2,Given I have searched for recycling facilities...,Display Recycling Facility Hours,Recycling facility hours are displayed in the ...,,"As a user, I want to be able to get the hours ..."
8,g04,google-gemini-2.0-flash-001,4,google-gemini-2.0-flash-001_4_3,Given I am on the details page of a recycling ...,Display Recycling Facility Hours,"Closed recycling facility displays ""Closed"" fo...",,"As a user, I want to be able to get the hours ..."
9,g04,google-gemini-2.0-flash-001,4,google-gemini-2.0-flash-001_4_4,Given I am on the details page of a recycling ...,Display Recycling Facility Hours,Recycling center has no hours listed,,"As a user, I want to be able to get the hours ..."


In [41]:
full_scenarios.to_csv(exp_dir / 'parsed_scenario_data.csv', index=False)