# Data Preprocessing
We have a annotated dataset from the paper: https://arxiv.org/pdf/2410.14335

In this section we augment and preprocess this dataset to generate a corpus for training an LLM to generate accurate critical questions.

In [3]:
import transformers
import torch
import pandas as pd
import json


In [4]:
################################################################################
#######################   PATH VARIABLES        ################################
################################################################################

output_filtered_path = 'Data/Processed/Filtered/US2016_arguments_filtered.json'
raw_input_path = 'Data/Raw/US2016.jsonl'
augmented_data_path = 'Data/Processed/Augmented/US2016.json'

################################################################################
#######################   STATIC VARIABLES      ################################
################################################################################

device = torch.device("mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu")
model_path = "Models/Meta-Llama-3-8B-Instruct"

## Filter Data
First we have to extract the relevant informations from the not well structured raw dataset. We are only intersted in:
- **modified_premises** --> the modified and final used premises
- **read_premises** --> the premises for the critical question scheme
- **scheme** --> the scheme of the premise
- **modified_cqs** --> the generated critical questions

In [25]:
df = pd.read_json(raw_input_path, lines=True)

# View the DataFrame
print(df.head(5))

          id       IDS SCHEMES ARGS  \
0     HOLT_0        []      []   []   
1     HOLT_2        []      []   []   
2  CLINTON_4        []      []   []   
3     HOLT_5        []      []   []   
4    TRUMP_7  [224122]      []   []   

                                        INTERVENTION  \
0  I do n't expect us to cover all the issues of ...   
1  thank you\nIt 's about putting money—more mone...   
2                        trade is an important issue   
3                          would you like to respond   
4             We have to renegotiate our trade deals   

                                  SENTENCES FULL_ARGS comment  
0                                        []        []     NaN  
1                                        []        []     NaN  
2                                        []        []     NaN  
3                                        []        []     NaN  
4  [We have to renegotiate our trade deals]        []     NaN  


In [26]:
# Function to extract desired fields from each FULL_ARGS item
def extract_argument_info(full_args_list):
    extracted = []
    for entry in full_args_list:
        if isinstance(entry, dict):
            extracted.append({
                'modified_premises': entry.get('modified_premises', ''),
                'read_premises': entry.get('read_premises', ''),
                'scheme': entry.get('scheme', ''),
                'modified_cqs': entry.get('modified_cqs', '')
            })
    return extracted

# Flatten FULL_ARGS from all rows
all_args_data = []
for full_args in df['FULL_ARGS']:
    all_args_data.extend(extract_argument_info(full_args))

# Convert to DataFrame
args_df = pd.DataFrame(all_args_data)

# Save to JSON

args_df.to_json(output_filtered_path, orient='records', indent=2)

output_filtered_path

'Data/Processed/Filtered/US2016_arguments_filtered.json'

In [4]:
df_filtered = pd.read_json(output_filtered_path)
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   modified_premises  408 non-null    object
 1   read_premises      408 non-null    object
 2   scheme             408 non-null    object
 3   modified_cqs       408 non-null    object
dtypes: object(4)
memory usage: 12.9+ KB


## Augment the existing dataset
We try to augment the dataset by rephrasing the generated questions and also the use premises.
For that we use *Meta-Llama-3-8B-Instruct*.

In [5]:
def generate_answer(pipe, terminator, input_text):
    messages = [
        {"role": "system", "content": "You are a helpful assistant to rephrase given text. But you always keep in mind that the meaning must not change."},
        {"role": "user", "content": input_text},
    ]

    prompt = pipe.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
    )
    outputs = pipe(
        prompt,
        max_new_tokens=256,
        eos_token_id=terminator,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )

    return outputs[0]["generated_text"][len(prompt):]

In [6]:
df_filtered = pd.read_json(output_filtered_path)
system_prompt = "Please rephrase the following input text and give a direct answer without something like Here is a rephrased version of the text: "


pipeline = transformers.pipeline(
    "text-generation",
    model=model_path,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device=device,
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

# This block is just for show
for premise in df_filtered['modified_premises']:
    input_text = system_prompt + premise
    rephrased_premise = generate_answer(pipeline, terminators, input_text)
    print("Input text: ", input_text)
    print("Rephrased text: ", rephrased_premise)
    print("-----------")
    break

new_entries = []

for _, row in df_filtered.iterrows():
    input_text = system_prompt + row['modified_premises']
    rephrased_premise = generate_answer(pipeline, terminators, input_text)

    new_entry = {
        "modified_premises": rephrased_premise,
        "read_premises" : row['read_premises'],
        "scheme": row['scheme'],
        "modified_cqs": row['modified_cqs']
    }

    new_entries.append(new_entry)

# Combine old and new entries
full_data = df_filtered.to_dict(orient='records') + new_entries

# Save to new file
with open(augmented_data_path, "w") as f:
    json.dump(full_data, f, indent=2)

print(f"Extended dataset saved to {augmented_data_path}")





Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use mps
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Input text:  Please rephrase the following input text and give a direct answer without something like Here is a rephrased version of the text: 
we 've created a movement is true in this situation.
we 've created a movement is often a sign of situations in which Clinton and others, politicians, should have been doing this for years is true.
Clinton and others, politicians, should have been doing this for years might be true in this situation.
Rephrased text:  The fact that we've created a movement in this situation is a sign that Clinton and others, politicians, should have been doing this for years.
-----------


Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for

Extended dataset saved to Data/Processed/Augmented/US2016.json


In [7]:
df_augmented = pd.read_json(augmented_data_path)
df_augmented.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 816 entries, 0 to 815
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   modified_premises  816 non-null    object
 1   read_premises      816 non-null    object
 2   scheme             816 non-null    object
 3   modified_cqs       816 non-null    object
dtypes: object(4)
memory usage: 25.6+ KB


In [2]:
# Load the extended dataset
with open(augmented_data_path, "r") as f:
    data = json.load(f)

# This will store the new entries created from rephrased questions
rephrased_entries = []

# Rephrasing prompt
question_prompt = "Please rephrase the following input text and give a direct answer without something like Here is a rephrased version of the text: "

i = 0
for entry in data:
    print(f"{i}. Entry")
    cqs_field = entry.get("modified_cqs", "")
    if not cqs_field.strip():
        continue

    # Split, rephrase, and collect all questions
    questions = [q.strip() for q in cqs_field.split('\n') if q.strip()]
    rephrased_questions = []

    for q in questions:
        input_text = question_prompt + q
        rephrased_q = generate_answer(pipeline, terminators, input_text)
        rephrased_questions.append(rephrased_q)

    # Combine all rephrased questions into a single string
    new_modified_cqs = "\n".join(rephrased_questions)

    new_entry = {
        "modified_premises": entry["modified_premises"],
        "read_premises" : entry['read_premises'],
        "scheme": entry["scheme"],
        "modified_cqs": new_modified_cqs
    }

    rephrased_entries.append(new_entry)
    i = i + 1

# Append new entries to the original data
data.extend(rephrased_entries)

# Save updated dataset
with open(augmented_data_path, "w") as f:
    json.dump(data, f, indent=2)

print(f"Rephrased question sets added to {augmented_data_path}")

NameError: name 'augmented_data_path' is not defined

In [13]:
df_augmented = pd.read_json(augmented_data_path)
df_augmented.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1566 entries, 0 to 1565
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   modified_premises  1566 non-null   object
 1   read_premises      1566 non-null   object
 2   scheme             1566 non-null   object
 3   modified_cqs       1566 non-null   object
dtypes: object(4)
memory usage: 49.1+ KB


In [5]:
df_hotpotqa = pd.read_json("Data/Raw/hotpot_train_v1.1.json")
df_hotpotqa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90447 entries, 0 to 90446
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   supporting_facts  90447 non-null  object
 1   level             90447 non-null  object
 2   question          90447 non-null  object
 3   context           90447 non-null  object
 4   answer            90447 non-null  object
 5   _id               90447 non-null  object
 6   type              90447 non-null  object
dtypes: object(7)
memory usage: 4.8+ MB


In [6]:
df_hotpotqa.head(5)

Unnamed: 0,supporting_facts,level,question,context,answer,_id,type
0,"[[Arthur's Magazine, 0], [First for Women, 0]]",medium,Which magazine was started first Arthur's Maga...,"[[Radio City (Indian radio station), [Radio Ci...",Arthur's Magazine,5a7a06935542990198eaf050,comparison
1,"[[Oberoi family, 0], [The Oberoi Group, 0]]",medium,The Oberoi family is part of a hotel company t...,"[[Ritz-Carlton Jakarta, [The Ritz-Carlton Jaka...",Delhi,5a879ab05542996e4f30887e,bridge
2,"[[Allie Goertz, 0], [Allie Goertz, 1], [Allie ...",hard,Musician and satirist Allie Goertz wrote a son...,"[[Lisa Simpson, [Lisa Marie Simpson is a ficti...",President Richard Nixon,5a8d7341554299441c6b9fe5,bridge
3,"[[Peggy Seeger, 0], [Peggy Seeger, 1], [Ewan M...",medium,What nationality was James Henry Miller's wife?,"[[Moloch: or, This Gentile World, [Moloch: or,...",American,5a82171f5542990a1d231f4a,bridge
4,"[[Cadmium chloride, 1], [Ethanol, 0]]",medium,Cadmium Chloride is slightly soluble in this c...,"[[Cadmium chloride, [Cadmium chloride is a whi...",alcohol,5a84dd955542997b5ce3ff79,bridge


In [17]:
df_hotpotqa.iloc[0]

supporting_facts       [[Arthur's Magazine, 0], [First for Women, 0]]
level                                                          medium
question            Which magazine was started first Arthur's Maga...
context             [[Radio City (Indian radio station), [Radio Ci...
answer                                              Arthur's Magazine
_id                                          5a7a06935542990198eaf050
type                                                       comparison
Name: 0, dtype: object

In [20]:
 df_hotpotqa.iloc[5]['context']

[['Li Na',
  ['Li Na (; ; born 26 February 1982) is a retired Chinese professional tennis player, who achieved a career-high WTA-ranking of world No. 2 on 17 February 2014.',
   ' Over the course of her career, Li won seven WTA singles titles and two Grand Slam singles titles at the 2011 French Open and 2014 Australian Open.',
   " Li's rise to prominence came after those victories, which made her the first and only Grand Slam singles champion from East Asia and Asia as a whole.",
   ' Prior to this, she had already become the first player representing an East Asian and Asian country to appear in a Grand Slam singles final, a milestone she achieved at the 2011 Australian Open.',
   ' Li was also the runner-up at the 2013 Australian Open and 2013 WTA Tour Championships, a three-time quarterfinalist at Wimbledon and a semifinalist at the 2008 Beijing Olympic Games and 2013 US Open.',
   " Among her other most notable accolades, she was the first Chinese player to win a WTA tour title at 

In [18]:
context = df_hotpotqa.iloc[5]['context']

In [19]:
def flatten(lst):
    for item in lst:
        if isinstance(item, list):
            yield from flatten(item)
        else:
            yield item

# Flatten the data and join into a single string
flattened = list(flatten(context))
result = ' '.join(flattened)

print(result)

Li Na Li Na (; ; born 26 February 1982) is a retired Chinese professional tennis player, who achieved a career-high WTA-ranking of world No. 2 on 17 February 2014.  Over the course of her career, Li won seven WTA singles titles and two Grand Slam singles titles at the 2011 French Open and 2014 Australian Open.  Li's rise to prominence came after those victories, which made her the first and only Grand Slam singles champion from East Asia and Asia as a whole.  Prior to this, she had already become the first player representing an East Asian and Asian country to appear in a Grand Slam singles final, a milestone she achieved at the 2011 Australian Open.  Li was also the runner-up at the 2013 Australian Open and 2013 WTA Tour Championships, a three-time quarterfinalist at Wimbledon and a semifinalist at the 2008 Beijing Olympic Games and 2013 US Open.  Among her other most notable accolades, she was the first Chinese player to win a WTA tour title at the Guangzhou International Women's Ope