# Data Preprocessing
We have a annotated dataset from the paper: https://arxiv.org/pdf/2410.14335

In this section we augment and preprocess this dataset to generate a corpus for training an LLM to generate accurate critical questions.

In [23]:
import transformers
import torch
import pandas as pd


In [24]:
################################################################################
#######################   PATH VARIABLES        ################################
################################################################################

output_filtered_path = 'Data/Processed/Filtered/US2016_arguments_filtered.json'
raw_input_path = 'Data/Raw/US2016.jsonl'
augmented_data_path = 'Data/Processed/Augmented/US2016.json'

################################################################################
#######################   STATIC VARIABLES      ################################
################################################################################

device = torch.device("mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu")
model_path = "Models/Meta-Llama-3-8B-Instruct"

## Filter Data
First we have to extract the relevant informations from the not well structured raw dataset. We are only intersted in:
- **modified_premises** --> the modified and final used premises
- **read_premises** --> the premises for the critical question scheme
- **scheme** --> the scheme of the premise
- **modified_cqs** --> the generated critical questions

In [25]:
df = pd.read_json(raw_input_path, lines=True)

# View the DataFrame
print(df.head(5))

          id       IDS SCHEMES ARGS  \
0     HOLT_0        []      []   []   
1     HOLT_2        []      []   []   
2  CLINTON_4        []      []   []   
3     HOLT_5        []      []   []   
4    TRUMP_7  [224122]      []   []   

                                        INTERVENTION  \
0  I do n't expect us to cover all the issues of ...   
1  thank you\nIt 's about putting money—more mone...   
2                        trade is an important issue   
3                          would you like to respond   
4             We have to renegotiate our trade deals   

                                  SENTENCES FULL_ARGS comment  
0                                        []        []     NaN  
1                                        []        []     NaN  
2                                        []        []     NaN  
3                                        []        []     NaN  
4  [We have to renegotiate our trade deals]        []     NaN  


In [26]:
# Function to extract desired fields from each FULL_ARGS item
def extract_argument_info(full_args_list):
    extracted = []
    for entry in full_args_list:
        if isinstance(entry, dict):
            extracted.append({
                'modified_premises': entry.get('modified_premises', ''),
                'read_premises': entry.get('read_premises', ''),
                'scheme': entry.get('scheme', ''),
                'modified_cqs': entry.get('modified_cqs', '')
            })
    return extracted

# Flatten FULL_ARGS from all rows
all_args_data = []
for full_args in df['FULL_ARGS']:
    all_args_data.extend(extract_argument_info(full_args))

# Convert to DataFrame
args_df = pd.DataFrame(all_args_data)

# Save to JSON

args_df.to_json(output_filtered_path, orient='records', indent=2)

output_filtered_path

'Data/Processed/Filtered/US2016_arguments_filtered.json'

In [27]:
df_filtered = pd.read_json(output_filtered_path)
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   modified_premises  408 non-null    object
 1   read_premises      408 non-null    object
 2   scheme             408 non-null    object
 3   modified_cqs       408 non-null    object
dtypes: object(4)
memory usage: 12.9+ KB


## Augment the existing dataset
We try to augment the dataset by rephrasing the generated questions and also the use premises.
For that we use *Meta-Llama-3-8B-Instruct*.

In [28]:
def generate_answer(pipe, terminator, input_text):
    messages = [
        {"role": "system", "content": "You are a helpful assistant to rephrase given text. But you always keep in mind that the meaning must not change."},
        {"role": "user", "content": input_text},
    ]

    prompt = pipe.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
    )
    outputs = pipe(
        prompt,
        max_new_tokens=256,
        eos_token_id=terminator,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )

    return outputs[0]["generated_text"][len(prompt):]

In [29]:
df_filtered = pd.read_json(output_filtered_path)
system_prompt = "Please rephrase the following input text and give a direct answer without something like Here is a rephrased version of the text: "


pipeline = transformers.pipeline(
    "text-generation",
    model=model_path,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device=device,
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]


for premise in df_filtered['modified_premises']:
    input_text = system_prompt + premise
    rephrased = generate_answer(pipeline, terminators, input_text)
    print("Input text: ", input_text)
    print("Rephrased text: ", rephrased)
    print("-----------")
    break



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use mps
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Input text:  Please rephrase the following input text and give a direct answer without something like Here is a rephrased version of the text: 
we 've created a movement is true in this situation.
we 've created a movement is often a sign of situations in which Clinton and others, politicians, should have been doing this for years is true.
Clinton and others, politicians, should have been doing this for years might be true in this situation.
Rephrased text:  The creation of a movement in this situation is a long-overdue indication that Clinton and other politicians should have taken action years ago.
-----------
