# Prepare dataset (jsonl file)

- Prepare datasets for fine-tuning GPT-3.5-turbo with **features as text** and **all-in-one strategy**.

- Here, the argument component (AC) and its sentence are given as features.

- We create the data files: `data_train_v1.jsonl`, `data_val_v1.jsonl`, `data_test_v1.jsonl`

## Libraries

In [144]:
import os
import json
import pandas as pd
import random

In [78]:
random.seed(42)

## Load csv file

In [79]:
data_dir = os.path.join(os.getcwd(), "data")

In [80]:
df = pd.read_csv(os.path.join(data_dir, "persuasive_essays_dataset.csv"), index_col=0)

In [81]:
df.isna().sum()

tag                                     0
label                                   0
start                                   0
end                                     0
argument_component                      0
essay_file                              0
essay_title                             0
essay_text                              0
sentence                                0
nr_essay_paragraphs                     0
paragraph_nr                            0
paragraph                               0
is_component_in_intro_paragraph         0
is_component_in_conclusion_paragraph    0
is_component_first_in_paragraph         0
is_component_last_in_paragraph          0
split                                   0
structral_featxt                        0
argument_counter                        0
dtype: int64

In [82]:
df.head()

Unnamed: 0,tag,label,start,end,argument_component,essay_file,essay_title,essay_text,sentence,nr_essay_paragraphs,paragraph_nr,paragraph,is_component_in_intro_paragraph,is_component_in_conclusion_paragraph,is_component_first_in_paragraph,is_component_last_in_paragraph,split,structral_featxt,argument_counter
0,T1,MajorClaim,503,575,we should attach more importance to cooperatio...,essay001.txt,Should students be taught to compete or to coo...,Should students be taught to compete or to coo...,"From this point of view, I firmly believe that...",4,1,It is always said that competition can effecti...,1,0,1,1,TRAIN,Topic: Should students be taught to compete or...,1
1,T3,Claim,591,714,"through cooperation, children can learn about ...",essay001.txt,Should students be taught to compete or to coo...,Should students be taught to compete or to coo...,"First of all, through cooperation, children ca...",4,2,"First of all, through cooperation, children ca...",0,0,1,0,TRAIN,Topic: Should students be taught to compete or...,2
2,T4,Premise,716,851,What we acquired from team work is not only ho...,essay001.txt,Should students be taught to compete or to coo...,Should students be taught to compete or to coo...,What we acquired from team work is not only ho...,4,2,"First of all, through cooperation, children ca...",0,0,0,0,TRAIN,Topic: Should students be taught to compete or...,3
3,T5,Premise,853,1086,"During the process of cooperation, children ca...",essay001.txt,Should students be taught to compete or to coo...,Should students be taught to compete or to coo...,"During the process of cooperation, children ca...",4,2,"First of all, through cooperation, children ca...",0,0,0,0,TRAIN,Topic: Should students be taught to compete or...,4
4,T6,Premise,1088,1191,All of these skills help them to get on well w...,essay001.txt,Should students be taught to compete or to coo...,Should students be taught to compete or to coo...,All of these skills help them to get on well w...,4,2,"First of all, through cooperation, children ca...",0,0,0,1,TRAIN,Topic: Should students be taught to compete or...,5


In [83]:
df.split.value_counts()

split
TRAIN    4823
TEST     1266
Name: count, dtype: int64

In [121]:
train_essays_l = list(df[df.split=="TRAIN"].essay_file.value_counts().index)
len(train_essays_l)

322

In [122]:
# validation set: 10% of train set

val_size = int(322 * 10/100)
val_size

32

In [123]:
val_essays_l = random.sample(train_essays_l, val_size)

In [124]:
len(val_essays_l)
# val_essays_l

32

In [125]:
train_essays_l = list(set(train_essays_l) - set(val_essays_l))
len(train_essays_l)

290

## Prepare prompt

In [126]:
# Dataset in chat completion format

def formatting_fct(task_description="", question="", answer="", mode="train"):
    
    prompt_d = {"messages": [
        {"role": "system", "content": f"{task_description}"},
        {"role": "user", "content": f"{question}"},
        {"role": "assistant", "content": f"{answer if mode=='train' else ''}"}
    ]
             }
    
    return prompt_d

In [127]:
my_task_description = """### You are an expert in linguistics and you will classify an arguement component into three possible classes: major claim, claim, or premise.
"""

In [128]:
print(my_task_description)

### You are an expert in linguistics and you will classify an arguement component into three possible classes: major claim, claim, or premise.



In [129]:
def build_question(x):
    
    question = f"""### Here is an argument component given in quotation marks: "{x}"\nIs this argument compoment a major claim, a claim, or a premise? No other answer besides these three is accepted.
    """
    
    return question

In [130]:
question = build_question(df.iloc[0].argument_component)
print(question)

### Here is an argument component given in quotation marks: "we should attach more importance to cooperation during primary education"
Is this argument compoment a major claim, a claim, or a premise? No other answer besides these three is accepted.
    


In [131]:
def build_answer(x):
    
    if x == "MajorClaim":
        return "major claim"
    
    elif x == "Claim":
        return "claim"
    
    elif x == "Premise":
        return "premise"

In [132]:
answer = build_answer(df.iloc[0].label)
print(answer)

major claim


In [133]:
print(formatting_fct(my_task_description, question, answer, mode="train"))

{'messages': [{'role': 'system', 'content': '### You are an expert in linguistics and you will classify an arguement component into three possible classes: major claim, claim, or premise.\n'}, {'role': 'user', 'content': '### Here is an argument component given in quotation marks: "we should attach more importance to cooperation during primary education"\nIs this argument compoment a major claim, a claim, or a premise? No other answer besides these three is accepted.\n    '}, {'role': 'assistant', 'content': 'major claim'}]}


## Prepare data files

### Train set

In [134]:
data_file_train = []

for i, _ in df[df["essay_file"].isin(train_essays_l)].iterrows():
    
    question = build_question(df.iloc[i].argument_component)
    answer = build_answer(df.iloc[i].label)
    
    data_file_train.append( formatting_fct(my_task_description, question, answer, mode="train") )

In [135]:
len(data_file_train)

4356

In [136]:
for i in range(3):
    
    print(data_file_train[i])
    print()

{'messages': [{'role': 'system', 'content': '### You are an expert in linguistics and you will classify an arguement component into three possible classes: major claim, claim, or premise.\n'}, {'role': 'user', 'content': '### Here is an argument component given in quotation marks: "we should attach more importance to cooperation during primary education"\nIs this argument compoment a major claim, a claim, or a premise? No other answer besides these three is accepted.\n    '}, {'role': 'assistant', 'content': 'major claim'}]}

{'messages': [{'role': 'system', 'content': '### You are an expert in linguistics and you will classify an arguement component into three possible classes: major claim, claim, or premise.\n'}, {'role': 'user', 'content': '### Here is an argument component given in quotation marks: "through cooperation, children can learn about interpersonal skills which are significant in the future life of all students"\nIs this argument compoment a major claim, a claim, or a pre

### Validation set

In [137]:
data_file_val = []

for i, _ in df[df["essay_file"].isin(val_essays_l)].iterrows():
    
    question = build_question(df.iloc[i].argument_component)
    answer = build_answer(df.iloc[i].label)
    
    data_file_val.append( formatting_fct(my_task_description, question, answer, mode="train") )

In [138]:
len(data_file_val)

467

In [139]:
for i in range(3):
    
    print(data_file_val[i])
    print()

{'messages': [{'role': 'system', 'content': '### You are an expert in linguistics and you will classify an arguement component into three possible classes: major claim, claim, or premise.\n'}, {'role': 'user', 'content': '### Here is an argument component given in quotation marks: "it can effectively save time which is considered as money in our modern society"\nIs this argument compoment a major claim, a claim, or a premise? No other answer besides these three is accepted.\n    '}, {'role': 'assistant', 'content': 'claim'}]}

{'messages': [{'role': 'system', 'content': '### You are an expert in linguistics and you will classify an arguement component into three possible classes: major claim, claim, or premise.\n'}, {'role': 'user', 'content': '### Here is an argument component given in quotation marks: "it is obvious that prepared food can bring about some negative influence result from utilizing the artificial ingredients, ignoring the nutrition of food and modifying people\'s eating

### Test set

In [140]:
data_file_test = []

for i, _ in df[df.split == "TEST"].iterrows():
    
    question = build_question(df.iloc[i].argument_component)
    answer = build_answer(df.iloc[i].label)
    
    data_file_test.append( formatting_fct(my_task_description, question, answer, mode="test") )

In [141]:
len(data_file_test)

1266

In [142]:
for i in range(3):
    
    print(data_file_test[i])
    print()

{'messages': [{'role': 'system', 'content': '### You are an expert in linguistics and you will classify an arguement component into three possible classes: major claim, claim, or premise.\n'}, {'role': 'user', 'content': '### Here is an argument component given in quotation marks: "the tourism bring large profit for the destination countries"\nIs this argument compoment a major claim, a claim, or a premise? No other answer besides these three is accepted.\n    '}, {'role': 'assistant', 'content': ''}]}

{'messages': [{'role': 'system', 'content': '### You are an expert in linguistics and you will classify an arguement component into three possible classes: major claim, claim, or premise.\n'}, {'role': 'user', 'content': '### Here is an argument component given in quotation marks: "this industry has affected the cultural attributes and damaged the natural environment of the tourist destinations"\nIs this argument compoment a major claim, a claim, or a premise? No other answer besides th

## Save `jsonl` files

In [146]:
file_name = "data_train_v1.jsonl"

with open(os.path.join(data_dir, file_name), 'w') as fh:
    
    for entry in data_file_train:
        
        json.dump(entry, fh)
        fh.write('\n')

In [147]:
file_name = "data_val_v1.jsonl"

with open(os.path.join(data_dir, file_name), 'w') as fh:
    
    for entry in data_file_val:
        
        json.dump(entry, fh)
        fh.write('\n')

In [148]:
file_name = "data_test_v1.jsonl"

with open(os.path.join(data_dir, file_name), 'w') as fh:
    
    for entry in data_file_test:
        
        json.dump(entry, fh)
        fh.write('\n')