# Important note:

Both fever and scifact datasets work this way:

SUPPORT: The evidence backs up the claim. This means the abstract or sentences provided align with what the claim says, suggesting it’s consistent with that evidence. It doesn’t guarantee the claim is absolutely true in all contexts—just that this evidence supports it.

REFUTE (or "CONTRADICT"): The evidence goes against the claim. This indicates the abstract or sentences contradict what the claim states, suggesting it’s inconsistent with that evidence. Again, it doesn’t mean the claim is universally false—it just doesn’t hold up against this particular evidence.

That is why initially there are other columns (like evidence)

**But who cares, for now it is okay**

### Scifact

Look at the sample of data

In [2]:
import pandas as pd
# example csv
df = pd.read_csv("data/csv/fold_4_claims_train_4_filtered.csv")
df

Unnamed: 0,id,claim,evidence,label,cited_doc_ids
0,2,1 in 5 million in UK have abnormal PrP positiv...,[4],CONTRADICT,[13734012]
1,3,"1,000 genomes project enables mapping of genet...","[2, 5]",SUPPORT,[14717500]
2,3,"1,000 genomes project enables mapping of genet...",[7],SUPPORT,[14717500]
3,5,1/2000 in UK have abnormal PrP positivity.,[4],SUPPORT,[13734012]
4,9,32% of liver transplantation programs required...,[15],SUPPORT,[44265107]
...,...,...,...,...,...
1028,1404,siRNA knockdown of A20 slows tumor progression...,[6],SUPPORT,"[33370, 38355793]"
1029,1404,siRNA knockdown of A20 slows tumor progression...,[7],SUPPORT,"[33370, 38355793]"
1030,1404,siRNA knockdown of A20 slows tumor progression...,[8],SUPPORT,"[33370, 38355793]"
1031,1404,siRNA knockdown of A20 slows tumor progression...,[9],SUPPORT,"[33370, 38355793]"


**This cited_doc and evidence columns can be potentially important. However, for now we do not know how to use them, therefore, ignore**

##### Create a combined csv file (hope this code does not contain errors, but we need to check it)

In [13]:
import os
import json
import csv

# Folder containing the subfolders with .jsonl files
input_folder = "data/cross_validation"
# Output CSV file
output_file = "scifact_combined_filtered_claims.csv"

# Create or overwrite the output CSV file
with open(output_file, 'w', newline='') as csv_file:
    # Define CSV writer
    csv_writer = csv.writer(csv_file)
    # Write the header row
    csv_writer.writerow(["id", "claim", "evidence", "label", "cited_doc_ids"])
    
    # Iterate through all subfolders in the input folder
    for subfolder in os.listdir(input_folder):
        subfolder_path = os.path.join(input_folder, subfolder)
        if os.path.isdir(subfolder_path):  # Check if it's a folder
            # Iterate through all files in the subfolder
            for filename in os.listdir(subfolder_path):
                if filename.endswith(".jsonl"):
                    input_file = os.path.join(subfolder_path, filename)
                    
                    # Process each .jsonl file
                    with open(input_file, 'r') as jsonl_file:
                        for line in jsonl_file:
                            data = json.loads(line)
                            evidence = data.get("evidence", {})
                            
                            # Filter items with labels in the evidence
                            for doc_id, evidence_list in evidence.items():
                                for evidence_item in evidence_list:
                                    if "label" in evidence_item:
                                        csv_writer.writerow([
                                            data["id"],
                                            data["claim"],
                                            evidence_item["sentences"],
                                            evidence_item["label"],
                                            data["cited_doc_ids"]
                                        ])

print(f"Combined CSV file created at: {output_file}")

Combined CSV file created at: scifact_combined_filtered_claims.csv


In [3]:
# many dublicates for some reason, let's remove them
all_data = pd.read_csv('scifact_combined_filtered_claims.csv')
all_data = all_data[['claim', 'label']]
all_data

Unnamed: 0,claim,label
0,A diminished ovarian reserve is a very strong ...,CONTRADICT
1,A diminished ovarian reserve is a very strong ...,CONTRADICT
2,A diminished ovarian reserve is a very strong ...,CONTRADICT
3,A diminished ovarian reserve is a very strong ...,CONTRADICT
4,A diminished ovarian reserve is a very strong ...,CONTRADICT
...,...,...
6470,siRNA knockdown of A20 accelerates tumor progr...,CONTRADICT
6471,siRNA knockdown of A20 slows tumor progression...,SUPPORT
6472,siRNA knockdown of A20 slows tumor progression...,SUPPORT
6473,siRNA knockdown of A20 slows tumor progression...,SUPPORT


In [4]:
all_data_filter = all_data.drop_duplicates()
all_data_filter.index = range(len(all_data_filter))
all_data_filter

Unnamed: 0,claim,label
0,A diminished ovarian reserve is a very strong ...,CONTRADICT
1,A high microerythrocyte count raises vulnerabi...,CONTRADICT
2,A mutation in HNF4A leads to an increased risk...,SUPPORT
3,ADAR1 binds to Dicer to cleave pre-miRNA.,SUPPORT
4,AMP-activated protein kinase (AMPK) activation...,SUPPORT
...,...,...
687,mcm 5 s 2 U is required for proper decoding of...,SUPPORT
688,miR-142-5P is a known regulator of raised body...,SUPPORT
689,miRNAs enforce homeostasis by suppressing low-...,SUPPORT
690,siRNA knockdown of A20 slows tumor progression...,SUPPORT


In [5]:
all_data_filter.to_csv('scifact_combined_FULL_filtered.csv', index=False)

### Fever

The structure of the dataset looks approximately the same. Hence, we can use approximately the same code to convert it to the csv format

In [32]:
# Input file paths
input_files = [
    "fever_train.jsonl",
    "fever_shared_task_dev.jsonl"
]

# Output CSV file path
output_file = "fever_combined.csv"

# Create or overwrite the output CSV file
with open(output_file, 'w', newline='') as csv_file:
    # Define CSV writer
    csv_writer = csv.writer(csv_file)
    # Write the header row
    csv_writer.writerow(["id", "verifiable", "label", "claim", "evidence"])
    
    # Process each input file
    for input_file in input_files:
        with open(input_file, 'r') as jsonl_file:
            for line in jsonl_file:
                data = json.loads(line)
                # Write the relevant fields to the CSV
                csv_writer.writerow([
                    data["id"],
                    data["verifiable"],
                    data["label"],
                    data["claim"],
                    data["evidence"]
                ])

print(f"Combined CSV file created at: {output_file}")

Combined CSV file created at: fever_combined.csv


In [6]:
fever_all = pd.read_csv('fever_combined.csv')
fever_all

Unnamed: 0,id,verifiable,label,claim,evidence
0,75397,VERIFIABLE,SUPPORTS,Nikolaj Coster-Waldau worked with the Fox Broa...,"[[[92206, 104971, 'Nikolaj_Coster-Waldau', 7],..."
1,150448,VERIFIABLE,SUPPORTS,Roman Atwood is a content creator.,"[[[174271, 187498, 'Roman_Atwood', 1]], [[1742..."
2,214861,VERIFIABLE,SUPPORTS,"History of art includes architecture, dance, s...","[[[255136, 254645, 'History_of_art', 2]]]"
3,156709,VERIFIABLE,REFUTES,Adrienne Bailon is an accountant.,"[[[180804, 193183, 'Adrienne_Bailon', 0]]]"
4,83235,NOT VERIFIABLE,NOT ENOUGH INFO,System of a Down briefly disbanded in limbo.,"[[[100277, None, None, None]]]"
...,...,...,...,...,...
165442,8538,VERIFIABLE,REFUTES,Hermit crabs are arachnids.,"[[[15450, 19262, 'Hermit_crab', 0], [15450, 19..."
165443,145641,VERIFIABLE,REFUTES,Michael Hutchence died on a boat.,"[[[168967, 182663, 'Michael_Hutchence', 15]]]"
165444,87517,VERIFIABLE,SUPPORTS,The Cyclades are located to the southeast of G...,"[[[104709, 118125, 'Cyclades', 0]]]"
165445,111816,NOT VERIFIABLE,NOT ENOUGH INFO,Theresa May worked the docks.,"[[[131223, None, None, None]]]"


**Again, for now we do not know what to do with the evidence feature. I think verifiable feature is not very important for now, hence, we can just leave the claims and the labels**

**Note that now there are three types of labels: 'SUPPORTS', 'REFUTES', 'NOT ENOUGH INFO'. We will keep all of them**

In [8]:
fever_all['label'].unique()

array(['SUPPORTS', 'REFUTES', 'NOT ENOUGH INFO'], dtype=object)

In [11]:
# Again there were duplicates, let's remove them by doing drop_duplicates() (not sure what else we can do)
fever_all = fever_all[['claim', 'label']]
fever_all = fever_all.drop_duplicates()
fever_all.index = range(len(fever_all))
fever_all

Unnamed: 0,claim,label
0,Nikolaj Coster-Waldau worked with the Fox Broa...,SUPPORTS
1,Roman Atwood is a content creator.,SUPPORTS
2,"History of art includes architecture, dance, s...",SUPPORTS
3,Adrienne Bailon is an accountant.,REFUTES
4,System of a Down briefly disbanded in limbo.,NOT ENOUGH INFO
...,...,...
155951,Hermit crabs are arachnids.,REFUTES
155952,Michael Hutchence died on a boat.,REFUTES
155953,The Cyclades are located to the southeast of G...,SUPPORTS
155954,Theresa May worked the docks.,NOT ENOUGH INFO


In [12]:
fever_all.to_csv('fever_combined_final.csv', index=False)

### Combine

Since both scifact and fever have the same structure, we can just concatenate them, and rename columns in the scifact dataframe to make sure that they match the fever dataframe

In [13]:
all_data_filter

Unnamed: 0,claim,label
0,A diminished ovarian reserve is a very strong ...,CONTRADICT
1,A high microerythrocyte count raises vulnerabi...,CONTRADICT
2,A mutation in HNF4A leads to an increased risk...,SUPPORT
3,ADAR1 binds to Dicer to cleave pre-miRNA.,SUPPORT
4,AMP-activated protein kinase (AMPK) activation...,SUPPORT
...,...,...
687,mcm 5 s 2 U is required for proper decoding of...,SUPPORT
688,miR-142-5P is a known regulator of raised body...,SUPPORT
689,miRNAs enforce homeostasis by suppressing low-...,SUPPORT
690,siRNA knockdown of A20 slows tumor progression...,SUPPORT


In [15]:
all_data_filter.loc[all_data_filter['label'] == 'SUPPORT', 'label'] = 'SUPPORTS'
all_data_filter.loc[all_data_filter['label'] == 'CONTRADICT', 'label'] = 'REFUTES'
all_data_filter

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_data_filter.loc[all_data_filter['label'] == 'SUPPORT', 'label'] = 'SUPPORTS'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_data_filter.loc[all_data_filter['label'] == 'CONTRADICT', 'label'] = 'REFUTES'


Unnamed: 0,claim,label
0,A diminished ovarian reserve is a very strong ...,REFUTES
1,A high microerythrocyte count raises vulnerabi...,REFUTES
2,A mutation in HNF4A leads to an increased risk...,SUPPORTS
3,ADAR1 binds to Dicer to cleave pre-miRNA.,SUPPORTS
4,AMP-activated protein kinase (AMPK) activation...,SUPPORTS
...,...,...
687,mcm 5 s 2 U is required for proper decoding of...,SUPPORTS
688,miR-142-5P is a known regulator of raised body...,SUPPORTS
689,miRNAs enforce homeostasis by suppressing low-...,SUPPORTS
690,siRNA knockdown of A20 slows tumor progression...,SUPPORTS


In [14]:
fever_all

Unnamed: 0,claim,label
0,Nikolaj Coster-Waldau worked with the Fox Broa...,SUPPORTS
1,Roman Atwood is a content creator.,SUPPORTS
2,"History of art includes architecture, dance, s...",SUPPORTS
3,Adrienne Bailon is an accountant.,REFUTES
4,System of a Down briefly disbanded in limbo.,NOT ENOUGH INFO
...,...,...
155951,Hermit crabs are arachnids.,REFUTES
155952,Michael Hutchence died on a boat.,REFUTES
155953,The Cyclades are located to the southeast of G...,SUPPORTS
155954,Theresa May worked the docks.,NOT ENOUGH INFO


In [16]:
final_dataframe = pd.concat([all_data_filter, fever_all], ignore_index=True)
final_dataframe.index = range(len(final_dataframe))
final_dataframe

Unnamed: 0,claim,label
0,A diminished ovarian reserve is a very strong ...,REFUTES
1,A high microerythrocyte count raises vulnerabi...,REFUTES
2,A mutation in HNF4A leads to an increased risk...,SUPPORTS
3,ADAR1 binds to Dicer to cleave pre-miRNA.,SUPPORTS
4,AMP-activated protein kinase (AMPK) activation...,SUPPORTS
...,...,...
156643,Hermit crabs are arachnids.,REFUTES
156644,Michael Hutchence died on a boat.,REFUTES
156645,The Cyclades are located to the southeast of G...,SUPPORTS
156646,Theresa May worked the docks.,NOT ENOUGH INFO


In [17]:
len(final_dataframe) == len(all_data_filter) + len(fever_all)

True

In [18]:
final_dataframe.label.value_counts()

SUPPORTS           80980
NOT ENOUGH INFO    40489
REFUTES            35179
Name: label, dtype: int64

In [19]:
final_dataframe.to_csv('final_dataframe_fine_tuning.csv', index=False)