This script takes the final gaza-israel and ukraine-russia datasets and aggregates the four labels ("True", "False", "Misleading", "NEI") into two labels ("True", "False") for a binary classification task. 

For a better overview and to not mix it up with the original label aggregation step ("claims_postprocessing_label_aggregation.ipynb") I created a separate folder and a separate script for this. The aggregation in two labels is done, because the evaluation of the first DEFAME model (model1_run_1) revealed inter-coder reliability problems with the ground truth labels *within* and *between* the used fact-checking websites (see the scripts in evaluation/gt_label_check for more details). Claims with the sub-labels "False", "Misleading" and "NEI" have not been labeled consistently and reliably. Many claims with the ground truth label "False" (and "NEI") should have the label "Misleading". Further, there were also many claims, especially in the ukraine-russia datasets, where the difference between the labels "False" and "NEI" has not been clear. This especially occured with claims which stated that some politician or public figure supposedly made a statement, which never happened. Some websites labeled these claims as "False", while other labeled it with "no evidence", "unsubstantiated" labels that I aggregated into the "NEI" label. 

To resolve the reliability problems, ideally several human experts (professional fact-checkers) would be hired to label all claims again. Inter-coder reliabilities could be resolved by majority voting. Yet, I don't have access to human experts or the ressources to pay them. 

Thus, the reliability problem will be resolved by aggregating the sub-labels "False", "Misleading" and "NEI" into a "False" label. Since, there were no problems between the labeling of true and false claims by the fact-checking websites and for the real-world application of automatic multimodal fact-checking systems, it is relevant, whether a claim is true or false, this decision seems reasonable. For the real-life application it might not be extremely relevant, whether a claim is factually false or misleading or whether there is no evidence for the claim. 

Future work could solve the reliability problem between the "False", "Misleading" and "NEI" sub-labels. This will be discussed in the "Discussion/Conclusion" sectin of the thesis.

In [1]:
## 1) Load libraries

import pandas as pd
import numpy as np 

In [2]:
## 2) Load datasets 

df_gaza_israel_final = pd.read_csv("../gaza_israel/Combined_dataset/gaza_israel_dataset_combined_010724_300425_final.csv", index_col=0)
print(len(df_gaza_israel_final))
print(df_gaza_israel_final.columns)


df_ukraine_russia_final = pd.read_csv("../ukraine_russia/Combined_dataset/ukraine_russia_dataset_combined_010724_300425_final.csv", index_col=0)
print(len(df_ukraine_russia_final))
print(df_ukraine_russia_final.columns)

100
Index(['id', 'Website', 'Article_URL', 'Headline', 'Claim_Date', 'Review_Date',
       'Query/Keyword', 'Original_Claim_Website', 'Original_Claim_Only',
       'Claim', 'Image_URL', 'Image_Path', 'Label_Website', 'Label',
       'Context/Label_Explanation', 'Text_Only_Claim', 'Normal_Image',
       'AI_Generated_Image', 'Altered_Image', 'Data_Collection_Type'],
      dtype='object')
79
Index(['id', 'Website', 'Article_URL', 'Headline', 'Claim_Date', 'Review_Date',
       'Query/Keyword', 'Original_Claim_Website', 'Original_Claim_Only',
       'Claim', 'Image_URL', 'Image_Path', 'Label_Website', 'Label',
       'Context/Label_Explanation', 'Text_Only_Claim', 'Normal_Image',
       'AI_Generated_Image', 'Altered_Image', 'Data_Collection_Type'],
      dtype='object')


In [4]:
## Check column data types

print(df_gaza_israel_final.dtypes)
print(df_ukraine_russia_final.dtypes)

id                            int64
Website                      object
Article_URL                  object
Headline                     object
Claim_Date                   object
Review_Date                  object
Query/Keyword                object
Original_Claim_Website       object
Original_Claim_Only          object
Claim                        object
Image_URL                    object
Image_Path                   object
Label_Website                object
Label                        object
Context/Label_Explanation    object
Text_Only_Claim                bool
Normal_Image                   bool
AI_Generated_Image             bool
Altered_Image                  bool
Data_Collection_Type         object
dtype: object
id                            int64
Website                      object
Article_URL                  object
Headline                     object
Claim_Date                   object
Review_Date                  object
Query/Keyword                object
Original_Claim

In [5]:
## 3) Look at original (final) label distribution

print(df_gaza_israel_final["Label"].value_counts())
print(df_ukraine_russia_final["Label"].value_counts())


Label
False         59
Misleading    31
True           6
NEI            4
Name: count, dtype: int64
Label
FALSE         53
Misleading    15
TRUE           9
NEI            2
Name: count, dtype: int64


In [6]:
## 4) Aggregate labels into binary labels 

# Create the new binary label column 
df_gaza_israel_final['Label_Binary'] = np.where(
    df_gaza_israel_final['Label'].str.lower().isin(['false', 'misleading', 'nei']),  ## account for the fact that the labels are sometimes in Capital letters (after using Google Sheets)
    'False', 
    'True'  # remaining claims are true
)
# Check that the re-labeling was successful
print(df_gaza_israel_final['Label_Binary'].value_counts())



# Create the new binary label column 
df_ukraine_russia_final['Label_Binary'] = np.where(
    df_ukraine_russia_final['Label'].str.lower().isin(['false', 'misleading', 'nei']),  ## account for the fact that the labels are sometimes in Capital letters (after using Google Sheets)
    'False', 
    'True'  # remaining claims are true
)

print(df_ukraine_russia_final['Label_Binary'].value_counts())


Label_Binary
False    94
True      6
Name: count, dtype: int64
Label_Binary
False    70
True      9
Name: count, dtype: int64


In [7]:
## 5) Check the data type of the new column to make sure its a string and no boolean value

print(df_gaza_israel_final.dtypes)
print(df_ukraine_russia_final.dtypes)

id                            int64
Website                      object
Article_URL                  object
Headline                     object
Claim_Date                   object
Review_Date                  object
Query/Keyword                object
Original_Claim_Website       object
Original_Claim_Only          object
Claim                        object
Image_URL                    object
Image_Path                   object
Label_Website                object
Label                        object
Context/Label_Explanation    object
Text_Only_Claim                bool
Normal_Image                   bool
AI_Generated_Image             bool
Altered_Image                  bool
Data_Collection_Type         object
Label_Binary                 object
dtype: object
id                            int64
Website                      object
Article_URL                  object
Headline                     object
Claim_Date                   object
Review_Date                  object
Query/Keyword 

In [8]:
## 5) Reorder the columns


df_gaza_israel_final.head()

# reorder columns

new_column_order = [
    'id', 
    'Website', 
    'Article_URL', 
    'Headline', 
    'Claim_Date', 
    'Review_Date', 
    'Query/Keyword',
    'Original_Claim_Website',  
    'Original_Claim_Only',
    'Claim',
    'Image_URL',
    'Image_Path',
    'Label_Website',
    'Label',
    'Label_Binary',   
    'Context/Label_Explanation', 
    'Text_Only_Claim', 
    'Normal_Image', 
    'AI_Generated_Image', 
    'Altered_Image',
    'Data_Collection_Type'
]

df_gaza_israel_final_binary = df_gaza_israel_final[new_column_order]

df_ukraine_russia_final_binary = df_ukraine_russia_final[new_column_order]


df_gaza_israel_final_binary.head()
df_ukraine_russia_final_binary.head()


Unnamed: 0,id,Website,Article_URL,Headline,Claim_Date,Review_Date,Query/Keyword,Original_Claim_Website,Original_Claim_Only,Claim,...,Image_Path,Label_Website,Label,Label_Binary,Context/Label_Explanation,Text_Only_Claim,Normal_Image,AI_Generated_Image,Altered_Image,Data_Collection_Type
0,0,AFP Factcheck,https://factcheck.afp.com/doc.afp.com.372Y6CV,Fake newspaper cover on Ukrainian soldiers in ...,2025-03-15 0:00:00,2025-03-19 0:00:00,"""War in Ukraine""","""The Kursk expedition was a disaster and a com...","""70,000 Ukrainian soldiers in the Kursk region...",This image shows a screenshot of an authentic ...,...,images/ukraine_russia/0.jpg,altered,False,False,But the supposed Hull Daily Mail headline blas...,False,False,False,True,Manual
1,1,AFP Factcheck,https://factcheck.afp.com/doc.afp.com.36YR9KZ,"No, Zelensky hasn't bought Eagle's Nest, it is...",2025-02-18 0:00:00,2025-02-27 0:00:00,"""War in Ukraine""","According to the latest claims, Zelensky alleg...","According to the latest claims, Zelensky alleg...",Ukrainian President Volodymyr Zelenskyy purcha...,...,,FALSE,False,False,The Eagle's Nest is in the property of the sta...,True,False,False,False,Manual
2,2,AFP Factcheck,https://factcheck.afp.com/doc.afp.com.36YC3DG,Claims that Ukraine banned Truth Social are false,2025-02-20 0:00:00,2025-02-21 0:00:00,"""War in Ukraine""","""BREAKING: Zelensky blocks access to President...","""BREAKING: Zelensky blocks access to President...",In February 2025 Ukrainian President Volodymyr...,...,,FALSE,False,False,A spokesperson for Trump Media and Technology ...,True,False,False,False,Manual
3,3,AFP Factcheck,https://factcheck.afp.com/doc.afp.com.36P98ZW,Fake 'apocalypse' cover of The Economist circu...,2024-11-18 0:00:00,2024-12-03 0:00:00,"""War in Ukraine""","""APOCALYPSE: Allowing missile strikes deep int...","""APOCALYPSE: Allowing missile strikes deep int...",This image shows a screenshot of an authentic ...,...,images/ukraine_russia/3.jpg,FALSE,False,False,"However, The Economist does list Telegram amon...",False,False,False,True,Manual
4,4,AFP Factcheck,https://factcheck.afp.com/doc.afp.com.36MM6QY,Old photo misrepresented as coffins of 'Britis...,2024-11-03 0:00:00,2024-11-27 0:00:00,"""War in Ukraine""","""Recently, 18 members of the British special f...","""18 British Special Forces were killed in Ukra...",This image shows the coffins of 18 British Spe...,...,images/ukraine_russia/4.jpg,FALSE,False,False,A reverse image search and keyword searches on...,False,True,False,False,Manual


In [10]:
## 5) Check the data type of the new column to make sure its a string and no boolean value

print(df_gaza_israel_final.dtypes)
print(df_ukraine_russia_final.dtypes)

id                            int64
Website                      object
Article_URL                  object
Headline                     object
Claim_Date                   object
Review_Date                  object
Query/Keyword                object
Original_Claim_Website       object
Original_Claim_Only          object
Claim                        object
Image_URL                    object
Image_Path                   object
Label_Website                object
Label                        object
Context/Label_Explanation    object
Text_Only_Claim                bool
Normal_Image                   bool
AI_Generated_Image             bool
Altered_Image                  bool
Data_Collection_Type         object
Label_Binary                 object
dtype: object
id                            int64
Website                      object
Article_URL                  object
Headline                     object
Claim_Date                   object
Review_Date                  object
Query/Keyword 

In [9]:
## 6) Save the new dataframes in the respective gaza_israel or ukraine_russia data folders

df_gaza_israel_final_binary.to_csv("../gaza_israel/Combined_dataset/gaza_israel_dataset_combined_010724_300425_final_binary.csv")
df_ukraine_russia_final_binary.to_csv("../ukraine_russia/Combined_dataset/ukraine_russia_dataset_combined_010724_300425_final_binary.csv")
