# Dataset Preprocessing for Evaluating Claim Decomp Models

This script preprocesses the reconstructed dataset from `chen-etal-2022-generating `,University of Texas, for evaluating our fine-tuned claim decomposition models.

## Install Required Libraries

In [1]:
!pip install tqdm
!pip install pandas
!pip install beautifulsoup4
!pip install argparse
!pip install requests
!pip install allennlp==2.7
!pip install torch==1.9.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting argparse
  Downloading argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting allennlp==2.7
  Downloading allennlp-2.7.0-py3-none-any.whl (738 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m738.3/738.3 KB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
Collecting tensorboardX>=1.2
  Downloading tensorboardX-2.6-py2.py3-none-any.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 KB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
Collecting jsonnet>=0.10.0
  Downloading jsonnet-0.19.1.tar.gz (593 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m593.6/593.6 KB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m
[?25h  P

# Reconstruct ClaimDecomp Dataset

As the inital dataset downloaded from ClaimDecomp contains incomplete fields (like missing claims, person etc), the data has to be reconstructed by processing information from the source articles of each claim

##Information on Train.jsonl

Removed Invalid Samples: 
* **717-719**

Skipped Samples: 
1. 241
2. 513
3. 579
4. 586

Total Samples: 
**793**



In [None]:
!python3 reconstruct_dataset.py --input_path ./claim_decomp_raw/train.jsonl --output ./Reconstructed/train.jsonl

##Information on dev.jsonl

Removed Invalid Samples: **34-36**

Total Samples: **197**

In [None]:
!python3 reconstruct_dataset.py --input_path ./claim_decomp_raw/dev.jsonl --output ./Reconstructed/dev.jsonl

Size of Dataframe: 197

197it [05:15,  1.60s/it]


##Information on test.jsonl

Total Samples: **200**

In [None]:
!python3 reconstruct_dataset.py --input_path ./claim_decomp_raw/test.jsonl --output ./Reconstructed/test.jsonl

Size of Dataframe: 200

200it [06:26,  1.93s/it]


# Dataframe Creation

The following code serves to construct a Pandas dataframe from the raw jsonl files reconstructed from the original ClaimDecomp dataset

In [27]:
import pandas as pd
import numpy as np
import json
from pandas import json_normalize

path = "./reconstructed_data/test.jsonl"
with open(path, 'r') as jsonl_file:
    jsonl_list = list(jsonl_file)

#listOfRows is a list of dict, each containing data for each row in the df with keys = column name
listOfRows = []
for entry in jsonl_list:
    loadedEntry = json.loads(entry)
    listOfRows.append(loadedEntry)

df = pd.DataFrame(listOfRows)
df.head()


Unnamed: 0,example_id,label,url,annotations,claim,person,venue,justification,full_article
0,5088301214328986128,half-true,https://www.politifact.com/factchecks/2018/oct...,"[{'questions': ['Has Barr received $36,550 fro...","Says Kentucky Rep. Andy Barr ""would let shady ...",With Honor,"stated on September 10, 2018 in a TV ad:","With Honor says Barr ""would let shady payday l...","The ""cross-partisan"" group With Honor, which f..."
1,8487672735991906749,barely-true,https://www.politifact.com/factchecks/2018/aug...,[{'questions': ['Did Nicholson make $1 million...,"""New reports show Kevin Nicholson made over $1...",Tammy Baldwin,"stated on July 31, 2018 in a TV ad:","Baldwin says: ""New reports show Kevin Nicholso...",U.S. Sen. Tammy Baldwin used a TV ad to attack...
2,3277676096708619167,pants-fire,https://www.politifact.com/factchecks/2018/apr...,[{'questions': ['Will people be taken into loc...,Says that unless the recipient called back abo...,Anonymous Caller,"stated on April 10, 2018 in an anonymous phone...",An automated phone message fielded in Austin s...,A phone message from a New York area code abou...
3,6252140003382293541,half-true,https://www.politifact.com/factchecks/2016/jul...,"[{'questions': ['Was Donald Trump ""Excited"" fo...","""Donald Trump said he was excited for the 2008...",Elizabeth Warren,"stated on July 25, 2016 in a speech at the Dem...","Warren said, ""Donald Trump said he was excited...",Sen. Elizabeth Warren's speech at the Democrat...
4,-2194277234040896494,barely-true,https://www.politifact.com/factchecks/2016/jun...,[{'questions': ['Has Obama cited climate chang...,"""The president has said the national security ...",Paul Babeu,"stated on May 26, 2016 in an interview on Fox ...","Babeu said, ""The president has said the nation...","Paul Babeu, the Republican Sheriff of Pinal Co..."


In [28]:
#Select only the claim, label, original questions from the df
dfFiltered = df.loc[ : ,['claim', 'label']]
dfFiltered['questions'] = np.nan

#Combine all subquestions for each claim into a list
for i in range(len(df)):
  annotationList = df.loc[i, 'annotations']
  questions = []
  
  for j in range(len(annotationList)):
    questionSet = annotationList[j]['questions']
    questions.extend(questionSet)
  
  dfFiltered['questions'][i] = questions

# Rename the questions row to specify that these are from human annotators
dfFiltered.rename(columns={'questions':'annotated-questions'}, inplace=True)
dfFiltered.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfFiltered['questions'][i] = questions


Unnamed: 0,claim,label,annotated-questions
0,"Says Kentucky Rep. Andy Barr ""would let shady ...",half-true,"[Has Barr received $36,550 from payday lenders..."
1,"""New reports show Kevin Nicholson made over $1...",barely-true,[Did Nicholson make $1 million consulting for ...
2,Says that unless the recipient called back abo...,pants-fire,[Will people be taken into local police custod...
3,"""Donald Trump said he was excited for the 2008...",half-true,"[Was Donald Trump ""Excited"" for the 2008 housi..."
4,"""The president has said the national security ...",barely-true,[Has Obama cited climate change as the top nat...


## Further clean data by removing breakline tokens

In [29]:
# Clean data by removing breaklines \n
def removeBreaklines(df):
  for j in range(len(df)):
    qsnSeries = df[j]
    for i in range(len(qsnSeries)):
      # Add qsn numbering for openAI dataset
      qsnSeries[i] = str(i+1) + ". " + qsnSeries[i]
      qsnSeries[i] = qsnSeries[i].replace('\n','')
    # Combine questions into a single string
    df[j] = " \n".join(df[j])

print("Before breakline removal:\n")
print(dfFiltered.loc[100, "annotated-questions"])

tmp = dfFiltered.loc[ : , 'annotated-questions']
removeBreaklines(tmp)

print("\nAfter breakline removal:\n")
dfFiltered.loc[100, "annotated-questions"]

Before breakline removal:

["Can sending more girls to school boost a country's GDP?", 'Do experts agree that putting more girls in school leads to economic growth?', 'Can educating more women possibly lead to a stronger economy?', 'Is there evidence that supports a causal link between increased education of women and economic growth?', 'Is education the only barrier to employment for women?']

After breakline removal:



"1. Can sending more girls to school boost a country's GDP? \n2. Do experts agree that putting more girls in school leads to economic growth? \n3. Can educating more women possibly lead to a stronger economy? \n4. Is there evidence that supports a causal link between increased education of women and economic growth? \n5. Is education the only barrier to employment for women?"

## Preview the state of our Dataframe

In [30]:
dfFiltered.head()

Unnamed: 0,claim,label,annotated-questions
0,"Says Kentucky Rep. Andy Barr ""would let shady ...",half-true,"1. Has Barr received $36,550 from payday lende..."
1,"""New reports show Kevin Nicholson made over $1...",barely-true,1. Did Nicholson make $1 million consulting fo...
2,Says that unless the recipient called back abo...,pants-fire,1. Will people be taken into local police cust...
3,"""Donald Trump said he was excited for the 2008...",half-true,"1. Was Donald Trump ""Excited"" for the 2008 hou..."
4,"""The president has said the national security ...",barely-true,1. Has Obama cited climate change as the top n...


## We shall reclassify original labels to only 2 categories, True/False for easier evaluation

**New True labels are represented by:**

Original labels:  
1. half-true
2. mostly-true
3. true

**New False labels are represented by:**

Original labels:  
1. pants-fire
2. false
3. barely-true

In [31]:
def convert_labels(df):
    # map labels to 1 or 0 according to the above classification
    label_equivalent = {
        'pants-fire': 0,
        'false': 0,
        'barely-true': 0,
        'half-true': 1,
        'mostly-true': 1,
        'true': 1
    }
    
    # convert the 'label' column values to 1 or 0
    df['label'] = df['label'].map(label_equivalent)
    return df

dfFiltered = convert_labels(dfFiltered)
dfFiltered.head()

Unnamed: 0,claim,label,annotated-questions
0,"Says Kentucky Rep. Andy Barr ""would let shady ...",1,"1. Has Barr received $36,550 from payday lende..."
1,"""New reports show Kevin Nicholson made over $1...",0,1. Did Nicholson make $1 million consulting fo...
2,Says that unless the recipient called back abo...,0,1. Will people be taken into local police cust...
3,"""Donald Trump said he was excited for the 2008...",1,"1. Was Donald Trump ""Excited"" for the 2008 hou..."
4,"""The president has said the national security ...",0,1. Has Obama cited climate change as the top n...


## Exporting Dataframe

In [32]:
fileName = "preprocessed_data/test.csv"
dfFiltered.to_csv(fileName, index=False, encoding = 'utf-8-sig', header=True, )

# Abstracting Dataframe Creation

The above cells are abstracted into the filterData method. 

The following cell represents a condensed version suitable for abstraction into a standalone python file for dataframe creation

In [36]:
import pandas as pd
import numpy as np
import json
from pandas import json_normalize
from tqdm import tqdm
import warnings

warnings.filterwarnings("ignore")

def convert_labels(df):
    # map labels to 1 or 0 according to the above classification
    label_equivalent = {
        'pants-fire': 0,
        'false': 0,
        'barely-true': 0,
        'half-true': 1,
        'mostly-true': 1,
        'true': 1
    }
    
    # convert the 'label' column values to 1 or 0
    df['label'] = df['label'].map(label_equivalent)
    return df

# Clean data by removing breaklines \n
def removeBreaklines(df):
  for j in range(len(df)):
    qsnSeries = df[j]
    for i in range(len(qsnSeries)):
      # Add qsn numbering for openAI dataset
      qsnSeries[i] = str(i+1) + ". " + qsnSeries[i]
      qsnSeries[i] = qsnSeries[i].replace('\n','')
    # Combine questions into a single string
    df[j] = " \n".join(df[j])

# Selects and returns clean data
def preprocessData(sourcePath, destPath):
  
  #Load jsonl file
  with open(sourcePath, 'r') as jsonl_file:
      jsonl_list = list(jsonl_file)

  #listOfRows is a list of dict, each dict containing data for a row in the df with keys = column name
  listOfRows = []
  for entry in jsonl_list:
      loadedEntry = json.loads(entry)
      listOfRows.append(loadedEntry)

  #Convert list of dict into a df
  df = pd.DataFrame(listOfRows)

  #Select only the claim, label, original questions from the df
  dfFiltered = df.loc[ : ,['claim', 'label']]
  dfFiltered['questions'] = np.nan

  #Combine all subquestions for each claim into a list
  for i in tqdm(range(len(df))):
    annotationList = df.loc[i, 'annotations']
    questions = []
    
    for j in range(len(annotationList)):
      questionSet = annotationList[j]['questions']
      questions.extend(questionSet)
    
    dfFiltered['questions'][i] = questions

  # Rename the questions row to specify that these are from human annotators
  dfFiltered.rename(columns={'questions':'annotated-questions'}, inplace=True)

  #Remove breakline tokens 
  tmp = dfFiltered.loc[ : , 'annotated-questions']
  removeBreaklines(tmp)
  dfFinal = convert_labels(dfFiltered)

  #Export filtered df to csv
  # dfFiltered.to_csv(destPath, index=False, encoding = 'utf-8-sig', header=True, )
  
  #Export filtered df to csv [openAI]
  dfFinal.to_csv(destPath, index=False, encoding = 'utf-8-sig', header=True, )

  #Export filtered df to json
  # dfFiltered.to_json(destPath, orient='table', index=False)

#File Paths
devSource = "./reconstructed_data/dev.jsonl"
devDest = "./preprocessed_data/dev.csv"

trainSource = "./reconstructed_data/train.jsonl"
trainDest = "./preprocessed_data/train.csv"

testSource = "./reconstructed_data/test.jsonl"
testDest = "./preprocessed_data/test.csv"

#Data Extraction
print("Creating dev.csv")
preprocessData(devSource, devDest)

print("\nCreating train.csv")
preprocessData(trainSource, trainDest)

print("\nCreating test.csv")
preprocessData(testSource, testDest)


Creating dev.csv


100%|██████████| 197/197 [00:00<00:00, 26086.12it/s]



Creating train.csv


100%|██████████| 793/793 [00:00<00:00, 37043.32it/s]



Creating test.csv


100%|██████████| 200/200 [00:00<00:00, 23618.57it/s]


In [None]:
#File Paths
devSource = "./reconstructed_data/dev.jsonl"
devDest = "./filtered_data/dev.json"

#Data Extraction
print("Creating dev.json")
preprocessData(devSource, devDest)

Creating dev.json


100%|██████████| 197/197 [00:00<00:00, 26484.96it/s]
