# Dataset Preparation
Before training the Google T5 model for our claim decomposition task, it is essential for us to first fetch the necessary training data. 

This file contains code to reconstruct the ClaimDecomp dataset specified in the `chen-etal-2022-generating ` paper by the University of Texas.

## Install Required Libraries

In [None]:
!pip install tqdm
!pip install pandas
!pip install beautifulsoup4
!pip install argparse
!pip install requests
!pip install allennlp==2.7
!pip install torch==1.9.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting argparse
  Downloading argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting allennlp==2.7
  Downloading allennlp-2.7.0-py3-none-any.whl (738 kB)
[K     |████████████████████████████████| 738 kB 7.7 MB/s 
[?25hCollecting wandb<0.13.0,>=0.10.0
  Downloading wandb-0.12.21-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 38.2 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 52.2 MB/s 
[?25hCollecting boto3<2.0,>=1.14
  Downloading boto3-1.24.95-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 62.7 MB/s 
Collecting torch<1.10.0,>=1.6.0
  Downloading torch-1.9.1-cp37-cp37m-manylinux1_x86_64.whl (831.4 MB)
[K     |████████████████████████████████| 831.4 MB 6.6 kB/

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.9.0
  Downloading torch-1.9.0-cp37-cp37m-manylinux1_x86_64.whl (831.4 MB)
[K     |████████████████████████████████| 831.4 MB 2.1 kB/s 
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.9.1
    Uninstalling torch-1.9.1:
      Successfully uninstalled torch-1.9.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.10.1 requires torch==1.9.1, but you have torch 1.9.0 which is incompatible.
torchtext 0.13.1 requires torch==1.12.1, but you have torch 1.9.0 which is incompatible.
torchaudio 0.12.1+cu113 requires torch==1.12.1, but you have torch 1.9.0 which is incompatible.[0m
Successfully installed torch-1.9.0


##Download ClaimDecomp Dataset from the University of Texas

In [None]:
!wget -O ./claim_decomp_raw/train.jsonl 'https://www.cs.utexas.edu/~jfchen/claim-decomp/train.jsonl'
!wget -O ./claim_decomp_raw/dev.jsonl 'https://www.cs.utexas.edu/~jfchen/claim-decomp/dev.jsonl'
!wget -O ./claim_decomp_raw/test.jsonl 'https://www.cs.utexas.edu/~jfchen/claim-decomp/test.jsonl'

# Reconstruct ClaimDecomp Dataset

As the inital dataset downloaded from ClaimDecomp contains incomplete fields (like missing claims, person etc), the data has to be reconstructed by processing information from the source articles of each claim

##Information on Train.jsonl

Removed Invalid Samples: 
* **717-719**

Skipped Samples: 
1. 241
2. 513
3. 579
4. 586

Total Samples: 
**793**



In [None]:
!python3 reconstruct_dataset.py --input_path ./claim_decomp_raw/train.jsonl --output ./Reconstructed/train.jsonl

##Information on dev.jsonl

Removed Invalid Samples: **34-36**

Total Samples: **197**

In [None]:
!python3 reconstruct_dataset.py --input_path ./claim_decomp_raw/dev.jsonl --output ./Reconstructed/dev.jsonl

Size of Dataframe: 197

197it [05:15,  1.60s/it]


##Information on test.jsonl

Total Samples: **200**

In [None]:
!python3 reconstruct_dataset.py --input_path ./claim_decomp_raw/test.jsonl --output ./Reconstructed/test.jsonl

Size of Dataframe: 200

200it [06:26,  1.93s/it]


# Dataframe Creation

The following code serves to construct a Pandas dataframe from the raw jsonl files reconstructed from the original ClaimDecomp dataset

In [None]:
import pandas as pd
import numpy as np
import json
from pandas import json_normalize

path = "./reconstructed_data/dev.jsonl"
with open(path, 'r') as jsonl_file:
    jsonl_list = list(jsonl_file)

#listOfRows is a list of dict, each containing data for each row in the df with keys = column name
listOfRows = []
for entry in jsonl_list:
    loadedEntry = json.loads(entry)
    # print(f"result: {loadedEntry}")
    # print(isinstance(loadedEntry, dict))
    listOfRows.append(loadedEntry)

df = pd.DataFrame(listOfRows)
df.head()

# print(df['claim'][0])
# df['annotations'][0][1]['questions']
# df.iloc[0]['annotations']

Unnamed: 0,example_id,label,url,annotations,claim,person,venue,justification,full_article
0,8057719209342304749,false,https://www.politifact.com/factchecks/2020/apr...,[{'questions': ['Is voting fraud widespread in...,"With voting by mail, “you get thousands and th...",Donald Trump,"stated on April 7, 2020 in a press briefing:","Trump said that with voting by mail, ""you get ...",The daily White House briefings about coronavi...
1,-3333998957238197422,barely-true,https://www.politifact.com/factchecks/2019/mar...,[{'questions': ['Was the federal aid given by ...,"""I’ve already traveled to Washington, D.C., an...",Ron DeSantis,"stated on March 5, 2019 in his State of the St...","DeSantis said, ""I’ve already traveled to Washi...","Editor’s note, March 10 12:55 p.m.: Two days a..."
2,-5816336384767541299,barely-true,https://www.politifact.com/factchecks/2016/nov...,[{'questions': ['Is this ban directly linked t...,Says that when San Francisco banned plastic gr...,James Quintero,"stated on October 10, 2016 in a panel discussi...",Quintero said that when San Francisco banned p...,"Reused grocery bags made Californians sick, a ..."
3,7968458905312541095,true,https://www.politifact.com/factchecks/2014/dec...,[{'questions': ['Is it true that The United S...,"The United States ""decided waterboarding was t...",Sheldon Whitehouse,"stated on December 14, 2014 in a TV interview:","Sheldon Whitehouse said the United States ""dec...","The so-called ""CIA torture report"" has heighte..."
4,-2095875040468818200,false,https://www.politifact.com/factchecks/2020/sep...,[{'questions': ['Has Trump been accused of wal...,"Quotes Donald Trump as saying, “I’ll tell you ...",Viral image,"stated on September 21, 2020 in a post on Face...",A Facebook post quotes President Trump as sayi...,"President Donald Trump, a former beauty pagean..."


Exploration Code, to be removed

In [None]:
# tmp = []
# l1 = df.iloc[0]['annotations'][0]['questions']
# l2 = df.iloc[0]['annotations'][1]['questions']
# tmp.extend(l1)
# tmp.extend(l2)

# "".join(tmp)
# tmp

In [None]:
#Select only the claim, justification and questions from the df
dfFiltered = df.loc[ : ,['claim', 'justification']]
dfFiltered['questions'] = np.nan

#Combine all subquestions for each claim into a list
for i in range(len(df)):
  annotationList = df.loc[i, 'annotations']
  questions = []
  
  for j in range(len(annotationList)):
    questionSet = annotationList[j]['questions']
    questions.extend(questionSet)
  
  dfFiltered['questions'][i] = questions

dfFiltered.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Unnamed: 0,claim,justification,questions
0,"With voting by mail, “you get thousands and th...","Trump said that with voting by mail, ""you get ...","[Is voting fraud widespread in the US?, Is the..."
1,"""I’ve already traveled to Washington, D.C., an...","DeSantis said, ""I’ve already traveled to Washi...",[Was the federal aid given by Trump for hurric...
2,Says that when San Francisco banned plastic gr...,Quintero said that when San Francisco banned p...,[Is this ban directly linked to outbreaks of f...
3,"The United States ""decided waterboarding was t...","Sheldon Whitehouse said the United States ""dec...","[Is it true that The United States "" decided ..."
4,"Quotes Donald Trump as saying, “I’ll tell you ...",A Facebook post quotes President Trump as sayi...,[Has Trump been accused of walking into a dres...


##Further clean data by removing breakline tokens

In [None]:
# Clean data by removing breaklines \n
def removeBreaklines(df):
  for j in range(len(df)):
    qsnSeries = df[j]
    for i in range(len(qsnSeries)):
      qsnSeries[i] = qsnSeries[i].replace('\n','')

print("Before breakline removal:\n")
print(dfFiltered.loc[100, "questions"])

tmp = dfFiltered.loc[ : , 'questions']
removeBreaklines(tmp)

print("\nAfter breakline removal:\n")
dfFiltered.loc[100, "questions"]


Before breakline removal:

['Did Hilary Clinton ever say that Rubio scares her?\n', 'Have any democrats ever said that they were concerned about Rubio?', 'Has the person that this claim originates from always supported Hillary Clinton as president?', "Did Hilary Clinton say there's only one candidate who scares her--Marco Rubio?", 'Are Democrats concerned about Rubio?']

After breakline removal:



['Did Hilary Clinton ever say that Rubio scares her?',
 'Have any democrats ever said that they were concerned about Rubio?',
 'Has the person that this claim originates from always supported Hillary Clinton as president?',
 "Did Hilary Clinton say there's only one candidate who scares her--Marco Rubio?",
 'Are Democrats concerned about Rubio?']

##Exporting Dataframe

In [None]:
fileName = "filtered_data/dev.csv"
dfFiltered.to_csv(fileName, index=False, encoding = 'utf-8-sig', header=True, )

#Abstracting Dataframe Creation

The above cells are abstracted into the filterData method. 

The following cell represents a condensed version suitable for abstraction into a standalone python file for dataframe creation

In [None]:
import pandas as pd
import numpy as np
import json
from pandas import json_normalize
from tqdm import tqdm
import warnings

warnings.filterwarnings("ignore")

# Clean data by removing breaklines \n
def removeBreaklines(df):
  for j in range(len(df)):
    qsnSeries = df[j]
    for i in range(len(qsnSeries)):
      qsnSeries[i] = qsnSeries[i].replace('\n','')

# Selects and returns clean data
def filterData(sourcePath, destPath):
  
  #Load jsonl file
  with open(sourcePath, 'r') as jsonl_file:
      jsonl_list = list(jsonl_file)

  #listOfRows is a list of dict, each dict containing data for a row in the df with keys = column name
  listOfRows = []
  for entry in jsonl_list:
      loadedEntry = json.loads(entry)
      listOfRows.append(loadedEntry)

  #Convert list of dict into a df
  df = pd.DataFrame(listOfRows)

  #Select only the claim, justification and questions from the original df
  dfFiltered = df.loc[ : ,['claim', 'justification']]
  dfFiltered['questions'] = np.nan

  #Combine all subquestions for each claim into a list
  for i in tqdm(range(len(df))):
    annotationList = df.loc[i, 'annotations']
    questions = []
    
    for j in range(len(annotationList)):
      questionSet = annotationList[j]['questions']
      questions.extend(questionSet)
    
    dfFiltered['questions'][i] = questions

  #Remove breakline tokens 
  tmp = dfFiltered.loc[ : , 'questions']
  removeBreaklines(tmp)

  #Export filtered df to csv
  dfFiltered.to_csv(destPath, index=False, encoding = 'utf-8-sig', header=True, )

  #Export filtered df to json
  # dfFiltered.to_json(destPath, orient='table', index=False)

#File Paths
devSource = "./reconstructed_data/dev.jsonl"
devDest = "./filtered_data/dev.csv"

trainSource = "./reconstructed_data/train.jsonl"
trainDest = "./filtered_data/train.csv"

testSource = "./reconstructed_data/test.jsonl"
testDest = "./filtered_data/test.csv"

#Data Extraction
print("Creating dev.csv")
filterData(devSource, devDest)

print("\nCreating train.csv")
filterData(trainSource, trainDest)

print("\nCreating test.csv")
filterData(testSource, testDest)


Creating dev.csv


100%|██████████| 197/197 [00:00<00:00, 18918.78it/s]



Creating train.csv


100%|██████████| 793/793 [00:00<00:00, 36134.62it/s]


Creating test.csv



100%|██████████| 200/200 [00:00<00:00, 18152.45it/s]


In [None]:
#File Paths
devSource = "./reconstructed_data/dev.jsonl"
devDest = "./filtered_data/dev.json"

#Data Extraction
print("Creating dev.json")
filterData(devSource, devDest)

Creating dev.json


100%|██████████| 197/197 [00:00<00:00, 26484.96it/s]
