<a href="https://colab.research.google.com/github/Avalionnet/Avalionnet/blob/main/data/dataset_retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset Preparation
Before training the Google T5 model for our claim decomposition task, it is essential for us to first fetch the necessary training data. 

This file contains code to reconstruct the ClaimDecomp dataset specified in the `chen-etal-2022-generating ` paper by the University of Texas.

## Install required libraries

In [None]:
!pip install tqdm
!pip install pandas
!pip install beautifulsoup4
!pip install argparse
!pip install requests
!pip install allennlp==2.7
!pip install torch==1.9.0

##Download ClaimDecomp Dataset from the University of Texas

In [2]:
!wget -O ./ClaimDecomp/train.jsonl 'https://www.cs.utexas.edu/~jfchen/claim-decomp/train.jsonl'
!wget -O ./ClaimDecomp/dev.jsonl 'https://www.cs.utexas.edu/~jfchen/claim-decomp/dev.jsonl'
!wget -O ./ClaimDecomp/test.jsonl 'https://www.cs.utexas.edu/~jfchen/claim-decomp/test.jsonl'

--2022-10-20 06:19:50--  https://www.cs.utexas.edu/~jfchen/claim-decomp/train.jsonl
Resolving www.cs.utexas.edu (www.cs.utexas.edu)... 128.83.120.48
Connecting to www.cs.utexas.edu (www.cs.utexas.edu)|128.83.120.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 658106 (643K)
Saving to: ‘./ClaimDecomp/train.jsonl’


2022-10-20 06:19:52 (660 KB/s) - ‘./ClaimDecomp/train.jsonl’ saved [658106/658106]

--2022-10-20 06:19:53--  https://www.cs.utexas.edu/~jfchen/claim-decomp/dev.jsonl
Resolving www.cs.utexas.edu (www.cs.utexas.edu)... 128.83.120.48
Connecting to www.cs.utexas.edu (www.cs.utexas.edu)|128.83.120.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 341551 (334K)
Saving to: ‘./ClaimDecomp/dev.jsonl’


2022-10-20 06:19:54 (427 KB/s) - ‘./ClaimDecomp/dev.jsonl’ saved [341551/341551]

--2022-10-20 06:19:54--  https://www.cs.utexas.edu/~jfchen/claim-decomp/test.jsonl
Resolving www.cs.utexas.edu (www.cs.utexas.edu)... 128.83.120.48
Co

# Reconstruct Dataset

As the inital dataset downloaded from ClaimDecomp contains incomplete fields (like missing claims, person etc), the data has to be reconstructed by processing information from the source articles of each claim

##Information on Train.jsonl

Removed Invalid Samples: 
* **717-719**

Skipped Samples: 
1. 241
2. 513
3. 579
4. 586

Total Samples: 
**793**



In [None]:
!python3 reconstruct_dataset.py --input_path ./ClaimDecomp/train.jsonl --output ./Reconstructed/train.jsonl

##Information on dev.jsonl

Removed Invalid Samples: **34-36**

Total Samples: **197**

In [6]:
!python3 reconstruct_dataset.py --input_path ./ClaimDecomp/dev.jsonl --output ./Reconstructed/dev.jsonl

Size of Dataframe: 197

197it [05:15,  1.60s/it]


##Information on test.jsonl

Total Samples: **200**

In [7]:
!python3 reconstruct_dataset.py --input_path ./ClaimDecomp/test.jsonl --output ./Reconstructed/test.jsonl

Size of Dataframe: 200

200it [06:26,  1.93s/it]


# Dataframe Creation

The following code serves to construct a Pandas dataframe from the raw jsonl files reconstructed from the original ClaimDecomp dataset

In [18]:
import pandas as pd
import json
from pandas import json_normalize

path = "./reconstructed_data/dev.jsonl"
with open(path, 'r') as jsonl_file:
    jsonl_list = list(jsonl_file)

#listOfRows is a list of dict, each containing data for each row in the df with keys = column name
listOfRows = []
for entry in jsonl_list:
    loadedEntry = json.loads(entry)
    # print(f"result: {loadedEntry}")
    # print(isinstance(loadedEntry, dict))
    listOfRows.append(loadedEntry)

df = pd.DataFrame(listOfRows)
df.head()

Unnamed: 0,example_id,label,url,annotations,claim,person,venue,justification,full_article
0,8057719209342304749,false,https://www.politifact.com/factchecks/2020/apr...,[{'questions': ['Is voting fraud widespread in...,"With voting by mail, “you get thousands and th...",Donald Trump,"stated on April 7, 2020 in a press briefing:","Trump said that with voting by mail, ""you get ...",The daily White House briefings about coronavi...
1,-3333998957238197422,barely-true,https://www.politifact.com/factchecks/2019/mar...,[{'questions': ['Was the federal aid given by ...,"""I’ve already traveled to Washington, D.C., an...",Ron DeSantis,"stated on March 5, 2019 in his State of the St...","DeSantis said, ""I’ve already traveled to Washi...","Editor’s note, March 10 12:55 p.m.: Two days a..."
2,-5816336384767541299,barely-true,https://www.politifact.com/factchecks/2016/nov...,[{'questions': ['Is this ban directly linked t...,Says that when San Francisco banned plastic gr...,James Quintero,"stated on October 10, 2016 in a panel discussi...",Quintero said that when San Francisco banned p...,"Reused grocery bags made Californians sick, a ..."
3,7968458905312541095,true,https://www.politifact.com/factchecks/2014/dec...,[{'questions': ['Is it true that The United S...,"The United States ""decided waterboarding was t...",Sheldon Whitehouse,"stated on December 14, 2014 in a TV interview:","Sheldon Whitehouse said the United States ""dec...","The so-called ""CIA torture report"" has heighte..."
4,-2095875040468818200,false,https://www.politifact.com/factchecks/2020/sep...,[{'questions': ['Has Trump been accused of wal...,"Quotes Donald Trump as saying, “I’ll tell you ...",Viral image,"stated on September 21, 2020 in a post on Face...",A Facebook post quotes President Trump as sayi...,"President Donald Trump, a former beauty pagean..."
