### Title: Data Combination Notebook - LIAR Dataset
### Filename: combine_datasets.ipynb

# Combining LIAR Dataset Files
This notebook combines the three separate TSV files from the LIAR dataset into a single comprehensive dataset.

## Dataset Information
The LIAR dataset consists of three files:
- train.tsv
- valid.tsv
- test.tsv

Each file contains statements labeled for truthfulness with various metadata.

In [2]:
# Importing the required libraries
import pandas as pd
import os

#### Define Column Names
Based on the README file, we define the columns for our dataset:

In [3]:
# Defining the Column names based on the README file
column_names = [
    "id",
    "label",
    "statement",
    "subject",
    "speaker",
    "speaker_job",
    "state_info",
    "party_affiliation",
    "barely_true_counts",
    "false_counts",
    "half_true_counts",
    "mostly_true_counts",
    "pants_on_fire_counts",
    "context",
]

#### Load and Combine the Datasets

In [4]:
# Dataset Path
dataset_path = "liar_dataset"

In [5]:
# Loading the Three TSV files
train_df = pd.read_csv(
    os.path.join(dataset_path, "train.tsv"), sep="\t", names=column_names
)
valid_df = pd.read_csv(
    os.path.join(dataset_path, "valid.tsv"), sep="\t", names=column_names
)
test_df = pd.read_csv(
    os.path.join(dataset_path, "test.tsv"), sep="\t", names=column_names
)

In [6]:
# Combining the three datasets
combined_df = pd.concat([train_df, valid_df, test_df], axis=0, ignore_index=True)

#### Basic Dataset Statistics

In [7]:
print("Dataset Statistics:")
print(f"Total number of rows: {len(combined_df)}")
print("\nDistribution of labels:")
print(combined_df["label"].value_counts())
print("\nMissing values in each column:")
print(combined_df.isnull().sum())

Dataset Statistics:
Total number of rows: 12791

Distribution of labels:
label
half-true      2627
false          2507
mostly-true    2454
barely-true    2103
true           2053
pants-fire     1047
Name: count, dtype: int64

Missing values in each column:
id                         0
label                      0
statement                  0
subject                    2
speaker                    2
speaker_job             3568
state_info              2751
party_affiliation          2
barely_true_counts         2
false_counts               2
half_true_counts           2
mostly_true_counts         2
pants_on_fire_counts       2
context                  131
dtype: int64


#### Exporting the Combined Dataset

In [8]:
# Save the combined dataset
output_path = os.path.join(dataset_path, "liars_dataset.csv")
combined_df.to_csv(output_path, index=False)
print(f"\nCombined dataset saved to: {output_path}")


Combined dataset saved to: liar_dataset/liars_dataset.csv
