# Get PolitiFact Dataset

- HuggingFace: https://huggingface.co/datasets/LittleFish-Coder/Fake_News_PolitiFact

- Split:
    - `train`: 381
    - `test`: 102

- Column:
    - `text`: str
    - `embeddings`: list of float
    - `label`: int
        - `0`: real
        - `1`: fake

In [1]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# load and download the dataset from huggingface
dataset = load_dataset("LittleFish-Coder/Fake_News_PolitiFact", download_mode="reuse_cache_if_exists", cache_dir="dataset")

Generating train split: 100%|██████████| 381/381 [00:00<00:00, 20245.14 examples/s]
Generating test split: 100%|██████████| 102/102 [00:00<00:00, 12641.66 examples/s]


In [3]:
print(type(dataset))
print(dataset)

<class 'datasets.dataset_dict.DatasetDict'>
DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 381
    })
    test: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 102
    })
})


In [4]:
dataset.keys()  # type: ignore

dict_keys(['train', 'test'])

In [5]:
# train data
train_dataset = dataset['train']    # type: ignore
# first row of the train data
print(train_dataset[0].keys())
print(train_dataset[0])

dict_keys(['text', 'label', 'embeddings'])
{'text': 'Inside a Fake News Sausage Factory: ‘This Is All About Income’ In Tbilisi, the two-room rented apartment Mr. Latsabidze shares with his younger brother is an unlikely offshore outpost of America’s fake news industry. The two brothers, both computer experts, get help from a third young Georgian, an architect.\n\nThey say they have no keen interest in politics themselves and initially placed bets across the American political spectrum and experimented with show business news, too. They set up a pro-Clinton website, walkwithher.com, a Facebook page cheering Bernie Sanders and a web digest of straightforward political news plagiarized from The New York Times and other mainstream news media.\n\nBut those sites, among the more than a dozen registered by Mr. Latsabidze, were busts. Then he shifted all his energy to Mr. Trump. His flagship pro-Trump website, departed.co, gained remarkable traction in a crowded field in the prelude to the Nov

In [6]:
from collections import Counter

In [7]:
# select top 100 rows
top_100_train_dataset = train_dataset.select(range(100))    # type: ignore

In [8]:
# count the first 100 samples label distribution
print("Label distribution of the first 100 samples:")
print(Counter(top_100_train_dataset['label']))

Label distribution of the first 100 samples:
Counter({0: 69, 1: 31})


In [9]:
# select top 10 rows
top_10_train_dataset = train_dataset.select(range(10))  # type: ignore

In [10]:
# count the first 10 samples label distribution
print("Label distribution of the first 10 samples:")
print(Counter(top_10_train_dataset['label']))

Label distribution of the first 10 samples:
Counter({0: 7, 1: 3})
