In [1]:
import datasets

# Fake News English

This dataset contains URLs of news articles classified as either fake or satire. The articles classified as fake also have the URL of a rebutting article.

Paper: https://dl.acm.org/doi/10.1145/3201064.3201100

#### Data Fields
- article_number: An integer used as an index for each row
- url_of_article: A string which contains URL of an article to be assessed and classified as either Fake or Satire
- fake_or_satire: A classlabel for the above variable which can take two values- Fake (1) and Satire (0)
- url_of_rebutting_article: A string which contains a URL of the article used to refute the article in question (present - in url_of_article)


In [2]:
dataset_1 = datasets.load_dataset("community-datasets/fake_news_english")

Downloading readme:   0%|          | 0.00/5.01k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/43.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/492 [00:00<?, ? examples/s]

In [5]:
dataset_1

DatasetDict({
    train: Dataset({
        features: ['article_number', 'url_of_article', 'fake_or_satire', 'url_of_rebutting_article'],
        num_rows: 492
    })
})

In [6]:
dataset_1['train'][0]

{'article_number': 375,
 'url_of_article': 'http://www.redflagnews.com/headlines-2016/cdc-proposes-rule-to-apprehend-and-detain-anyone-anywhere-at-any-time-for-any-duration-without-due-process-or-right-of-appeal-and-administer-forced-vaccinations',
 'fake_or_satire': 1,
 'url_of_rebutting_article': 'http://www.snopes.com/cdc-forced-vaccinations/'}

# Fake News Graph Classification dataset

The dataset is composed of two sets of tree-structured fake/real news propagation graphs extracted from Twitter. Different from most of the benchmark datasets for the graph classification task, the graphs in this dataset are directed tree-structured graphs where the root node represents the news, the leaf nodes are Twitter users who retweeted the root news. Besides, the node features are encoded user historical tweets using different pretrained language models:
- bert: the 768-dimensional node feature composed of Twitter user historical tweets encoded by the bert-as-service
- content: the 310-dimensional node feature composed of a 300-dimensional “spacy” vector plus a 10-dimensional “profile” vector
- profile: the 10-dimensional node feature composed of ten Twitter user profile attributes.
- spacy: the 300-dimensional node feature composed of Twitter user historical tweets encoded by the spaCy word2vec encoder.

Reference: <https://github.com/safe-graph/GNN-FakeNews>

#### Statistics:

- Politifact:
    - Graphs: 314
    - Nodes: 41,054
    - Edges: 40,740
    - Classes:
        - Fake: 157
        - Real: 157
    - Node feature size:
        - bert: 768
        - content: 310
        profile: 10
        spacy: 300
- Gossipcop:
    Graphs: 5,464
    Nodes: 314,262
    Edges: 308,798
    Classes:
        Fake: 2,732
        Real: 2,732
    Node feature size:
        bert: 768
        content: 310
        profile: 10
        spacy: 300

In [12]:
# import torch
# torch.__version__
# ! nvcc --version
# ! pip install  dgl -f https://data.dgl.ai/wheels/cu121/repo.html

In [16]:
from dgl.data import FakeNewsDataset

In [17]:
dataset = FakeNewsDataset('gossipcop', 'bert')

Downloading C:\Users\harri\.dgl\gossipcop.zip from https://data.dgl.ai/dataset/FakeNewsGOS.zip...


C:\Users\harri\.dgl\gossipcop.zip:   0%|          | 0.00/1.40G [00:00<?, ?B/s]

Extracting file to C:\Users\harri\.dgl\gossipcop_0df94731


In [18]:
graph, label = dataset[0]
num_classes = dataset.num_classes
feat = dataset.feature
labels = dataset.labels

In [19]:
labels

tensor([0., 0., 0.,  ..., 1., 1., 1.], dtype=torch.float64)