In [2]:
import datasets
import torch
from dgl.data import FakeNewsDataset
import pandas as pd

# Fake News English

This dataset contains URLs of news articles classified as either fake or satire. The articles classified as fake also have the URL of a rebutting article.

Paper: https://dl.acm.org/doi/10.1145/3201064.3201100

#### Data Fields
- article_number: An integer used as an index for each row
- url_of_article: A string which contains URL of an article to be assessed and classified as either Fake or Satire
- fake_or_satire: A classlabel for the above variable which can take two values- Fake (1) and Satire (0)
- url_of_rebutting_article: A string which contains a URL of the article used to refute the article in question (present - in url_of_article)


In [3]:
fake_news_dataset = datasets.load_dataset("community-datasets/fake_news_english")

In [4]:
fake_news_dataset

DatasetDict({
    train: Dataset({
        features: ['article_number', 'url_of_article', 'fake_or_satire', 'url_of_rebutting_article'],
        num_rows: 492
    })
})

In [5]:
fake_news_dataset['train'][0]

{'article_number': 375,
 'url_of_article': 'http://www.redflagnews.com/headlines-2016/cdc-proposes-rule-to-apprehend-and-detain-anyone-anywhere-at-any-time-for-any-duration-without-due-process-or-right-of-appeal-and-administer-forced-vaccinations',
 'fake_or_satire': 1,
 'url_of_rebutting_article': 'http://www.snopes.com/cdc-forced-vaccinations/'}

# Fake News Graph Classification dataset

The dataset is composed of two sets of tree-structured fake/real news propagation graphs extracted from Twitter. Different from most of the benchmark datasets for the graph classification task, the graphs in this dataset are directed tree-structured graphs where the root node represents the news, the leaf nodes are Twitter users who retweeted the root news. Besides, the node features are encoded user historical tweets using different pretrained language models:
- bert: the 768-dimensional node feature composed of Twitter user historical tweets encoded by the bert-as-service
- content: the 310-dimensional node feature composed of a 300-dimensional “spacy” vector plus a 10-dimensional “profile” vector
- profile: the 10-dimensional node feature composed of ten Twitter user profile attributes.
- spacy: the 300-dimensional node feature composed of Twitter user historical tweets encoded by the spaCy word2vec encoder.

Reference: <https://github.com/safe-graph/GNN-FakeNews>

#### Statistics:

- Politifact:
    - Graphs: 314
    - Nodes: 41,054
    - Edges: 40,740
    - Classes:
        - Fake: 157
        - Real: 157
    - Node feature size:
        - bert: 768
        - content: 310
        - profile: 10
        - spacy: 300
- Gossipcop:
    - Graphs: 5,464
    - Nodes: 314,262
    - Edges: 308,798
    - Classes:
        - Fake: 2,732
        - Real: 2,732
    - Node feature size:
        - bert: 768
        - content: 310
        - profile: 10
        - spacy: 300

In [6]:
# import torch
# torch.__version__
# ! nvcc --version
# ! pip install  dgl -f https://data.dgl.ai/wheels/cu121/repo.html

In [7]:
gossipcop_dataset = FakeNewsDataset('gossipcop', 'bert')

In [8]:
graph, label = gossipcop_dataset[0]
num_classes = gossipcop_dataset.num_classes
feat = gossipcop_dataset.feature
labels = gossipcop_dataset.labels

In [9]:
labels

tensor([0., 0., 0.,  ..., 1., 1., 1.], dtype=torch.float64)

# Covid Vaccine Tweets dataset

We collect recent tweets about Pfizer & BioNTech vaccine.

The data is collected using tweepy Python package to access Twitter API.

Inspiration
Study the subjects of recent tweets about the vaccine made in collaboration by Pfizer and BioNTech, perform various NLP tasks on this data source.

In [10]:
vaccination_dataset = pd.read_csv('../data/vaccination_tweets/vaccination_tweets.csv')

In [11]:
vaccination_dataset.head()

Unnamed: 0,id,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,retweets,favorites,is_retweet
0,1340539111971516416,Rachel Roh,"La Crescenta-Montrose, CA",Aggregator of Asian American news; scanning di...,2009-04-08 17:52:46,405,1692,3247,False,2020-12-20 06:06:44,Same folks said daikon paste could treat a cyt...,['PfizerBioNTech'],Twitter for Android,0,0,False
1,1338158543359250433,Albert Fong,"San Francisco, CA","Marketing dude, tech geek, heavy metal & '80s ...",2009-09-21 15:27:30,834,666,178,False,2020-12-13 16:27:13,While the world has been on the wrong side of ...,,Twitter Web App,1,1,False
2,1337858199140118533,eli🇱🇹🇪🇺👌,Your Bed,"heil, hydra 🖐☺",2020-06-25 23:30:28,10,88,155,False,2020-12-12 20:33:45,#coronavirus #SputnikV #AstraZeneca #PfizerBio...,"['coronavirus', 'SputnikV', 'AstraZeneca', 'Pf...",Twitter for Android,0,0,False
3,1337855739918835717,Charles Adler,"Vancouver, BC - Canada","Hosting ""CharlesAdlerTonight"" Global News Radi...",2008-09-10 11:28:53,49165,3933,21853,True,2020-12-12 20:23:59,"Facts are immutable, Senator, even when you're...",,Twitter Web App,446,2129,False
4,1337854064604966912,Citizen News Channel,,Citizen News Channel bringing you an alternati...,2020-04-23 17:58:42,152,580,1473,False,2020-12-12 20:17:19,Explain to me again why we need a vaccine @Bor...,"['whereareallthesickpeople', 'PfizerBioNTech']",Twitter for iPhone,0,0,False


# Natural Language Processing with Disaster Tweets

https://www.kaggle.com/competitions/nlp-getting-started

Twitter has become an important communication channel in times of emergency.

The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:

The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with  the visual aid. But it’s less clear to a machine.

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

EDA example: https://www.kaggle.com/code/hamditarek/fake-news-detection-on-twitter-eda/notebook

In [12]:
disaster_dataset = pd.read_csv('../data/disaster_tweets/train.csv')

In [13]:
disaster_dataset.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


# FakeNewsNet

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UEMMHS

Complete dataset cannot be distributed because of Twitter privacy policies and news publisher copy rights. Social engagements and user information are not disclosed because of Twitter Policy. This code repository can be used to download news articles from published websites and relevant social media data from Twitter.

In [14]:
fnn_gossipcop_fake = pd.read_csv("../data/FakeNewsNet/gossipcop_fake.csv")

In [15]:
fnn_gossipcop_fake.head()

Unnamed: 0,id,news_url,title,tweet_ids
0,gossipcop-2493749932,www.dailymail.co.uk/tvshowbiz/article-5874213/...,Did Miley Cyrus and Liam Hemsworth secretly ge...,284329075902926848\t284332744559968256\t284335...
1,gossipcop-4580247171,hollywoodlife.com/2018/05/05/paris-jackson-car...,Paris Jackson & Cara Delevingne Enjoy Night Ou...,992895508267130880\t992897935418503169\t992899...
2,gossipcop-941805037,variety.com/2017/biz/news/tax-march-donald-tru...,Celebrities Join Tax March in Protest of Donal...,853359353532829696\t853359576543920128\t853359...
3,gossipcop-2547891536,www.dailymail.co.uk/femail/article-3499192/Do-...,Cindy Crawford's daughter Kaia Gerber wears a ...,988821905196158981\t988824206556172288\t988825...
4,gossipcop-5476631226,variety.com/2018/film/news/list-2018-oscar-nom...,Full List of 2018 Oscar Nominations – Variety,955792793632432131\t955795063925301249\t955798...


### From Kaggle

This is a repository for an ongoing data collection project for fake news research at ASU. We describe and compare FakeNewsNet with other existing datasets in Fake News Detection on Social Media: A Data Mining Perspective. We also perform a detail analysis of FakeNewsNet dataset, and build a fake news detection model on this dataset in Exploiting Tri-Relationship for Fake News Detection

EDA: https://www.kaggle.com/code/jaybhanushali1792/buzzfeed-news-analysis-and-classification

#### Buzzfeed

In [19]:
bf_fake_news_dataset = pd.read_csv("../data/FakeNewsNet/kaggle/BuzzFeed_fake_news_content.csv")

In [17]:
bf_fake_news_dataset.head()

Unnamed: 0,id,title,text,url,top_img,authors,source,publish_date,movies,images,canonical_link,meta_data
0,Fake_1-Webpage,Proof The Mainstream Media Is Manipulating The...,I woke up this morning to find a variation of ...,http://www.addictinginfo.org/2016/09/19/proof-...,http://addictinginfo.addictinginfoent.netdna-c...,Wendy Gittleson,http://www.addictinginfo.org,{'$date': 1474243200000},,"http://i.imgur.com/JeqZLhj.png,http://addictin...",http://addictinginfo.com/2016/09/19/proof-the-...,"{""publisher"": ""Addicting Info | The Knowledge ..."
1,Fake_10-Webpage,Charity: Clinton Foundation Distributed “Water...,Former President Bill Clinton and his Clinton ...,http://eaglerising.com/36899/charity-clinton-f...,http://eaglerising.com/wp-content/uploads/2016...,View All Posts,http://eaglerising.com,{'$date': 1474416521000},,http://constitution.com/wp-content/uploads/201...,http://eaglerising.com/36899/charity-clinton-f...,"{""description"": ""The possibility that CHAI dis..."
2,Fake_11-Webpage,A Hillary Clinton Administration May be Entire...,After collapsing just before trying to step in...,http://eaglerising.com/36880/a-hillary-clinton...,http://eaglerising.com/wp-content/uploads/2016...,"View All Posts,Tony Elliott",http://eaglerising.com,{'$date': 1474416638000},,http://constitution.com/wp-content/uploads/201...,http://eaglerising.com/36880/a-hillary-clinton...,"{""description"": ""Hillary Clinton may be the fi..."
3,Fake_12-Webpage,Trump’s Latest Campaign Promise May Be His Mos...,"Donald Trump is, well, deplorable. He’s sugges...",http://www.addictinginfo.org/2016/09/19/trumps...,http://addictinginfo.addictinginfoent.netdna-c...,John Prager,http://www.addictinginfo.org,{'$date': 1474243200000},,"http://i.imgur.com/JeqZLhj.png,http://2.gravat...",http://addictinginfo.com/2016/09/19/trumps-lat...,"{""publisher"": ""Addicting Info | The Knowledge ..."
4,Fake_13-Webpage,Website is Down For Maintenance,Website is Down For Maintenance,http://www.proudcons.com/clinton-foundation-ca...,,,http://www.proudcons.com,,,,,"{""og"": {""url"": ""http://www.proudcons.com"", ""ty..."


In [20]:
bf_real_news_dataset = pd.read_csv("../data/FakeNewsNet/kaggle/BuzzFeed_real_news_content.csv")

In [21]:
pf_real_news_dataset = pd.read_csv("../data/FakeNewsNet/kaggle/PolitiFact_real_news_content.csv")

In [None]:
# BuzzFeedNews.txt

In [None]:
# BuzzFeedNewsUser.txt

In [None]:
# BuzzFeedUser.txt

In [None]:
# BuzzFeedUserFeature.mat

In [None]:
# BuzzFeedUserUser.txt

#### Politifact

In [22]:
pf_real_news_dataset.head()

Unnamed: 0,id,title,text,url,top_img,authors,source,publish_date,movies,images,canonical_link,meta_data
0,Real_1-Webpage,Trump Just Insulted Millions Who Lost Everythi...,16.8k SHARES SHARE THIS STORY\n\nHillary Clint...,http://occupydemocrats.com/2016/09/27/trump-ju...,http://occupydemocrats.com/wp-content/uploads/...,"Brett Bose,Grant Stern,Steve Bernstein,Natalie...",http://occupydemocrats.com,{'$date': 1474934400000},,http://occupydemocrats.com/wp-content/uploads/...,http://occupydemocrats.com/2016/09/27/trump-ju...,"{""generator"": ""Powered by Visual Composer - dr..."
1,Real_10-Webpage,Famous dog killed in spot she waited a year fo...,Famous dog killed in spot she waited a year fo...,http://rightwingnews.com/top-news/famous-dog-k...,http://rightwingnews.com/wp-content/uploads/20...,,http://rightwingnews.com,{'$date': 1474948336000},,http://rightwingnews.com/wp-content/uploads/20...,http://rightwingnews.com/top-news/famous-dog-k...,"{""googlebot"": ""noimageindex"", ""og"": {""site_nam..."
2,Real_100-Webpage,House oversight panel votes Clinton IT chief i...,Story highlights The House Oversight panel vot...,http://cnn.it/2deaH2d,http://i2.cdn.cnn.com/cnnnext/dam/assets/16091...,"Tom Lobianco,Deirdre Walsh",http://cnn.it,,,http://i2.cdn.cnn.com/cnnnext/dam/assets/17050...,http://www.cnn.com/2016/09/22/politics/bryan-p...,"{""description"": ""Members of the House Oversigh..."
3,Real_101-Webpage,America Just Tragically Lost A Country Music I...,We are absolutely heartbroken to hear about th...,http://newsbake.com/entertainment-news/music-e...,http://newsbake.com/wp-content/uploads/2016/05...,Nancy Wells,http://newsbake.com,{'$date': 1474898600000},https://www.youtube.com/embed/8ozTJcu-_BU,http://0.gravatar.com/avatar/0d702c6042933cd78...,http://newsbake.com/entertainment-news/music-e...,"{""shareaholic"": {""site_name"": ""NewsBake"", ""lan..."
4,Real_102-Webpage,Monuments to the Battle for the New South,"Nine years ago, a driver lost control of his p...",http://politi.co/2dd9U1x,http://static.politico.com/25/ed/85332de14c45b...,"Jack Shafer,Lisa Rab",http://politi.co,{'$date': 1473941820000},,http://static.politico.com/25/ed/85332de14c45b...,http://www.politico.com/magazine/story/2016/09...,"{""description"": ""Virginia, increasingly divers..."


# FNC-1 (Fake News Challenge Stage 1) #FakeNewsChallenge

https://github.com/FakeNewsChallenge

In [27]:
fnc_stance_detection = pd.read_csv("../data/fnc_stance_detection/train_bodies.csv")

In [28]:
fnc_stance_detection.head()

Unnamed: 0,Body ID,articleBody
0,0,A small meteorite crashed into a wooded area i...
1,4,Last week we hinted at what was to come as Ebo...
2,5,(NEWSER) – Wonder how long a Quarter Pounder w...
3,6,"Posting photos of a gun-toting child online, I..."
4,7,At least 25 suspected Boko Haram insurgents we...


In [34]:
print(fnc_stance_detection['articleBody'][0])

A small meteorite crashed into a wooded area in Nicaragua's capital of Managua overnight, the government said Sunday. Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city's airport, the Associated Press reports. 

Government spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth." House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports. 
Murillo said Nicaragua will ask international experts to help local scientists in understanding what happened.

The crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee. He said it is still not clear if the meteorite disintegrated or was buried.

Humberto Garcia

In [29]:
fnc_stance_detection_stances = pd.read_csv("../data/fnc_stance_detection/train_stances.csv")

In [36]:
fnc_stance_detection_stances[fnc_stance_detection_stances['Stance'] == "agree"].head()

Unnamed: 0,Headline,Body ID,Stance
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree
5,'Nasa Confirms Earth Will Experience 6 Days of...,154,agree
8,Banksy 'Arrested & Real Identity Revealed' Is ...,1739,agree
11,Woman detained in Lebanon is not al-Baghdadi's...,1468,agree
17,"No, Robert Plant Didn’t Rip Up an $800 Million...",295,agree


In [37]:
fnc_stance_detection_stances[fnc_stance_detection_stances['Body ID'] == 158].head()

Unnamed: 0,Headline,Body ID,Stance
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree
3107,It's 'rubbish' that Robert Plant turned down £...,158,unrelated
6392,Robert Plant ripped up $800M Led Zeppelin reun...,158,unrelated
8059,ISIS Militant “Jihadi John” Identified As Youn...,158,unrelated
11688,Claim: Comcast Got Complaining Customer Fired ...,158,unrelated


# CREDBANK

In [None]:
https://github.com/compsocial/CREDBANK-data

In [None]:
# https://www.kaggle.com/datasets/fabliha6459345398/credbank-dataset
# ~20 gb of data!

# PHEME

In [None]:
https://github.com/kochkinaelena/Multitask4Veracity?tab=readme-ov-file

In [None]:
https://www.kaggle.com/datasets/usharengaraju/pheme-dataset

# Fake News Detection

### ISOT Fake News Dataset

### PolitiFact

### Covid Fake News Dataset

### UPFD (User Preference-aware Fake News Detection)

# MuMiN

In [1]:
! pip uninstall -y mumin
! pip install mumin

Found existing installation: mumin 1.8.0
Uninstalling mumin-1.8.0:
  Successfully uninstalled mumin-1.8.0
Collecting mumin
  Obtaining dependency information for mumin from https://files.pythonhosted.org/packages/10/d2/47adce9e07968bbe3d60e7cbafabd40056e013ccc2491e81026fe567fa1c/mumin-1.8.0-py3-none-any.whl.metadata
  Using cached mumin-1.8.0-py3-none-any.whl.metadata (5.4 kB)
Using cached mumin-1.8.0-py3-none-any.whl (31 kB)
Installing collected packages: mumin
Successfully installed mumin-1.8.0


In [2]:
import mumin

In [11]:
dataset = mumin.MuminDataset(twitter_bearer_token="")

In [14]:
dataset.dataset_path

WindowsPath('mumin-small.zip')

In [16]:
dataset.compile()

2024-07-31 20:10:18,112 [INFO] Loading dataset
2024-07-31 20:10:21,093 [INFO] Shrinking dataset
2024-07-31 20:10:21,550 [INFO] Rehydrating tweet nodes


Rehydrating:   0%|          | 0/5261 [00:00<?, ?it/s]

2024-07-31 20:10:21,960 [ERROR] [403] {"client_id":"29128305","detail":"When authenticating requests to the Twitter API v2 endpoints, you must use keys and tokens from a Twitter developer App that is attached to a Project. You can create a project via the developer portal.","registration_url":"https://developer.twitter.com/en/docs/projects/overview","title":"Client Forbidden","required_enrollment":"Appropriate Level of API Access","reason":"client-not-enrolled","type":"https://api.twitter.com/2/problems/client-forbidden"}
2024-07-31 20:10:22,386 [ERROR] [403] {"client_id":"29128305","detail":"When authenticating requests to the Twitter API v2 endpoints, you must use keys and tokens from a Twitter developer App that is attached to a Project. You can create a project via the developer portal.","registration_url":"https://developer.twitter.com/en/docs/projects/overview","title":"Client Forbidden","required_enrollment":"Appropriate Level of API Access","reason":"client-not-enrolled","type"

2024-07-31 20:10:28,932 [ERROR] [403] {"client_id":"29128305","detail":"When authenticating requests to the Twitter API v2 endpoints, you must use keys and tokens from a Twitter developer App that is attached to a Project. You can create a project via the developer portal.","registration_url":"https://developer.twitter.com/en/docs/projects/overview","title":"Client Forbidden","required_enrollment":"Appropriate Level of API Access","reason":"client-not-enrolled","type":"https://api.twitter.com/2/problems/client-forbidden"}
2024-07-31 20:10:29,411 [ERROR] [403] {"client_id":"29128305","detail":"When authenticating requests to the Twitter API v2 endpoints, you must use keys and tokens from a Twitter developer App that is attached to a Project. You can create a project via the developer portal.","registration_url":"https://developer.twitter.com/en/docs/projects/overview","title":"Client Forbidden","required_enrollment":"Appropriate Level of API Access","reason":"client-not-enrolled","type"

2024-07-31 20:10:36,186 [ERROR] [403] {"client_id":"29128305","detail":"When authenticating requests to the Twitter API v2 endpoints, you must use keys and tokens from a Twitter developer App that is attached to a Project. You can create a project via the developer portal.","registration_url":"https://developer.twitter.com/en/docs/projects/overview","title":"Client Forbidden","required_enrollment":"Appropriate Level of API Access","reason":"client-not-enrolled","type":"https://api.twitter.com/2/problems/client-forbidden"}
2024-07-31 20:10:36,632 [ERROR] [403] {"client_id":"29128305","detail":"When authenticating requests to the Twitter API v2 endpoints, you must use keys and tokens from a Twitter developer App that is attached to a Project. You can create a project via the developer portal.","registration_url":"https://developer.twitter.com/en/docs/projects/overview","title":"Client Forbidden","required_enrollment":"Appropriate Level of API Access","reason":"client-not-enrolled","type"

KeyboardInterrupt: 

In [17]:
! pip install  dgl -f https://data.dgl.ai/wheels/cu121/repo.html

Looking in links: https://data.dgl.ai/wheels/cu121/repo.html


# https://zubiaga.org/datasets/