Misinformation_Baselines

Misinformation detection is one of the major problems that the Natural Language Processing community is faced with. Research in this area to curb the spread of fake news is of prime importance. In this project, I implement baselines models for three stance detection misinformation datasets.

Datasets

ByteDance Dataset -

ByteDance dataset has been released by ByteDance (A China-based global Internet technology company) as the competition dataset of Task: Fake News Classification. The training dataset consists of 320,767 news pairs with 3 class labels, i.e., agreed, disagreed, and unrelated. The testing dataset contains 80,126 news pairs without any labels. These news pairs are available in both Chinese and English.

Fake News Challenge-1 Dataset

Fake News Challenge-1 has provided this dataset and derived from Emergent (a digital journalism project for rumor debunking). There are 49,972 headline-body pairs in total, with stances labeled by expert journalists. FNC1 dataset has a headline and a body text pair, either from the same news article or from two different articles, and the corresponding stance labels Agrees, Disagrees, Discusses, Unrelated.

Covid-Stance Dataset -

This is a stance detection dataset that includes user-generated content on Twitter in the context of COVID-19. It is a collection of approximately 14 thousand tweets. It contains manually annotated opinions of the tweet initiators regarding the use of “chloroquine” and “hydroxychloroquine” to prevent or treat COVID-19. The instances of this dataset have three different classes as Neutral, Against, and Favor.

Baselines

Siamese BiLSTM + LSTM -

Siamese LSTMs are used since, in each dataset we have a source target pair and we observe that the Siamese (Shared weights) give better learning as compared to separate LSTMs.

MultiChannel CNNs + LSTM -

MultiChannel CNNs are used to capture local features in the input and proved effective in various NLP tasks (apart from their applications in traditional computer vision). The features generated from the CNNs and LSTMs are added and then fed to a dense layer for the final classification.

Method

Word Features only -

With the above two models we only use the word level embeddings obtained from Glove 300d as the initial embeddings.

Word + Sentence Level -

Here, along with the word representations, we generate the sentence level representations using BERT(BiDirectional Encoding Representations from Transformers). The two features have separate encoders and the representation obtained from these encoders is fused and passed to the dense layer for the final classification.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
BERT_Sentence		BERT_Sentence
ByteDance		ByteDance
Covid_Stance		Covid_Stance
FNC		FNC
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Misinformation_Baselines

Datasets

ByteDance Dataset -

Fake News Challenge-1 Dataset

Covid-Stance Dataset -

Baselines

Siamese BiLSTM + LSTM -

MultiChannel CNNs + LSTM -

Method

Word Features only -

Word + Sentence Level -

References

About

Releases

Packages

Languages

Nish-19/Misinformation_Baselines

Folders and files

Latest commit

History

Repository files navigation

Misinformation_Baselines

Datasets

ByteDance Dataset -

Fake News Challenge-1 Dataset

Covid-Stance Dataset -

Baselines

Siamese BiLSTM + LSTM -

MultiChannel CNNs + LSTM -

Method

Word Features only -

Word + Sentence Level -

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages