Skip to content

StevenPeutz/Disinformation-NLP-R-project

Repository files navigation

Misinformation-textAnalysis

Misinformation, Fake News and Propaganda

Introduction

Disinformation, defined as the subset of 'misinformation' where there is intent to mislead, has seen a an astronomical rise in both its success in terms of spread and impact as well in the effort to combat it. Though known state-backed disinformation campaigns date back to at least the cold war era, they have perhaps caught the eye of the public after the 2016 US presidential election (Allcot et al., 2017). Later in 2016 disinformation is claimed to have been influential in the 'Brexit' referendum, and later in 2018 a similar development was suspected during the Brazilian presidential elections. As this was unfolding an explosive growth in so called 'fact-checking' organisation and their cooperation with news agencies, social media companies (e.g. facebook/meta) and governments can be seen. Fact-checking organization, often volunteer based or financed through charity, their capacity tend to be outpaced by the sheer volume of suspected disinformation content.

One supposed solution for this is using AI to automate classification of new articles or posts based on its linguistic aspects. Though improvements are being made here, the most accurate models are strongly dependent on meta data such as publication network which are often not available when shared through social media, and have other downsides such as disproportionately high false positive rates when publication networks that have shared disinformation were to post content without disinformation.

An alternative 'intermediate' solution is offered in this project. Instead of classification we aim to discover the topics that are present in Russian propaganda. By doing this we can streamline fact-checking by establishing a basis on which already fact-checked disinformation and propaganda can be matched with with newly published unchecked articles and posts. We use Latent Dirichlet topic modelling (LDA) in order to create our mixed membership model. We use LDA because it is an algorithm that uses a three level hierarchical Bayesian model in which each item of a collection is modeled as a finite mixture over an underlying set of topics (Blei et al., 2001).

Project scope

This projects restricts itself to pro-Kremlin disinformation. This is specified as 'pro-Kremlin' as direct ties with the Russian Internet Research Agency (IRA) and official backing of the Kremlin are perhaps expected but are not verified with hard proof. The scoped will also be limited to a preliminary test of the theoretical possibility and validity of the LDA mixed membership modelling without going into the practical application of the results.

About the data:

The disinformation texts were collected by the EUvsDisinfo project. A project started in 2015 that identifies and fact checks disinformation cases originating from pro-Kremlin media that are spread across the EU. More information about this project can be found here: https://euvsdisinfo.eu/ The dataset collected from EUvsDisinfo runs from 2015 to 2019, and can be found here: https://www.kaggle.com/datasets/stevenpeutz/misinformation-fake-news-text-dataset-79k

Packages & Libraries

Required packages & libraries

packages <- c("textstem","tokenizers","tidytext","dplyr","stringr","corpus","tidyverse","stopwords","SnowballC","tidyr","topicmodels","ldatuning","wordcloud","stm","Rtsne","ggrepel", "knitr")
install.packages(setdiff(packages, rownames(installed.packages())))  

About

Misinformation, Fake News and Propaganda

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published