GitHub - texty/manipulative_news_methodology: Classifier, data collection scripts, and annotation tool for project on manipulative news

How we used NLP to analyse more then 2 millions of news items from hundreds of "junk sites"

This is methodology for "We've got a bad news" project (in English)

Repository structure

classifier - scripts for training and applying language model classifier
data_collection - scripts to load RSS feeds, Facebook feeds of selected sites, and scrapy project to load html for each article
data_processing - scripts to prepare data for classifier
Aggregated ranking - grouped results for the whole database of news. In final product we do not consider Russian sites, big Ukrainian sites, and sites with less than 25% of manipulative news
.._annotation.csv - annotated sample of news htmls. html_id - key id of article in data file, other columns - annotations
cls_tool - Django site for annotation

Data

Scripts for data collection and their description are in data_collection folder.

Data can be downloaded here(2Gb). html_id - key field, ra_summary - readability html of article page, real_url - link to article.

Totally we collected 306 500 articles in Ukrainian and 2 301 000 articles in Russian. Next we filtered out articles not about Ukrainian politics and society (excluded celebrities, international news etc.). There were left 1 174 000 relevant articles in Russian and 227 400 articles in Ukrainian. Websites in final ranking totally produced 289 300 relevant articles.

Data for the project are news from around 200 websites, collected from December 2017 until Nowember 2018. For each site we collected RSS feed every hour as well as daily Facebook feeds. Breaks occured several times because of technical reasons.

For every link from RSS or Facebook feed of site's page we downloaded full text and processed it using readability (by Mozilla, and Python readability) algorithm. Readability parsing errors occure in less than 5% of cases, without significant error rate for individual websites. Next we removed html tags and tokenized text.

Annotation

Please find annotation tool in cls_tool folder.
We invited journalists with experience as newsfeed editor to label training set for training. Totally we collected 1300 relevant annotated articles in Ukrainian and 6000 in Russian.
All annotators were interviewed, instructed about possible labels of manipulative news. During annotation we maintained Facebook group to discuss uncertainties and labeling in general. We controlled annotation quality by monitoring labels and time intervals between annotation (if time between two labels is enough to read the article).
Inintially we used the following labels for annotation, text exactly as it was written in annotator instruction:

Fully fictional news
Manipulative title / click-bait
Conspiracies / Pseudo-sciense
Emotionally charged news
Manipulations thrue bad argumentation
Political conspiracies
Normal text
Other (non-relevant content)

Finally we selected only emotional manipulations and manipulative arguments. The notion of clickbait headlines turned to be ambiguous and we did not manage to build working classifier for this type of manipulation. The rest of manipulations occured rarely and there were not enough positive examples for training.

Classification

classifier folder containes links to pretrained models, classification scripts, instructions on how to download libraries, and test datasets.

We tried various NLP approaches to detect manipulation in news: bag-of-words and document vectors machine learning models, and LSTM on word vectors. Finally we used text classification with language model developed by fast.ai. Code for training language models on Wikipedia corpus can be found here.

We used example code from fast.ai courseto train classifiers and found most of defaults working best for our data. We increased dropouts for training Ukrainian language classifier and added multilabel final layer to detect multiple manipulation types at once (multilabel classifier was as accurate as individual classifiers for each manipulation type or better).

Language model containes:

input layer of vocabulary size (up to 60 000 tokens that occur more than 10 times)
Embedding layser of size 400
3 LSTM layers of 1150 cells each
Model output is the result of "concat pooling": last hidden LSTM state, max-pooling of LSTM states, and average LSTM state, up to bptt last activations. The size of LM output is 3 * embedding size

Final feedforward layer for language model training is prediction of the news word.
For classification we change the last layer to feedforward network with 50 and the 2 cells, since we have 2 categories to classify. In classification we changed default categorical cross-entropy loss to binary cross-entropy, and softmax activation to sigmoid in order to perform multilabel classification.

You can download and use all models according to project's license

Final ranking

in final ranking we left only sites with more than 200 relevant news and more that 25% of manipulative news. It is simple aggregation of classification results.

Confusion matrix of classifier's prediction on validation set (threshould=0.41, True = this is emotionally manipulative item):

Classifier's result ----->	True	False
Ground truth
True	139	115
False	123	1138

ROC-curves for classifiers (built on validation set):

ROC-curve Russian	ROC-curve Ukrainian

Modelled distribution of scores, share of positive examples (emotional news) in population is 20%

True Positive Rate: TPR = TP / (TP + FN)
True Negative Rate: TNR = TN / (TN + FP)
Geometric accuracy mean: GA = sqrt(TNR*TPR)
TP - true positive, FP - false positive, etc

Performance of Language Model

Language models' perplexities before and after finetuning

(Accuracy below is for language model(LM), not for classifier. Roughly, it's share of words, correсtly predicted by language model, after some imput sequence).

LM on Wikipedia corpus was trained for max 30k dictionary, while in finetuning we used max 60k tokens

	ru			uk
	loss	accuracy	perplexity	loss	accuracy	perplexity
wiki lm	6.2167	0.1611	501.0470	5.34244	0.1963	209.0221
after fine-tuning	3.4486	0.3915	31.4563	3.5339	0.3711	34.2580

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
classifier		classifier
cls_tool		cls_tool
data_collection		data_collection
data_processing		data_processing
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
psql_engine.txt		psql_engine.txt
requirements.txt		requirements.txt

License

texty/manipulative_news_methodology

Folders and files

Latest commit

History

Repository files navigation

How we used NLP to analyse more then 2 millions of news items from hundreds of "junk sites"

This is methodology for "We've got a bad news" project (in English)

Repository structure

Table of contents

Data

Annotation

Classification

You can download and use all models according to project's license

Final ranking

ROC-curves for classifiers (built on validation set):

Modelled distribution of scores, share of positive examples (emotional news) in population is 20%

Performance of Language Model

Language models' perplexities before and after finetuning

About

Resources

License

Stars

Watchers

Forks

Languages