Fake News Challenge (FNC-1)

This GitHub repository contains all of the code and data for our final NLP class project. Our initial project proposal is available as well.

Data

The original dataset and scorer script are located in the fnc-1/ directory. All additional data files can be found in the data/ directory, including the following:

Generated features files for train and test: feats.train.all.csv and feats.competition_test.all.csv
Binary file containing 300-dimensional GloVe embeddings: GoogleNews-vectors-negative300.bin
Serialized LDA topic model and dictionary: lda.model.pkl and lda.dct.pkl
"The New York Times Newswire Service" portion of the English Gigaword corpus: nyt.txt
Subdirectory for sentiment analysis: sentiment/
Serialized count and TF-IDF vectorizer models: vectorizer.count.pkl and vectorizer.tfidf.pkl

Scripts

All Python and bash scripts are included in the scripts/ directory. To run, create a conda environment and install the requirements:

[FakeNewsChallenge]$ conda create -n fnc-nyu python=3.6
[FakeNewsChallenge]$ source activate fnc-nyu
(fnc-nyu) [FakeNewsChallenge]$ pip install -r requirements.txt

Python scripts

Once your environment is set up, you can run the following scripts:

feature_builder.py: feature engineering module
run_1stage.py: train and run 1-stage classifier
run_2stage.py: train and run 2-stage classifier
scorer.py: original evaluation script published by the FNC-1 organizers
train_lda_model.py: train LDA topic model on the NYT portion of Gigaword
tune_xgb_params_1stage.py: tune hyperparameters for 1-stage classifier
tune_xgb_params_2stage.py: tune hyperparameters for 2-stage classifier
utils.py: utilty functions (configuration file import and text preprocessing)

Note that most of the scripts require that you pass a configuration file as argument:

(fnc-nyu) [FakeNewsChallenge]$ python run_1stage.py competition_test.yml

Most importantly, competition_test.yml contains the list of features that will be included or generated on each run.

Bash scripts

Alternatively, use sbatch and the bash scripts below to run extensive Python jobs on HPC:

run_1stage.sh
run_2stage.sh
train_lda_model.sh
tune_xgb_params_1stage.sh
tune_xgb_params_2stage.sh

Predictions and results

There are two output files containing the predictions on the competition test set:

predictions.1stage.csv: contains the 1-stage classifier predictions
predictions.2stage.csv: contains the 2-stage classifier predictions

1-stage classifier

The top score after feature selection and hyperparameter tuning for the 1-stage classifier is **9128.5, or 78.35%.

(fnc) [mt3685@c38-15 FakeNewsChallenge]$ python scripts/scorer.py fnc-1/competition_test_stances.csv predictions.1stage.csv
CONFUSION MATRIX:
-------------------------------------------------------------
|           |   agree   | disagree  |  discuss  | unrelated |
-------------------------------------------------------------
|   agree   |    144    |     4     |   1607    |    148    |
-------------------------------------------------------------
| disagree  |    12     |     1     |    522    |    162    |
-------------------------------------------------------------
|  discuss  |    190    |     2     |   3874    |    398    |
-------------------------------------------------------------
| unrelated |     2     |     0     |    246    |   18101   |
-------------------------------------------------------------
ACCURACY: 0.870

MAX  - the best possible score (100% accuracy)
NULL - score as if all predicted stances were unrelated
TEST - score based on the provided predictions

||    MAX    ||    NULL   ||    TEST   ||
|| 11651.25  ||  4587.25  ||  9128.5   ||

2-stage classifier

The top score after feature selection and hyperparameter tuning for the 2-stage classifier is 9161.5, or 78.63%.

(fnc) [mt3685@c38-15 FakeNewsChallenge]$ python scripts/scorer.py fnc-1/competition_test_stances.csv predictions.2stage.csv
CONFUSION MATRIX:
-------------------------------------------------------------
|           |   agree   | disagree  |  discuss  | unrelated |
-------------------------------------------------------------
|   agree   |    27     |     0     |   1733    |    143    |
-------------------------------------------------------------
| disagree  |     9     |     0     |    533    |    155    |
-------------------------------------------------------------
|  discuss  |    45     |     0     |   4060    |    359    |
-------------------------------------------------------------
| unrelated |     5     |     0     |    366    |   17978   |
-------------------------------------------------------------
ACCURACY: 0.868

MAX  - the best possible score (100% accuracy)
NULL - score as if all predicted stances were unrelated
TEST - score based on the provided predictions

||    MAX    ||    NULL   ||    TEST   ||
|| 11651.25  ||  4587.25  ||  9161.5   ||

Resampling

The original dataset is highly imbalanced, with the majority of example pairs being "unrelated":

agree	disagree	discuss	unrelated	total
3678	840	8909	36545	49972

To improve recall on "disagree" examples, we experimented with oversampling the training data to introduce a bias towards "disagree" predictions. After resampling, the number of "disagree" samples in the training data increases from 840 to 3678 (the original number of "agree" samples). When using resampling, the recall for "disagree" improves, while the precision for "discuss" decreases significantly.

The overall accuracy of the 2-stage classifier with resampling is 9115.75, or 78.24%.

(fnc) [mt3685@c38-15 FakeNewsChallenge]$ python scripts/scorer.py fnc-1/competition_test_stances.csv predictions.2stage.csv
CONFUSION MATRIX:
-------------------------------------------------------------
|           |   agree   | disagree  |  discuss  | unrelated |
-------------------------------------------------------------
|   agree   |    25     |    21     |   1718    |    139    |
-------------------------------------------------------------
| disagree  |     4     |     7     |    529    |    157    |
-------------------------------------------------------------
|  discuss  |    33     |    84     |   3993    |    354    |
-------------------------------------------------------------
| unrelated |     6     |     3     |    366    |   17974   |
-------------------------------------------------------------
ACCURACY: 0.866

MAX  - the best possible score (100% accuracy)
NULL - score as if all predicted stances were unrelated
TEST - score based on the provided predictions

||    MAX    ||    NULL   ||    TEST   ||
|| 11651.25  ||  4587.25  ||  9115.75  ||

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
fnc-1 @ 29d473a		fnc-1 @ 29d473a
scripts		scripts
sentiment		sentiment
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
LICENSE		LICENSE
PROPOSAL.md		PROPOSAL.md
README.md		README.md
competition_test.yml		competition_test.yml
predictions.1stage.csv		predictions.1stage.csv
predictions.2stage.csv		predictions.2stage.csv
requirements.txt		requirements.txt
tune_xgb_params.1stage.log.txt		tune_xgb_params.1stage.log.txt
tune_xgb_params.2stage.log.txt		tune_xgb_params.2stage.log.txt

License

NYU-FNC/FakeNewsChallenge

Folders and files

Latest commit

History

Repository files navigation

Fake News Challenge (FNC-1)

Data

Scripts

Python scripts

Bash scripts

Predictions and results

1-stage classifier

2-stage classifier

Resampling

About

Resources

License

Stars

Watchers

Forks

Languages