GitHub - david-yoon/detecting-incongruity: TensorFlow implementation of "Detecting Incongruity Between News Headline and Body Text via a Deep Hierarchical Encoder," AAAI-19

detecting-incongruity

This repository contains the source code & data corpus used in the following paper,

Detecting Incongruity Between News Headline and Body Text via a Deep Hierarchical Encoder, AAAI-19, paper

Requirements

  tensorflow==1.4 (tested on cuda-8.0, cudnn-6.0)
  python==2.7
  scikit-learn==0.20.0
  nltk==3.3

Download Dataset

download preprocessed dataset with the following script

cd data
sh download_processed_dataset_aaai-19.sh
the downloaded dataset will be placed into the following path of the project

/data/aaai-19/para
/data/aaai-19/whole
format (example)

test_title.npy: [100000, 49] - (#samples, #token (index))
test_body: [100000, 1200] - (#samples, #token (index))
test_label: [100000] - (#samples)
dic_mincutN.txt: dictionary

Source Code

according to the training method

whole-type: using the codes in the ./src_whole
para-type: using the codes in the ./src_para

Training Phase

each source code folder contains a reference script for training the model

train_reference_scripts.sh
<< for example >>
train dataset with AHDE model and "whole" method

python AHDE_Model.py --batch_size 256 --encoder_size 80 --context_size 10 --encoderR_size 49 --num_layer 1 --hidden_dim 300  --num_layer_con 1 --hidden_dim_con 300 --embed_size 300 --lr 0.001 --num_train_steps 100000 --is_save 1 --graph_prefix 'ahde' --corpus 'aaai-19_whole' --data_path '../data/target_aaai-19_whole/'

Results will be displayed in the console
The final test result will be stored in "./TEST_run_result.txt"

※ hyper parameters

major parameters: edit from the training script
other parameters: edit from "./params.py"

Inference Phase

each source code folder contains an inference script
you need to modify the "model_path" in the "eval_AHDE.sh" to a proper path

<< for example >>
evaluate test dataset with AHDE model and "whole" method

	src_whole$ sh eval_AHDE.sh

Results will be displayed in the console
scores for the testset will be stored in "./output.txt"

Dataset Statistics

whole case

data Samples tokens (avg)
headline tokens (avg)
body text

train 1,700,000 13.71 499.81

dev 100,000 13.69 499.03

test 100,000 13.55 769.23
Note

We crawled articles for "dev" and "test" dataset from different media outlets.

Newly introduced dataset (English version)

We create an English version of the dataset, nela-17, using NELA 2017 data. Please refer to the dataset repository [link].
If you want to run our model (AHDE) with the nela-17 data, you can use the preprocessed dataset that is compatible with our code.

cd data
sh download_processed_dataset_nela-17.sh
training script (refer to the "train_reference_scripts.sh")

python AHDE_Model.py --batch_size 64 --encoder_size 200 --context_size 50 --encoderR_size 25 --num_layer 1 --hidden_dim 100  --num_layer_con 1 --hidden_dim_con 100 --embed_size 300 --use_glove 1 --lr 0.001 --num_train_steps 100000 --is_save 1 --graph_prefix 'ahde' --corpus 'nela-17_whole' --data_path '../data/target_nela-17_whole/'

Other implementation (pytorch version)

Pytorch implementation [link] by M. Lee
compatible with the preprocessed dataset

cite

Please cite our paper, when you use our code | dataset | model

@inproceedings{yoon2019detecting,
title={Detecting Incongruity between News Headline and Body Text via a Deep Hierarchical Encoder},
author={Yoon, Seunghyun and Park, Kunwoo and Shin, Joongbo and Lim, Hongjun and Won, Seungpil and Cha, Meeyoung and Jung, Kyomin},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={33},
pages={791--800},
year={2019}
}

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
assets		assets
data		data
preprocessing		preprocessing
src_para		src_para
src_whole		src_whole
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

data

data

preprocessing

preprocessing

src_para

src_para

src_whole

src_whole

util

util

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

detecting-incongruity

Requirements

Download Dataset

Source Code

Training Phase

Inference Phase

Dataset Statistics

Newly introduced dataset (English version)

Other implementation (pytorch version)

cite

About

Releases 1

Packages

Languages

data	Samples	tokens (avg) headline	tokens (avg) body text
train	1,700,000	13.71	499.81
dev	100,000	13.69	499.03
test	100,000	13.55	769.23

License

david-yoon/detecting-incongruity

Folders and files

Latest commit

History

Repository files navigation

detecting-incongruity

Requirements

Download Dataset

Source Code

Training Phase

Inference Phase

Dataset Statistics

Newly introduced dataset (English version)

Other implementation (pytorch version)

cite

About

Topics

Resources

License

Stars

Watchers

Forks

Languages