SMP 2018 (1st prize)

This contest is to distinguish human writing or robot writing from articles, and we won the champion out of 240 teams.

Task description

Given an article, we need to create algorithms that judge types of authors (automatic summary, machine translation, robot writer or human writer). More details see SMP EUPT 2018

1.Set up

tensorflow >= 1.4.0
keras >= 1.2.0
gensim
scikit-learn
you may need keras.utils.vis_utils for model visualization

2.Data Preprocessing

my_utils/: for data preprocessing
- my_utils/data: convert origin data to csv file
- my_utils/data_preprocess: create data sequences and batches for the input of deep learning models
- my_utils/w2v_process: get the vocabs and pre-trained embeddings for words and chars
- my_utils/metrics: calcuate the precision, recall and F1 scores for each categories of authors

3.Models

There are total 12 models that combine word representations and character representations. The best model word rcnn char cgru we devised is spired by two papers:

Here is the scores of different models:

model	off-line	on-line
word_char_cnn	0.9888	0.9849
word_char_rnn	0.9894	0.9863
deep_word_char_cnn	0.9887	0.9828
word_rcnn_char_rnn	0.9899	0.9879
word_rnn_char_rcnn	0.9902	0.9872
word_char_cgru	0.9896	0.9861
word_cgru_char_rcnn	0.9904	untested
word_rcnn_char_cgru	0.9910	0.9882
word_cgru_char_rnn	0.9887	untested
word_rnn_char_cgru	0.9899	untested
word_rnn_char_cnn	0.9897	0.9862
word_char_rcnn	0.9894	0.9884

Note that rcnn comes from A Hybrid Framework for Text Modeling with Convolutional RNN while cgru comes from A C-LSTM Neural Network for Text Classfication

The source codes derives from https://github.com/fuliucansheng/360
We use model to create the architectures of models, and use train to train them

4.Ensemble

We use LightGBM for ensemble combined 12 models and extra statistical features, which is in ensemble, more details seen in https://github.com/TFknight/SMP-2018-Ensemble-Guide
In test dataset, we only adopt a simple but efficient voting mechanism for ensembling, which is in evaluate/predict

5.Main files

my_utils/: for data preprocessing
- my_utils/data: convert origin data to csv file
- my_utils/data_preprocess: create data sequences and batches for the input of deep learning models
- my_utils/w2v_process: get the vocabs and pre-trained embeddings for words and chars
- my_utils/metrics: calcuate the precision, recall and F1 scores for each categories of authors
models/: for creating deep learning models
- deepzoo: for keeping all models
init/config.py: for saving the path of models, data and so on
train: for training models
figure: for saving the visualization of models

Acknowledgment

Thanks for all the efforts of my teammates in GDUFS-iiip
We hope that more people will join in our labs: Data Mining Lab in GDUFS(广外数据挖掘实验室）

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensemble

ensemble

evaluate

evaluate

figure

figure

init

init

model

model

my_utils

my_utils

train

train

Author Identification Base on Media Contents.pdf

Author Identification Base on Media Contents.pdf

README.md

README.md

gdufs_iiip_smp2018.ppt

gdufs_iiip_smp2018.ppt

Repository files navigation

SMP 2018 (1st prize)

Task description

1.Set up

2.Data Preprocessing

3.Models

4.Ensemble

5.Main files

Acknowledgment

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
ensemble		ensemble
evaluate		evaluate
figure		figure
init		init
model		model
my_utils		my_utils
train		train
Author Identification Base on Media Contents.pdf		Author Identification Base on Media Contents.pdf
README.md		README.md
gdufs_iiip_smp2018.ppt		gdufs_iiip_smp2018.ppt

Quincy1994/smp2018

Folders and files

Latest commit

History

Repository files navigation

SMP 2018 (1st prize)

Task description

1.Set up

2.Data Preprocessing

3.Models

4.Ensemble

5.Main files

Acknowledgment

About

Resources

Stars

Watchers

Forks

Languages