Skip to content

312shan/nCoV_sentence_simi

 
 

Repository files navigation

nCoV-2019 related sentence similarity

If useful for you, maybe a star to encourage our work.

introduce

ERNIE , RoBerta based model for sentence similarity

For example:

387,支原体肺炎,支原体肺炎的症状及治疗方法是什么,肺炎衣原体与肺炎支原体有什么区别?,0
388,支原体肺炎,支原体肺炎的症状及治疗方法是什么,肺炎支原体培养及药敏的检验单怎么看?,0
389,支原体肺炎,支原体肺炎的症状及治疗方法是什么,小儿支原体与小儿支原体肺炎相同吗?,0
390,支原体肺炎,宝宝支原体肺炎感染的症状有哪些?,宝宝肺炎支原体感染的症状是什么?,1
391,支原体肺炎,宝宝支原体肺炎感染的症状有哪些?,宝宝支原体肺炎感染有什么症状?,1

95.2 acc online (simply choose the 1st fold, 1/6)

  • ERNIE 1.0
  • Nadam with 2.0*1e-5 lr
  • OHEM CE, with label smoothing
  • cosine lr scheduler with warmup
  • clean noise data by an overfitted model

more tricks maybe

  • simply change the model

  • add any 'word2vec' features

  • split into multipiece data,get N bert,
    using multiple feature to train a tree based
    model, lightGBM, Xgboost...

  • for those hard example, maybe add the nearest sentence
    (pair with label) for reference info, into bert

  • pseudo label

  • more open data(e.g ping an CHIP 2019)

  • ...

denpendency

  • opencv-python
  • pytorch >= 1.4
  • pandas
  • yacs
  • sklearn

prepare

train

you maye change the data path, have a look at train.py test.py

export PYTHONPATH=./
sh train_pipeline.sh

ref

https://tianchi.aliyun.com/competition/entrance/231776/introduction?spm=5176.12281949.1003.4.21eb2448atCLQk

About

nCoV related sentence similarity by BERT

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.8%
  • Shell 0.2%