Skip to content

LindgeW/transformer_for_textclassification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

transformer_for_textclassification

Transformer Architecture

Fine-tune Tricks

The demo apply transformer decoder to finish text classification task. Actually, it's hard to adjust the parameter value of model to proper one. Here, I draw a brief conclusion for how to effectively fine-tune the parameters.

  • Adam optimization with learning rate of 1.5e-4 (1e-4 magnitude) and learning rate decay schedule with exponential decay of each epoch. If lr is set larger(1e-3 magnitude), the result is not well.

  • I try to add a BiLSTM layer on th top of transformer decoder, but get a worse result. I personally think it's due to dismatch learning rate between decoder layer and lstm layer, which make it hard to fine tune their parameters well simultaneously. And, it further confirms a fact that "unsupervised pre-training + supervised fine-tuning" mode is effective and well done nowadays.

  • For activation function, GeLU seems works worse than ReLU (maybe due to limited corpus).

  • For regularization, dropout with a rate of 0.1 (but not in interval 0.2 to 0.5) and label smoothing for better generalization.

  • x + dropout(sublayer(norm(x))) give a better result than norm(x + dropout(sublayer(x)))

  • For the position-wise feed-foward network, you can set the inner state dimensionality larger. (1024 d_ff for 300 d_model)

  • For multi-head self-attention network, more heads, better performance.

  • The bigger data, the better effect!

About

transformer in textclassification task

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages