Skip to content

Paper on pretrained embeddings for Q/A topic classification

License

Notifications You must be signed in to change notification settings

TheodorEmanuelsson/YahooEmbeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YahooEmbeddings

Code for project in Text Mining course (732A81) at Linköping University during the fall of 2022. See folder paper for the final report.

Abstract

Pretrained word embeddings are easily available and can be utilized to build representations of longer pieces of text. This paper investigates three simple strategies for representing paragraphs in a Q/A topic classification problem as well as for suitable common classifiers. The data used is the large scale Yahoo! Answers dataset which contains triples of question title, question content and best answer. The investigated approaches for paragraph representation are Distributed bag of words (DBOW), mean-pooling and projecting the word embeddings for an observation onto the first principal component. The DBOW and mean-pooling representations perform equally well with logistic regression (69% accuracy) and multilayer perceptron (71-72% accuracy). Other investigated models are SVM with linear and radial-basis function. The best model performs 4 percentage-points lower than the state-of-the-art on accuracy. Yet, the simplicity of the approaches show the power of pretrained word embeddings and simple solutions for representing longer pieces of text for topic classification in question and answer settings. SpaCy word embeddings are used throughout the study.

Data

Data is available on Google Drive.

See the official repo for the data and the corresponding paper.

About

Paper on pretrained embeddings for Q/A topic classification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published