This project aims to classify news documents into 91 different classes using various text classification models and techniques. The report covers the methodology, performance comparison, and suggestions for enhancements.
if you want full repoert, here : Results Report
- Omar Hawash
- Date: 14.04.2024
- NLP Course - AN-Najah National University
- Supervisor: Dr. Hamed Abdelhaq
Text classification is a fundamental task in natural language processing (NLP) that involves categorizing text documents into predefined classes or categories. In this project, we explore advanced text classification techniques to achieve high accuracy and F1-Score.
- Dataset: News documents with body text and 91 different classes.
- Training Data: 11413 samples
- Testing Data: 4024 samples
- Naive Bayes theoretical equation
- Sk-learn pre-trained models
- Gensim pre-trained models
| Model | Mean Average F1-Score (%) |
|---|---|
| Naive Bayes_01 | 3.57 |
| Naive Bayes_02 | 3.63 |
| Naive Bayes_03 | 3.48 |
| Model | Mean Average F1-Score (%) |
|---|---|
| Sk_learn Count Vector | 30.73 |
| Sk_learn TF-IDF Vector | 13.73 |
| Model | Mean Average F1-Score (%) |
|---|---|
| Glove | 26.67 |
| Word2Vec | 15.96 |
| FastText | 5.03 |
| Glove_02 | 35.7 |
| Glove_03 | 37.73 |
| Model | Mean Average F1-Score (%) |
|---|---|
| SVM | 36.78 |
| Random Forest | 22.66 |
- Scratch Naive Bayes model implementation achieved really low accuracy, and especially in F-Score.
- Scikit-learn Naive Bayes got much higher results. Specifically, CountVectorizer model outperformed TF-IDF in F1-Score.
- Word Embedding Models, particularly Glove, achieved the highest performance with a mean average F1-Score of 37.73%.
- Changing the solver of logistic regression had a negligible effect on results.
Despite various preprocessing and model selection techniques, the performance of the models varied. Word embedding models, especially Glove, showed promising results. Further enhancements could include exploring different preprocessing methods and experimenting with deep learning architectures.