Advanced Text Classification Project

Overview

This project aims to classify news documents into 91 different classes using various text classification models and techniques. The report covers the methodology, performance comparison, and suggestions for enhancements.

if you want full repoert, here : Results Report

Author

Omar Hawash
Date: 14.04.2024
NLP Course - AN-Najah National University
Supervisor: Dr. Hamed Abdelhaq

Introduction

Text classification is a fundamental task in natural language processing (NLP) that involves categorizing text documents into predefined classes or categories. In this project, we explore advanced text classification techniques to achieve high accuracy and F1-Score.

Hypothesis

Dataset: News documents with body text and 91 different classes.
Training Data: 11413 samples
Testing Data: 4024 samples

Materials

Naive Bayes theoretical equation
Sk-learn pre-trained models
Gensim pre-trained models

Procedure

Models Accuracy

A. Naive Bayes from scratch

Model	Mean Average F1-Score (%)
Naive Bayes_01	3.57
Naive Bayes_02	3.63
Naive Bayes_03	3.48

B. Scikit-learn Naive Bayes

Model	Mean Average F1-Score (%)
Sk_learn Count Vector	30.73
Sk_learn TF-IDF Vector	13.73

C. Word Embedding Models

Model	Mean Average F1-Score (%)
Glove	26.67
Word2Vec	15.96
FastText	5.03
Glove_02	35.7
Glove_03	37.73

D. SVM & Random Forest

Model	Mean Average F1-Score (%)
SVM	36.78
Random Forest	22.66

Results

Scratch Naive Bayes model implementation achieved really low accuracy, and especially in F-Score.
Scikit-learn Naive Bayes got much higher results. Specifically, CountVectorizer model outperformed TF-IDF in F1-Score.
Word Embedding Models, particularly Glove, achieved the highest performance with a mean average F1-Score of 37.73%.
Changing the solver of logistic regression had a negligible effect on results.

Conclusion

Despite various preprocessing and model selection techniques, the performance of the models varied. Word embedding models, especially Glove, showed promising results. Further enhancements could include exploring different preprocessing methods and experimenting with deep learning architectures.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
.DS_Store		.DS_Store
.gitattributes		.gitattributes
readme.md		readme.md
restructured.ipynb		restructured.ipynb
scratch_nb.ipynb		scratch_nb.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Advanced Text Classification Project

Overview

Author

Introduction

Hypothesis

Materials

Procedure

Models Accuracy

A. Naive Bayes from scratch

B. Scikit-learn Naive Bayes

C. Word Embedding Models

D. SVM & Random Forest

Results

Conclusion

About

Uh oh!

Releases

Packages

Languages

OmarMHawash/Advanced-Text-Classification

Folders and files

Latest commit

History

Repository files navigation

Advanced Text Classification Project

Overview

Author

Introduction

Hypothesis

Materials

Procedure

Models Accuracy

A. Naive Bayes from scratch

B. Scikit-learn Naive Bayes

C. Word Embedding Models

D. SVM & Random Forest

Results

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages