This project explores text classification using the Naive Bayes algorithm, implemented in two different environments:
- Python (scikit-learn)
- RapidMiner (Text Processing extension)
Both Multinomial Naive Bayes (term occurrences) and Bernoulli Naive Bayes (binary term occurrences) were tested.
The aim was to compare their performance in terms of accuracy, precision, recall, F1-score, and execution time
This project implements a text sentiment classifier using the Naive Bayes algorithm. It includes two variations of Naive Bayes:
- Multinomial Naive Bayes: Based on word frequency in texts, suitable for problems where term frequency matters.
- Bernoulli Naive Bayes: Uses binary term occurrence (presence/absence of words), suitable for cases where the existence of a word is more important than its frequency.
The classifier can be used for sentiment analysis in text files (e.g., positive or negative sentiment) and can be applied to text documents organized into folders.
- Support for Multinomial and Bernoulli Naive Bayes.
- Calculation of performance metrics such as accuracy, confusion matrix, and classification report.
- Parameterization via min_df and max_df for term frequency pruning.
- Capable of analyzing large text datasets with fast training and prediction.
To run this project, make sure you have installed the following Python libraries:
- scikit-learn
- glob
- os
- time
Install dependencies:
pip install scikit-learn
Your project folder should follow this structure:
project-directory/
│
├── txt_sentoken/
│ ├── neg/ # Files with negative sentiment
│ └── pos/ # Files with positive sentiment
│
└── main_script.py # The main script containing the code
Reads text files from specific folders that correspond to categories (e.g., 'neg' and 'pos') and returns the texts along with their labels.
Trains a Naive Bayes model (Multinomial or Bernoulli, depending on nb_type
) and performs predictions. Data is first vectorized using CountVectorizer
, then the model predicts and prints the results.
print_results(model_name, execution_time, execution_time_with_load, accuracy, conf_matrix, classification_rep)
Prints the results of the model execution: execution time, accuracy, confusion matrix, and classification report (precision, recall, f1-score).
- Prepare folders: Place text files into the correct folders (
neg
andpos
) as shown above. - Parameter settings: Adjust term frequency thresholds (
min_df
,max_df
), training/test split (test_size
), and other script parameters as needed. - Run: Execute the script to train the model and view results.
Example execution:
python main_script.py
Confusion Matrix:
[[234 45]
[ 32 289]]
Accuracy: 0.8760
Classification Report:
precision recall f1-score support
neg 0.8795 0.8396 0.8591 279
pos 0.8657 0.9000 0.8825 321
- Bernoulli Naive Bayes achieved the best overall performance:
- Python: 79.80% accuracy
- RapidMiner: 79.75% accuracy
- Multinomial Naive Bayes:
- Python: 77.40% accuracy
- RapidMiner: 73.40% accuracy
- Python execution was approximately twice as fast as RapidMiner.
The full academic report with detailed methodology, results, and comparison is available in:
📂 report/Text Classification (EN).pdf
- Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
- Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
- RapidMiner Documentation: https://docs.rapidminer.com
- Scikit-learn Documentation: https://scikit-learn.org/stable/modules/naive_bayes.html
✍️ Onour Imprachim