This repository contains a Jupyter notebook pipeline for Mail spam detection using classical text preprocessing + feature extraction + classification.
spam_NLP.csv
— Original Mail messages labelled “spam” or “ham”.spam_NLP_cleaned.csv
— Cleaned version of the text (lowercasing, punctuation removed, etc.).TEST_DATA_spam.csv
— Smaller dataset used for quick testing/validation.main.ipynb
— Main notebook: loads data, cleans, extracts features, trains classifier, evaluates.logic.ipynb
— Supporting notebook for experiments / exploring alternative preprocessing and feature setups.
- Text preprocessing: tokenization, lower-case conversion, stopword removal, punctuation removal, possibly more cleaning steps (stemming / lemmatization if included).
- Feature extraction:
- Bag of Words (count vectorization)
- TF-IDF vectorization
- Classification:
- Multinomial Naive Bayes classifier
- Evaluation:
- Accuracy
- Precision, Recall, F1-score
- Confusion matrix
- Python 3.8+ (ideally)
- Required packages:
pandas
numpy
scikit-learn
matplotlib
The dataset (spam_NLP.csv
, spam_NLP_cleaned.csv
, TEST_DATA_spam.csv
) contains email messages labeled as spam or ham.
It is a processed version of a mail spam dataset (5,796 rows).
- Clone the repository
git clone https://github.com/Splendorius/Spam-Detection-Text-Processing.git
cd Spam-Detection-Text-Processing
- Install dependencies
pip install pandas numpy scikit-learn matplotlib
- Launch Jupyter Notebook
jupyter notebook
- Open the notebook main.ipynb and run the cells top-to-bottom.