A machine learning project that classifies emails as Spam or Ham (legitimate) using two models — Multinomial Naive Bayes and Linear SVM — with a fully interactive Streamlit web app.
| Classifier | EDA | Model Performance |
|---|---|---|
| Real-time spam detection | Word clouds & distributions | Confusion matrices & ROC curves |
spamshield/
│
├── spam_app.py # Streamlit web application
├── spam_email_classifier.py # Core ML training & evaluation script
├── spam.csv # Dataset (SMS Spam Collection)
│
├── outputs/
│ ├── chart1_distribution.png # Spam vs Ham pie chart
│ ├── chart2_wordcloud.png # Spam word cloud
│ ├── cm_Linear_SVM.png # SVM confusion matrix
│ ├── cm_Multinomial_Naive_Bayes.png
│ ├── roc_Linear_SVM.png # SVM ROC curve
│ └── roc_Multinomial_Naive_Bayes.png
│
├── requirements.txt # Python dependencies
└── README.md
git clone https://github.com/YOUR_USERNAME/spamshield.git
cd spamshieldpython -m venv venv
# Activate — macOS/Linux
source venv/bin/activate
# Activate — Windows
venv\Scripts\activatepip install -r requirements.txtstreamlit run spam_app.pyThe app will open automatically at http://localhost:8501
python spam_email_classifier.pyThis trains both models, prints metrics to console, and saves all charts as .png files.
| Model | Vectorizer | Test Accuracy | ROC-AUC |
|---|---|---|---|
| Multinomial Naive Bayes | CountVectorizer | ~98.2% | 0.98 |
| Linear SVM | TF-IDF | ~98.3% | 0.99 |
Both models are wrapped in sklearn Pipelines (vectorizer → classifier) to prevent data leakage.
- Paste any email or select a pre-loaded example
- Switch between Naive Bayes and SVM in the sidebar
- Get instant SPAM 🚫 or HAM ✅ prediction
- Run a batch demo on 4 pre-written test emails at once
- Spam vs Ham donut chart
- Message length distribution by category
- Word cloud of most common spam keywords
- Interactive data table preview (adjustable row count)
- Side-by-side metrics comparison table
- Per-model tabs with:
- Accuracy, Precision, Recall, F1, ROC-AUC cards
- Confusion matrix heatmap (test set)
- ROC curve (train vs test AUC)
SMS Spam Collection Dataset
- Source: UCI ML Repository via GitHub mirror
- 5,574 SMS messages (after deduplication: ~5,169)
- Class distribution: 87.4% Ham, 12.6% Spam
- Columns:
Category(spam/ham),Message(text)
pandas
numpy
scikit-learn
matplotlib
seaborn
wordcloud
streamlit
Install all via:
pip install -r requirements.txt- True Negatives (Ham correctly identified): 1,104
- False Positives (Ham flagged as Spam): 3
- False Negatives (Spam missed): 17
- True Positives (Spam correctly caught): 169
- True Negatives: 1,103
- False Positives: 4
- False Negatives: 17
- True Positives: 169
Both models perform near-identically on unseen data. SVM has a slight edge in precision (fewer false alarms).
Words most strongly associated with spam: free, call, txt, claim, prize, urgent, mobile, won, reply, stop, cash, offer
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Commit your changes:
git commit -m "Add my feature" - Push to the branch:
git push origin feature/my-feature - Open a Pull Request
This project is licensed under the MIT License — see the LICENSE file for details.
Made with ❤️ using Python, scikit-learn, and Streamlit.