-
Notifications
You must be signed in to change notification settings - Fork 55
Add the email_spam_detection directory in NLP/projects #88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,213 @@ | ||
|
|
||
| # Email Spam Detection | ||
|
|
||
| ### AIM | ||
| To develop a machine learning-based system that classifies email content as spam or ham (not spam). | ||
|
|
||
| ### DATASET LINK | ||
| [https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification](https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification) | ||
|
|
||
|
|
||
| ### NOTEBOOK LINK | ||
| [https://www.kaggle.com/code/inshak9/email-spam-detection](https://www.kaggle.com/code/inshak9/email-spam-detection) | ||
|
|
||
|
|
||
| ### LIBRARIES NEEDED | ||
|
|
||
| ??? quote "LIBRARIES USED" | ||
|
|
||
| - pandas | ||
| - numpy | ||
| - scikit-learn | ||
| - matplotlib | ||
| - seaborn | ||
|
|
||
|
|
||
| --- | ||
|
|
||
| ### DESCRIPTION | ||
| !!! info "What is the requirement of the project?" | ||
| - A robust system to detect spam emails is essential to combat increasing spam content. | ||
| - It improves user experience by automatically filtering unwanted messages. | ||
|
|
||
| ??? info "Why is it necessary?" | ||
| - Spam emails consume resources, time, and may pose security risks like phishing. | ||
| - Helps organizations and individuals streamline their email communication. | ||
|
|
||
| ??? info "How is it beneficial and used?" | ||
| - Provides a quick and automated solution for spam classification. | ||
| - Used in email services, IT systems, and anti-spam software to filter messages. | ||
|
|
||
| ??? info "How did you start approaching this project? (Initial thoughts and planning)" | ||
| - Analyzed the dataset and prepared features. | ||
| - Implemented various machine learning models for comparison. | ||
|
|
||
| ??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)." | ||
| - Documentation from [scikit-learn](https://scikit-learn.org) | ||
| - Blog: Introduction to Spam Classification with ML | ||
|
|
||
| --- | ||
|
|
||
| ### EXPLANATION | ||
|
|
||
| #### DETAILS OF THE DIFFERENT FEATURES | ||
| The dataset contains features like word frequency, capital letter counts, and others that help in distinguishing spam emails from ham. | ||
|
|
||
| | Feature | Description | | ||
| |----------------------|-------------------------------------------------| | ||
| | `word_freq_x` | Frequency of specific words in the email body | | ||
| | `capital_run_length` | Length of consecutive capital letters | | ||
| | `char_freq` | Frequency of special characters like `;` and `$` | | ||
| | `is_spam` | Target variable (1 = Spam, 0 = Ham) | | ||
|
|
||
| --- | ||
|
|
||
| #### WHAT I HAVE DONE | ||
|
|
||
| === "Step 1" | ||
|
|
||
| Initial data exploration and understanding: | ||
| - Loaded the dataset using pandas. | ||
| - Explored dataset features and target variable distribution. | ||
|
|
||
| === "Step 2" | ||
|
|
||
| Data cleaning and preprocessing: | ||
| - Checked for missing values. | ||
| - Standardized features using scaling techniques. | ||
|
|
||
| === "Step 3" | ||
|
|
||
| Feature engineering and selection: | ||
| - Extracted relevant features for spam classification. | ||
| - Used correlation matrix to select significant features. | ||
|
|
||
| === "Step 4" | ||
|
|
||
| Model training and evaluation: | ||
| - Trained models: KNN, Naive Bayes, SVM, and Random Forest. | ||
| - Evaluated models using accuracy, precision, and recall. | ||
|
|
||
| === "Step 5" | ||
|
|
||
| Model optimization and fine-tuning: | ||
| - Tuned hyperparameters using GridSearchCV. | ||
|
|
||
| === "Step 6" | ||
|
|
||
| Validation and testing: | ||
| - Tested models on unseen data to check performance. | ||
|
|
||
| --- | ||
|
|
||
| #### PROJECT TRADE-OFFS AND SOLUTIONS | ||
|
|
||
| === "Trade Off 1" | ||
| - **Accuracy vs. Training Time**: | ||
| - Models like Random Forest took longer to train but achieved higher accuracy compared to Naive Bayes. | ||
|
|
||
| === "Trade Off 2" | ||
| - **Complexity vs. Interpretability**: | ||
| - Simpler models like Naive Bayes were more interpretable but slightly less accurate. | ||
|
|
||
| --- | ||
|
|
||
| ### SCREENSHOTS | ||
| <!-- Attach the screenshots and images --> | ||
|
|
||
| !!! success "Project structure or tree diagram" | ||
|
|
||
| ``` mermaid | ||
| graph LR | ||
| A[Start] --> B[Load Dataset]; | ||
| B --> C[Preprocessing]; | ||
| C --> D[Train Models]; | ||
| D --> E{Compare Performance}; | ||
| E -->|Best Model| F[Deploy]; | ||
| E -->|Retry| C; | ||
| ``` | ||
|
|
||
| ??? tip "Visualizations and EDA of different features" | ||
|
|
||
| === "Feature Correlation Heatmap" | ||
|  | ||
|
|
||
| ??? example "Model performance graphs" | ||
|
|
||
| === "Model Comparison" | ||
|  | ||
|
|
||
| --- | ||
|
|
||
| ### MODELS USED AND THEIR EVALUATION METRICS | ||
|
|
||
| | Model | Accuracy | Precision | Recall | | ||
| |----------------------|----------|-----------|--------| | ||
| | KNN | 90% | 89% | 88% | | ||
| | Naive Bayes | 92% | 91% | 90% | | ||
| | SVM | 94% | 93% | 91% | | ||
| | Random Forest | 95% | 94% | 93% | | ||
| | AdaBoost | 97% | 97% | 100% | | ||
|
|
||
| --- | ||
|
|
||
| #### MODELS COMPARISON GRAPHS | ||
|
|
||
| !!! tip "Models Comparison Graphs" | ||
|
|
||
| === "Accuracy Comparison" | ||
|  | ||
|
|
||
| --- | ||
|
|
||
| ### CONCLUSION | ||
|
|
||
| #### WHAT YOU HAVE LEARNED | ||
|
|
||
| !!! tip "Insights gained from the data" | ||
| - Feature importance significantly impacts spam detection. | ||
| - Simple models like Naive Bayes can achieve competitive performance. | ||
|
|
||
| ??? tip "Improvements in understanding machine learning concepts" | ||
| - Gained hands-on experience with classification models and model evaluation techniques. | ||
|
|
||
| ??? tip "Challenges faced and how they were overcome" | ||
| - Balancing between accuracy and training time was challenging, solved using model tuning. | ||
|
|
||
| --- | ||
|
|
||
| #### USE CASES OF THIS MODEL | ||
|
|
||
| === "Application 1" | ||
|
|
||
| **Email Service Providers** | ||
| - Automated filtering of spam emails for improved user experience. | ||
|
|
||
| === "Application 2" | ||
|
|
||
| **Enterprise Email Security** | ||
| - Used in enterprise software to detect phishing and spam emails. | ||
|
|
||
| --- | ||
|
|
||
| ### FEATURES PLANNED BUT NOT IMPLEMENTED | ||
|
|
||
| === "Feature 1" | ||
|
|
||
| - Integration of deep learning models (LSTM) for improved accuracy. | ||
|
|
||
| --- | ||
|
|
||
| ### **DEVELOPER** | ||
| ***Insha Khan*** | ||
|
|
||
| [LinkedIn](https://www.linkedin.com/in/insha-khan-4087532a4/){ .md-button } | ||
| [GitHub](https://www.github.com/ikcod){ .md-button } | ||
|
|
||
| ##### Happy Coding 🤓 | ||
| #### Show some ❤️ by 🌟 this repository! | ||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
Binary file added
BIN
+26.1 KB
docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - AdaBoost.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+26.6 KB
docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - Decision Tree.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+26.7 KB
docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - Naive Bayes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+27 KB
docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - Random Forest.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+24.7 KB
docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - SVM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+37.3 KB
docs/NLP/projects/Email_Spam_Detection/images/Model accracy comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.