Spam Detector Network

Project Description

In 2023, approximately 45.6% of global emails were identified as spam, marking a decrease from nearly 49% in 2022. Despite this decline, spam remains a critical issue in email communications. This project focuses on developing an effective spam email detection system using deep learning techniques, with a particular emphasis on leveraging a pre-trained DistilBERT model.

Project Objectives

Developing a Robust Model: Utilize state-of-the-art deep learning models for accurate classification of spam and non-spam emails.
Efficiency in Processing: Implement efficient preprocessing steps to handle large volumes of email data.
Real-Time Detection: Deploy the model for real-time spam detection in email systems.
Integration with Node-RED: Integrate the spam detection model with Node-RED for seamless workflow automation.

Project Structure

Dataset Used

Enron Spam Dataset
- Description: Contains emails from Enron Corporation labeled as spam or non-spam.
- Link: Enron Spam Dataset

Model Building

Utilized a pre-trained DistilBERT model, customized for binary classification by incorporating a ReLU activation layer, Dropout layer, and final linear layer.

Training and Validation

Hyperparameters tuned through validation set to find optimal settings.
Combined training and validation sets for final model training to minimize overfitting and maximize performance.

Performance Evaluation

Evaluated model using metrics like accuracy, precision, recall, and F1-score.
Achieved an accuracy of 97% with strong recall, indicating minimal false positives and false negatives.

Deployment and Monitoring

Deployed trained model in production for real-time spam detection.
Monitored model performance and server uptime for reliability.

File Structure

SpamDetectionNetwork.ipynb: Jupyter notebook for model development and training.
Prediction-Server.py: Python script for setting up server for real-time spam prediction.
SpamDetector-NodeRed.json: Node-RED configuration file for integration and deployment.
Model/: Directory containing saved model files.
- saved_model.pb: TensorFlow model file.
- variables/: Directory for model variables.

Model Characteristics

The project selected DistilBERT for its efficient training capabilities and reduced size. Customized model architecture replaced final classification layer with ReLU activation, Dropout for regularization, and linear layer for binary classification. Frozen feature extraction retained pre-trained weights for focused fine-tuning.

Data Preprocessing

Data Reading and Integration:
- Reads email data from a CSV file.
- Merges 'Subject' and 'Message' columns into a unified 'Text' column for streamlined text processing.
Label Encoding:
- Converts categorical labels ('Spam' and 'Ham') into binary format (1 for 'spam', 0 for 'ham') to facilitate classification.
Data Cleaning and Handling:
- Removes unnecessary columns like 'Date', 'Subject', and 'Message ID' to simplify the dataset.
- Handles missing values to ensure data completeness.
Text Preprocessing:
- Utilized googletrans for translating non-English text to English, ensuring uniform processing and analysis across different languages.
- Cleans text by removing non-alphanumeric characters and other noise.
- Employed DistilBERT's tokenizer to tokenize text efficiently, ensuring compatibility with model input requirements.
- Removes stopwords and lemmatizes tokens for improved analysis.

Data Loader

Implemented DistilBERTDataset class for handling input during model training and validation. Utilized DistilBERT tokenizer for tokenization, padding, and truncation, preprocessing CSV data efficiently via PyTorch's DataLoader with multiprocessing support.

Model Training and Evaluation

Training Process

Utilized AdamW optimizer for training with CrossEntropyLoss as the loss function.
Monitored training progress through epochs, saving best model weights based on validation accuracy and loss.

Validation and Testing

Validated model performance on a separate test dataset to ensure generalizability and accuracy.
Evaluated metrics such as precision, recall, and F1-score to assess model effectiveness in spam detection.

Attention Masking

During dataset preparation, each input sequence was tokenized and converted into an ID sequence. To ensure computational efficiency and handle varying sequence lengths, attention masks were applied during testing. These masks identify relevant tokens (set to 1) and irrelevant tokens (set to 0) during attention calculation.

Model Testing and Results

Achieved an accuracy of 97% with a robust recall score, indicating effective spam detection with minimal false positives and negatives. Detailed error analysis included confusion matrices and tokenized misclassifications, improving model understanding and performance.

Tools and Technologies Used

Deep Learning Framework: PyTorch for developing and training deep learning models.
Model Architecture: DistilBERT for efficient text classification tasks.
Data Handling: Pandas and NumPy for data manipulation and preprocessing.
Visualization: Matplotlib for visualizing training progress and model evaluation metrics.
Collaborative Development: Google Colab for cloud-based Jupyter environment and GPU utilization.
Deployment: Node-RED for workflow automation and integration, Flask for creating API endpoints.

Server Setup and Deployment

Setting Up Prediction Server

Implemented Flask application for hosting the trained model, enabling real-time predictions via HTTP requests.
Utilized Docker for containerization, ensuring portability and scalability of the prediction server.

Monitoring and Maintenance

Integrated monitoring tools to track model performance and server uptime, ensuring continuous operation and reliability.
Implemented logging mechanisms to capture errors and user interactions for troubleshooting and improvement.

Node-RED Workflow

Workflow Integration

Developed Node-RED flows to automate email processing and spam detection.
Integrated with IMAP servers for retrieving emails, processing through Flask API for prediction, and routing based on spam classification.

Real-Time Decision Making

Configured decision-making nodes to classify emails as spam or non-spam based on model predictions.
Automated responses for spam emails, ensuring efficient email management and user communication.

Conclusion

The Spam Detector Network project encompasses a comprehensive approach to building and deploying a spam detection system using advanced machine learning techniques. The integration with Node-RED enhances automation capabilities, making it suitable for real-time email spam detection and management.

Future Proposals

Cloud Deployment: Deploying the model on a cloud server using Docker containers for improved scalability and management.
Big Data Handling: Implementing Hadoop or Spark frameworks for handling large volumes of data and advanced result analysis.
Mobile Application: Developing a cross-platform mobile app using Flutter for local model execution or Swift for optimized iOS performance.

Authors

License

This project is licensed under the GNU General Public License v3.0. Refer to the LICENSE file for more information.

Acknowledgment

Gratitude to Marcel Wiechmann, creator of the Enron Spam Dataset, for providing valuable data for this project.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Model		Model
LICENSE		LICENSE
Prediction-Server.py		Prediction-Server.py
README.md		README.md
SpamDetector-NodeRed.json		SpamDetector-NodeRed.json
SpamDetectorNetwork.ipynb		SpamDetectorNetwork.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spam Detector Network

Table of Contents

Project Description

Project Objectives

Project Structure

Dataset Used

Model Building

Training and Validation

Performance Evaluation

Deployment and Monitoring

File Structure

Model Characteristics

Data Preprocessing

Data Loader

Model Training and Evaluation

Training Process

Validation and Testing

Attention Masking

Model Testing and Results

Tools and Technologies Used

Server Setup and Deployment

Setting Up Prediction Server

Monitoring and Maintenance

Node-RED Workflow

Workflow Integration

Real-Time Decision Making

Conclusion

Future Proposals

Authors

License

Acknowledgment

About

Releases

Packages

Contributors 2

Languages

License

Leonard2310/SpamDetectorNetwork

Folders and files

Latest commit

History

Repository files navigation

Spam Detector Network

Table of Contents

Project Description

Project Objectives

Project Structure

Dataset Used

Model Building

Training and Validation

Performance Evaluation

Deployment and Monitoring

File Structure

Model Characteristics

Data Preprocessing

Data Loader

Model Training and Evaluation

Training Process

Validation and Testing

Attention Masking

Model Testing and Results

Tools and Technologies Used

Server Setup and Deployment

Setting Up Prediction Server

Monitoring and Maintenance

Node-RED Workflow

Workflow Integration

Real-Time Decision Making

Conclusion

Future Proposals

Authors

License

Acknowledgment

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages