# **Human Trafficking Detection Using Phishing Email Dataset**

## **Project Overview**
This project aims to identify potential human trafficking activities by analyzing phishing emails, which are often used for illegal activities such as trafficking. By examining various email features, we develop machine learning models to classify emails as phishing or non-phishing, while also incorporating sentiment analysis to capture emotional undertones that may be linked to trafficking behavior.

## **Dataset Source**
We used the phishing email dataset, which contains the following columns:
- **Sender**: Email address of the sender.
- **Receiver**: Email address of the receiver.
- **Date**: The date the email was sent.
- **Subject**: Subject line of the email.
- **Body**: Full content of the email.
- **Label**: Classification of the email (1 = phishing, 0 = non-phishing).
- **URLs**: Number of URLs in the email body.

Dataset link: [Phishing Email Dataset](https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset)

## **Data Preprocessing**
- **Text Cleaning**: We removed URLs, special characters, and converted all text to lowercase to standardize the input.
- **Handling Missing Values**: We checked for missing data, dropped records with missing values in critical columns like the label, and prepared the text data for vectorization.
- **Text Vectorization**: We utilized TF-IDF to transform the email content into numerical features suitable for machine learning algorithms.

## **Exploratory Data Analysis (EDA)**
- **Missing Value Check**: We inspected for any missing data and handled them accordingly.
- **Data Cleaning**: Removed noise from the text to improve model performance.
- **Visualization**: Visualized the distribution of phishing vs. non-phishing emails to understand the data better.

## **Modeling**
We trained two machine learning models:
1. **Naive Bayes**: A probabilistic model often used in text classification due to its simplicity and effectiveness.
   - **Accuracy**: 97.81%
   - **Precision**: 0.96 for non-phishing, 1.00 for phishing.
   - **Recall**: 1.00 for non-phishing, 0.96 for phishing.
   - **F1-Score**: 0.98 for both classes.

2. **Support Vector Machine (SVM)**: Chosen for its high accuracy and ability to handle high-dimensional data.
   - **Accuracy**: 99.56%
   - **Precision**: 1.00 for non-phishing, 0.99 for phishing.
   - **Recall**: 0.99 for non-phishing, 1.00 for phishing.
   - **F1-Score**: 0.99 for non-phishing, 1.00 for phishing.

### **Why SVM?**
SVM demonstrated superior performance with a balanced precision and recall, minimizing both false positives and false negatives, making it the ideal choice for this classification problem.

## **Sentiment Analysis**
We incorporated VADER Sentiment Analysis to capture emotional tones in the email bodies. This helps in identifying whether emails with aggressive or emotional language patterns could be associated with trafficking behavior.

## **Conclusion**
This project successfully built a classification model for phishing emails, which could potentially be extended to detect emails related to human trafficking. By combining machine learning with sentiment analysis, weâ€™ve enhanced the detection capabilities, enabling the system to flag not only phishing attempts but also suspicious emotional undertones in the emails.
