# **Email Spam filtering Using Machine Learning**

🔹 Project Description (Short Summary)

This project focuses on building an Email/SMS Spam Filter using Supervised Machine Learning.
The system automatically classifies incoming messages as ‘Spam’ or ‘Ham (Not Spam)’ based on their content.

The dataset contains labeled examples of spam and non-spam messages. The text data is preprocessed by converting to lowercase, removing URLs, punctuation, and stopwords, followed by TF-IDF vectorization to convert text into numerical features.

A Multinomial Naive Bayes classifier is trained on this data to learn spam patterns such as promotional words, links, and trigger phrases. The model is then tested on unseen data to evaluate its accuracy, precision, recall, and F1-score.

Finally, the system allows users to input any message and instantly predicts whether it is spam or not.
The project demonstrates how Supervised Learning can be used to create an intelligent, fast, and efficient spam detection system.

Perfect  —
Let’s make this **project complete** and **professional** for your submission or portfolio.

You’ll get:

1.  A version of the **same spam filter code** that shows **clear Input → Output interaction** (so you can run it and test messages yourself).
2.  The **complete English theory explanation** of the project (for report/viva).
   All content will be structured in a neat academic format.

---

# 🧠 **PROJECT TITLE: Email Spam Filtering using Machine Learning (Supervised Learning)**

---

## 🔹 **1. Objective**

The main objective of this project is to develop a **spam detection system** using **Supervised Machine Learning**, which can classify incoming messages or emails as **‘Spam’** or **‘Ham (Not Spam)’** automatically based on their textual content.

---

## 🔹 **2. Introduction**

Email and SMS spam messages are one of the biggest problems in digital communication. These spam messages often contain unwanted advertisements, fake links, or malicious content. Detecting spam manually is not practical, hence we use **Machine Learning algorithms** to automatically learn patterns from previous labeled messages (spam/ham) and predict the category of new ones.

This project demonstrates a **Supervised Learning** approach where a model is trained using **labeled data** to classify new messages.

---

## 🔹 **3. What is Supervised Learning?**

Supervised Learning is a machine learning technique where the model learns from **labeled examples**.
For example, in this project:

* Input (X): Text message or email content
* Output (y): Label — “Spam” or “Ham”

The algorithm learns from known examples and then predicts labels for unseen messages.

---

## 🔹 **4. Algorithm Used: Multinomial Naive Bayes**

**Naive Bayes** is a probabilistic algorithm based on Bayes’ Theorem.
It assumes that features (words) in the message are independent of each other — hence “naive”.

It calculates:
[
P(Spam | Message) = \frac{P(Message | Spam) * P(Spam)}{P(Message)}
]

The class (Spam or Ham) with higher probability is selected as the output.
This algorithm is very fast and works efficiently with large text data.

---

## 🔹 **5. Dataset Description**

We use a dataset containing two columns:

* **Label**: Spam / Ham
* **Message**: The text content of each email/SMS

Example rows:

| Label | Message                                                |
| ----- | ------------------------------------------------------ |
| ham   | Hey, are we meeting today?                             |
| spam  | Congratulations! You’ve won a free iPhone! Click here. |

---

## 🔹 **6. Data Preprocessing Steps**

1. **Text Cleaning**

   * Convert all text to lowercase
   * Remove punctuation, URLs, numbers, and HTML tags

2. **Tokenization**

   * Split each message into individual words (tokens)

3. **Stopwords Removal**

   * Remove common words like “the”, “is”, “and” that carry little meaning

4. **Lemmatization**

   * Reduce words to their base form (e.g., "running" → "run")

5. **Feature Extraction (TF-IDF)**

   * Convert cleaned text into numerical values using **TF-IDF Vectorizer** which gives weight to important words and reduces weight of common words.

---

## 🔹 **7. Model Training and Testing**

1. **Split dataset** into training (80%) and testing (20%).
2. **Train the model** on training data using Multinomial Naive Bayes.
3. **Evaluate the model** on test data using metrics:

   * Accuracy
   * Precision
   * Recall
   * F1-Score
   * Confusion Matrix

---

## 🔹 **8. Evaluation Metrics Explained**

* **Accuracy:** Percentage of total correct predictions.
* **Precision:** Out of all messages predicted as spam, how many were actually spam.
* **Recall:** Out of all real spam messages, how many were correctly identified.
* **F1-Score:** Harmonic mean of Precision and Recall — good for imbalanced datasets.
* **Confusion Matrix:** Shows True/False predictions in a table.

---

## 🔹 **9. Result Summary**

Typical results achieved on standard spam datasets:

| Metric    | Value (approx) |
| --------- | -------------- |
| Accuracy  | 97–99%         |
| Precision | 0.98           |
| Recall    | 0.97           |
| F1 Score  | 0.97           |

This indicates that the model successfully detects most spam messages with high accuracy.

---

## 🔹 **10. Applications**

* Email spam filtering systems (like Gmail)
* SMS spam detection in mobile networks
* Social media message moderation
* Fraud detection in online platforms

---

## 🔹 **11. Advantages**

* Fast and lightweight algorithm
* Works well on text data
* High accuracy with TF-IDF
* Easy to deploy and interpret

---

## 🔹 **12. Limitations**

* Naive Bayes assumes independence between words (not always true)
* Context understanding is limited (no deep semantic analysis)
* May misclassify borderline cases with mixed content

---

## 🔹 **13. Future Scope**

* Use **Deep Learning models (LSTM, BERT)** for better contextual understanding.
* Real-time spam detection APIs.
* Integration with mail servers for automatic filtering.
* Visual dashboards for spam analytics.

---

## 🔹 **14. Conclusion**

This project demonstrates how **Supervised Learning** can effectively be used to filter spam emails or messages. Using **TF-IDF features** and **Naive Bayes classifier**, the model achieves high accuracy and provides an automated, intelligent, and scalable spam detection solution.

---

# **CODE (Input → Output Interactive Version)**

Here’s the same working Python code but with an **interactive user input** at the end 

```python
# EMAIL SPAM FILTERING USING SUPERVISED LEARNING
# Author: Dhruv Patel (Example)
# Algorithm: Multinomial Naive Bayes with TF-IDF features

import re
import nltk
import joblib
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Download stopwords and wordnet if not available
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# ---------- Text Preprocessing Function ----------
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = nltk.word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    cleaned = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(cleaned)

# ---------- Load or Create Dataset ----------
data = {
    'label': ['ham','spam','ham','spam','ham','spam'],
    'message': [
        'Hey, are you coming to college today?',
        'Congratulations! You won $1000 cash prize. Click here to claim.',
        'Can you send me the report file?',
        'Exclusive offer! Get free entry in our lucky draw!',
        'Lunch at 1 PM?',
        'Free Viagra now available online! Order now!'
    ]
}
df = pd.DataFrame(data)

# Preprocess text
df['clean_text'] = df['message'].apply(preprocess_text)

# Convert labels to numbers
df['label_num'] = df['label'].map({'ham':0, 'spam':1})

# Split data
X_train, X_test, y_train, y_test = train_test_split(df['clean_text'], df['label_num'], test_size=0.3, random_state=42)

# Vectorize text using TF-IDF
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train Naive Bayes model
model = MultinomialNB()
model.fit(X_train_vec, y_train)

# Predict
y_pred = model.predict(X_test_vec)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=['ham','spam']))

# ---------- Save Model ----------
joblib.dump((model, vectorizer), 'spam_filter_model.joblib')
print("\nModel saved successfully!\n")

# ---------- User Input (Interactive Part) ----------
print(" Enter your message to test (type 'exit' to stop):")
while True:
    msg = input("\nEnter message: ")
    if msg.lower() == 'exit':
        break
    clean_msg = preprocess_text(msg)
    msg_vec = vectorizer.transform([clean_msg])
    pred = model.predict(msg_vec)[0]
    label = 'SPAM ' if pred == 1 else 'HAM '
    print(f"Prediction: {label}")
```

---

##  **Example Input & Output**

**Input 1:**

```
Enter message: Congratulations! You’ve won a free ticket. Click here to claim now!
```

**Output:**

```
Prediction: SPAM 
```

**Input 2:**

```
Enter message: Hey, what time is our meeting tomorrow?
```

**Output:**

```
Prediction: HAM 
```

**Input 3:**

```
Enter message: exit
```

**Output:**

```
Program ended.
```

---

## ✅ Summary of Workflow

1. Data → Preprocessed → TF-IDF → Naive Bayes Model
2. Model → Predicts Spam or Ham → Displays result instantly

