# Spam Detection Pipeline

**Objective:** Build an end-to-end spam detection model using **TF-IDF** and **Logistic Regression**.

---
## 🧠 1️⃣ What is Spam Detection?

Spam detection is a text classification problem where the goal is to determine whether a message/email is **spam (unwanted)** or **ham (legitimate)**.

**Example:**
- "Win a FREE iPhone! Click here now!" → Spam 🚫
- "Your meeting is scheduled for tomorrow at 10 AM." → Ham ✅

---
## ⚙️ 2️⃣ Import Required Libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

---
## 📚 3️⃣ Create or Load Dataset

For demonstration, we’ll create a small sample dataset. In real-world applications, datasets like **SMS Spam Collection** (UCI) or **Enron Email Dataset** are used.

In [None]:
data = {
    'text': [
        'Win a brand new car! Click here to claim.',
        'Congratulations! You have won a lottery.',
        'Please verify your account details to continue.',
        'Hey, are we still on for the meeting tomorrow?',
        'Don’t forget to submit your assignment by 5 PM.',
        'Get free vouchers now!!!',
        'Call me when you reach the office.',
        'URGENT! Your mobile number has been selected!',
        'Can you review my project report?',
        'Exclusive offer just for you. Limited time only!'
    ],
    'label': ['spam', 'spam', 'spam', 'ham', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam']
}

df = pd.DataFrame(data)
df.head()

---
## 🧹 4️⃣ Preprocessing & Data Splitting

We’ll split our dataset into training and testing sets, then convert the text into numerical vectors using **TF-IDF**.

In [None]:
X = df['text']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print('✅ TF-IDF vectorization complete!')
print('Vocabulary size:', len(vectorizer.get_feature_names_out()))

---
## 🤖 5️⃣ Model Training (Logistic Regression)

In [None]:
model = LogisticRegression(max_iter=300)
model.fit(X_train_tfidf, y_train)
y_pred = model.predict(X_test_tfidf)

print('✅ Model training complete!')
print('\nAccuracy:', accuracy_score(y_test, y_pred))

---
## 📊 6️⃣ Model Evaluation

In [None]:
print('\nClassification Report:\n', classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Ham','Spam'], yticklabels=['Ham','Spam'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

---
## 💬 7️⃣ Test with Custom Messages
You can now test your model on new messages.

In [None]:
samples = [
    'Win cash prizes instantly! Click the link below.',
    'Please send me the report by evening.',
    'You have been selected for a free vacation!'
]

sample_features = vectorizer.transform(samples)
predictions = model.predict(sample_features)

for text, label in zip(samples, predictions):
    emoji = '🚫' if label == 'spam' else '✅'
    print(f'{text} → {label.upper()} {emoji}')

---
## ⚙️ 8️⃣ Full Pipeline Integration

In real-world applications, you can use **`Pipeline`** from scikit-learn to combine preprocessing and model training steps together.

In [None]:
from sklearn.pipeline import Pipeline

spam_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=1000)),
    ('model', LogisticRegression(max_iter=300))
])

spam_pipeline.fit(X_train, y_train)
print('✅ End-to-End Pipeline Ready!')
print('Test Accuracy:', spam_pipeline.score(X_test, y_test))

---
## ✅ Summary

- Built an end-to-end **Spam Detection pipeline** using scikit-learn.
- Used **TF-IDF** for text feature extraction.
- Applied **Logistic Regression** for classification.
- Evaluated using **confusion matrix** and **accuracy**.
- Demonstrated **Pipeline automation**.

---
📘 **Next:** `09-News_Category_Classification.ipynb` — Learn how to classify text into multiple categories!