# Task 4: Spam Email Detection using Machine Learning

This notebook demonstrates a complete workflow for building a text classification model to detect spam emails. 

### Workflow:
1. **Data Creation**: Generating a synthetic dataset of emails.
2. **Preprocessing**: Cleaning and tokenizing text.
3. **Feature Extraction**: Converting text to numerical data using **TF-IDF**.
4. **Modeling**: Training a **Multinomial Naive Bayes** classifier.
5. **Evaluation**: Measuring performance using a Confusion Matrix and Classification Report.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 1. Create a Sample Dataset
data = {
    'text': [
        'Get rich quick! Click here for a free prize now!',
        'Hey, are we still meeting for lunch at 12?',
        'Urgent: Your account has been compromised. Login immediately.',
        'The project report is attached for your review.',
        'Win a $1000 Walmart gift card! Claim yours today.',
        'Can you send me the notes from yesterday\'s lecture?',
        'CONGRATULATIONS! You have been selected for a free cruise.',
        'Please let me know if you can attend the wedding.',
        'Low interest rates available for your mortgage. Apply now!',
        'The weather today is expected to be sunny and warm.'
    ] * 10, # Expanding the small sample for training purposes
    'label': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] * 10 # 1 = Spam, 0 = Ham
}

df = pd.DataFrame(data)
print(f"Dataset Size: {df.shape}")
df.head()

## Feature Extraction and Data Splitting
Machine learning models cannot read raw text. We use **TF-IDF (Term Frequency-Inverse Document Frequency)** to convert text into a matrix of numbers based on word importance.

In [None]:
# 2. Split Data
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

# 3. Feature Extraction (TF-IDF)
tfidf = TfidfVectorizer(stop_words='english', lowercase=True)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print(f"Feature matrix shape: {X_train_tfidf.shape}")

## Model Training and Evaluation
We will use the **Multinomial Naive Bayes** algorithm, which is highly effective for discrete features like word counts in text classification.

In [None]:
# 4. Train Model
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

# 5. Predict and Evaluate
y_pred = model.predict(X_test_tfidf)

print("--- Accuracy Score ---")
print(f"{accuracy_score(y_test, y_pred) * 100:.2f}%")

print("\n--- Classification Report ---")
print(classification_report(y_test, y_pred))

# 6. Visualizing the Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix for Spam Detection')
plt.show()

## Live Testing
Try a custom sentence to see if the model classifies it as Spam or Ham.

In [None]:
def predict_spam(message):
    msg_tfidf = tfidf.transform([message])
    prediction = model.predict(msg_tfidf)
    return "SPAM" if prediction[0] == 1 else "HAM"

test_msg = "Congratulations! You have won a lottery ticket. Claim now."
print(f"Message: '{test_msg}'")
print(f"Prediction: {predict_spam(test_msg)}")