# Spam Mail Detection Project

This notebook will guide you through building a spam detection model step-by-step using Python and scikit-learn.

## Workflow:
1. Load and explore the dataset
2. Clean and preprocess text data
3. Convert text to numerical features (TF-IDF)
4. Train/Test split
5. Train a Naive Bayes classifier
6. Evaluate the model
7. Test the model with custom inputs

In [None]:
# Step 1: Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## Step 2: Load the dataset
Download a spam dataset, such as the [SMS Spam Collection](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset),
and place it in the same folder as this notebook.

In [None]:
# Load dataset
df = pd.read_csv('spam.csv', encoding='latin-1')
df = df[['v1', 'v2']]
df.columns = ['label', 'message']

# Convert labels to binary (ham = 0, spam = 1)
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

df.head()

## Step 3: Preprocess and vectorize text data

In [None]:
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['message'])
y = df['label']

X.shape  # (number_of_samples, number_of_features)

## Step 4: Train/Test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

## Step 5: Train Naive Bayes classifier

In [None]:
model = MultinomialNB()
model.fit(X_train, y_train)
print("Model training completed.")

## Step 6: Evaluate the model

In [None]:
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

## Step 7: Test with custom email messages

In [None]:
sample_emails = [
    "Free entry to win a $500 gift card! Click here now!",
    "Hey, are we still meeting tomorrow for lunch?"
]

sample_features = vectorizer.transform(sample_emails)
predictions = model.predict(sample_features)

for email, label in zip(sample_emails, predictions):
    print(email, "->", "Spam" if label == 1 else "Not Spam")

## Next Steps
- Improve text preprocessing by adding stemming or lemmatization.
- Try other algorithms like Logistic Regression or Random Forest.
- Experiment with deep learning models for better accuracy.
- Deploy this model using Flask, Django, or FastAPI.