# 📧 Email Spam Detection using Naive Bayes

## Problem Statement
Email communication is a crucial part of everyday life, but the increasing volume of spam emails poses a significant challenge. Spam emails often contain fraudulent content, phishing attempts, or advertisements that can clutter inboxes and lead to security risks.

This project aims to build a **Naive Bayes-based Email Spam Classifier** to automatically detect and filter spam emails. Using the **Spam Email Dataset**, we will preprocess email text, extract meaningful features, and train a Naive Bayes model to classify emails as **spam or not spam**.

## Objectives
- Load and preprocess the dataset by cleaning text data and transforming labels.
- Implement text vectorization techniques such as **TF-IDF** or **CountVectorizer**.
- Train a **Naive Bayes classifier** (MultinomialNB) for spam detection.
- Evaluate the model using **accuracy, precision, recall, and F1-score**.
- Visualize results using **word clouds, confusion matrices, and bar plots**.

By the end of this project, we aim to develop a robust spam detection system that can improve email security and enhance the user experience.


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [3]:
data = pd.read_csv('spam.csv', encoding='latin-1')
data = data[['v1', 'v2']] # Selecting only relevant columns (v1: label, v2: email text)
data.columns = ['label', 'message'] # Renaming columns for better readability

In [5]:
# Convert categorical labels ('ham' or 'spam') to binary format (0 = ham, 1 = spam)
data['label'] = data['label'].map({'ham': 0, 'spam': 1})

In [6]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(data['message'], data['label'], test_size=0.2, random_state=42)

In [7]:
# Create a text processing and classification pipeline
model = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('classifier', MultinomialNB())
])


In [8]:

# Train the model
model.fit(X_train, y_train)