# Spam Detection

## 1. Introduction
This notebook builds a spam detection model using the 20 Newsgroups dataset from `scikit-learn`. We will treat some newsgroup categories as 'ham' (not spam) and others as 'spam' to create a binary classification problem. The project involves text preprocessing, feature extraction using TF-IDF, and training a classification model.

## 2. Data Loading and Preparation

In [None]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np

# Define the categories for 'ham' and 'spam'
ham_categories = ['rec.sport.baseball', 'sci.space']
spam_categories = ['talk.politics.guns', 'misc.forsale']

# Load the datasets
ham_train = fetch_20newsgroups(subset='train', categories=ham_categories, shuffle=True, random_state=42)
spam_train = fetch_20newsgroups(subset='train', categories=spam_categories, shuffle=True, random_state=42)

ham_test = fetch_20newsgroups(subset='test', categories=ham_categories, shuffle=True, random_state=42)
spam_test = fetch_20newsgroups(subset='test', categories=spam_categories, shuffle=True, random_state=42)

# Combine the data
X_train = ham_train.data + spam_train.data
y_train = np.concatenate([np.zeros(len(ham_train.data)), np.ones(len(spam_train.data))])

X_test = ham_test.data + spam_test.data
y_test = np.concatenate([np.zeros(len(ham_test.data)), np.ones(len(spam_test.data))])

## 3. Text Preprocessing and Feature Extraction

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the training data
X_train_tfidf = vectorizer.fit_transform(X_train)

# Transform the test data
X_test_tfidf = vectorizer.transform(X_test)

## 4. Model Building and Training

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize and train the Multinomial Naive Bayes model
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

## 5. Model Evaluation

In [None]:
# Make predictions
y_pred = model.predict(X_test_tfidf)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print('\nClassification Report:')
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))
print('\nConfusion Matrix:')
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='g', xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## 6. Conclusion
The Multinomial Naive Bayes model achieved excellent performance in classifying the selected newsgroup posts as 'spam' or 'ham'. The high accuracy, precision, and recall scores demonstrate the effectiveness of using TF-IDF for feature extraction in text classification tasks. The model is highly capable of distinguishing between the different categories of text data.