
# üìß Email/SMS Spam Detection Using Machine Learning

Welcome to this comprehensive notebook where we will build a spam detection system using machine learning. This notebook is designed for clarity and learning, so every step is well-explained and beginner-friendly.

**Dataset:** `spam.csv` (uploaded in the same directory)  
**Goal:** Classify SMS or email messages as **spam** or **ham (not spam)** using a machine learning model.

Let's get started! üöÄ



## 1Ô∏è‚É£ Importing Required Libraries

Let's begin by importing all the essential libraries for data handling, text processing, model training, and evaluation.


In [None]:

# Importing Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import joblib



## 2Ô∏è‚É£ Loading the Dataset

We'll load the dataset `spam.csv`. Make sure it is in the same directory as this notebook.


In [None]:

# Load Dataset
df = pd.read_csv('spam.csv', encoding='latin-1')
df.head()



## 3Ô∏è‚É£ Data Exploration

Let's check the dataset structure, inspect column names, and understand what we have.


In [None]:

# Explore Data
print("Columns in dataset:", df.columns)
df.info()
df.describe()



## 4Ô∏è‚É£ Data Preprocessing

We'll:
- Rename columns if necessary
- Encode labels ('ham' ‚Üí 0, 'spam' ‚Üí 1)
- Handle missing values (if any)


In [None]:

# Preprocessing
# Rename columns if necessary (adjust accordingly)
df.columns = ['label', 'text', 'Unnamed_2', 'Unnamed_3', 'Unnamed_4']  # adjust if needed
df = df[['label', 'text']]

# Encode labels
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Check for missing values
print(df.isnull().sum())

df.head()



## 5Ô∏è‚É£ Splitting the Data

We'll split the dataset into **training** and **testing** sets to evaluate our model's performance.


In [None]:

# Split Data
X = df['text']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")



## 6Ô∏è‚É£ Feature Extraction (TF-IDF)

We'll use **TF-IDF** (Term Frequency-Inverse Document Frequency) to convert text into numerical features suitable for machine learning.


In [None]:

# Feature Extraction (TF-IDF)
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(f"TF-IDF shape: {X_train_tfidf.shape}")



## 7Ô∏è‚É£ Model Training

We'll train a **Logistic Regression** model to classify messages as spam or ham.


In [None]:

# Model Training
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Predictions
y_pred = model.predict(X_test_tfidf)



## 8Ô∏è‚É£ Model Evaluation

Let's evaluate the model using **accuracy** and a **classification report**.


In [None]:

# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

print("Classification Report:")
print(classification_report(y_test, y_pred))



## 9Ô∏è‚É£ Save the Model (Optional)

Saving the trained model and vectorizer for later use.


In [None]:

# Save Model (Optional)
joblib.dump(model, 'spam_classifier.pkl')
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')
print("Model and vectorizer saved!")



## ‚úÖ Conclusion

Congratulations! üéâ You've successfully built a spam detection model using machine learning. This project helps you understand data preprocessing, feature extraction, model training, and evaluation.

Feel free to experiment with other models (like Naive Bayes, SVM, Random Forest) or fine-tune hyperparameters for better performance.

Happy Learning! üöÄ
