# SMS Spam Classification using Machine Learning

This project builds a machine learning model that can classify SMS messages as **Spam** or **Ham (Not Spam)**.
It uses natural language processing (NLP) and Logistic Regression to achieve around **95% accuracy**.


## 1. Project Overview

In this notebook, we will:

- Load and explore the SMS Spam dataset
- Clean and preprocess the text
- Convert text to numerical features using TF-IDF
- Train a Logistic Regression model
- Evaluate accuracy, precision, recall, and confusion matrix
- Test the model with a custom message

This project is beginner-friendly and demonstrates basic NLP and classification.


## 2. Import Required Libraries

We import essential Python libraries for data processing, machine learning, and evaluation.


## 3. Load the Dataset

The dataset used here is the **SMS Spam Collection Dataset** from Kaggle/UCI.
It contains:
- 747 spam messages
- 4825 ham (normal) messages

We load the CSV file and display the first few rows.


## 4. Data Preprocessing

We perform the following steps:
- Rename columns if needed
- Convert labels 'spam' and 'ham' into numerical values
- Clean any missing or irrelevant data

This prepares the dataset for model training.


## 5. Split Dataset into Training and Testing Sets

We divide the dataset into:
- **80% training data**
- **20% testing data**

This helps evaluate how well the model generalizes to unseen messages.


## 6. Build and Train the Machine Learning Model

We use **Logistic Regression**, a simple and effective algorithm for text classification.
TF-IDF is used to convert text into numerical vectors.


## 7. Evaluate Model Performance

We calculate:
- Accuracy
- Precision
- Recall
- F1-score
- Confusion matrix

This helps us understand how well the model detects both ham and spam messages.


## 8. Test the Model with a Custom Example

We type a message and let the model predict whether it is spam or ham.


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load dataset
df = pd.read_csv("dataset.csv", encoding="latin-1")

# Keep only relevant columns (dataset contains extra columns)
df = df[['v1', 'v2']]
df.columns = ['label', 'message']

# Convert labels to numbers
df['label_num'] = df.label.map({'ham': 0, 'spam': 1})

# Split into train + test
X_train, X_test, y_train, y_test = train_test_split(
    df.message,
    df.label_num,
    test_size=0.2,
    random_state=42
)

# Convert text â†’ numerical features
vectorizer = TfidfVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train model
model = LogisticRegression()
model.fit(X_train_vec, y_train)

# Make predictions
y_pred = model.predict(X_test_vec)

# Metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Test with your own message
test_message = ["Congratulations! You won a free lottery!!"]
test_vec = vectorizer.transform(test_message)
print("\nModel prediction for test message:", model.predict(test_vec))


Accuracy: 0.9524663677130045

Classification Report:
               precision    recall  f1-score   support

           0       0.95      1.00      0.97       965
           1       0.97      0.67      0.79       150

    accuracy                           0.95      1115
   macro avg       0.96      0.83      0.88      1115
weighted avg       0.95      0.95      0.95      1115


Confusion Matrix:
 [[962   3]
 [ 50 100]]

Model prediction for test message: [1]


In [8]:
# Test the model with your own message

test_message = ["Free recharge click this link!"]  # change the text to anything you want
test_message_vector = vectorizer.transform(test_message)
prediction = model.predict(test_message_vector)

if prediction[0] == 1:
    print("Prediction: SPAM")
else:
    print("Prediction: HAM (Not Spam)")


Prediction: HAM (Not Spam)
