#  SMS Spam Detection using Naive Bayes

This notebook presents a simple but powerful **Spam vs Ham classification model** using **Multinomial Naive Bayes** and **CountVectorizer** for feature extraction. The goal is to classify SMS text messages as either:

- **Ham** (not spam)
- **Spam** (unsolicited promotional or fraudulent messages)

---

## 📂 Dataset: SMS Spam Collection Dataset
- 📄 Contains 5,574 SMS messages labeled as **'spam'** or **'ham'**
- 📁 Source: UCI Machine Learning Repository (via Kaggle)
- 🧾 Columns Used:
  - `v1`: Label (`ham` or `spam`)
  - `v2`: Text message content

---

## 🔍 What This Notebook Does:
1. **Loads and cleans the dataset**
2. **Encodes text labels** into binary form (`ham` → 0, `spam` → 1)
3. **Converts text** into numeric features using **CountVectorizer**
4. **Splits the data** into training and testing sets
5. **Trains** a Naive Bayes model on the training data
6. **Evaluates** the model using:
   - Accuracy
   - Precision
   - Recall
   - F1-Score
7. **Visualizes** results with a **confusion matrix**
8. Includes **custom functions** to test:
   - Single message prediction
   - Bulk message prediction

---

## 🧠 Future Improvements:
- Use **TF-IDF Vectorization** for better feature scaling
- Experiment with other classifiers (e.g., Logistic Regression, SVM)
- Build a **Streamlit app** for real-time prediction

---

## ✅ Ideal For:
- Beginners learning text classification
- Quick demo of Naive Bayes for NLP tasks
- Anyone interested in spam detection logic


In [None]:
import numpy as np              
import pandas as pd           #importing Necessary Libraries
import seaborn as sns
import matplotlib.pyplot as plt            
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score,precision_score,f1_score,recall_score


In [None]:

data = pd.read_csv('/kaggle/input/spam-and-ham/spam.csv',encoding='latin1')  # reading dataset

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.shape

In [None]:
data = data.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)  #removing unwanted colums
data.head()

In [None]:
data['label'] = data.v1.map({'ham':0,'spam':1}) # Mapping text labels to numerical values: 'ham' → 0, 'spam' → 1
data.head()

In [None]:
# Convert messages into word count features
wrd_array = CountVectorizer()  # Create a CountVectorizer object
wrd_cnt = wrd_array.fit_transform(data['v2'])  # Fit and transform the text data

# Show the shape of the word count matrix
print("Shape of word count matrix:", wrd_cnt.shape)


In [None]:
y=data['label']    # spliting data for training
X_train, X_test, y_train, y_test = train_test_split(
    wrd_cnt, y, test_size=0.3, random_state=42
)

In [None]:
model = MultinomialNB()
model.fit(X_train,y_train)   #model

In [None]:
# Make predictions on the test data
predictions = model.predict(X_test)


In [None]:
    #Visualizing the Confusion Matrix using a Heatmap

c_matrix = confusion_matrix(y_test, predictions) 
sns.heatmap(c_matrix, annot=True, fmt='d', cmap='Blues',          
            xticklabels=['Predicted Ham', 'Predicted Spam'],
            yticklabels=['Actual Ham', 'Actual Spam'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

In [None]:
print("accuracy:",accuracy_score(y_test,predictions))
print("precision:",precision_score(y_test,predictions))   #calculating accuracy,precision,f1score,recall
print("recall :",recall_score(y_test,predictions))
print("f1 score: ",f1_score(y_test,predictions))

In [None]:
#  Function to check if a single message is spam or ham
def check_message(message):
    message_vector = wrd_array.transform([message])  # Convert message to vector using the trained vectorizer
    prediction = model.predict(message_vector)[0]  # Predict using the trained model
    score = model.predict_proba(message_vector)[0]  # Get the probability scores for ham and spam

    # Print the message and prediction results
    print(f"Message: {message}")
    print(f"Predicted as: {'Spam' if prediction == 1 else 'Ham'}")
    print(f"Ham score: {score[0]*100:.2f}%")
    print(f"Spam score: {score[1]*100:.2f}%")

#  Test the function with a sample message
message = "free money , review your account"
check_message(message)


In [None]:
# Function to check if multiple messages are spam or ham
def check_spam(messages):
    for msg in messages:
        msg_vector = wrd_array.transform([msg])  # Convert each message to vector form
        prediction = model.predict(msg_vector)[0]  # Predict spam or ham
        score = model.predict_proba(msg_vector)[0]  # Get prediction probabilities

        # Print the message and prediction results
        print(f"Message: {msg}")
        print(f"Predicted as: {'Spam' if prediction == 1 else 'Ham'}")
        print(f"Ham score: {score[0]*100:.2f}%")
        print(f"Spam score: {score[1]*100:.2f}%")
        print("-" * 50)  # Separator for readability

# List of messages to test
messages = [
    "free cash offer now",
    "are we meeting today?",
    "win big prizes",
    "hii austin how are you",
    "limited time deal just for you"
]

#  Run the function to check all messages
check_spam(messages)
