# **Spam Classification Using Naive Bayes**

- This notebook demonstrates how to build a spam classification model using the Naive Bayes algorithm. The dataset used is the SMSSpamCollection, which contains labeled SMS messages as either "spam" or "ham" (non-spam).
- The goal is to classify the messages into two categories: spam or ham (not spam). We will use Natural Language Processing (NLP) techniques along with the Naive Bayes classifier for this task.

**Loading the Data**

- First, we load the dataset from the file and inspect the first few rows to understand the structure of the data.


In [1]:
import pandas as pd 
df=pd.read_csv("SMSSpamCollection",sep="\t",names=["label","message"])
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
df.shape

(5572, 2)

**Text preprocessing**

- Preprocessing the text data is an important step in NLP tasks. We will clean the data by removing any special characters, converting all text to lowercase, removing stopwords, and stemming the words.


## Why Vectorize Only the "Message" Column?
- The "label" column is categorical (either "ham" or "spam"), and it's already encoded as numerical values (0 for ham and 1 for spam) using pd.get_dummies().
- The "message" column contains the actual text data, which needs to be transformed into numerical features for machine learning algorithms to work with.

In [3]:
import nltk 
import re 
from nltk.corpus import stopwords 
from nltk.stem import PorterStemmer
ps=PorterStemmer()

In [4]:
corpus=[]
for i in range(len(df)):
    rp=re.sub('[^a-zA-Z]'," ",df["message"][i])
    rp=rp.lower()
    rp=rp.split()
    rp=[word for word in rp if not word in set(stopwords.words('english'))]
    rp=" ".join(rp)
    corpus.append(rp)

**Vectorization**

- Since machine learning models cannot work directly with text data, we need to convert our text data into numerical representations. We will use the **CountVectorizer** method to convert text into a document-term matrix.


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
X=cv.fit_transform(corpus).toarray()

**Label Encoding**

In [6]:
# Label encoding (spam = 1, ham = 0)
y=pd.get_dummies(df["label"],drop_first=True)

- Split the data into training and testing sets (x_train, x_test, y_train, y_test).
- Apply SMOTE on the training data only to balance the classes in the training set (x_train, y_train).
- Train the model on the balanced training data.
- Evaluate the model on the untouched test data to see how well it generalize

**Train test split**

- we split the dataset into training and testing sets. The training set will be used to train the model, and the test set will be used to evaluate the model's performance.


In [7]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1)

In [8]:
print("Training set shape:", x_train.shape, y_train.shape)
print("Testing set shape:", x_test.shape, y_test.shape)

Training set shape: (3900, 7623) (3900, 1)
Testing set shape: (1672, 7623) (1672, 1)


**Model Training**

- We will now train a **Naive Bayes** model using the training data.

**Naive bayes classifier with default classifier** 

- Here, we train a Naive Bayes classifier on the training data and evaluate the model using accuracy metrics.


In [9]:
from sklearn.naive_bayes import MultinomialNB
model=MultinomialNB()
model.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


**Predictions**

In [10]:
ypred_train=model.predict(x_train)
ypred_test=model.predict(x_test)

**Evaluation**

- Let's evaluate the model's performance on the test set using accuracy. In addition to accuracy, we will also include precision, recall, and F1-score for a better evaluation.

In [11]:
from sklearn.metrics import accuracy_score
print("Train Accuracy:", accuracy_score(y_train, ypred_train))
print("Test Accuracy:", accuracy_score(y_test, ypred_test))

Train Accuracy: 0.9923076923076923
Test Accuracy: 0.9748803827751196


In [12]:
from sklearn.metrics import classification_report
print(classification_report(y_test, ypred_test))

              precision    recall  f1-score   support

       False       0.99      0.98      0.99      1442
        True       0.88      0.95      0.91       230

    accuracy                           0.97      1672
   macro avg       0.94      0.96      0.95      1672
weighted avg       0.98      0.97      0.98      1672



### Key Insights:
- The model performs excellently in identifying "Ham" messages.
- It performs well for "Spam" but with slightly lower precision.
- Overall, the model is highly accurate with good performance across both classes.

**Saving the model**

In [13]:
import joblib

# Save the model
joblib.dump(model, 'spam_classifier_model.pkl')

# Save the vectorizer
joblib.dump(cv, 'count_vectorizer.pkl')

['count_vectorizer.pkl']

In [14]:
# Load the model and vectorizer
model = joblib.load('spam_classifier_model.pkl')
cv = joblib.load('count_vectorizer.pkl')

# Example prediction
sample_message = "Congratulations! You've won a $1000 Walmart gift card. Click here to claim now."
sample_message_processed = cv.transform([sample_message]).toarray()
prediction = model.predict(sample_message_processed)

print("Prediction:", "Spam" if prediction[0] == 1 else "Ham")

Prediction: Spam


### **Project by : SIREESHA RAGIPATI**