**Bag of Word**s is a text representation technique that **converts a sentence  into a vector of word counts**, ignoring grammar and word order, but keeping multiplicity (frequency).

It "bags" all the unique words (vocabulary) and maps how often each appears in a document.

# **SMS Spam Classifier Using Bag of Words**

# **Import Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score

# **Load Dataset**

In [None]:
url="https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep='\t', names=['label', 'message'])
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df.shape

(5572, 2)

# **Text Preprocessing**

In [None]:
import re
def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Keep alphanumerics and spaces
    text = re.sub(r'\s+', ' ', text).strip()    # Remove extra spaces
    return text


df['cleaned_message'] = df['message'].apply(clean_text)
df.head()

Unnamed: 0,label,message,cleaned_message
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...


In [None]:
df.sample(5)

Unnamed: 0,label,message,cleaned_message
3278,ham,Its a great day. Do have yourself a beautiful ...,its a great day do have yourself a beautiful one
1981,ham,"Sorry, I'll call later",sorry ill call later
3218,ham,Come to mahal bus stop.. &lt;DECIMAL&gt;,come to mahal bus stop ltdecimalgt
1424,ham,Lol great now im getting hungry.,lol great now im getting hungry
1151,ham,(That said can you text him one more time?),that said can you text him one more time


# **Convert Lables into Binary**

In [None]:
df['label_num'] = df['label'].map({'ham':0, 'spam':1})

# **Train-Test Split**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df['cleaned_message'], df['label_num'], test_size=0.2, random_state=42)

# **Bag of word Vectorization**

In [None]:
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)


# **Train Naive Bayes Classifier**

In [None]:
model=MultinomialNB()
model.fit(X_train_bow, y_train)

# **Evaluate The Model**

In [None]:
y_pred = model.predict(X_test_bow)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.99      1.00      0.99       966
           1       1.00      0.91      0.95       149

    accuracy                           0.99      1115
   macro avg       0.99      0.95      0.97      1115
weighted avg       0.99      0.99      0.99      1115

Accuracy: 0.9874439461883409


# **Predict on New SMS**

In [None]:
def predict_spam(message):
    msg_clean = clean_text(message)
    msg_vector = vectorizer.transform([msg_clean])
    prediction = model.predict(msg_vector)
    return "SPAM" if prediction[0] else "HAM"

predict_spam("ACTION REQUIRED. Please verify your Bank of America account information to avoid a hold on your account. Click here to confirm: [Link]")


'SPAM'

In [None]:
import pickle

# Save the CountVectorizer and model
with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)

with open('spam_classifier.pkl', 'wb') as f:
    pickle.dump(model, f)


In [None]:
from google.colab import files

# Download the model
files.download('spam_classifier.pkl')

# Download the vectorizer
files.download('vectorizer.pkl')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>