# SMS Spam classifier

Have you ever recieved messages that seem fraud? Messages like :

`Congratulations, you have won $1,000,000. Click this link to claim your prize.`

These type of messages are usually fake and classified as spam since plenty of people recieve plenty of these messages. 

Here in this project, we aim to help you classify which messages are spam or not spam which would help you to keep these messages away from your inbox.

SMS classification is a Binary classification problem where we classify the given sms into one of two categories : 
* Spam
* Not spam

To achieve this, let's use a binary classification machine learning algorithm using the Sklearn python package.
 

### Import the required libraries and obtain the dataset to train the model.

To train a machine learning model, the most important step is to collect data. Here we have collected a spam.csv dataset from Kaggle which contains well over 5000+ messages classified as either spam or not spam.

Link : https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv('spam.csv', encoding='Windows-1252')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


### Preprocessing the data
Removing unnecessary columns in the given dataset and preparing it to train our model.

In [4]:
df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace=True)
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
df.shape

(5572, 2)

In [6]:
df.columns = ['Label', 'Content']
df.head()

Unnamed: 0,Label,Content
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Converting data into numbers.
The machine doesn't understand textual data on its own. However, it is great at finding patterns and working with numbers. So here we are converting the label data into numbers to help the machine understand it better. Later we'll create a function to reverse the process so that we'll be able to get the final result in string format.

In [7]:
def relabel(x):
    if x == 'ham':
        return 1
    else:
        return 0

In [8]:
df['Label'] = df['Label'].apply(relabel)
df.head()

Unnamed: 0,Label,Content
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."


### Splitting our data
Here we split our data into 2 sets: training and testing and we separate the Content (denoted by X) and the label (denoted by y). We'll first train the model with our training data and then test it on our testing data to find the accuracy of the model.

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['Content'], df['Label'], random_state=42)

In [13]:
X_train.shape, y_train.shape

((4179,), (4179,))

In [14]:
X_test.shape, y_test.shape

((1393,), (1393,))

In [15]:
X_train.values

array(['U can call now...',
       'Tell them u have a headache and just want to use 1 hour of sick time.',
       'Never try alone to take the weight of a tear that comes out of ur heart and falls through ur eyes... Always remember a STUPID FRIEND is here to share... BSLVYL',
       ..., "Prabha..i'm soryda..realy..frm heart i'm sory",
       'Nt joking seriously i told',
       'In work now. Going have in few min.'], dtype=object)

### Bag of words
As mentioned earlier, the machine doesn't understand text. So here we are changing our Content into numerical format in the form of a count vectorizer to help the model understand it. 

What count vectorizer does is that it creates a table with each word present in the entire Content as a separate column. Each of these columns are assigned a value of 0 or 1 where 0 denotes that the word is absent in that message and 1 denotes that the word is present. 

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

countVector = CountVectorizer()
dataTrain = countVector.fit_transform(X_train.values)
dataTest = countVector.transform(X_test.values)
dataTrain

<4179x7441 sparse matrix of type '<class 'numpy.int64'>'
	with 55194 stored elements in Compressed Sparse Row format>

## Model 1. Naive Bayes model
Here we are going to use the naive bayes classifier to classify whether the given text messages are Spam or not.

The `fit()` method lets the model train on the training data. 

In [17]:
from sklearn import naive_bayes
model_1 = naive_bayes.MultinomialNB()
model_1.fit(dataTrain, y_train)

The `predict()` method returns the model's predictions. We pass the test data to see what predictions we get.

In [18]:
predictions_1 = model_1.predict(dataTest)

### Calculating accuracy
We use classification report and accuracy score functions from the sklearn.metrics library to find the accuracy of our model and see how well the predicted values match the y_test values. 

In this particular model, we can see that the model has an accuracy score of 98.2%.

In [19]:
from sklearn.metrics import classification_report, accuracy_score
print(classification_report(predictions_1, y_test, target_names = ['Spam', 'Not Spam'], digits = 2))
print("Accuracy = ",accuracy_score(predictions_1, y_test))

              precision    recall  f1-score   support

        Spam       0.88      0.98      0.93       172
    Not Spam       1.00      0.98      0.99      1221

    accuracy                           0.98      1393
   macro avg       0.94      0.98      0.96      1393
weighted avg       0.98      0.98      0.98      1393

Accuracy =  0.9820531227566404


In [20]:
def revertlabel(x):
    if x:
        return "Not Spam"
    return "Spam"

### Predicting based on user input
The following code lets the user give the model an input and the model will predict whether the given message is spam or not spam.

In [21]:
new_message = input("Enter a text that is to be classified as spam or not spam : ")
print(f"The message is {revertlabel(model_1.predict(countVector.transform([new_message])))}")

The message is Not Spam


In [29]:
import pickle
pickle.dump(model_1, open('model.pkl', 'wb'))
pickle.dump(countVector, open('vectorizer.pkl', 'wb'))

## There we go!

We have tried to solve the sms classification problem using the Naive Bayes Classfier. It works very well with a promising accuracy score of 0.98!

For more information on different machine learning models, I highly recommend going through the **Sklearn documentation** and test them all on your own. 

## And that marks the end of this notebook! 