# Dataset Description

The dataset contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

The file contains one message per line. Each line is composed of two columns: v1 contains the label(ham or spam) and v2 contains the raw text. 

Dataset: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

# Data Preprocessing

In [1]:
import numpy as np
import pandas as  pd
import nltk
import nltk.corpus
import os
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

In [2]:
df = pd.read_csv('spam.csv', encoding = "ISO-8859-1")

In [3]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [5]:
df.shape

(5572, 5)

In [6]:
#Drop columns 3-5
df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace=True)

In [7]:
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
#v1 is the target column

In [9]:
df.duplicated().sum()

403

In [10]:
df = df.drop_duplicates(keep = 'first')

In [11]:
df.shape

(5169, 2)

To label encode the data: 0-ham and 1-spam

In [12]:
from sklearn.preprocessing import LabelEncoder
encode = LabelEncoder()

In [13]:
df['v1'] = encode.fit_transform(df['v1'])

In [14]:
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['v2'],df['v1'],test_size = 0.30, random_state=0)

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer().fit(X_train)

In [17]:
x_train_vector = count_vector.transform(raw_documents=X_train.values).toarray()
X_test_vector = count_vector.transform(raw_documents=X_test.values).toarray()

In [18]:
X_test_vector = count_vector.transform(raw_documents=X_test.values).toarray()

## Building the model

In [19]:
#Fit the MultinomialNB model
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train_vector, y_train)

MultinomialNB()

In [20]:
model.predict(x_train_vector)

array([0, 0, 0, ..., 0, 0, 0])

# To check model's accuracy

In [21]:
from sklearn.metrics import classification_report

In [22]:
print(classification_report(y_train, model.predict(x_train_vector)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3174
           1       0.98      0.97      0.97       444

    accuracy                           0.99      3618
   macro avg       0.99      0.98      0.99      3618
weighted avg       0.99      0.99      0.99      3618



In [23]:
print(classification_report(y_test, model.predict(X_test_vector)))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1342
           1       0.97      0.90      0.94       209

    accuracy                           0.98      1551
   macro avg       0.98      0.95      0.96      1551
weighted avg       0.98      0.98      0.98      1551



In [24]:
from sklearn.metrics import confusion_matrix

In [25]:
confusion_matrix(y_train, model.predict(x_train_vector))

array([[3165,    9],
       [  14,  430]], dtype=int64)

In [26]:
pd.crosstab(index=pd.Series(y_train,name='Actuals'), columns=pd.Series(model.predict(x_train_vector),name='Predicted')
           ,values=[1]*2402,aggfunc = lambda x: len(x))

Predicted,0,1
Actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1839,258
1,270,35


Accuracy

In [27]:
print("Accuracy: ", (3165+430)/(3165+9+14+430))

Accuracy:  0.9936428966279712
