### Spam Detection using Text Mining and Naive Bayes Classifier

- Each SMS message can be a spam or a ham (legitimate).

In [1]:
# Import useful libararies used for data management
import pandas as pd
import numpy as np

# load dataset 'Spam.csv', using 'python' as engine
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
dataset = pd.read_csv("Spam.csv", encoding = "ISO-8859-1")

In [2]:
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html
# set max_colwidth 500
pd.set_option('display.max_colwidth', 500)  

dataset.head()

Unnamed: 0,Text,Class
0,"Hi...I have to use R to find out the 90% confidence-interval for the sensitivityand specificity of the following diagnostic test:A particular diagnostic test for multiple sclerosis was conducted on 20 MSpatients and 20 healthy subjects, 6 MS patients were classified as healthyand 8 healthy subjects were classified as suffering from the MS.Furthermore, I need to find the number of MS patients required for asensitivity of 1%...Is there a simple R-command which can do that for me?I am completel...",ham
1,"Francesco Poli wrote:) On Sun, 15 Apr 2007 21:24:00 +0200 Arnoud Engelfriet wrote:) ) The sign [X] (hereafter ""the Mark"") is a trademark, rights to which) ) are held by [Y], representing [Z] if applicable (hereafter ""the) ) Mark Holder"").) ) Wait, the ""Mark Holder"" would be [Y], I think.I thought you used Y and Z for cases where Y is licensing Z'strademark (if Y is Z's subsidiary or authorized licensee for example).Then the trademark holder is Z but Y has certain rights to the mark.The ""if a...",ham
2,"Stephen Thorne wrote:) What I was thinking was possibly using the web browser and a local) light weight http server of some kind (SimpleHTTPServer or something)) that would serve ajaxy data to the web browser and integrate with) things like the journal. This way even if the Web activity with its) MozEmbed component dies via OOM, the backend store can still be) written (and the backend store can take SIGDANGER signals and cache) that data to disk).This is exactly what's being worked on.",ham
3,"Hi,I have this site that auto generates an index.html file every 15 minutes(it's a blog aggregator).I need that every time the file is generated, all the contents betweenthe lines(h4 class=""post-title"")(ahref=""http://domain.com/2006/08/bourne-shell.html"")Bourne Shell(/a)(/h4)and (p)(a href=""http://domain.com/2006/08/bourne-shell.html"")S??b, 14 Abr2007 12:31:07(/a)(/p)to be deleted (including these two ones).I've tried several ways and googled but i can't do the trick.Any help would be apprec...",ham
4,"Author: metzeDate: 2007-04-16 08:20:13 +0000 (Mon, 16 Apr 2007)New Revision: 22249WebSVN: http://websvn.samba.org/cgi-bin/viewcvs.cgi?view=rev&root=samba&rev=22249Log:move tdb code to lib/tdb/ as in samba4metzeAdded: branches/SAMBA_3_0/source/lib/tdb/Removed: branches/SAMBA_3_0/source/tdb/Modified: branches/SAMBA_3_0/source/Makefile.in branches/SAMBA_3_0/source/configure.inChangeset:Modified: branches/SAMBA_3_0/source/Makefile.in===============================================================...",ham


In [3]:
dataset['Class'].value_counts()

ham     4864
spam    3246
Name: Class, dtype: int64

In [4]:
# convert the 'label' column into a numeric variable; 'ham' as 0, 'spam' as 1
# Alternatively, you can do one-hot-encoding, and drop the attribute Class_ham.
dataset['Label'] = dataset['Class'].map({'ham':0, 'spam':1})

In [5]:
dataset.head()

Unnamed: 0,Text,Class,Label
0,"Hi...I have to use R to find out the 90% confidence-interval for the sensitivityand specificity of the following diagnostic test:A particular diagnostic test for multiple sclerosis was conducted on 20 MSpatients and 20 healthy subjects, 6 MS patients were classified as healthyand 8 healthy subjects were classified as suffering from the MS.Furthermore, I need to find the number of MS patients required for asensitivity of 1%...Is there a simple R-command which can do that for me?I am completel...",ham,0
1,"Francesco Poli wrote:) On Sun, 15 Apr 2007 21:24:00 +0200 Arnoud Engelfriet wrote:) ) The sign [X] (hereafter ""the Mark"") is a trademark, rights to which) ) are held by [Y], representing [Z] if applicable (hereafter ""the) ) Mark Holder"").) ) Wait, the ""Mark Holder"" would be [Y], I think.I thought you used Y and Z for cases where Y is licensing Z'strademark (if Y is Z's subsidiary or authorized licensee for example).Then the trademark holder is Z but Y has certain rights to the mark.The ""if a...",ham,0
2,"Stephen Thorne wrote:) What I was thinking was possibly using the web browser and a local) light weight http server of some kind (SimpleHTTPServer or something)) that would serve ajaxy data to the web browser and integrate with) things like the journal. This way even if the Web activity with its) MozEmbed component dies via OOM, the backend store can still be) written (and the backend store can take SIGDANGER signals and cache) that data to disk).This is exactly what's being worked on.",ham,0
3,"Hi,I have this site that auto generates an index.html file every 15 minutes(it's a blog aggregator).I need that every time the file is generated, all the contents betweenthe lines(h4 class=""post-title"")(ahref=""http://domain.com/2006/08/bourne-shell.html"")Bourne Shell(/a)(/h4)and (p)(a href=""http://domain.com/2006/08/bourne-shell.html"")S??b, 14 Abr2007 12:31:07(/a)(/p)to be deleted (including these two ones).I've tried several ways and googled but i can't do the trick.Any help would be apprec...",ham,0
4,"Author: metzeDate: 2007-04-16 08:20:13 +0000 (Mon, 16 Apr 2007)New Revision: 22249WebSVN: http://websvn.samba.org/cgi-bin/viewcvs.cgi?view=rev&root=samba&rev=22249Log:move tdb code to lib/tdb/ as in samba4metzeAdded: branches/SAMBA_3_0/source/lib/tdb/Removed: branches/SAMBA_3_0/source/tdb/Modified: branches/SAMBA_3_0/source/Makefile.in branches/SAMBA_3_0/source/configure.inChangeset:Modified: branches/SAMBA_3_0/source/Makefile.in===============================================================...",ham,0


In [6]:
# Now let's define X and y 

X = dataset['Text']
y = dataset['Label']

In [7]:
# show the dimension of the X
#https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.shape.html
X.shape

(8110,)

In [8]:
# Now let's do train/test split on the X and y data (70% train, 30% test)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3 , random_state=1)

In [9]:
# show the dimension of X_train
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.shape.html
X_train.shape

(5677,)

In [10]:
# show the dimension of X_test
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.shape.html

X_test.shape

(2433,)

In [11]:
# show the number of 'ham' and 'spam' in the training set
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html
y_train.value_counts()

0    3393
1    2284
Name: Label, dtype: int64

In [12]:
# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Now we are ready to vectorize the data
# first, instantiate the vectorizer
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# When building the vocabulary ignore terms that have a document frequency strictly lower than the min_df (proportion), or higher than max_df
vectorizer = CountVectorizer(encoding='utf-8',stop_words='english',min_df=0.02, max_df=0.5)

#### TODO
- Please try TfidfVectorizer later to see the model performance (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- You need to use "from sklearn.feature_extraction.text import TfidfVectorizer" to import the TfidfVectorizer first

In [13]:
# Learn the vocabulary dictionary and return term-document matrix.
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit_transform
X_train_vec = vectorizer.fit_transform(X_train)

In [14]:
# print the terms
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.get_feature_names
vectorizer.get_feature_names_out()

array(['00', '000', '0000', '00our', '00proposition', '00we', '01', '02',
       '0200', '03', '04', '05', '06', '07', '08', '09', '0nline', '10',
       '100', '11', '12', '128', '129', '13', '14', '149', '149ableton',
       '15', '16', '17', '18', '185technologies', '19', '1imited', '1o',
       '20', '2006', '2007', '21', '22', '23', '24', '2441', '2442',
       '2480', '25', '26', '27', '28', '29', '2http', '30', '31', '319',
       '32', '33', '34', '35', '36', '369', '37', '38', '39', '399', '40',
       '40speedy', '41', '42', '44', '449', '45', '47', '48', '49',
       '49adobe', '49http', '50', '500', '56', '59', '60', '69adobe',
       '75', '79', '80', '819', '89', '899', '90', '95', '95you',
       '95your', '99', '______________________________________________',
       '______________________________________________r', 'able', 'ac',
       'access', 'account', 'accurate', 'acrobat', 'action', 'actually',
       'add', 'added', 'additional', 'address', 'adobe', 'advance',


In [15]:
print(X_train_vec)

  (0, 784)	3
  (0, 258)	1
  (0, 246)	1
  (0, 441)	1
  (0, 305)	3
  (0, 704)	1
  (0, 348)	2
  (0, 584)	2
  (0, 544)	1
  (0, 388)	2
  (0, 183)	4
  (0, 347)	1
  (0, 428)	1
  (0, 577)	1
  (0, 699)	1
  (0, 734)	1
  (0, 262)	1
  (0, 228)	2
  (0, 382)	1
  (0, 400)	1
  (0, 497)	1
  (1, 39)	1
  (1, 37)	1
  (1, 638)	2
  (1, 754)	2
  :	:
  (5676, 758)	1
  (5676, 447)	1
  (5676, 125)	1
  (5676, 167)	1
  (5676, 539)	1
  (5676, 110)	1
  (5676, 466)	1
  (5676, 575)	1
  (5676, 511)	1
  (5676, 112)	1
  (5676, 489)	1
  (5676, 239)	1
  (5676, 264)	1
  (5676, 440)	1
  (5676, 634)	2
  (5676, 245)	1
  (5676, 362)	1
  (5676, 626)	1
  (5676, 501)	1
  (5676, 275)	1
  (5676, 599)	1
  (5676, 229)	1
  (5676, 582)	1
  (5676, 747)	1
  (5676, 680)	1


In [16]:
# Now let's transform the test data without fitting (fit only on training)! 
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.transform
X_test_vec = vectorizer.transform(X_test)
print(X_test_vec)

  (0, 23)	1
  (0, 36)	1
  (0, 37)	1
  (0, 51)	1
  (0, 70)	1
  (0, 71)	2
  (0, 89)	3
  (0, 93)	2
  (0, 122)	1
  (0, 161)	1
  (0, 172)	6
  (0, 176)	1
  (0, 179)	1
  (0, 183)	2
  (0, 191)	2
  (0, 204)	2
  (0, 227)	1
  (0, 251)	1
  (0, 252)	1
  (0, 274)	6
  (0, 286)	1
  (0, 290)	2
  (0, 301)	1
  (0, 323)	6
  (0, 331)	6
  :	:
  (2432, 27)	1
  (2432, 32)	4
  (2432, 35)	1
  (2432, 37)	4
  (2432, 48)	1
  (2432, 49)	1
  (2432, 51)	1
  (2432, 52)	4
  (2432, 54)	3
  (2432, 55)	3
  (2432, 56)	1
  (2432, 57)	2
  (2432, 60)	1
  (2432, 61)	1
  (2432, 122)	2
  (2432, 160)	7
  (2432, 338)	2
  (2432, 491)	1
  (2432, 497)	1
  (2432, 611)	4
  (2432, 612)	2
  (2432, 666)	2
  (2432, 721)	1
  (2432, 735)	1
  (2432, 771)	1


In [17]:
X_train_vec.shape

(5677, 793)

In [18]:
#Import Multinomial Naive Bayes model from sklearn
from sklearn.naive_bayes import MultinomialNB

# Create a Multinomial Naive Bayes Classifier, which is frequently used in Tf-idf
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
mnb = MultinomialNB()
# train the model using training set 
mnb.fit(X_train_vec, y_train)

MultinomialNB()

In [19]:
# Empirical log probability of features given a class 1 (spam)
coeff_df = pd.DataFrame(mnb.feature_log_prob_[1,:].flatten(), [vectorizer.get_feature_names_out()], columns=['Log Probability'])  
coeff_df = coeff_df.sort_values(by=['Log Probability'], ascending=False)

# display the 50 most informative features (indicate spam)
coeff_df.head(50)

Unnamed: 0,Log Probability
com,-3.702794
adobe,-4.342634
79,-4.403574
price,-4.556449
software,-4.811341
php,-4.833474
suite,-4.885449
time,-4.887933
microsoft,-4.909304
2007,-4.9144


In [20]:
# Make class prediction for test set
# y_pred_class is the binary label
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.predict
y_pred_class = mnb.predict(X_test_vec)

# y_pred_prob is the probability estimate
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.predict_proba
y_pred_prob = mnb.predict_proba(X_test_vec)

In [21]:
# import libararies for evaluation measures
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

print("Accuracy:",accuracy_score(y_test, y_pred_class, normalize=True, sample_weight=None))
print("Confusion Matrix:", confusion_matrix(y_test, y_pred_class))
print("Classification Report:",classification_report(y_test, y_pred_class))

Accuracy: 0.9445129469790382
Confusion Matrix: [[1464    7]
 [ 128  834]]
Classification Report:               precision    recall  f1-score   support

           0       0.92      1.00      0.96      1471
           1       0.99      0.87      0.93       962

    accuracy                           0.94      2433
   macro avg       0.96      0.93      0.94      2433
weighted avg       0.95      0.94      0.94      2433



In [22]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

selectkbest = SelectKBest(chi2, k=400)
selectkbest.fit(X_train_vec, y_train)
X_train_vec2 = selectkbest.transform(X_train_vec)
X_test_vec2 = selectkbest.transform(X_test_vec)

In [23]:
selectkbest.get_support(indices=True)

array([  0,   2,   3,   4,   5,   6,   7,  10,  13,  14,  15,  16,  17,
        19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31,
        32,  33,  34,  35,  37,  38,  39,  40,  41,  42,  43,  44,  46,
        48,  50,  51,  53,  54,  59,  63,  65,  68,  69,  72,  73,  74,
        75,  79,  81,  83,  84,  85,  86,  87,  90,  91,  92,  93,  94,
        97,  99, 100, 105, 107, 109, 110, 111, 112, 113, 115, 116, 118,
       121, 122, 123, 125, 128, 129, 130, 131, 132, 136, 140, 143, 147,
       149, 150, 152, 153, 155, 156, 157, 158, 160, 167, 171, 172, 177,
       178, 183, 185, 189, 191, 193, 194, 199, 200, 201, 204, 205, 210,
       216, 217, 218, 221, 222, 228, 229, 230, 232, 233, 238, 239, 240,
       241, 242, 243, 244, 245, 247, 250, 251, 253, 255, 257, 258, 260,
       261, 263, 264, 265, 266, 268, 269, 271, 274, 277, 281, 282, 284,
       287, 289, 290, 293, 296, 298, 299, 303, 312, 317, 321, 323, 329,
       331, 332, 334, 335, 337, 341, 342, 343, 345, 347, 349, 35

In [24]:
#Import Multinomial Naive Bayes model from sklearn
from sklearn.naive_bayes import MultinomialNB

# Create a Multinomial Naive Bayes Classifier, which is frequently used in Tf-idf
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
mnb = MultinomialNB()
# train the model using training set 
mnb.fit(X_train_vec2, y_train)

MultinomialNB()

In [25]:
# Empirical log probability of features given a class 1 (spam)
coeff_df = pd.DataFrame(mnb.feature_log_prob_[1,:].flatten(), [vectorizer.get_feature_names_out()[i] for i in selectkbest.get_support(indices=True)], columns=['Log Probability'])  
coeff_df = coeff_df.sort_values(by=['Log Probability'], ascending=False)

# display the 50 most informative features (indicate spam)
coeff_df.head(50)

Unnamed: 0,Log Probability
com,-3.36317
adobe,-4.003011
79,-4.063951
price,-4.216826
software,-4.471718
php,-4.49385
suite,-4.545826
microsoft,-4.569681
2007,-4.574777
59,-4.643496


In [26]:
# Make class prediction for test set
# y_pred_class is the binary label
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.predict
y_pred_class = mnb.predict(X_test_vec2)

# y_pred_prob is the probability estimate
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.predict_proba
y_pred_prob = mnb.predict_proba(X_test_vec2)

In [27]:
print("Accuracy:",accuracy_score(y_test, y_pred_class, normalize=True, sample_weight=None))
print("Confusion Matrix:", confusion_matrix(y_test, y_pred_class))
print("Classification Report:",classification_report(y_test, y_pred_class))

Accuracy: 0.9116317303740238
Confusion Matrix: [[1463    8]
 [ 207  755]]
Classification Report:               precision    recall  f1-score   support

           0       0.88      0.99      0.93      1471
           1       0.99      0.78      0.88       962

    accuracy                           0.91      2433
   macro avg       0.93      0.89      0.90      2433
weighted avg       0.92      0.91      0.91      2433



In [None]:
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.metrics import accuracy_score

#Import Multinomial Naive Bayes model from sklearn
from sklearn.naive_bayes import MultinomialNB

# Create a Multinomial Naive Bayes Classifier, which is frequently used in Tf-idf
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
mnb = MultinomialNB()

sfs = SequentialFeatureSelector(mnb,  
           n_features_to_select=400,
           direction='forward',
           scoring='accuracy',
           cv=4)

sfs.fit(X_train_vec, y_train) 
X_train_vec3 = sfs.transform(X_train_vec)
X_test_vec3 = sfs.transform(X_test_vec)

In [None]:
sfs.get_support(indices=True)

In [None]:
#Import Multinomial Naive Bayes model from sklearn
from sklearn.naive_bayes import MultinomialNB

# Create a Multinomial Naive Bayes Classifier, which is frequently used in Tf-idf
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
mnb = MultinomialNB()
# train the model using training set 
mnb.fit(X_train_vec3, y_train)

In [None]:
# Empirical log probability of features given a class 1 (spam)
coeff_df = pd.DataFrame(mnb.feature_log_prob_[1,:].flatten(), [vectorizer.get_feature_names_out()[i] for i in selectkbest.get_support(indices=True)], columns=['Log Probability'])  
coeff_df = coeff_df.sort_values(by=['Log Probability'], ascending=False)

# display the 50 most informative features (indicate spam)
coeff_df.head(50)

In [None]:
# Make class prediction for test set
# y_pred_class is the binary label
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.predict
y_pred_class = mnb.predict(X_test_vec3)

# y_pred_prob is the probability estimate
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.predict_proba
y_pred_prob = mnb.predict_proba(X_test_vec3)

In [None]:
print("Accuracy:",accuracy_score(y_test, y_pred_class, normalize=True, sample_weight=None))
print("Confusion Matrix:", confusion_matrix(y_test, y_pred_class))
print("Classification Report:",classification_report(y_test, y_pred_class))

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train_vec, y_train)
lr.coef_
selector = SelectFromModel(lr, prefit=True,max_features=400,threshold=-np.inf)
X_train_vec4 = selector.transform(X_train_vec)
X_test_vec4 = selector.transform(X_test_vec)
X_train_vec4.shape

In [None]:
#Import Multinomial Naive Bayes model from sklearn
from sklearn.naive_bayes import MultinomialNB

# Create a Multinomial Naive Bayes Classifier, which is frequently used in Tf-idf
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
mnb = MultinomialNB()
# train the model using training set 
mnb.fit(X_train_vec4, y_train)

In [None]:

# Empirical log probability of features given a class 1 (spam)
coeff_df = pd.DataFrame(mnb.feature_log_prob_[1,:].flatten(), [vectorizer.get_feature_names_out()[i] for i in selector.get_support(indices=True)], columns=['Log Probability'])  
coeff_df = coeff_df.sort_values(by=['Log Probability'], ascending=False)

# display the 50 most informative features (indicate spam)
coeff_df.head(50)

In [None]:
# Make class prediction for test set
# y_pred_class is the binary label
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.predict
y_pred_class = mnb.predict(X_test_vec4)

# y_pred_prob is the probability estimate
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.predict_proba
y_pred_prob = mnb.predict_proba(X_test_vec4)

In [None]:
print("Accuracy:",accuracy_score(y_test, y_pred_class, normalize=True, sample_weight=None))
print("Confusion Matrix:", confusion_matrix(y_test, y_pred_class))
print("Classification Report:",classification_report(y_test, y_pred_class))