# SPAM Classifier
A simple program which could be used to label a SMS a Ham or Spam based upon the importance of word used in the SMS itself.

We use Spam.csv file which can be found on internet or at the address of "D:/Acadview training notes/spam.csv"

Main python - library used are:

numpy

pandas

sklearn.feature_extraction.text

StratifiedShuffleSplit
 and various other classification machine models

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

Importing the spam.csv file. We use encoding latin1 such that SMS could be read by UTF 

createDataset method takes the filepath as input to the spam.csv file and returns the X column containing the SMS and y column containing the labels to those messages(i.e. Ham or Spam)

In [14]:
def createDataset(filePath):
    df = pd.read_csv(filePath,encoding="latin1")
    df.head()
    X=pd.DataFrame(df["SMS"])
    y=pd.DataFrame(df["class"])
    y = y.values.reshape(5572)
    print(np.sum(df["class"]=='spam'))
    print(np.sum(df["class"]=='ham'))
    return X,y,df

In [15]:
filePath = "D:/Acadview training notes/spam.csv"
X,y,df = createDataset(filePath)

747
4825


# Using TF - IDF for feature selection

 We vectorize the SMS columns such that the string could be made easier to extract features. Basically vectorizing the sms column help us to make more and more features to our data. 
 
 TF - IDF vectorizer extracts feature based upon the frequency and importance of the word. That is the reason of including it to classify the sms using the important words featured with the help of TF - IDF vectorizer.

In [16]:
def vectorize():
    from sklearn.feature_extraction.text import TfidfVectorizer
    textFeatures=df['SMS']
    vectorizer = TfidfVectorizer()
    #FX_train 
    fX= vectorizer.fit_transform(textFeatures)
    #FX_test = vectorizer.fit_transform(X_test)
    return fX


In [17]:
fX = vectorize()

Stratified Shuffle split is a cross-validation method of combination of StratifiedK fold and ShuffleSplit. In the spam dataset we clearly observe that number of spam are less than the ham sms, so in that case spliting of dataset into training and testing dataset should be stratified such that the ratio of the spam and ham label are balanced, hence we use the stratified shuffled split it maintains the percentage of label for each class.

# Splitting the data set using Stratified Shuffle

In [18]:
from sklearn.model_selection import StratifiedShuffleSplit
def spliter(fX,y):
    s = StratifiedShuffleSplit(n_splits=2, test_size=0.5, random_state=7)
    s.get_n_splits(fX,y)
    
    for train_index, test_index in s.split(fX, y):
        print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = fX[train_index], fX[test_index]
        y_train, y_test = y[[train_index]], y[[test_index]]
    return X_train, X_test, y_train, y_test

In [19]:
X_train, X_test, y_train, y_test = spliter(fX,y)

TRAIN: [2230 5530 2234 ..., 3325 3899 1588] TEST: [5139 3271 5013 ..., 2774 1670 1685]
TRAIN: [1407  942  175 ..., 1590 2529 5560] TEST: [2241 1679 2228 ..., 5484 5408 3383]


In [20]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
def model_accuracy(y_true,pred):
    print("1) Accuracy score = ",metrics.accuracy_score(y_true,pred)*100,"%")
    print("2) Confusion matrix : \n",confusion_matrix(y_true,pred))
    print("3) Classification report : \n",classification_report(y_true,pred))

# Multinomial Naive Bayes

In [21]:
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB


In [22]:
model = MultinomialNB(alpha=0.22)
model.fit(X_train,y_train)
fX.shape
pred = model.predict(X_test)

In [23]:
model_accuracy(y_test,pred)

1) Accuracy score =  98.1694185212 %
2) Confusion matrix : 
 [[2403    9]
 [  42  332]]
3) Classification report : 
              precision    recall  f1-score   support

        ham       0.98      1.00      0.99      2412
       spam       0.97      0.89      0.93       374

avg / total       0.98      0.98      0.98      2786



# Gaussian Naive Bayes

In [24]:
from sklearn.naive_bayes import GaussianNB

In [25]:
modelg = GaussianNB()
type(X_train)
print(np.sum(y_test=='spam'))
print(np.sum(y_test=='ham'))
modelg.fit(X_train.todense(),y_train)
pred1 = modelg.predict(X_test.todense())

374
2412


In [26]:
model_accuracy(y_test,pred1)

1) Accuracy score =  90.5240488155 %
2) Confusion matrix : 
 [[2183  229]
 [  35  339]]
3) Classification report : 
              precision    recall  f1-score   support

        ham       0.98      0.91      0.94      2412
       spam       0.60      0.91      0.72       374

avg / total       0.93      0.91      0.91      2786



# Bernoulli Naive Bayes

In [27]:
from sklearn.naive_bayes import BernoulliNB

In [28]:
modelb = BernoulliNB()
modelb.fit(X_train,y_train)
pred2 = modelb.predict(X_test)

In [29]:
model_accuracy(y_test,pred2)

1) Accuracy score =  97.1643933955 %
2) Confusion matrix : 
 [[2409    3]
 [  76  298]]
3) Classification report : 
              precision    recall  f1-score   support

        ham       0.97      1.00      0.98      2412
       spam       0.99      0.80      0.88       374

avg / total       0.97      0.97      0.97      2786



# Support Vector Classifier

In [30]:
from sklearn.svm import SVC


In [31]:
svc = SVC(C=10000)
svc.fit(X_train,y_train)
pred3 = svc.predict(X_test)

In [32]:
model_accuracy(y_test,pred3)

1) Accuracy score =  98.3129935391 %
2) Confusion matrix : 
 [[2408    4]
 [  43  331]]
3) Classification report : 
              precision    recall  f1-score   support

        ham       0.98      1.00      0.99      2412
       spam       0.99      0.89      0.93       374

avg / total       0.98      0.98      0.98      2786



# Decision Tree

In [33]:
from sklearn.tree import DecisionTreeClassifier

In [34]:
dtc = DecisionTreeClassifier(criterion='gini')
dtc.fit(X_train,y_train)
pred4 = dtc.predict(X_test)

In [35]:
model_accuracy(y_test,pred4)

1) Accuracy score =  96.4106245513 %
2) Confusion matrix : 
 [[2362   50]
 [  50  324]]
3) Classification report : 
              precision    recall  f1-score   support

        ham       0.98      0.98      0.98      2412
       spam       0.87      0.87      0.87       374

avg / total       0.96      0.96      0.96      2786



# Ensemble

In [36]:
from sklearn.ensemble import BaggingClassifier

In [37]:
from sklearn.naive_bayes import MultinomialNB

In [38]:
mnb = MultinomialNB()


In [39]:
bgg = BaggingClassifier(base_estimator=mnb,n_estimators=120,random_state=7)
bgg.fit(X_train,y_train)
pre = bgg.predict(X_test)

In [40]:
model_accuracy(y_test,pre)

1) Accuracy score =  94.2928930366 %
2) Confusion matrix : 
 [[2412    0]
 [ 159  215]]
3) Classification report : 
              precision    recall  f1-score   support

        ham       0.94      1.00      0.97      2412
       spam       1.00      0.57      0.73       374

avg / total       0.95      0.94      0.94      2786



# KNeighbors

In [41]:
from sklearn.neighbors import KNeighborsClassifier

In [42]:
knn = KNeighborsClassifier(n_neighbors=20,algorithm='brute')
knn.fit(X_train,y_train)
pree = knn.predict(X_test)

In [43]:
model_accuracy(y_test,pree)

1) Accuracy score =  95.5132806892 %
2) Confusion matrix : 
 [[2411    1]
 [ 124  250]]
3) Classification report : 
              precision    recall  f1-score   support

        ham       0.95      1.00      0.97      2412
       spam       1.00      0.67      0.80       374

avg / total       0.96      0.96      0.95      2786



# Logistic Regression

In [44]:
from sklearn.linear_model import LogisticRegression

In [45]:
LR = LogisticRegression()
LR.fit(X_train,y_train)
predd = LR.predict(X_test)

In [46]:
model_accuracy(y_test,predd)

1) Accuracy score =  94.472361809 %
2) Confusion matrix : 
 [[2409    3]
 [ 151  223]]
3) Classification report : 
              precision    recall  f1-score   support

        ham       0.94      1.00      0.97      2412
       spam       0.99      0.60      0.74       374

avg / total       0.95      0.94      0.94      2786

