## Agenda <br>

**1. Use TF-IDF Vectorizer to vectorize the sms text data into features after removing stopwords.**<br>

**2. Train Naive Bayes Classifier for spam/ham detection.**<br>

**3. Using Gridsearch to find the best parameters and scores.**<br>

**4. Analyzing results and identifying the class imbalance issue.**<br>

**5. Checking for abnormal cases of incorrect predictions by checking the probabilites of each class predictions**

**5. Addressing class imbalance issue using SMOTE oversampling technique and comparing results with previous model.**

In [1]:
## Loading in the libraries.

import pandas as pd

## Vectorizing text data using Tf-idf values.

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV
from sklearn.metrics import classification_report,roc_auc_score,accuracy_score,confusion_matrix

from sklearn.naive_bayes import MultinomialNB

from imblearn.over_sampling import SMOTE

In [2]:
## Loading in the dataset

sms_df = pd.read_csv('attachment_sms_spam.xls')
original_df = sms_df.copy()
sms_df.head()

Unnamed: 0,type,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
## Initializing vectorizer with stopwords

vectorizer = TfidfVectorizer(strip_accents='ascii',stop_words= stopwords.words('English'))
vectorizer.fit(sms_df.text)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', '...aven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn'],
        strip_accents='ascii', sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [4]:
#Preparing X and y (in binary format for predicting probabilities)

X = sms_df.text
y = sms_df['type'].map({'ham':0,'spam':1})

# Transforming X into tf-idf values

X_ = vectorizer.fit_transform(X)

In [5]:
## Training and Test sets

X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.3, random_state=42)

In [6]:
## Training Naive Bayes Classifier

n_bayes_model = MultinomialNB()
n_bayes_model.fit(X_train,y_train)
pred = n_bayes_model.predict(X_test)

In [7]:
## Checking results
print(classification_report(pred,y_test))

              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1495
           1       0.79      1.00      0.88       178

   micro avg       0.97      0.97      0.97      1673
   macro avg       0.89      0.98      0.93      1673
weighted avg       0.98      0.97      0.97      1673



In [8]:
## Applying Gridsearch

In [9]:
grid = GridSearchCV(MultinomialNB(),param_grid={'alpha':list(range(0,20,2))},cv=10,return_train_score=True,
                   scoring='roc_auc')
grid.fit(X_,y)

  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)


GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'alpha': [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='roc_auc', verbose=0)

In [10]:
#Checking Grid results for params - ROC_AUC Scores

grid_df = pd.DataFrame.from_dict(grid.cv_results_)
grid_df[['param_alpha','mean_test_score', 'mean_train_score']]

Unnamed: 0,param_alpha,mean_test_score,mean_train_score
0,0,0.981572,0.999964
1,2,0.985413,0.993902
2,4,0.981028,0.990144
3,6,0.978047,0.987469
4,8,0.975563,0.985404
5,10,0.97356,0.983678
6,12,0.971886,0.982187
7,14,0.970369,0.980839
8,16,0.969096,0.979648
9,18,0.967843,0.978607


In [11]:
## Visualizing the comparisons with probabilities for each class.

output_df = pd.DataFrame(y_test)
output_df['pred'] = pred

pred_class = pd.DataFrame(n_bayes_model.predict_proba(X_test)) ## Creating a dataframe of class probabilities
pred_class.rename(columns={0:'Class0',1:'Class1'},inplace=True)
pred_class.index = output_df.index

output_df = output_df.join(pred_class)
output_df.head()

Unnamed: 0,type,pred,Class0,Class1
3690,0,0,0.997118,0.002882
3527,0,0,0.98458,0.01542
724,0,0,0.932968,0.067032
3370,0,0,0.997756,0.002244
468,0,0,0.94308,0.05692


In [24]:
## Let's analyze the instances where the classifier was incorrect in it's predictions by checking the probabilities 
## of each class. When the probablities are close to 50/50 it tends to go either way and predictions can go wrong.
## But in some cases, the classifier has predicted the wrong result with heavy probabilites in its favour. 
## Lets check those messages.

prob_df  = output_df[(output_df['type']!=output_df['pred'])] ## Finding instances of mismatch

abnormal_predictions = prob_df[prob_df.Class0>0.75]   ## Heavy probability support instances.

abnormal_predictions.loc[:,'text'] = sms_df.loc[abnormal_predictions.index,:].text ## Attaching text to it.

abnormal_predictions.head()

Unnamed: 0,type,pred,Class0,Class1,text
3742,1,0,0.801979,0.198021,2/2 146tf150p
1507,1,0,0.807099,0.192901,Thanks for the Vote. Now sing along with the s...
1893,1,0,0.886661,0.113339,CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
856,1,0,0.919405,0.080595,Talk sexy!! Make new friends or fall in love i...
4676,1,0,0.829242,0.170758,"Hi babe its Chloe, how r u? I was smashed on s..."


In [13]:
## Confusion matrix shows how the classifier favours the predominant class 0.
## All incorrect predictions are FNs (Type2 errors). Addressing the class imbalance by oversampling should reduce the issue.

cm_df = pd.crosstab(y_test,pred)
cm_df.index.rename('Actual',inplace=True)
cm_df.columns.rename('Predicted',inplace=True)
cm_df

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1447,0
1,48,178


In [14]:
## ADDRESSING CLASS IMBALANCE WITH SMOTE OVERSAMPLING

In [15]:
## Oversampling X and y based on class imbalance in y.

sm = SMOTE(random_state=2)

X_sm,y_sm = sm.fit_sample(X_,y)

In [16]:
## Training and Test sets

X_train2, X_test2, y_train2, y_test2 = train_test_split(X_sm, y_sm, test_size=0.3, random_state=42)

In [17]:
## Training Naive Bayes Classifier
## Precision scores of class 1 has increased by almost 20% and overall F1 score improved by 10%

n_bayes_model2 = MultinomialNB()
n_bayes_model2.fit(X_train2,y_train2)
pred2 = n_bayes_model2.predict(X_test2)
print(classification_report(pred2,y_test2))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1447
           1       0.99      0.97      0.98      1450

   micro avg       0.98      0.98      0.98      2897
   macro avg       0.98      0.98      0.98      2897
weighted avg       0.98      0.98      0.98      2897



In [18]:
## Applying Gridsearch

## Mean Test and Train scores have improved.

grid2 = GridSearchCV(MultinomialNB(),param_grid={'alpha':list(range(0,20,2))},cv=10,return_train_score=True,
                   scoring='roc_auc')
grid2.fit(X_sm,y_sm)

#Checking Grid results for params - ROC_AUC Scores

grid_df2 = pd.DataFrame.from_dict(grid.cv_results_)
grid_df2[['param_alpha','mean_test_score', 'mean_train_score']]

  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)
  'setting alpha = %.1e' % _ALPHA_MIN)


Unnamed: 0,param_alpha,mean_test_score,mean_train_score
0,0,0.981572,0.999964
1,2,0.985413,0.993902
2,4,0.981028,0.990144
3,6,0.978047,0.987469
4,8,0.975563,0.985404
5,10,0.97356,0.983678
6,12,0.971886,0.982187
7,14,0.970369,0.980839
8,16,0.969096,0.979648
9,18,0.967843,0.978607


In [19]:
## Drastic improvement in the performance of Classifier in dealing with Type 2 errors as FNs have significantly reduced.
## Although accuracy has decreased, the precision and recall scores have improved much.

cm_df2 = pd.crosstab(y_test2,pred2)
cm_df2.index.rename('Actual',inplace=True)
cm_df2.columns.rename('Predicted',inplace=True)
cm_df2

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1437,45
1,10,1405


In [20]:
## Visualizing the comparisons with probabilities for each class.

output_df2 = pd.DataFrame(y_test2,columns=['actual'])
output_df2['pred'] = pred2

pred_class2 = pd.DataFrame(n_bayes_model2.predict_proba(X_test2)) ## Creating a dataframe of class probabilities
pred_class2.rename(columns={0:'Class0',1:'Class1'},inplace=True)
pred_class2.index = output_df2.index

output_df2 = output_df2.join(pred_class2)
output_df2.head()

Unnamed: 0,actual,pred,Class0,Class1
0,1,1,0.01804,0.98196
1,0,0,0.789127,0.210873
2,0,0,0.985222,0.014778
3,1,1,0.001368,0.998632
4,1,1,0.001384,0.998616


In [22]:
## Let's analyze the instances where the classifier was incorrect in it's predictions by checking the probabilities 
## of each class. When the probablities are close to 50/50 it tends to go either way and predictions can go wrong.
## But in some cases, the classifier has predicted the wrong result with heavy probabilites in its favour. 
## Lets check those messages.

prob_df2  = output_df2[(output_df2['actual']!=output_df2['pred'])] ## Finding instances of mismatch

abnormal_predictions2 = prob_df2[(prob_df2.Class0>0.75)|(prob_df2.Class1>0.75)]   ## Heavy probability support instances.

abnormal_predictions2.loc[:,'text'] = sms_df.loc[abnormal_predictions2.index,:].text ## Attaching text to it.

abnormal_predictions2

Unnamed: 0,actual,pred,Class0,Class1,text
105,0,1,0.192867,0.807133,Umma my life and vava umma love you lot dear
441,1,0,0.773505,0.226495,Yes..he is really great..bhaji told kallis bes...
1722,0,1,0.119628,0.880372,Am watching house – very entertaining – am get...
2218,0,1,0.208842,0.791158,* Will have two more cartons off u and is very...
2243,0,1,0.043489,0.956511,Nope wif my sis lor... Aft bathing my dog then...
2364,1,0,0.795696,0.204304,Fantasy Football is back on your TV. Go to Sky...
2401,0,1,0.19398,0.80602,Hi! This is Roger from CL. How are you?
2631,1,0,0.752421,0.247579,No way I'm going back there!
2800,0,1,0.057018,0.942982,I've told him that i've returned it. That shou...
