In [98]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Every day we get different messages. There are different algorithm that are used to categorize those messages as ham and spam. In this analysis, I will try to predict whether or not the message is spam. I will use the dataset provided by UC Irvince, which you can find here: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection.

###**Exploratory Data Analysis**

First I need to import the necessary packages.

In [99]:
# Import the necessary packages
import numpy as np
import pandas as pd
import nltk
import string
import re
from nltk.corpus import stopwords
from sklearn.model_selection import cross_val_score

In [100]:
# Import the dataset
df = pd.read_table('/content/drive/My Drive/NLP/SMSSpamCollection', header = None, encoding = 'utf-8')
# Note: The code above is the  direction to my google drive, so it can be different for you.

First, I will see how the dataset looks like, what are the variables, are there any missing values, and shape of the dataset.

In [101]:
# Check the first 5 rows
df.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [102]:
# Check the unique values of the dataframe
df.nunique()

0       2
1    5169
dtype: int64

In [103]:
# Check sum of missing values
df.isnull().sum()

0    0
1    0
dtype: int64

In [104]:
# Check the shape of df
df.shape

(5572, 2)

In [105]:
# Get info about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       5572 non-null   object
 1   1       5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [106]:
# Check number of ham and spam mails
classes = df[0]
print(classes.value_counts())

ham     4825
spam     747
Name: 0, dtype: int64


Note: The number spam messages is relatively small, which means that if I take all the messages as ham, then I will have around 87% accuracy (number of ham)/(number of total) * 100.

In [107]:
# Check the names of columns of df
df.columns

Int64Index([0, 1], dtype='int64')

To start the analysis, I need to convert ham and spam to 0 and 1. The following function performs that operation.

In [108]:
# Convert ham and spam to 0 and 1
def convert(row):
  if row[0] == "ham":
    return 0
  else:
    return 1

df['spam'] = df.apply(convert, axis=1)
df = df.drop([0], axis=1)

In [109]:
# Drop the duplicates
df.drop_duplicates(inplace = True)

In [110]:
# Reset Index
df = df.reset_index(drop=True)

In [111]:
# Check the shape of df
df.shape

(5169, 2)

After applying the function to the dataset, I dropped the duplicate rows and changed the indexed to start from 0 again. As I removed the duplicate rows, the number of rows decreases from 5572 to 5169.

In [112]:
# Check number of ham and spam mails
classes = df['spam']
print(classes.value_counts())

0    4516
1     653
Name: spam, dtype: int64


In [113]:
# Check the tail of dataset
df.tail()

Unnamed: 0,1,spam
5164,This is the 2nd time we have tried 2 contact u...,1
5165,Will ü b going to esplanade fr home?,0
5166,"Pity, * was in mood for that. So...any other s...",0
5167,The guy did some bitching but I acted like i'd...,0
5168,Rofl. Its true to its name,0


##**Text Preprocessing**

I need to remove htmls, email addresses, phone numbers, money signs, numbers, punctuations, white spaces, and leading and tailing white spaces. The reason is that I need to see if the occurance of email of phone number and not a specific email/phone number can indicate whether or not the email is spam. Also, I need to make all the words lowercase ("he" and "He" have the same meaning), remove the sropwords (is, which, this, etc) and lexicon Normalization ("car" and "cars" have the same meaning). 

In [114]:
# Take the text column
text_messages = df[1]

In [115]:
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

In [116]:
# Download stopwords and wordnet
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [117]:
# Import stopwords
stop_words = nltk.corpus.stopwords.words('english')

To replace email address or phone numbers, I used regular expressions. I used the following website: http://regexlib.com/Search.aspx to get the necessary expressions.

In [118]:
# Make a function to conduct text preprocessing
def preprocess(text):
    assert(type(text) == str)
    # Replace email addresses with emailaddr
    processed = re.sub(r'^.+@[^\.].*\.[a-z]{2}$', 'emailaddr', text)
    
    # Replace urls with webaddress
    processed = re.sub(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$', 'webaddr', processed)
    
    # Replace money symbols with money
    processed = re.sub(r'£|\$', 'money', processed)

    # Replace phone numbers with phone
    processed = re.sub(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$', 'phone', processed)
    
    # Replace numbers with 'number'
    processed = re.sub(r'\d+(\.\d+)?', 'number', processed)
    
    # Remove punctutations
    processed = re.sub(r'[^\w\d\s]', ' ', processed.lower())
    
    return ' '.join(
        lem.lemmatize(term) 
        for term in processed.split()
        if term not in set(stop_words)
    )

In [119]:
# Apply the function on the text data
text_messages.apply(preprocess)

0       go jurong point crazy available bugis n great ...
1                                 ok lar joking wif u oni
2       free entry number wkly comp win fa cup final t...
3                     u dun say early hor u c already say
4                     nah think go usf life around though
                              ...                        
5164    numbernd time tried number contact u u moneynu...
5165                          ü b going esplanade fr home
5166                                 pity mood suggestion
5167    guy bitching acted like interested buying some...
5168                                       rofl true name
Name: 1, Length: 5169, dtype: object

##**Feature Engineering**

After cleaning the data, I need to perform feature engineering. First, I need to tokenize the data. I can tokenize not only each word or each sentence, but also pair of words. For example, bigram is when a pair of two words is taken as a token. The advantage of bigram is that it will be able to capture the meaning better. I will do unigram and bigram, and then will calculate tf-idf statistics to see the frequency of each token in the text.

In [120]:
# Import TFidfVectorizer to calculate tf-idf statistics. I use ngram_range as 1 and 2 to take unigram and bigram 
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
n_grams = vectorizer.fit_transform(text_messages)
print(n_grams)

  (0, 3447)	0.19481166213785306
  (0, 17542)	0.19481166213785306
  (0, 42108)	0.19481166213785306
  (0, 9517)	0.19481166213785306
  (0, 7585)	0.19481166213785306
  (0, 23683)	0.19481166213785306
  (0, 48759)	0.19481166213785306
  (0, 17853)	0.19481166213785306
  (0, 7591)	0.19481166213785306
  (0, 21112)	0.19481166213785306
  (0, 31375)	0.18589385203479303
  (0, 5433)	0.19481166213785306
  (0, 10748)	0.19481166213785306
  (0, 33489)	0.19481166213785306
  (0, 22943)	0.19481166213785306
  (0, 45051)	0.19481166213785306
  (0, 17168)	0.19481166213785306
  (0, 46559)	0.1089915740494989
  (0, 3446)	0.19481166213785306
  (0, 17535)	0.09113128303211691
  (0, 42067)	0.09330658180194616
  (0, 9510)	0.1643214680072348
  (0, 7584)	0.18589385203479303
  (0, 23682)	0.1643214680072348
  (0, 48749)	0.13756803769805487
  :	:
  (5167, 11906)	0.1009052492406862
  (5167, 28961)	0.11421857769983029
  (5167, 6034)	0.07814420589252409
  (5167, 15387)	0.13984547406511025
  (5167, 41209)	0.05632770674862126
  

In [121]:
# Check the shape of the final data
n_grams.shape

(5169, 50506)

##**Build Models**

I will make different models to see which one is better for prediction. First, we will try SVM model. I will try different parameters and will do grid search to find the optimal ones.

In [122]:
# Import GridSearchCV and SVM
from sklearn.model_selection import GridSearchCV
from sklearn import svm

In [123]:
# Make the parameters, SVM, and the model
# Parameters
param_grid = {'C': [1, 2, 3],
              'kernel': [ 'linear', 'sigmoid']}
# Make the SVM
svm = svm.SVC()

# Model
grid = GridSearchCV(estimator=svm, param_grid=param_grid, scoring='f1', cv=5)

In [124]:
# Fit the model and find the best parameters
# Fit Model
grid.fit(n_grams, df['spam'])

# Get the best parameters
print("Best Parameters: ", grid.best_params_)

Best Parameters:  {'C': 2, 'kernel': 'sigmoid'}


In [125]:
# Make SVM with the best parameters
from sklearn import svm
svm = svm.SVC(C = 2, kernel = 'sigmoid')

In [126]:
# Run SVM and get the scores for 5-fold cross validation
scores = cross_val_score(
    estimator=svm,
    X=n_grams,
    y=df['spam'],
    cv=5
)
scores

array([0.98065764, 0.98259188, 0.98162476, 0.97969052, 0.98257502])

In [127]:
# Take the average of the scores
scores.mean()*100

98.14279642213155

With SVM I get around 98% accuracy. Note again that the data are not well balanced and there are 4516 ham and 653 spam emails, which means that by assigning all the emails as ham I would get 87% accuracy. I will make the confustion matrix to see which parameters are predicted incorrectly.

In [128]:
# Split into train test 80/20
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(n_grams, df['spam'], test_size = 0.2, random_state = 123)

In [129]:
# Fit model on train dataset
svm1 = svm.fit(X_train, y_train)

In [130]:
# Make the confustion matrix
from sklearn.metrics import confusion_matrix
svm_pred = svm1.predict(X_test)
print(confusion_matrix(y_test, svm_pred))

[[909   2]
 [ 17 106]]


From the matrix above I can see that the errors are mainly associated with False Positives. This means that sometimes the algorithm predicts a message to be spam, even though it is ham. Next, I will try random forest model to see how good it predicts.

In [131]:
# Import Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

In [132]:
# Make the parameters, RF, and the model
params= {#'bootstrap': [True, False],
         'max_depth': [100],
         'n_estimators': [300, 400]}

# Make the RF
RandomForestClassifier = RandomForestClassifier()

# Model
gridRF = GridSearchCV(estimator=RandomForestClassifier, param_grid=params, scoring='f1', cv=5)

In [133]:
# Fit the model and find the best parameters
# Fit Model
gridRF.fit(n_grams, df['spam'])

# Get the best parameters
print("Best Parameters: ", gridRF.best_params_)

Best Parameters:  {'max_depth': 100, 'n_estimators': 300}


In [134]:
# Make RF with the best parameters
from sklearn.ensemble import RandomForestClassifier
RandomForestClassifier = RandomForestClassifier(max_depth = 100, n_estimators = 300)

In [135]:
# Run RF and get the scores for 5-fold cross validation
scoresRF = cross_val_score(
    estimator=RandomForestClassifier,
    X=n_grams,
    y=df['spam'],
    cv=5
)
scoresRF

array([0.97388781, 0.97195358, 0.96615087, 0.95841393, 0.96030978])

In [136]:
# Take the average of the scores
scoresRF.mean()*100

96.61431933805315

In [137]:
# Fit model on train dataset
RandomForestClassifier1 = RandomForestClassifier.fit(X_train, y_train)

In [138]:
# Make the confustion matrix
from sklearn.metrics import confusion_matrix
RF_pred = RandomForestClassifier1.predict(X_test)
print(confusion_matrix(y_test, RF_pred))

[[911   0]
 [ 36  87]]


I used grid search for RF as well. However, I did not try many parameters as it takes too long to run and I would recommend trying more parameters if you have more computing power. As a result, I got 96% accuracy (lower than SVM). By looking at the confusion matrix you see that again the False Positives are the main causes of errors. Next I will try a relatively simple, Logistic Regression model.

In [139]:
# Import the package
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(random_state=0)

In [140]:
# Conduct 5 fold cross validationa and get the scores
scoresLR = cross_val_score(
    estimator=LR,
    X=n_grams,
    y=df['spam'],
    cv=5
)
scoresLR

array([0.93617021, 0.92553191, 0.93617021, 0.92843327, 0.93223621])

In [141]:
# Calculate the average score for LR
scoresLR.mean()*100

93.17083629023651

As you can see, a realitvely simple model does not predict well, and next I will try more advance model: XGBoost.

In [142]:
# Import XGBoost
import xgboost as xgb

In [143]:
# Make the parameters, XGBoost, and the model
# Parameters
param_gridxgb = {'learning rate': [0.1, 0.5],
              'n_estimators': [400],
              'subsample': [0.75, 0.85],
              'colsample_bytree':[0.5, 0.7],
              'reg_alpha':[0.001, 0.003]}

# XGBoost
xgbm = xgb.XGBClassifier()

# Model
grid_xgb = GridSearchCV(estimator=xgbm, param_grid=param_gridxgb, scoring='f1', cv=5)

In [144]:
# Fit the model and find the best parameters
# Fit Model
grid_xgb.fit(n_grams, df['spam'])

# Get the best parameters
print("Best Parameters: ", grid_xgb.best_params_)

Best Parameters:  {'colsample_bytree': 0.7, 'learning rate': 0.1, 'n_estimators': 400, 'reg_alpha': 0.003, 'subsample': 0.85}


In [146]:
# Make RF with the best parameters
import xgboost as xgb
spam_model_xgboost = xgb.XGBClassifier(learning_rate = 0.1, n_estimators = 400, subsample = 0.85, colsample_bytree=0.7, reg_alpha=0.003)

In [147]:
# Run RF and get the scores for 5-fold cross validation
scoresXGB = cross_val_score(
    estimator=spam_model_xgboost,
    X=n_grams,
    y=df['spam'],
    cv=5
)
scoresXGB

array([0.9787234 , 0.97969052, 0.98162476, 0.9729207 , 0.9767667 ])

In [148]:
# Take the average of the scores
scoresXGB.mean()*100

97.79452159959256

In [149]:
# Fit model on train dataset
spam_model_xgboost1 = spam_model_xgboost.fit(X_train, y_train)

In [150]:
# Make the confustion matrix
from sklearn.metrics import confusion_matrix
RF_pred = spam_model_xgboost1.predict(X_test)
print(confusion_matrix(y_test, RF_pred))

[[909   2]
 [ 19 104]]


Overall, after trying XGBoost, Logistic Regression, Random Forest, and SVM, I found that SVM model has the highest accuracy (around 98%). The errors are associated mainly with False Positives, when the algorithm predicts messages to be spam, even though they are not. The reason why SVM performs better than XGBoost can be the low number of observations (5,169 in total).