# SMS Spam Prediction

The dataset using was taken from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset). It contains 5572 SMS messages in English, tagged as ham (legitimate) or spam.

The goal of this project is to build a model that will classify SMS messages as ham or spam.

We will create a model using Machine Learning. This model will be used inside a web application that will allow users to enter a new SMS message and predict whether it is ham or spam.

In [50]:
# importing dependencies

%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

## 1. Machine Learning Approach

### 1.1 Data Exploration

In [51]:
df = pd.read_csv('spam.csv', encoding='latin-1')

In [52]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [53]:
df.shape

(5572, 5)

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [55]:
df.describe(include='all')

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


In [56]:
# dropping unnecessary columns

df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)

In [57]:
df.v1.value_counts()

v1
ham     4825
spam     747
Name: count, dtype: int64

In [58]:
# check for null values

df.isnull().sum()

v1    0
v2    0
dtype: int64

In [59]:
# rename columns

df.rename(columns={'v1':'label', 'v2':'message'}, inplace=True)

X = df['message']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

In [60]:
# preprocessing sms

def text_preprocessing(text, language, minWordSize):
	# remove non-letters	
	text = re.sub('[^a-zA-Z]', ' ', text)
	
	# convert to lower case
	text = text.lower()
	
	# split into words
	words = text.split()

	# remove stop words
	stop_words = set(stopwords.words(language))
	text_no_stop_words = ' '
	words = [w for w in words if not w in stop_words]
	
	# remove words less than minWordSize
	words = [w for w in words if len(w) >= minWordSize]

	# keep 'not' and 'no' ... in stop words
	whitelist = ["n't", "not", "no"]
	for word in text.split():
		if word not in stop_words or word in whitelist:  
			text_no_stop_words = text_no_stop_words + word + ' '

	# do stemming
	text_stemmer = ' '
	stemmer = SnowballStemmer(language)
	for w in text_no_stop_words.split():
		text_stemmer = text_stemmer + stemmer.stem(w) + ' '
	
	# remove short words
	text_no_short_words = ' '
	for w in text_stemmer.split(): 
		if len(w) >=minWordSize:
			text_no_short_words = text_no_short_words + w + ' '
	
	return text_no_short_words

In [61]:
nltk.download('stopwords')
language = 'english'
minWordLength = 2
text_prep = np.empty

for i in range(X_train.shape[0]):
  X_train.iloc[i] = text_preprocessing(X_train.iloc[i], language, minWordLength)

for i in range(X_test.shape[0]):
  X_test.iloc[i] = text_preprocessing(X_test.iloc[i], language, minWordLength)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\denis\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [62]:
X_train_clean = X_train.copy()
X_test_clean = X_test.copy()

In [63]:
import pickle

# Convert to bag-of-words representation
count_vect = CountVectorizer()
X_train_bag_of_words = count_vect.fit_transform(X_train)
X_test_bag_of_words = count_vect.transform(X_test)

# Save the CountVectorizer object
with open('count_vectorizer.pkl', 'wb') as f:
    pickle.dump(count_vect, f)

# Apply TF-IDF transformation
tfidf_transformer = TfidfTransformer()
X_train_tf = tfidf_transformer.fit_transform(X_train_bag_of_words)
X_test_tf = tfidf_transformer.transform(X_test_bag_of_words)

# Save the TfidfTransformer object
with open('tfidf_transformer.pkl', 'wb') as f:
    pickle.dump(tfidf_transformer, f)

In [64]:
model = LogisticRegression()
parameters = [{'C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000,10000, 100000]}]

grid_search = GridSearchCV(estimator = model, 
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 4,
                           n_jobs = -1)

grid_search = grid_search.fit(X_train_tf, y_train)

best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_

print('Best accuracy: ', grid_search.best_score_)
print('Best parameters:', grid_search.best_params_)

Best accuracy:  0.979134496944715
Best parameters: {'C': 1000}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [65]:
# predict the test set results
y_pred = grid_search.predict(X_test_tf)

# accuracy score
print('Accuracy: ', accuracy_score(y_test, y_pred))

# classification report
print(classification_report(y_test, y_pred))

Accuracy:  0.9883408071748879
              precision    recall  f1-score   support

         ham       0.99      1.00      0.99       943
        spam       1.00      0.92      0.96       172

    accuracy                           0.99      1115
   macro avg       0.99      0.96      0.98      1115
weighted avg       0.99      0.99      0.99      1115



We have created this model that classified SMS messages with 97.9% accuracy on our train data and a 98.7% accuracy on our test data. We will now save this model and use it in a web application.

In [66]:
# export the model

import pickle
with open('spam_classifier.pkl', 'wb') as f:
	pickle.dump(grid_search, f)