## Introduction
In this notebook I'm going to work with a text csv file from the UCI ML, which includes different messages from peoples mailing account. My goal is to create a model which can predict with the highest accuracy, whether a message is ham or spam. This project will include data examination and manipulation, vectorization, and the method of building two different classification models. At the end I will examine whether Naive Bays or Logistic Regression model could give us better results by predicting the goal values and I will analyse the output of the confusion matrix to get ideas which words, numbers etc. led our model to make false decisions.

Data : https://archive.ics.uci.edu/dataset/228/sms+spam+collection

In [60]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

## Loading, labeling and splitting the data

In [11]:
data = pd.read_table('sms.tsv', header=None, names= ['label', 'message'] )
data.sample(5)

Unnamed: 0,label,message
13,ham,I've been searching for the right words to tha...
3095,ham,Have you emigrated or something? Ok maybe 5.30...
4082,ham,Hurry home. Soup is DONE!
1851,ham,Then cant get da laptop? My matric card wif ü ...
3395,ham,Bull. Your plan was to go floating off to IKEA...


In [17]:
data.label.replace({'ham':0, 'spam': 1}, inplace=True)
data.shape

(5572, 2)

In [20]:
data.isna().any()

label      False
message    False
dtype: bool

In [25]:
# Defining X and y for later use
X = data.message
y = data.label

print(X.shape)    # Examiming the lenght and dimensions of the X and y.
y.shape           #CountVectorizer will need 1 dimensional data, so at this case we don't have to modify anything.

(5572,)


(5572,)

In [31]:
# Splitting up the data

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

print(X_train.shape)
print(y_train.shape)    # Checking if we got the same number of rows in X and y


(4179,)
(4179,)


## Vectorizing our data

In [37]:
vect = CountVectorizer()

X_train_dtm = vect.fit_transform(X_train)    # Vectorizing and transforming to a document-term-matrix

X_train_dtm                                  # We got a 4179x7456 matrix, which means there were 7456 unique values in the 4179 rows of text.

<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [39]:
#  transforming our testing data into a document-term-matrix
# At this case we are not fitting because it would overwrite the learned vectorized values

X_test_dtm = vect.transform(X_test)
X_test_dtm

<1393x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

## Naive Bays Model

In [45]:
# Naive Bays classifier is one of the best solutions for classification with discrete features, so first we will try this.

nb = MultinomialNB()

nb.fit(X_train_dtm, y_train)            # Training our model

y_pred = nb.predict(X_test_dtm)         # Making our predictions on the test data

metrics.accuracy_score(y_test, y_pred)  # Checking the accuracy of our model

# The resulting accuracy is 98.8%, which means our model predicted the right class at this rate of all time.
# This high accuracy could occur because of the right context of each text, and no tricky phrasing of the given texts.

0.9885139985642498

In [50]:
# Let's see by the help of the confusion matrix, how many and which messages caused problem for our model

metrics.confusion_matrix(y_test, y_pred)

# We can see our model 174 times predicted rightly the message spam from 179 times, and 11 times predicted falsly negative althought the messages were spam.

array([[1203,    5],
       [  11,  174]])

In [54]:
# Let's see that actual 5 wrongly predicted ham messages
X_test[(y_test == 0) & (y_pred == 1)]

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

In [70]:
# calculate AUC
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]

metrics.roc_auc_score(y_test, y_pred_prob)  # AUC score shows us, that a positive(spam) sample is classified as spam 96.8% of the time, which is a good value.

0.9866431000536962

In [59]:
print(metrics.classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1208
           1       0.97      0.94      0.96       185

    accuracy                           0.99      1393
   macro avg       0.98      0.97      0.97      1393
weighted avg       0.99      0.99      0.99      1393



## Logistic Regression

In [64]:
logreg = LogisticRegression()  # Object from LogisticRegression class

logreg.fit(X_train_dtm, y_train)  # Training our model

y_pred_class = logreg.predict(X_test_dtm)  # Making predictions on the testing data

In [65]:
metrics.accuracy_score(y_test, y_pred_class)

# We can see the accuracy of this model is just slightly worse than Naive Bays

0.9877961234745154

In [71]:
# AUC Score

y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]

metrics.roc_auc_score(y_test, y_pred_prob)
# The AUC Score of the Logistic Regression Model is also similar to Naive Bays.
# There aren't enormous differences between the two models predicting the outcomes, eventhought both are doing well in this situation.

0.9936280651512441

In [75]:
print(metrics.classification_report(y_test, y_pred_class))


              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1208
           1       0.99      0.91      0.95       185

    accuracy                           0.99      1393
   macro avg       0.99      0.96      0.97      1393
weighted avg       0.99      0.99      0.99      1393



## Getting informations for tuning our model

In [82]:
X_train_tokens = vect.get_feature_names_out()   # Getting our feature names as an array

X_train_tokens

array(['00', '000', '008704050406', ..., 'zyada', 'èn', '〨ud'],
      dtype=object)

In [83]:
nb.feature_count_  # This model function counts how many times did a token appear either in ham or in spam.

array([[ 0.,  0.,  0., ...,  1.,  1.,  1.],
       [ 5., 23.,  2., ...,  0.,  0.,  0.]])

In [94]:
# By creating a pandas dataframe from these informations we can see which tokens are the most likely to be spam.

ham_token_count = nb.feature_count_[0, :]
spam_token_count = nb.feature_count_[1, :]

tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')

tokens.sort_values(by='spam', ascending=False)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
to,1161.0,509.0
call,172.0,271.0
you,1442.0,218.0
your,310.0,198.0
for,360.0,158.0
...,...,...
happening,7.0,0.0
happened,13.0,0.0
happend,3.0,0.0
happen,16.0,0.0


In [None]:
# By using this dataframe we could also locate the exact words, which could cause the mistakes of our model during classifying.