# Hometask 2. Naive Bayes algorithm for spam classification.
*Author: Marina Talianskaia*


In [400]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
import scipy
from sklearn.model_selection import cross_validate

## 1. Preparing data

In [401]:
data=pd.read_csv('C:\\For Python\SMSSpamCollection.csv', sep='\t',names=['label','message'])
data.head(7)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...


In [402]:
print ('Number of colums', data.shape[1], ', number of lines', data.shape[0])
print ('Number of observetions in each class:')
data['label'].value_counts()

Number of colums 2 , number of lines 5572
Number of observetions in each class:


ham     4825
spam     747
Name: label, dtype: int64

###### Quality metric comment:
As the classes are of different size, we can observe a sample imbalance problem, though, accuracy will not be a good metric to evaluate the quality of model as it is biased to the larger class. So, I would choose F1-score metric as the one including both Presion and Recall opitions.

In [403]:
#Re-labling 'spam' and 'ham' as 0(ham) and 1(spam)
data['label'] = data.label.map({'spam':1,'ham':0})

In [404]:
#Spitting on test and training samples:
X_train, X_test, y_train, y_test = train_test_split(data['message'],data['label'],random_state=53)

In [405]:
#Data processing via Vectorizer
vector = CountVectorizer(lowercase=True, stop_words='english')
X_train_trans = vector.fit_transform(X_train)
X_test_trans = vector.transform(X_test)

## 2. 'Manual' Naive Bayes

Bayes theorem looks like this:
$$P(A | B) = P(B | A) * P(A) / P(B) $$

where A = Spam and B = Ham

So, we need to find the following variables:
- Generall probability of any message being a Spam: P(spam)

$$P(spam) = \frac{Quanity\ of\ spam\ lebeled\ massages}{total\ quantity\ of\ massages}$$

- General probability of any message being a Ham: P(ham)
$$P(ham) = \frac{Quanity\ of\ ham\ lebeled\ massages}{total\ quantity\ of\ massages}$$

- Probability of a word appearing in spam messages: P(word|spam)
- Probability of a word appearing in ham messages: P(word|ham)

In [406]:
#Probability of a message being a Spam
p_spam = sum(y_train) / len(y_train)
p_spam

0.13208901651112706

In [407]:
#Probability of a message being a Ham
p_ham = 1 - sum(y_train) / len(y_train)
p_ham

0.867910983488873

$$P(word1 | ham) = \frac{qty\ of\ word1\ belonging\ to\ category\ ham}{total\ qty\ of\ words\ belonging\ to\ category\ ham} $$


$$P(word1 | spam) = \frac{qty\ of\ word1\ belonging\ to\ category\ spam}{total\ qty\ of\ words\ belonging\ to\ category\ spam} $$

At the same time, it is possible that some words of full dataset are not represented in our testing dataset, but appear in test sample, thus, it leads the situation when the equatation is equal to null. Thus, some technical transformations to avoid it are necessary. Let's apply the following formulas:

$$P(word1|ham)== \frac{qty\ of\ word1\ belonging\ to\ category\ ham\ + 1}{total\ qty\ of\ words\ in\ ham\ masseges}$$

$$P(word1|spam)== \frac{qty\ of\ word1\ belonging\ to\ category\ spam\ + 1}{total\ qty\ of\ words\ in\ spam\ masseges}$$

In [408]:
#Array containing all spam-labeled massages:
spam_array = np.where(y_train == 1)[0]
#Sparce matrix with all spam messages
spam = X_train_trans.tocsr()[spam_array]
spam

<552x7252 sparse matrix of type '<class 'numpy.int64'>'
	with 8391 stored elements in Compressed Sparse Row format>

In [409]:
#Array containing all ham-labeled massages:
ham_array = np.where(y_train == 0)[0]
#Sparce matrix with all spam messages
ham = X_train_trans.tocsr()[ham_array]
ham


<3627x7252 sparse matrix of type '<class 'numpy.int64'>'
	with 24375 stored elements in Compressed Sparse Row format>

In [410]:
#Let's define how many times each word appears in Ham-labled messages (in the array format)
ham_freq = ham.toarray().sum(axis=0) + 1

In [411]:
#Let's define how many times each word appears in Ham-labled messages (in the array format)
spam_freq = spam.toarray().sum(axis=0) + 1

In [412]:
#Getting the probabilities of each word being in a spam message:
ham_sum = sum(ham_freq)
ham_prob = ham_freq/ham_sum
ham_prob

array([3.04757261e-05, 3.04757261e-05, 6.09514522e-05, ...,
       6.09514522e-05, 3.04757261e-05, 6.09514522e-05])

In [413]:
#Getting the probabilities of each word being in a spam message:
spam_sum = sum(spam_freq)
spam_prob = spam_freq/spam_sum
spam_prob

array([4.98038972e-04, 1.55637179e-03, 6.22548714e-05, ...,
       6.22548714e-05, 1.24509743e-04, 6.22548714e-05])

After that, let's apply log on both sides of equatation:
$$log(P(ham | bodyText)) = log(P(ham)) + \sum\limits_{i=1}^n log(P(word_i | ham))$$

In [414]:
#if (log(P(ham | bodyText)) > log(P(spam | bodyText))) {return ‘ham’;} else {return ‘spam’; }

In [420]:
def function(df):
    prob_ham = np.log(p_ham)
    prob_spam = np.log(p_spam)
    df = scipy.sparse.find(df)
    for i in range(len(df[1])):
        prob_ham = prob_ham + np.log(ham_prob[df[1][i]]) * df[2][i]
        prob_spam = prob_spam + np.log(spam_prob[df[1][i]]) * df[2][i]

    if prob_ham >= prob_spam:
        return 0
    else:
        return 1

ans = []
for i in X_test_trans:
    ans.append(spam_or_ham(i))

In [421]:
print(classification_report(y_test, ans))

             precision    recall  f1-score   support

          0       0.98      0.99      0.99      1198
          1       0.94      0.90      0.92       195

avg / total       0.98      0.98      0.98      1393



## 3. 'Automatic' Naive Bayes algorithm

In [417]:
NB = MultinomialNB()
NB.fit(X_train_trans, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [418]:
print(classification_report(y_test, NB.predict(X_test_trans)))

             precision    recall  f1-score   support

          0       0.98      0.99      0.99      1198
          1       0.94      0.90      0.92       195

avg / total       0.98      0.98      0.98      1393



## 4. Cross-validation

In [419]:
valid_results = cross_validate(estimator=NB, X=X_train_trans, y=y_train, cv=5, scoring='f1')
print('F1 scores: ', valid_results['test_score'])
print('Average F1 score: ', np.mean(valid_results['test_score']))

F1 scores:  [0.92173913 0.92982456 0.93693694 0.93965517 0.89867841]
Average F1 score:  0.9253668430571876


## 5. Commens and results

We see that the 'manual' model performs the same as the sklearn NB-classificator and its overall pridictive power around 0.98 according to F1-score metric. At the same time, we can observe that the minor class represents lower performance, so, it's a sample imbalanced problem and the resuts are spoiled in favour of majority class. In practical sense, it depends on what we find more important: miss no spam or let good messages be filtered as spam.


*Note: I finally failed to code the Step 2 (after counting frequency of each word in text massive), despite catching the mathemathic sense of the issue, thus, some ideas were taken from the work my colleagues.