## Homework №1: Spam detection 
*Bezhenaer OLga*

### Content of the task:
1. Download sms-spam dataset https://archive.ics.uci.edu/ml/ datasets/sms+spam+collection
2. Choose and argument metric for quality
3. Code «by a hands» naive bayes for spam detection task;
4. Choose a measure of a test's accuracy and argument your choice; Perform 5-fold validation for this task;
5. Compare your results with sklearn naive_bayes

In [1]:
from sklearn.model_selection import train_test_split
import scipy
import pandas as pd
import numpy as np
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
import re 
from nltk.tokenize import word_tokenize
import nltk  
nltk.download("punkt")  
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Коля\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Коля\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 1.  Data Preprocessing 

In [2]:
#data loading 
data = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'message'])
data.head(n=5)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### Exploratory data analysis 

In [3]:
data.shape

(5572, 2)

In [4]:
data["label"].value_counts() #the problem of sample imbalance is presented 

ham     4825
spam     747
Name: label, dtype: int64

In [5]:
#change in labels:spam- positive class "1", "ham" - negaive class "0"
data['label'] = data['label'].map({'ham': 0, 'spam': 1})
data.head(n=5)

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


#### Bag of words 

In [6]:
#converting to the lower case 
data['message'] = data['message'].map(lambda x: x.lower())  

In [7]:
#removing punctuation
data['message'] = data.message.str.replace('[^\w\s]', '')  

In [8]:
#removing digits
data['message'] = data.message.str.replace('[0-9]', '')

In [9]:
#tokenizing the text into single words
data['message'] = data.message.apply(nltk.word_tokenize)  

In [10]:
#implemention of the word stemmimg
stemmer = PorterStemmer()
data['message'] = data['message'].apply(lambda x: [stemmer.stem(y) for y in x])  

In [11]:
#checking the results
data['message'].head(10)

0    [go, until, jurong, point, crazi, avail, onli,...
1                         [ok, lar, joke, wif, u, oni]
2    [free, entri, in, a, wkli, comp, to, win, fa, ...
3    [u, dun, say, so, earli, hor, u, c, alreadi, t...
4    [nah, i, dont, think, he, goe, to, usf, he, li...
5    [freemsg, hey, there, darl, it, been, week, no...
6    [even, my, brother, is, not, like, to, speak, ...
7    [as, per, your, request, mell, mell, oru, minn...
8    [winner, as, a, valu, network, custom, you, ha...
9    [had, your, mobil, month, or, more, u, r, enti...
Name: message, dtype: object

In [12]:
data['message'] = data['message'].apply(lambda x: ' '.join(x))

In [13]:
# Creating training and testing sets.

X_train, X_test, y_train, y_test = train_test_split(data['message'],data['label'],random_state=42)

print('Number of rows in the total set: {}'.format(data.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


In [14]:
# data trasformation into occurence that becomes features in the model (also here we remove stop words)
count_vect = CountVectorizer(stop_words='english')  
X_train_new = count_vect.fit_transform(X_train)  

X_test_new = count_vect.transform(X_test)

### 2. Metric for quality

As it was mentioned above, the problem of sample imbalance is presented in our data. In that case the "accuracy" score does not show trustworthy results. The main metrics for quality might be recalls, precision and f1-score. What is important for us? 
 - Equiality of recall_1 and recall_0 (meaning that classifier predicts well both positive and negative classes) the close recall_1 and recall_0 too each other - the better quality of the classifier;
 - F1-score - it comprises both precision and recall metrics and does not suffer as "accuracy" metric from sample imbalance.

### 3. Naive bayes "by hands"


#### Bayes' theorem: P(A | B) = P(B | A) * P(A) / P(B)
P(A | B) - Probability that a text containing a given word is spam ( P(Spam|Word))

Need to find these 4 probabilities:
- Overall probability that any given message is spam: P(spam)
- Overall probability that any given message is not spam (is ham): P(ham)
- Probability that a word appears in spam messages: P(word/spam)
- Probability that a word appears in ham messages: P(word/ham)


#### Step. 1 Finding overall probabilities 

In [15]:
# Total number of hams and spams
number_of_hams = y_train.value_counts()[0]
number_of_spams=y_train.value_counts()[1]
total = y_train.value_counts()[0]+y_train.value_counts()[1]

P(ham) = number of documents belonging to category ham / total number of documents 

P(spam) = nnumder of documents belonging to category spam / total numder of documents

In [16]:
#Finding overall probabalities
p_spam= number_of_spams/total #P(spam)
p_ham= number_of_hams/total #P(ham)

print("Probability of spam message",p_spam.round(3) )
print("Probability of usual message",p_ham.round(3) )

Probability of spam message 0.134
Probability of usual message 0.866


#### Step. 2 Finding probability that a word appears in spam/ham messages:

P(word1 | ham) = (count of word1 belonging to category ham + 1) / (total count of words belonging to ham + number of distinct words in training data sets i.e. our database)

P(word1 | spam) = (count of word1 belonging to category spam + 1) / (total count of words belonging to spam + number of distinct words in training data sets i.e. our database)

*Note: I honesty failed in the implementation of the Step 2., and really do not understand how to do it "by hands". That is why the part of code with Step 2, I copy/paste from the work of my colleague,Alexander Marinskiy. *

In [17]:
# find probabilities for spam
indices = np.where(y_train == 1)[0]
spam = X_train_new.tocsr()[indices,:]

frequency_spam = spam.toarray().sum(axis=0) + 1
probability_spam = frequency_spam / (sum(frequency_spam))

In [18]:
# find probabilities for ham
indices = np.where(y_train == 0)[0]
ham = X_train_new.tocsr()[indices,:]

frequency_ham = ham.toarray().sum(axis=0) + 1
probability_ham = frequency_ham / (sum(frequency_ham))

log(P(ham | bodyText)) = log(P(ham)) + log(P(bodyText | ham))
= log(P(ham)) + log(P(word1 | ham)) + log(P(word2 | ham)) …

In [19]:
def spam_or_ham(arr):
    prob_ham = np.log(p_ham)
    prob_spam = np.log(p_spam)
    arr = scipy.sparse.find(arr)
    for i in range(len(arr[1])):
        prob_ham = prob_ham + np.log(probability_ham[arr[1][i]]) * arr[2][i]
        prob_spam = prob_spam + np.log(probability_spam[arr[1][i]]) * arr[2][i]

    if prob_ham >= prob_spam:
        return 0
    else:
        return 1

ans = []
for i in X_test_new:
    ans.append(spam_or_ham(i))

#### Step 3. Checking the model quality

In [20]:
print(classification_report(y_test, ans))

             precision    recall  f1-score   support

          0       0.99      0.99      0.99      1207
          1       0.95      0.92      0.93       186

avg / total       0.98      0.98      0.98      1393



Average high f1-score, but if we look at the recalls: they are sufficiently different. Model works in favor of the majority class ( 0- "ham"). It can be caused by sample imbalance.

### 4. Sklearn naive_bayes

In [21]:
#training the model
model = MultinomialNB().fit(X_train_new, y_train)

In [22]:
#testing the model
predictions = model.predict(X_test_new)

In [23]:
print(classification_report(y_test, model.predict(X_test_new)))

             precision    recall  f1-score   support

          0       0.99      0.99      0.99      1207
          1       0.95      0.92      0.93       186

avg / total       0.98      0.98      0.98      1393



### 5. 5-fold cross validation 

As the results of "by hand" and sklearn naive_bayes looks quite similar. 5 fols cross-validation was implemented for the sklearn model. First cross-validation is done for the "accuracy" and than for the "f1" score. In oder to compare the results.

In [24]:
from sklearn.model_selection import cross_validate

In [25]:
scores = cross_validate(estimator=model, X=X_train_new, y=y_train, cv=5, scoring='accuracy')
print('Accuracy scores: ', scores['test_score'])
print('Average Accuracy score: ', np.mean(scores['test_score']))

Accuracy scores:  [0.97729988 0.96172249 0.96052632 0.96407186 0.97245509]
Average Accuracy score:  0.9672151260922446


Quite a high metric value, but! when we look at "F1":

In [26]:
scores = cross_validate(estimator=model, X=X_train_new, y=y_train, cv=5, scoring='f1')
print('F1 scores: ', scores['test_score'])
print('Average F1 score: ', np.mean(scores['test_score']))

F1 scores:  [0.91914894 0.86554622 0.85957447 0.86956522 0.89956332]
Average F1 score:  0.8826796317822622


We see that average F1 score is sufficient lower that accuracy.This means that model does not work as well as we would like, even despite the high "accuracy" rate.

### Results:
The quality of the "by hand" model and sklearn model looks the same (but this is due to "good job done" by my collegue) and even models performs quiet qood results (f1 = 0.99 - negativ class,f1=0.93-positiv) The difference between this two f1 scores confirms that  the problem of sample impalance spoils the results in favor of the majority class(ham or negetiv class). Also cross-validation confirms the nessecity of solving sample imbalance problem (may be this can be done by dowmsampling the majority class) or also better data preprocessing is needed.