<a href="https://colab.research.google.com/github/borisevstratov/homeworkHSM/blob/master/nlip_hw2_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HW2: Spam detection with naive bayes
*Boris Evstratov*

## Task
* Download sms-spam dataset https://archive.ics.uci.edu/ml/
datasets/sms+spam+collection
* Choose and argument metric for quality
* Code «by a hands» naive bayes for spam detection task;
* Choose a measure of a test's accuracy and argument your choice;
Perform 5-fold validation for this task;
* Compare your results with sklearn naive_bayes
* I expect your result as self-sufficient (with all comments/graph/etc.)
Jupiter notebook in your GitHub in 2 weeks (next lecture).

## 1. Exploratory data analysis

In [75]:
import pandas as pd
import numpy as np
import scipy
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
import string

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [76]:
# Mount google drive folder to access dataframe from google colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
# Specify the path to the csv data file
DataPath = '/content/drive/My Drive/GSOM/2 semester/Natural Language and Image Recognition/Homework/Homework 2/SMSSpamCollection.csv'

In [78]:
# Convert csv file to a pandas' dataframe
DataFrame = pd.read_csv(DataPath, delimiter='\t', header=None, names=['sender', 'message'])
DataFrame.head()

Unnamed: 0,sender,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [79]:
# Check the percentage of 'spam' and 'ham' messages
DataFrame['sender'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: sender, dtype: float64

In [80]:
print('Total number of records: ', DataFrame.shape[0])

Total number of records:  5572


In [81]:
# Check the sample for the missing values
DataFrame.isnull().values.any()

False

## 2.0 Preparing the data

In [0]:
# Mapping the 'ham' and 'spam' to 0 and 1 correspondingly
DataFrame['label'] = DataFrame['sender'].map({'ham': 0, 'spam': 1})

In [0]:
# Split the dataframe into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(DataFrame['message'], 
                                                    DataFrame['label'], 
                                                    random_state=113)

In [0]:
# Instantiate the CountVectorizer method
CountVector = CountVectorizer(lowercase = True, token_pattern = '(?u)\\b\\w\\w+\\b', stop_words = 'english')

In [0]:
# Transforming the data using CountVectorizer method
TrainingData = CountVector.fit_transform(X_train)
TestingData = CountVector.transform(X_test)

## 2.1 Naive Bayes manual implementation

In [0]:
# Calculating probabilities for 'spam' and 'ham'
ProbSpam = sum(y_train) / len(y_train)
ProbHam = 1 - sum(y_train) / len(y_train)

In [0]:
# Find probabilities for 'spam'
IndiciesSpam = np.where(y_train == 1)[0]
SpamData = TrainingData.tocsr()[IndiciesSpam,:]

FreqSpam = SpamData.toarray().sum(axis=0) + 1
ProbsSpam = FreqSpam / (sum(FreqSpam))

In [0]:
# Find probabilities for 'ham'
IndiciesHam = np.where(y_train == 0)[0]
HamData = TrainingData.tocsr()[IndiciesHam,:]

FreqHam = HamData.toarray().sum(axis=0) + 1
ProbsHam = FreqHam / (sum(FreqHam))

In [0]:
def SpamOrHam(Arr):
    PrHam = np.log(ProbHam)
    PrSpam = np.log(ProbSpam)
    Arr = scipy.sparse.find(Arr)
    for i in range(len(Arr[1])):
        PrHam = PrHam + np.log(ProbsHam[Arr[1][i]]) * Arr[2][i]
        PrSpam = PrSpam + np.log(ProbsSpam[Arr[1][i]]) * Arr[2][i]

    if PrHam >= PrSpam:
        return 0
    else:
        return 1

PredictionsMI = []
for i in TestingData:
    PredictionsMI.append(SpamOrHam(i))

## 2.2 Naive Bayes implementation using SciKit-Learn

In [91]:
# Applying MultinomialNB classifier
NBCLassifierSK = MultinomialNB()
NBCLassifierSK.fit(TrainingDataSK, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [0]:
# Testing the model
PredictionsSK = NBCLassifierSK.predict(TestingDataSK)

## 3. Evaluating the model

In [102]:
print('------------ Naive Bayes: Manual Implementation -----')
print(classification_report(y_test, PredictionsMI))

------------ Naive Bayes: Manual Implementation -----
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1208
           1       0.94      0.94      0.94       185

   micro avg       0.98      0.98      0.98      1393
   macro avg       0.97      0.96      0.96      1393
weighted avg       0.98      0.98      0.98      1393



In [103]:
print('------------ Naive Bayes: SciKit-Learn --------------')
print(classification_report(y_test, PredictionsSK))

------------ Naive Bayes: SciKit-Learn --------------
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1208
           1       0.94      0.94      0.94       185

   micro avg       0.98      0.98      0.98      1393
   macro avg       0.97      0.96      0.96      1393
weighted avg       0.98      0.98      0.98      1393



## Conclusion

The Manual Implementation of Naive Bayes model gains the same test results as the Sci-Kit Learn implementation of Multinominal Naive Bayes classifier.

Probably they are using same algorithm/approach to this problem