In [1]:
import pandas as pd

# Data Collection

We will be using a dataset originally compiled and posted on the UCI Machine Learning repository which has a very good collection of datasets for experimental research purposes. If you're interested, you can review the [abstract](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) and the original [compressed data file](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/) on the UCI site.  

This is tab seperated file, with no columns.

In [2]:
df = pd.read_table('SMSSpamCollection',
                   sep='\t', 
                   header=None, 
                   names=['label', 'sms_message'])
print(df.shape)
df.head()

(5572, 2)


Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# Preprocessing

In [3]:
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


# Bag of Words

The basic idea of BoW is to take a piece of text and count the frequency of the words in that text.
That the BoW concept treats each word individually and the order in which the words occur does not matter.

We can covert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrance of each word or token in that document.

To handle this, we will be using sklearns count vectorizer method which does the following:

1 It tokenizes the string(separates the string into individual words) and gives an integer ID to each token.  
2 It counts the occurrance of each of those tokens.

In [5]:
'''
This is how internally things happen in CountVectorizer

from collections import Counter

Step 1: Convert all strings to their lower case form.
df['sms_message'] = df1.sms_message.str.lower()

Step 2: Removing all punctuation
df['sms_message'] = df1.sms_message.str.translate(str.maketrans('', '', string.punctuation))

Step 3: Tokenization
df['sms_message'] = df1.sms_message.str.split(' ')

Step 4: Count frequencies
df['sms_message'] = df1.sms_message.apply(lambda i : Counter(i))
'''

"\nThis is how internally things happen in CountVectorizer\n\nfrom collections import Counter\n\nStep 1: Convert all strings to their lower case form.\ndf['sms_message'] = df1.sms_message.str.lower()\n\nStep 2: Removing all punctuation\ndf['sms_message'] = df1.sms_message.str.translate(str.maketrans('', '', string.punctuation))\n\nStep 3: Tokenization\ndf['sms_message'] = df1.sms_message.str.split(' ')\n\nStep 4: Count frequencies\ndf['sms_message'] = df1.sms_message.apply(lambda i : Counter(i))\n"

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
count_vector

CountVectorizer()

In [7]:
training_data = count_vector.fit_transform(X_train)
testing_data = count_vector.transform(X_test)

In [8]:
training_data

<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [9]:
frequency_matrix = training_data.toarray()
frequency_matrix

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [10]:
# count_vector.get_feature_names()
# method returns our feature names for this dataset, which is the set of words that make up our vocabulary for 'documents'.

frequency_matrix = pd.DataFrame(frequency_matrix, 
                                columns = count_vector.get_feature_names())
print(frequency_matrix.shape)
frequency_matrix.head()

(4179, 7456)


Unnamed: 0,00,000,008704050406,0121,01223585236,01223585334,0125698789,02,0207,02072069400,...,zed,zeros,zhong,zindgi,zoe,zoom,zouk,zyada,èn,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Model Training

Gaussian Naive Bayes is better suited for continuous data as it assumes that the input data has a Gaussian(normal) distribution.  
Multinomial Naive Bayes is suitable for classification with discrete features.  

The term 'Naive' in Naive Bayes comes from the fact that the algorithm considers the features that it is using to make the predictions to be independent of each other, which may not always be the case.

One of the application is Search engine. Search engine treats the words as "independant entities" and hence being 'naive' in its approach.  

Applying this to our problem of classifying messages as spam, the Naive Bayes algorithm looks at each word individually and not as associated entities with any kind of link between them. In the case of spam detectors, this usually works as there are certain red flag words which can almost guarantee its classification as spam, for example emails with words like 'viagra' are usually classified as spam.

In [11]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB()

# Evaluating our model

In [12]:
predictions = naive_bayes.predict(testing_data)

** Precision ** tells us what proportion of messages we classified as spam, actually were spam.  
`[True Positives/(True Positives + False Positives)]`  
** Recall(sensitivity)** tells us what proportion of messages that actually were spam were classified by us as spam.  
`[True Positives/(True Positives + False Negatives)]`  
For classification problems that are skewed in their classification distributions like in our case, accuracy by itself is not a very good metric.

In [13]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


# Prediction

In [14]:
msg = [input('Enter the message :')]

msg = count_vector.transform(msg)

'ham' if naive_bayes.predict(msg)[0]==0 else 'spam'

Enter the message :Hii, Win Cash


'spam'

# Conclusion

- Naive Bayes has ability to handle an extremely large number of features.  
- It performs well even with the presence of irrelevant features and is relatively unaffected by them.  
- Naive Bayes' works well right out of the box and tuning it's parameters is rarely ever necessary, except usually in cases where the distribution of the data is known.  
- It rarely ever overfits the data.