# Naive Bayes Project: Spam Detection

## 1. Introduction
In this project, Naive Bayes algorithem was used to create a machine learning model that can classify messages as spam or not spam. It should be noticed that in Naive Bayes each feature is considered to be independent of each other.(But in practice, this may not be always true and hence that can affect the final judgement.) One of the major advantages that Naive Bayes has over other classification algorithms is its ability to handle an extremely large number of features. It is relavitly simplicity, and rarely need to tune the parameters. It rarely overfits the data and the prediction times are very fast.

<a id='import'></a>
## 2. Import Libraries

In [15]:
# basic libraries
import pandas as pd
df = pd.read_table('C:/Users/xiaoj/Downloads/smsspamcollection~1/SMSSpamCollection', header = None,names = ['label','sms_message'])
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [16]:
df.shape

(5572, 2)

## 3. Data Preprocessing

In [17]:
# convert the value in 'label' to numerical values
df['label'] = df['label'].map({'ham':0,'spam':1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Since machine learning algorithms rely on numerical data to be fed into them as input, text have to be transfer to be numerical data.

In [18]:
# data preprocessing with CountVectorizer()
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

**Bag of Words(BOW)** is to take text and count the frequency of the words in the text. It should be noticed that BOW treats each word individually and the order in which the words occur doesn't matter.

**Sklearns.feature_extraction.text.CountVectorizer** is used to tokenize the string and counts the occurance of each of those tokens. CountVectorizer method automatically converts all tokenized words to their lower case form(the **lowercase** parameter is by default set to True), ignores all punctuation(**token_pattern** parameter), uses **stop_words**(set to be 'English').

## 4. Training and testing sets

In [25]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df['sms_message'], df['label'], random_state = 42)
print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(x_train.shape[0]))
print('Number of rows in the test set: {}'.format(x_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


In [29]:
# Fit the training data and then return the matrix
x_training_data = count_vector.fit_transform(x_train)

# transform testing data and return the matrix
x_testing_data = count_vector.transform(x_test)

## 5. Naive Bayes implementation using scikit-learn


The Bayes theorem calculates the probabilities that are realted the event. It is composed of a prior and the posterior.

**sklearn.naive_bayes** can be used to make predictions. **MultinomialNB** classifier is suitable for discrete features, while **Gaussian Naive Bayes** is better suited for continuous data since it assumes that the input data has a Gaussian distribution.

In [30]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(x_training_data, y_train)

MultinomialNB()

In [31]:
predictions = naive_bayes.predict(x_testing_data)

## 6. Evaluating model

In [32]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9775280898876404
Recall score:  0.9354838709677419
F1 score:  0.956043956043956
