# Spam Detection with Naives Bayes #

Spam detection constitutes one of the major applications of Machine Learning. 
I will be using the Naive Bayes algorithm to create a model that can classify SMS messages as spam or not spam.
Being able to identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. 
Also, this is a supervised learning problem, as we will be feeding a labelled dataset into the model, that it can learn from, to make future predictions. 

This project has been broken down into the following steps: 

- Understanding our dataset
- Data Preprocessing
- Training and testing sets
- Applying Bag of Words processing to the dataset
- Naive Bayes implementation using scikit-learn
- Evaluating the model
- Conclusion

### Understanding our dataset ### 


We will be using a dataset from the UCI Machine Learning repository. The columns in the data set are currently not named and there are 2 columns.  The first column takes two values, 'ham' which signifies that the message is not spam, and 'spam' which signifies that the message is spam.  The second column is the text content of the SMS message that is being classified.

In [38]:
import pandas as pd
# Dataset available using filepath 'smsspamcollection/SMSSpamCollection'
df = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'sms_message'])

# Output printing out first 5 rows
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Data Preprocessing ###

Now that we have a basic understanding of what our dataset looks like, let's convert our labels to binary variables, 0 to represent 'ham'(i.e. not spam) and 1 to represent 'spam', necessary step in Scikit-learn to handle outputs.


In [39]:
df['label'] = df.label.map({'ham':0, 'spam':1})

In [42]:
df.shape

(5572, 2)

### Training and testing sets ###

In [41]:
# split into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


### Applying Bag of Words processing to our dataset. ###

The data set constitutes a large collection of text data (5,572 rows of data) which needs to be turned into numerical for modelling. I apply Bag of Words processing with Sklearn’s CountVectorizer in order to turn the data in the desired matrix format. Bag of Words takes a piece of text and count the frequency of the words in that text. 

In [43]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

### Naive Bayes implementation using scikit-learn ###

Sklearn has several Naive Bayes implementations, so I will be using sklearn's `sklearn.naive_bayes` method to make predictions on the dataset. Specifically, the multinomial Naive Bayes algorithm. This classifier is suitable for classification with discrete features (word counts for text classification). It takes an integer word counts as its input.

In [44]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB()

In [45]:
predictions = naive_bayes.predict(testing_data)

### Evaluating our model ###

I will be using different metrics to make sure our model does well.

In [47]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


### Step 7: Conclusion ###

Advantages of Naives Bayes:
-	Ability to handle an extremely large number of features (words)
-	Very simple
-	Rarely overfits

### Sources ###

https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
    
https://archive.ics.uci.edu/ml/machine-learning-databases/00228/