# Spam Classification
> **Work done by**: Nwachukwu Anthony  
> **Email**: nwachukwuanthony2015@gmail.com  
> **Note**: this is my solution to the Udacity Nanodegree course on NLP

## Our Mission ##

Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically classify such mail as 'Junk Mail'. 

In this mission we will be using the Naive Bayes algorithm to create a model that can classify [dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) SMS messages as spam or not spam, based on the training we give to the model. It is important to have some level of intuition as to what a spammy text message might look like. Usually they have words like 'free', 'win', 'winner', 'cash', 'prize' and the like in them as these texts are designed to catch your eye and in some sense tempt you to open them. Also, spam messages tend to have words written in all capitals and also tend to use a lot of exclamation marks. To the recipient, it is usually pretty straightforward to identify a spam text and our objective here is to train a model to do that for us!

Being able to identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. Also, this is a supervised learning problem, as we will be feeding a labelled dataset into the model, that it can learn from, to make future predictions. 


## Understanding the dataset


We will be using a [dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) from the UCI Machine Learning repository which has a very good collection of datasets for experimental research purposes. The direct data link is [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/).


 ** Here's a preview of the data: ** 

<img src="images/dqnb.png" height="1242" width="1242">

The columns in the data set are currently not named and as you can see, there are 2 columns. 

The first column takes two values, 'ham' which signifies that the message is not spam, and 'spam' which signifies that the message is spam. 

The second column is the text content of the SMS message that is being classified.

## Download the dataset

In [1]:
import requests, zipfile, io
# Dataset from - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
zip_file_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

## Import the dataset

In [2]:
import pandas as pd

data = "SMSSpamCollection"
df = pd.read_csv(data, names=['label', 'sms_message'],sep='\t', header=None)

# Output printing out first 5 rows
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Data Preprocessing

We convert the labels to binary variables, 0 to represent 'ham'(i.e. not spam) and 1 to represent 'spam' for ease of computation. 

Scikit-learn handles inputs. Scikit-learn only deals with numerical values and hence if we were to leave our label values as strings, scikit-learn would do the conversion internally(more specifically, the string labels will be cast to unknown float values). 

The model would still be able to make predictions if we left our labels as strings but we could have issues later when calculating performance metrics, for example when calculating our precision and recall scores. Hence, to avoid unexpected 'gotchas' later, it is good practice to have our categorical values be fed into our model as integers. 

### Convert Labels to integer values

In [3]:
#extract unique labels
uniqueLabel = pd.unique(df['label'])
#assign each unique label to an integer value beginning from zero
dicOfUniqueLabel = {uniqueLabel[i]:i for i in range(len(uniqueLabel))}
#effect this change on the dataframe
df['label'] = df.label.map(dicOfUniqueLabel)

print('Unique integer assignments: ', dicOfUniqueLabel, '\n\n')
df.head()

Unique integer assignments:  {'ham': 0, 'spam': 1} 




Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
#check the number of rows and columns
df.shape

(5572, 2)

#### Implementing Bag-of-words from scratch

In [5]:
def bagOfWords(corpus):
    import string
    from collections import Counter
    frequency_list = []
    for i in corpus:
        #Convert all strings to their lower case form
        result = i.lower()
        #Removing all punctuations
        result = result.translate(str.maketrans('', '', string.punctuation))
        #Tokenization
        result = result.split()
        #Count frequencies
        frequency_list.append(dict(Counter(result)))
    return frequency_list

def vectorOfWords(corpus):
    import numpy as np
    
    ls = corpus
    featuresLabel = []

    #Retrieve the Labels
    for j in corpus:
        for i in j.keys():
            if i not in featuresLabel:
                featuresLabel.append(i)
    
    features = np.zeros([len(corpus),len(featuresLabel)], dtype = int)
    
    #Compute the word vectors
    for i in range(len(corpus)):
        for j in corpus[i].keys():
            if j in featuresLabel:
                features[i][featuresLabel.index(j)] = corpus[i][j]
    
    return pd.DataFrame(features,columns=featuresLabel)

vectorOfWords(bagOfWords(df['sms_message'])).head()

Unnamed: 0,go,until,jurong,point,crazy,available,only,in,bugis,n,...,heap,lowes,salesman,£750,087187272008,now1,pity,soany,suggestions,bitching
0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Implementing Bag-of-words from in scikit-learn 

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

#Creat an instance of CountVectorizer
count_vector = CountVectorizer()
#Fit the document dataset to the CountVectorizer object
count_vector.fit(df['sms_message'])
#Extract feature labels
feature = count_vector.get_feature_names()
#Get the vector of words
doc_array = count_vector.transform(df['sms_message']).toarray()
#Convert array to DataFrame
frequency_matrix = pd.DataFrame(doc_array,columns=feature)
frequency_matrix.head()

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Training and testing sets splitting

In [7]:
from sklearn.model_selection import train_test_split

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


## Applying Bag of Words processing to our dataset

In [8]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)


## Naive Bayes implementation using scikit-learn

We will be using sklearns `sklearn.naive_bayes` method to make predictions on our dataset. 

Specifically, we will be using the multinomial Naive Bayes implementation. This particular classifier is suitable for classification with discrete features (such as in our case, word counts for text classification). It takes in integer word counts as its input. On the other hand Gaussian Naive Bayes is better suited for continuous data as it assumes that the input data has a Gaussian(normal) distribution.

In [9]:
from sklearn.naive_bayes import MultinomialNB

#train the model
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [10]:
#Make predictions
predictions = naive_bayes.predict(testing_data)

## Evaluating the model

** Accuracy ** measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions (the number of test data points).

** Precision ** tells us what proportion of messages we classified as spam, actually were spam.
It is a ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classification), in other words it is the ratio of

`[True Positives/(True Positives + False Positives)]`

** Recall(sensitivity)** tells us what proportion of messages that actually were spam were classified by us as spam.
It is a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam, in other words it is the ratio of

`[True Positives/(True Positives + False Negatives)]`

For classification problems that are skewed in their classification distributions like in this case, for example if we had a 100 text messages and only 2 were spam and the rest 98 weren't, accuracy by itself is not a very good metric. We could classify 90 messages as not spam(including the 2 that were spam but we classify them as not spam, hence they would be false negatives) and 10 as spam(all 10 false positives) and still get a reasonably good accuracy score. For such cases, precision and recall come in very handy. These two metrics can be combined to get the F1 score, which is weighted average of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score.

In [11]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


## Conclusion

One of the major advantages that Naive Bayes has over other classification algorithms is its ability to handle an extremely large number of features. In this case, each word is treated as a feature and there are thousands of different words.

Also, it performs well even with the presence of irrelevant features and is relatively unaffected by them.

The other major advantage it has is its relative simplicity. Naive Bayes' works well right out of the box and tuning it's parameters is rarely ever necessary, except usually in cases where the distribution of the data is known. 

It rarely ever overfits the data. Another important advantage is that its model training and prediction times are very fast for the amount of data it can handle.