# Naive  Bayes Text Classification
In this notebook I'll be using the Naive Bayes algorithm to create a model that can classify dataset Messages as spam or not spam based on the [dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) that we'll give to the model.
If you don't know what is the spammy message look like it usually contain words like 'win', 'cash', 'money', 'winner' ,'free'..etc and it designed to be notice and tempt you to open it.And sometimes it contains CAPTIAL WORDS and alot of exclamation marks!!!.
Our mission here is to train a model to predict spammy messages for us!

Identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. Also, this is a supervised learning problem, as we will be feeding a labelled dataset into the model, that it can learn from, to make future predictions. 


###  What is Naive Bayes ###

Bayes theorem is one of the earliest probabilistic inference algorithms developed by Reverend Bayes.

The Bayes theorem calculates the probability of an event occurring, based on certain other probabilities that are related to the event in question. It is  composed of a  prior(the probabilities that we are aware of or that is given to us) and the posterior(the probabilities we are looking to compute using the priors). 

### Step 1.1: Understanding our dataset ### 

** Here's a preview of the data: ** 

<img src="images/dqnb.png" height="1242" width="1242">

The columns in the data set are currently not named and as you can see, there are 2 columns. 

The first column takes two values, 'ham' which signifies that the message is not spam, and 'spam' which signifies that the message is spam. 

The second column is the text content of the message that is being classified.

What we'll do
* Import the dataset into a pandas dataframe using the read_table method. You can access it using the filepath 'MessagesData/SpamHamMessages'. 
* Also, rename the column names by specifying a list ['label, 'message'] to the 'names' argument of read_table().
* Print the first five values of the dataframe with the new column names.

In [1]:

import pandas as pd
# Dataset available using filepath 'MessagesData/SpamHamMessages'
df = pd.read_table("MessagesData/SpamHamMessages", names=['label','message'])    

# Output printing out first 5 rows
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Step 1.2: Data Preprocessing ###

Now that we have a basic understanding of what our dataset looks like, lets convert our labels to binary variables(we make binary classification), 0 to represent 'ham'(not spam) and 1 to represent 'spam' for ease of computation.    

Our model would still be able to make predictions if we left our labels as strings but we could have issues later when calculating performance metrics

>**TODO: **
* Convert the values in the 'label' column to numerical values using map method as follows:
{'ham':0, 'spam':1} This maps the 'ham' value to 0 and the 'spam' value to 1.
* Also, to get an idea of the size of the dataset we are dealing with, print out number of rows and columns using 
'shape'.

In [2]:
df['label'] = df.label.map({"ham":0,"spam":1})
df.shape

(5572, 2)

### Step 2.1: Bag of words ###

What we have here in our dataset is a large collection of text data (5,572 rows of data). Most ML algorithms rely on numerical data to be fed into them as input, and our dataset are usually text. 

Here we'd like to introduce the Bag of Words(BoW) concept which is a term used to specify the problems that have a collection of text data that needs to be worked with. The basic idea of BoW is to take a piece of text and count the frequency of the words in that text. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter. 

Using a process which we will go through now, we can convert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrence of each word or token in that document.

For example: 

Lets say we have 4 documents as follows:

`['Hello, how are you!',
'Win money, win from home.',
'Call me now',
'Hello, Call you tomorrow?']`

Our objective here is to convert this set of text to a frequency distribution matrix, as follows:

<img src="images/countvectorizer.png" height="542" width="542">

Here as we can see, the documents are numbered in the rows, and each word is a column name, with the corresponding value being the frequency of that word in the document.

Lets break this down and see how we can do this conversion using a small set of documents.

To handle this, we will be using sklearns 
[count vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) method which does the following:

* It tokenizes the string(separates the string into individual words) and gives an integer ID to each token.
* It counts the occurrence of each of those tokens.

** Please Note: ** 

* The CountVectorizer method automatically converts all tokenized words to their lower case form so that it does not treat words like 'He' and 'he' differently. It does this using the `lowercase` parameter which is by default set to `True`.

* It also ignores all punctuation so that words followed by a punctuation mark (for example: 'hello!') are not treated differently . It does this using the `token_pattern` parameter which has a default regular expression which selects tokens of 2 or more alphanumeric characters.

* The third parameter to take note of is the `stop_words` parameter. Stop words refer to the most commonly used words in a language. They include words like 'am', 'an', 'and', 'the' etc. By setting this parameter value to `english`, CountVectorizer will automatically ignore all words(from our input text) that are found in the built in list of english stop words in scikit-learn. This is extremely helpful as stop words can skew our calculations when we are trying to find certain key words that are indicative of spam.

In [3]:
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()# TODO
count_vector.fit(documents)
count_vector.get_feature_names()

['are',
 'call',
 'from',
 'hello',
 'home',
 'how',
 'me',
 'money',
 'now',
 'tomorrow',
 'win',
 'you']

In [4]:
doc_array =count_vector.transform(documents) # TODO
doc_array.toarray()

array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)

In [5]:
frequency_matrix = pd.DataFrame(doc_array.toarray(),columns=count_vector.get_feature_names())# TODO))
frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


Congratulations! You have successfully implemented a Bag of Words problem for a document dataset that we created. 

### Step 3.1: Training and testing sets ###
Now that we have understood how to deal with the BOW problem we can get back to our dataset and split it to train and test to use it with our model

>>**TODO:**
Split the dataset into a training and testing set by using the train_test_split method in sklearn. Split the data
using the following variables:
* `X_train` is our training data for the 'message' column.
* `y_train` is our training data for the 'label' column
* `X_test` is our testing data for the 'message' column.
* `y_test` is our testing data for the 'label' column
Print out the number of rows we have in each our training and testing data.

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


### Step 3.2: Applying Bag of Words processing to our dataset. ###

our mission now is to apply BoW in our dataset as we did before
using `CountVectorizer()`. 

**TODO:**
   * fit our training data(`X_train`) into `CountVectorizer()` and return the matrix.
   * we have to transform our testing data(`X_test`) to return the matrix.

In [7]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train) # Fit will make it as dictionry of words

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

### Step 5: Naive Bayes implementation using scikit-learn ###

We will be using sklearns `sklearn.naive_bayes` method to make predictions on our dataset. 

Specifically, we will be using the **multinomial Naive Bayes** implementation. This particular classifier is suitable for classification with discrete features (such as in our case, word counts for text classification). It takes in integer word counts as its input. On the other hand Gaussian Naive Bayes is better suited for continuous data as it assumes that the input data has a Gaussian(normal) distribution.

In [8]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [9]:
predictions = naive_bayes.predict(testing_data)

Now that predictions have been made on our test set, we need to check the accuracy of our predictions.

### Step 6.1: Evaluating our model ###

**Accuracy:** measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions (in test dataset).

**Precision** tells us the ratio of true predict to all prediction(true predict and false predict)

**Recall(sensitivity)** tells us what proportion of messages that actually were spam were classified by us as spam.
It is a ratio of true predict(words classified as spam, and which are actually spam) to all the words that were actually spam.

**F1 score**, which is weighted average of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score.

In [10]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('Accuracy score: ', format(accuracy_score(y_test, predictions))) 
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


### Step 6.2:  let's try our model in production ! ###
Now after we created and traind our model and tested it, it's nice to try it in our new data we insert it to understand how to use it in productions. 

In [11]:
import numpy as np 
def spam_ham_classification(text):
    txt = np.asarray([text]) # to convert the string to array
    # preprocessing the data
    prepro_txt = count_vector.transform(txt) # convert txt to Bow
    predict_label = naive_bayes.predict(prepro_txt) # predict
    # return spam or not based on our prediction 
    #ham 0 and spam 1 as we mapped above
    if predict_label == 1:
        return "Spam"
    else:
        return "Not Spam"

In [12]:
spam_ham_classification("Hello my son I miss you")

'Not Spam'

In [13]:
spam_ham_classification("win free cash open the link")

'Spam'

### Step 7: Conclusion ###

One of the major advantage it rarely ever overfits the data. Another important advantage is that its model training and prediction times are very fast for the amount of data it can handle. All in all, Naive Bayes' really is a gem of an algorithm!

Congratulations! You have successfully designed a model that can  predict if an message is spam or not!

Thank you for reach the end!