### Machine Learning Technique 2 - Classification

# 1. Document classification - Spam or Not Spam Emails 

This notebook will demonstrate text classification based on basic scikit functionalities. 
Document classification can be applied in many different applications such as filtering spam, detecting languages, classifying genres, and much more.

In this example we will build a simple spam filter which will classify texts of emails into two classes: spam, or not spam (a.k.a [ham](https://en.wiktionary.org/wiki/ham_e-mail)). 

The goal will be to build a simple spam filter. While the filters in services like Gmail are very advanced, the model we will have by the end of this lesson is effective, and quite accurate.

#### Data:

The data has been already fetched and transformed for you. You just have to download the CSV file from your OneDrive in `/email_data/emails.csv`.

FYI: the original data comes from the following websites:

- [Enron-Spam](http://www.aueb.gr/users/ion/data/enron-spam/)
- [SpamAssassin](https://spamassassin.apache.org/publiccorpus/)

#### Links:

- [Tutorial this lesson is based on](http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html)


# 2. Import Libraries

In [None]:
# Starting by importing our beloved libraries: pandas, numpy, matplotlib.pyplot

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

# 3. Load the Data

We are now going to load the `email.csv`, which contains all email data. This dataset is already labelled. This means that each email has a label `spam` or `ham`. 

*(BTW, ham is a relatively new synonym for "not Spam")*

In [None]:
import sys
import csv

# The CSV is quite large, so we might need to extend the limit the system.
#csv.field_size_limit(sys.maxsize)
csv.field_size_limit(500 * 1024 * 1024)

email_data = pd.read_csv('../data/classification_data/emails.csv', sep=None, encoding='utf-8')

In [None]:
# See the data


# 4. Explore

In [None]:
# length of the data
print('Number of Emails: {}'.format( ... ))

In [None]:
# Count the number of spam email and the number of ham emails


Great, we have quite a 50/50 distribution of spam and not spam emails. 

When doing classification with labelled data, it is best to have a training dataset where each class is represented in an equal manner.

## 4.1 Check some Emails

In [None]:


# Random number between 0 and #emails



Note that there is a lot of formatting code ([HTML](https://www.w3schools.com/html/)).

We might need some serious **pre-processing to remove all those things**.

## 4.2 Length of Emails

Let's, for fun, also check the length of the emails.

We are going to add the length of the raw email data as a column and then compute some statistics.

In [None]:
# First, we can try to compute the length for each email using a lambda expression


In [None]:
# Now, we can add that to a new column


# Then print the first 10


We can now make use of an useful function that we did not see yet, and that is the [describe()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) function.

Let's just see it in action.

In [None]:
# First we group our data by the `class`, take the column length (must be numerical)
# and then we compute some statistiques.


# 5. Pre-Processing

We have seen that the data is a little dirty and contains a lot of HTML. Let's clean this up! What? Does it look like real garbage to you? It is not pretty, that is sure, but fortunatly for us, there some amazing libraries out there.

In [None]:
# BEFORE
email_data['text'][2]

In [None]:
# AFTER


Great, looks much better!

Let's use this and add a **new column** for our DataFrame:

In [None]:
# Let's remove all \n


In [None]:
# Check the number of duplicates and/or empty
def check_data():
    print('Number of duplicates: {}'.format( ... ))
    print('Number of empty: {}'.format( ... ))
    


In [None]:
# Remove all empty rows


# 6. Slice / Split the data into training and validation

Now we are going to split the data in order to train and evaluate the model. 

We will perform a 90%/10% for training and testing. The training will be used to train the model, using a technique called cross validation. 

In `sklearn` there is a class called [KFold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) for this purpose. Cross validation consist of splitting the data in k parts / folds and each part is then used once as a validation data while the k-1 remaining parts are used for training.

In [None]:
print('Size of Training : {}'.format( len(email_train) ))
print('Size of Testing  : {}'.format( len(email_test) ))

# 7. Feature Extraction

Now come sthe time to extract data or knowledge from the processed text. A machine learning algorithm needs more than text to work. This is why we need to extract features from the text and for example, generate a count by word. An algorithm basically needs numbers, especially in machine learning.

We will start by using a basic `CountVectorizer` to count each word.

**Links:**

- [CountVectorizer Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [None]:
# Import CountVectorizer


# Create a new CountVectorizer instance


# fit_transform learns the vocabulary dictionary of the dataframe and extracts word counts as features.


In [None]:
# Check the size of the matrix


The matrix resulting from the `transform` is a `n*m` matrix , where:

- `n` is the number of documents (emails)
- `m` is the number of words

The matrix looks something like this:

|Document |word1|word2|word3|...|
|---|---|---|---|---|
|0        | 4| 8| 0| ... |
|1        | 0| 23| 5| ... |
|2        | 12| 3| 14| ... |
|...        | ...| ...| ...| ... |

In [None]:
# Get the vocabulary


# Sum up the counts of each vocabulary word


# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in sorted(zip(vocabulary, dist), key=lambda pair: pair[1], reverse=True):
    print(count, tag)

# 8. Create the Model - Classify Emails

## 8.1 Naive Bayes Classifier

The first classification algorithm will be one based on Naive Bayes. 

**Links:**

- [Documentation MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

- [Advanced Explanations 1](http://www.statsoft.com/textbook/naive-bayes-classifier)
- [Advanced Explanations 2](http://software.ucv.ro/~cmihaescu/ro/teaching/AIR/docs/Lab4-NaiveBayes.pdf)
- [Advanced Explanations 3](http://ix.cs.uoregon.edu/~dou/research/papers/icdm11_fw.pdf)

In [None]:
# Import sklean implementaton of Naive Bayes




In [None]:
# Now lets check few examples

examples = [
    "I'm going to attend the Linux users group tomorrow.", 
    'Free Viagra call today!', 
    'Python online classes'
]



# 9. Measure Performance

All right, we have a model, and now we should measure its performance and find out if the model is accurate or not.

We will define a few method to help us print the accuracy of our model(s).

In [None]:
from sklearn.cross_validation import *

def cross_validate(classifier, x, true_labels, k=5):
    # Create our cross validator with 5-fold
    cv = KFold(true_labels.shape[0], k, shuffle=True, random_state=42)

    # Compute the scores
    scores = cross_val_score(classifier, x, true_labels, cv=cv)
    
    # Print the results
    print("Accuracy: {:.2f} (+/- {:.2f}) and {} folds".format(scores.mean(), scores.std() * 2, k))

    return scores

Model is trained on training data. We have now to take the test data and find out the real accuracy:

In [None]:
# Create the CountVectorizer for the test data


In [None]:
# Predict our test classes


In [None]:
# Measure Accuracy of Test


## 9.1 Classification Report, Confusion Matrix, ...

In this section, we are going to see additional ways to evaluate our classifier. Those evaluation only work with labelled data, because the predictions are evaluated based on the true values from your data.

**Links:**

- [Classification Report Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)
- [Confusion Matrix Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
- [Confusion Matrix, Precison and Recall Explained](http://docs.statwing.com/the-confusion-matrix-and-the-precision-recall-tradeoff/)
- [Confusion Matrix Terminology](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/)

In [None]:
# Print the classification report


In [None]:
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(true_labels, predictions, classes):
    # Create Confusion Matrix
    cm = ...

    print('Confusion Matrix:')
    print(cm)
    
    # Plot    

    plt.matshow(cm, cmap=plt.cm.binary, interpolation='nearest')
    
    plt.xlabel('predicted class')
    plt.ylabel('expected class')
    
    tick_marks = np.arange( len(classes) )
    plt.xticks(tick_marks, classes)
    plt.yticks(tick_marks, classes)

    plt.colorbar()

**Confusion Matrix Explained:**

- There are 2 possible predicted classes: *spam* or *ham*.
- In total we have `1443 + 1418 +33 + 136 = 3030` predictions.
- Out of 3030 predictions, 1443 have been predicted as *ham* and 1476 as *spam*
- Actually `136 + 1418 = 1554` are really *spam* and `1443 + 33 = 1476` are really *ham*

**Terms:**

Let's go into more detail and define the correct terms:

- True Positives (TP): We predicted *ham* (yes, it is safe), and the emails are *ham*.
- True Negatives (TN): We predicted *spam* (no, it is not safe), and the emails are *spam*.
- False Positives (FP): We predicted *ham* (yes), but the emails are *spam*.
- False Negatives (FN): We predicted *spam* (no), but the emails are *ham*.

Those terms map to the confusion matrix above like this:

![Image 1](http://rasbt.github.io/mlxtend/user_guide/evaluate/confusion_matrix_files/confusion_matrix_1.png)

**Accuracy:**

Overall, how often is the classifier correct?

    ( TP + TN ) / total = ( 1443 + 1418 ) / 3030 = 0.94

**Precision:**

When it predicts *ham* (yes), how often is it correct?

    TP / ( TP + FP ) = 1443 / (1443 + 136) = 0.91

**Recall:**

Recall is also called *Sensitivity* or *True Positive Rate*.
It describes: When it's actually *ham* (yes), how often does it predict *ham* (yes)?

    TP / ( TP + FN ) = 1443 / (1443 + 33) = 0.98 (rounded up)



In [None]:
print('Accuracy                        :', ...)
print('Accuracy from sklearn function  :', accuracy_score(email_test['class'], predictions))

In [None]:
print('Precision:', ... )

In [None]:
print('Recall:', ... )