## Background ##
  
   Automatic emails classification can help us to distinguish and classify different category of emails, which filters the unsolicited emails, known as spam and increases people's working efficiency. The main goal of developing the classifiers algorithms is to identify spam or unwanted emails versus useful emails. Effective emails classification using machine learning algorithms can detect spam that contains lots of annoying advertisements and unneeded information and filter the spam in a high accuracy. Different classifications are trained to classify the spam email and wanted email behaves in a different result even using a same database. 
   
   In order to enhance the effectiveness of email classification, the general content based spam filter are working based on words and phrases inside the email text. With certain keywords within the email, it would container as spam. The goal for this project is to finding the most effective algorithms. We apply five Machine Learning algorithms (as follows: Multinomial Naive Bayes, Bernoulli Naive Bayes, K Neighbors, Ridge Regression, Support Vector Machine and Random Forest) to the text and make an analysis to their performance based on email classification.

## Implementation ##
***

### 1. Training Classifiers

#### 1.1 Import 
***

In [30]:
import numpy as np
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from time import time
from pprint import pprint
import matplotlib
matplotlib.use("TkAgg")
from matplotlib import pyplot as plt
import matplotlib as mpl
from mpl_toolkits.axes_grid1 import host_subplot
import mpl_toolkits.axisartist as AA
from sklearn.externals import joblib
import pickle

#### 1.2 Get Dataset
We are going to use a built in method in **sklearn.dataset** to downloads data from 20newsgroups api with some chosen categories (computer science, science, electronics, sports), seperates training set and testing set, finally returns them.
***

In [31]:
from sklearn.datasets import fetch_20newsgroups

def get_data():
    remove = ()
    categories = 'alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'

    print('start downloading...')
    t_start = time()

    data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=0, remove=remove)
    data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=0, remove=remove)

    t_end = time()
    print('downloading completed，take %.3f sec' % (t_end - t_start))

    return data_train, data_test

data_train, data_test = get_data()


start downloading...
downloading completed，take 0.506 sec


**Explore the dataset**

Both *data_train* and *data_test* are of **Bunch** object, which has the following attributes

> 1. bunch.data: list
> 2. bunch.target: array
> 3. bunch.filenames: list
> 4. bunch.DESCR: a description of the dataset

Let's take a look at the first raw text document

In [32]:
print(data_train.data[0][:500], '...')

From: healta@saturn.wwc.edu (Tammy R Healy)
Subject: Re: note to Bobby M.
Lines: 52
Organization: Walla Walla College
Lines: 52

In article <1993Apr14.190904.21222@daffy.cs.wisc.edu> mccullou@snake2.cs.wisc.edu (Mark McCullough) writes:
>From: mccullou@snake2.cs.wisc.edu (Mark McCullough)
>Subject: Re: note to Bobby M.
>Date: Wed, 14 Apr 1993 19:09:04 GMT
>In article <1993Apr14.131548.15938@monu6.cc.monash.edu.au> darice@yoyo.cc.monash.edu.au (Fred Rice) writes:
>>In <madhausC5CKIp.21H@netcom.co ...


In [33]:
print('data type：', type(data_train))
print('# of texts in train set ：', len(data_train.data))
print('# of texts in test set：', len(data_test.data))
print('name of%d categories：' % len(data_train.target_names))

pprint(data_train.target_names)

data type： <class 'sklearn.utils.Bunch'>
# of texts in train set ： 2034
# of texts in test set： 1353
name of4 categories：
['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


#### 1.3 Preprocessing data
Now the dataset has been split into traning and test sets. In order to use those data to train classifiers, features and corresponding lables should be extrated via two methods **get_y_data()** and **tfidf_data()**

> 1. **get_y_data(data_train, data_tetst)**: takes two datasets and returns an array of lables.
> 2. **tfidf_data(data_train, data_test)**: takes two datasets and returns an array of tf-idf vectors.

//TODO
explain tfidf

Additionally, in order to run classifiers on another dataset, the new dataset has to be preprocessed in exactly the same way of treating the original dataset. This requires to save the **TfidfVectorizer** after fitting it with data_train. New raw text will be transformed into **TF-IDF** vectors using this very vectorizer. In a similar way as saving the classifiers, the python in-built module **pickle** was utilized to save the vectorizer. The vectorizer was saved in the file **vec.pickle**

***

In [34]:
def get_y_data(data_train, data_test):

    y_train = data_train.target
    y_test = data_test.target

    return y_train, y_test

def tfidf_data(data_train, data_test):

    vectorizer = TfidfVectorizer(input='content', stop_words='english', max_df=0.5, sublinear_tf=True)

    vec = vectorizer.fit(data_train.data)
    pickle.dump(vec, open("vec.pickle", "wb"))
    x_train = vectorizer.transform(data_train.data)  # x_train is sparse，scipy.sparse.csr.csr_matrix
    x_test = vectorizer.transform(data_test.data)

    return x_train, x_test

y_train, y_test = get_y_data(data_train, data_test)
x_train, x_test = tfidf_data(data_train, data_test)

In [35]:
print(' -- Examples : the first 2 texts -- ')

categories = data_train.target_names

for i in np.arange(2):
    print('----------------')
    print('category for text%d : %s\n' % (i + 1, categories[y_train[i]]))
    print(data_train.data[i][:500])
    print('...')
    print('\n\n')


 -- Examples : the first 2 texts -- 
----------------
category for text1 : alt.atheism

From: healta@saturn.wwc.edu (Tammy R Healy)
Subject: Re: note to Bobby M.
Lines: 52
Organization: Walla Walla College
Lines: 52

In article <1993Apr14.190904.21222@daffy.cs.wisc.edu> mccullou@snake2.cs.wisc.edu (Mark McCullough) writes:
>From: mccullou@snake2.cs.wisc.edu (Mark McCullough)
>Subject: Re: note to Bobby M.
>Date: Wed, 14 Apr 1993 19:09:04 GMT
>In article <1993Apr14.131548.15938@monu6.cc.monash.edu.au> darice@yoyo.cc.monash.edu.au (Fred Rice) writes:
>>In <madhausC5CKIp.21H@netcom.co
...



----------------
category for text2 : comp.graphics

From: ch381@cleveland.Freenet.Edu (James K. Black)
Subject: NEEDED: algorithms for 2-d & 3-d object recognition
Organization: Case Western Reserve University, Cleveland, OH (USA)
Lines: 23
Reply-To: ch381@cleveland.Freenet.Edu (James K. Black)
NNTP-Posting-Host: hela.ins.cwru.edu


Hi,
         I have a friend who is working on 2-d and 3-d object re

In [36]:
print('# of train set：%d，# of features：%d' % x_train.shape)

# of train set：2034，# of features：33809


#### 1.4 Train Classifier ####
***

Method **classifier** uses 6 different classifiers (Multinomial Naive Bayes classifier, Bernoulli Naive Bayes classifier, K Neighbors classifier, Ridge Regression classifier, Random Forest classifier, Support Vector Machine classifier) to classify training data. 

It then returns array of results of each classifier including its error rate, training time and testing time.

In [37]:
def classifier(x, y):
    print('\n\n===================\n evaluation of classifiers：\n')
    clfs = {"MultinomialNB": MultinomialNB(), 
            "BernoulliNB": BernoulliNB(),  
            "K_Neighbors": KNeighborsClassifier(),  
            "Ridge_Regression": RidgeClassifier(),  
            "RandomForest": RandomForestClassifier(n_estimators=200),  
            "SVC": SVC()  
            }
    result = []
    for name,clf in clfs.items():
        a = test_clf(name, clf, x, y)
        result.append(a)
        print('\n')
    return np.array(result)

**test_clf**'s input is one of the classifiers. GridSearchCV object's job is to set parameters to the classifier according to type of the classifier (e.g. set **neighbors_can** when the classifier is **K Neighbors classifier**). 

Then it trains each classifier.

Six classifiers are going to be trained and tested. To enable running these trained classifiers on another email dataset, they should be saved after being trained so that they can be loaded promptly later.

Regarding saving functionality, the **joblib** module in **sklearn.externals** comes handy for this purpose. Suppose we have a trained classifier **clf** and in order to save it, simply run **joblib.dump(clf, filename)**. The saving functionality was embeded within the method **test_clf** and after traning, six **joblib** files will be saved as

> 1. MultinomialNB.joblib
> 2. RandomForest.joblib
> 3. Ridge_Regression.joblib
> 4. BernoulliNB.joblib
> 5. K_Neighbors.joblib
> 6. SVC.joblib

In [None]:
result = classifier(x_train, y_train)

def test_clf(name, clf, x_train, y_train):
    
    print ('Classifier：', clf)
    
    alpha_can = np.logspace(-3, 2, 10)
    model = GridSearchCV(clf, param_grid={'alpha': alpha_can}, cv=5)
    m = alpha_can.size
    
    if hasattr(clf, 'alpha'):
        model.set_params(param_grid={'alpha': alpha_can})
        m = alpha_can.size
    if hasattr(clf, 'n_neighbors'):
        neighbors_can = np.arange(1, 15)
        model.set_params(param_grid={'n_neighbors': neighbors_can})
        m = neighbors_can.size
    if hasattr(clf, 'C'):
        C_can = np.logspace(1, 3, 3)
        gamma_can = np.logspace(-3, 0, 3)
        model.set_params(param_grid={'C':C_can, 'gamma':gamma_can})
        m = C_can.size * gamma_can.size
    if hasattr(clf, 'max_depth'):
        max_depth_can = np.arange(4, 10)
        model.set_params(param_grid={'max_depth': max_depth_can})
        m = max_depth_can.size
    
    t_start = time()
    model.fit(x_train, y_train)
    t_end = time()
    t_train = (t_end - t_start) / (5*m)
    
    print ('Training time for 5 -fold cross validation：%.3f/(5*%d) = %.3fsec' % ((t_end - t_start), m, t_train))
    print( 'Optimal hyperparameter：', model.best_params_)
    
    # save the classifier
    joblib.dump(model, "%s.joblib"%name)
    
    t_start = time()
    y_hat = model.predict(x_test)
    t_end = time()
    t_test = t_end - t_start
    print ('Testing Time：%.3f sec' % t_test)
    
    acc = metrics.accuracy_score(y_test, y_hat)
    print ('Accuracy ：%.2f%%' % (100 * acc))
    
    name = str(clf).split('(')[0]
    index = name.find('Classifier')
    if index != -1:
        name = name[:index]
    if name == 'SVC':
        name = 'SVM'
        
    return t_train, t_test, 1-acc, name



 evaluation of classifiers：

Classifier： MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
Training time for 5 -fold cross validation：0.351/(5*10) = 0.007sec
Optimal hyperparameter： {'alpha': 0.003593813663804626}
Testing Time：0.001 sec
Accuracy ：89.58%


Classifier： BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
Training time for 5 -fold cross validation：0.492/(5*10) = 0.010sec
Optimal hyperparameter： {'alpha': 0.001}
Testing Time：0.003 sec
Accuracy ：88.54%


Classifier： KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
Training time for 5 -fold cross validation：11.115/(5*14) = 0.159sec
Optimal hyperparameter： {'n_neighbors': 3}
Testing Time：0.134 sec
Accuracy ：86.03%


Classifier： RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, random_state=None, solver='auto',
      

#### 1.5) Render Results ####
***

**draw** renders a diagram using results from previous method to evaluate results of different classifiers by listing the **error rate, training time and testing time** of each classifier.

In [None]:
draw(result)

def draw(result):
    time_train1, time_test1, err1, names = result.T
    time_test = time_test1.astype(np.float)
    time_train = time_train1.astype(np.float)
    err = err1.astype(np.float)
    x= np.arange(len(time_train))
    bar_width = 0.25
    ax1 = host_subplot(111, axes_class=AA.Axes)
    plt.subplots_adjust(right = 0.75)

    ax2 = ax1.twinx()
    ax3 = ax1.twinx()
    offset3 = 60
    offset2 = 0

    new_fixed_axis = ax3.get_grid_helper().new_fixed_axis
    ax3.axis["right"] = new_fixed_axis(loc="right", axes=ax3, offset=(offset3, 0))
    ax3.axis["right"].toggle(all=True)

    new_fixed_axis2 = ax2.get_grid_helper().new_fixed_axis
    ax2.axis["right"] = new_fixed_axis2(loc="right", axes=ax2, offset=(offset2, 0))
    ax2.axis["right"].toggle(all=True)

    ax1.set_ylabel("Error percentage")
    ax2.set_ylabel("Training time")
    ax3.set_ylabel("Testing time")

    b1 = ax1.bar(x, err, bar_width, alpha=0.2, color='r')
    b2 = ax2.bar(x + bar_width, time_train, bar_width, alpha=0.2, color='g')
    b3 = ax3.bar(x + bar_width * 2, time_test, bar_width, alpha=0.2, color='b')
    plt.xticks(x + bar_width * 2, names)
    plt.legend([b1[0], b2[0], b3[0]], ('Error Percentage', 'Training Time', 'Testing Time'), loc='upper left')
    plt.xlabel('Different Types Of Classifiers')
    plt.title('Evaluation Of Different Classifiers')
    plt.savefig("Performance_his.png")
    plt.show()

### 2. Running classifiers
***

First specify the filename of the raw text and transform it into the **TF-IDF** vector using the method **get_tfidf**.

#### 2.1 Loading the vectorizer

The vectorized was fit and saved in the file **vec.pickle**. In order to load it, run **pickle.load("vec.pickle")**. The raw text can be transformed into a **TF-IDF** vector using this vectorizer.
***

In [None]:
from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

run_file = "test.txt" # Please specify the file name you want to test

def get_tfidf(filename):
	text = ""
	vectorizer = pickle.load(open("vec.pickle"))
	with open(filename) as fl:
		for line in fl:
			text += line
	return vectorizer.transform([text]) 

x = get_tfidf(run_file)

print(x)

#### 2.2 Loading classifiers

The raw text has been transformed into the **TF-IDF** vector and it can be classified using saved classifiers. Define a method **run(x, clf_name)** where x is a **TF-IDF** vector and **clf_name** is the filename for a classifier. It returns the y value for the given x
***

In [None]:
def run(x, clf_name):
	clf_name = clf_name + ".joblib"
	clf = joblib.load(clf_name)
	return clf.predict(x)

In [None]:
print("The classification for %s is: \n" % run_file)
clfs = ("MultinomialNB", "BernoulliNB",  "K_Neighbors",  "Ridge_Regression",  "RandomForest",  "SVC")
cat = ('alt.atheism', 'comp.graphics','sci.space', 'talk.religion.misc')
for clf in clfs:
    print("%s: %s\n" % (clf, cat[run(x, clf)[0]]))