## Implementation ##
***

### 1) Training Classifiers ###
***

#### 1.1) Import ####
***

In [5]:
import numpy as np
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from time import time
from pprint import pprint
import matplotlib
matplotlib.use("TkAgg")
from matplotlib import pyplot as plt
import matplotlib as mpl
from mpl_toolkits.axes_grid1 import host_subplot
import mpl_toolkits.axisartist as AA
from sklearn.externals import joblib
import pickle

#### 1.2) Get Dataset ####
***

We are going to use a built in method in **sklearn.dataset** to downloads data from 20newsgroups api with some chosen categories (computer science, science, electronics, sports), seperates training set and testing set, finally returns them.

If these files already exisits locally, it will just load it.

In [6]:
def get_data():

    remove = ()
    categories = 'alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'

    print('start downloading...')
    t_start = time()

    data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=0, remove=remove)
    data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=0, remove=remove)

    t_end = time()
    print('downloading completed，take %.3f sec' % (t_end - t_start))

    return data_train, data_test


This method prints type, size, and categories of training set and testing set.

In [7]:
def print_data_info(data_train, data_test):

    print('data type：', type(data_train))
    print('# of texts in train set ：', len(data_train.data))
    print('# of texts in test set：', len(data_test.data))
    print('name of%d categories：' % len(data_train.target_names))

    pprint(data_train.target_names)

data_train, data_test = get_data()
print_data_info(data_train, data_test)

start downloading...
downloading completed，take 0.517 sec
data type： <class 'sklearn.utils.Bunch'>
# of texts in train set ： 2034
# of texts in test set： 1353
name of4 categories：
['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


**get_y_data** method simply returns the label array of training set and testing set.

In [9]:
def get_y_data(data_train, data_test):

    y_train = data_train.target
    y_test = data_test.target

    return y_train, y_test

**print_examples** prints 2 example data in training set with their corrisponding categories.

In [24]:
def print_examples(y_train, data_train):

    print(' -- Examples : the first 2 texts -- ')

    categories = data_train.target_names

    for i in np.arange(2):
        print('----------------')
        print('category for text%d : %s\n' % (i + 1, categories[y_train[i]]))
        print(data_train.data[i])
        print('\n\n')

y_train, y_test = get_y_data(data_train, data_test)
print_examples(y_train, data_train)

 -- Examples : the first 2 texts -- 
----------------
category for text1 : alt.atheism

From: healta@saturn.wwc.edu (Tammy R Healy)
Subject: Re: note to Bobby M.
Lines: 52
Organization: Walla Walla College
Lines: 52

In article <1993Apr14.190904.21222@daffy.cs.wisc.edu> mccullou@snake2.cs.wisc.edu (Mark McCullough) writes:
>From: mccullou@snake2.cs.wisc.edu (Mark McCullough)
>Subject: Re: note to Bobby M.
>Date: Wed, 14 Apr 1993 19:09:04 GMT
>In article <1993Apr14.131548.15938@monu6.cc.monash.edu.au> darice@yoyo.cc.monash.edu.au (Fred Rice) writes:
>>In <madhausC5CKIp.21H@netcom.com> madhaus@netcom.com (Maddi Hausmann) writes:
>>
>>>Mark, how much do you *REALLY* know about vegetarian diets?
>>>The problem is not "some" B-vitamins, it's balancing proteins.  
>>>There is also one vitamin that cannot be obtained from non-animal
>>>products, and this is only of concern to VEGANS, who eat no
>>>meat, dairy, or eggs.  I believe it is B12, and it is the only
>>>problem.  Supplements are avai

#### 1.3) Fit Data using TF-IDF Model ####
***

Then we call method **tfidf_data** to fit training set and testing set using **TF-IDF**, meanwhile save the trained data to **vec.pickle**, avoiding training classifiers every time we run the file.

In [25]:
def tfidf_data(data_train, data_test):

    vectorizer = TfidfVectorizer(input='content', stop_words='english', max_df=0.5, sublinear_tf=True)

    vec = vectorizer.fit(data_train.data)
    pickle.dump(vec, open("vec.pickle", "wb"))
    x_train = vectorizer.transform(data_train.data)  # x_train is sparse，scipy.sparse.csr.csr_matrix
    x_test = vectorizer.transform(data_test.data)

    return x_train, x_test, vectorizer

It prints size and number of features of training set after **TF-IDF** fit.

In [30]:
def print_x_data(x_train, vectorizer):

    print('# of train set：%d，# of features：%d' % x_train.shape)
    
x_train, x_test, vectorizer = tfidf_data(data_train, data_test)
print_x_data(x_train, vectorizer)

# of train set：2034，# of features：33809


#### 1.4) Train Classifier ####
***

Method **classifier** uses Multinomial Naive Bayes classifier, Bernoulli Naive Bayes classifier, K Neighbors classifier, Ridge Regression classifier, Random Forest classifier, Support Vector Machine classifier to classify training data. 

It then returns array of results of each classifier including its error rate, training time and testing time.

In [31]:
def classifier(x, y):
    print('\n\n===================\n evaluation of classifiers：\n')
    clfs = {"MultinomialNB": MultinomialNB(), 
            "BernoulliNB": BernoulliNB(),  
            "K_Neighbors": KNeighborsClassifier(),  
            "Ridge_Regression": RidgeClassifier(),  
            "RandomForest": RandomForestClassifier(n_estimators=200),  
            "SVC": SVC()  
            }
    result = []
    for name,clf in clfs.items():
        a = test_clf(name, clf, x, y)
        result.append(a)
        print('\n')
    return np.array(result)

**test_clf**'s input is one of the classifiers. GridSearchCV object's job is to set parameters to the classifier according to type of the classifier (e.g. set **neighbors_can** when the classifier is **K Neighbors classifier**). 

Then it trains each classifier.

In [32]:
def test_clf(name, clf, x_train, y_train):
    print ('Classifier：', clf)
    alpha_can = np.logspace(-3, 2, 10)
    model = GridSearchCV(clf, param_grid={'alpha': alpha_can}, cv=5)
    m = alpha_can.size
    if hasattr(clf, 'alpha'):
        model.set_params(param_grid={'alpha': alpha_can})
        m = alpha_can.size
    if hasattr(clf, 'n_neighbors'):
        neighbors_can = np.arange(1, 15)
        model.set_params(param_grid={'n_neighbors': neighbors_can})
        m = neighbors_can.size
    if hasattr(clf, 'C'):
        C_can = np.logspace(1, 3, 3)
        gamma_can = np.logspace(-3, 0, 3)
        model.set_params(param_grid={'C':C_can, 'gamma':gamma_can})
        m = C_can.size * gamma_can.size
    if hasattr(clf, 'max_depth'):
        max_depth_can = np.arange(4, 10)
        model.set_params(param_grid={'max_depth': max_depth_can})
        m = max_depth_can.size
    t_start = time()
    model.fit(x_train, y_train)
    t_end = time()
    t_train = (t_end - t_start) / (5*m)
    print ('Training time for 5 -fold cross validation：%.3f/(5*%d) = %.3fsec' % ((t_end - t_start), m, t_train))
    print( 'Optimal hyperparameter：', model.best_params_)
    joblib.dump(model, "%s.joblib"%name)
    t_start = time()
    y_hat = model.predict(x_test)
    t_end = time()
    t_test = t_end - t_start
    print ('Testing Time：%.3f sec' % t_test)
    acc = metrics.accuracy_score(y_test, y_hat)
    print ('Accuracy ：%.2f%%' % (100 * acc))
    name = str(clf).split('(')[0]
    index = name.find('Classifier')
    if index != -1:
        name = name[:index]
    if name == 'SVC':
        name = 'SVM'
    return t_train, t_test, 1-acc, name



result = classifier(x_train, y_train)



 evaluation of classifiers：

Classifier： MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
Training time for 5 -fold cross validation：0.375/(5*10) = 0.008sec
Optimal hyperparameter： {'alpha': 0.003593813663804626}
Testing Time：0.001 sec
Accuracy ：89.58%


Classifier： BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
Training time for 5 -fold cross validation：0.523/(5*10) = 0.010sec
Optimal hyperparameter： {'alpha': 0.001}
Testing Time：0.002 sec
Accuracy ：88.54%


Classifier： KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
Training time for 5 -fold cross validation：11.131/(5*14) = 0.159sec
Optimal hyperparameter： {'n_neighbors': 3}
Testing Time：0.137 sec
Accuracy ：86.03%


Classifier： RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, random_state=None, solver='auto',
      

#### 1.5) Render Results ####
***

**draw** renders a diagram using results from previous method to evaluate results of different classifiers by listing the **error rate, training time and testing time** of each classifier.

In [None]:
def draw(result):
    time_train1, time_test1, err1, names = result.T
    time_test = time_test1.astype(np.float)
    time_train = time_train1.astype(np.float)
    err = err1.astype(np.float)
    x= np.arange(len(time_train))
    bar_width = 0.25
    ax1 = host_subplot(111, axes_class=AA.Axes)
    plt.subplots_adjust(right = 0.75)

    ax2 = ax1.twinx()
    ax3 = ax1.twinx()
    offset3 = 60
    offset2 = 0

    new_fixed_axis = ax3.get_grid_helper().new_fixed_axis
    ax3.axis["right"] = new_fixed_axis(loc="right", axes=ax3, offset=(offset3, 0))
    ax3.axis["right"].toggle(all=True)

    new_fixed_axis2 = ax2.get_grid_helper().new_fixed_axis
    ax2.axis["right"] = new_fixed_axis2(loc="right", axes=ax2, offset=(offset2, 0))
    ax2.axis["right"].toggle(all=True)

    ax1.set_ylabel("Error percentage")
    ax2.set_ylabel("Training time")
    ax3.set_ylabel("Testing time")

    b1 = ax1.bar(x, err, bar_width, alpha=0.2, color='r')
    b2 = ax2.bar(x + bar_width, time_train, bar_width, alpha=0.2, color='g')
    b3 = ax3.bar(x + bar_width * 2, time_test, bar_width, alpha=0.2, color='b')
    plt.xticks(x + bar_width * 2, names)
    plt.legend([b1[0], b2[0], b3[0]], ('Error Percentage', 'Training Time', 'Testing Time'), loc='upper left')
    plt.xlabel('Different Types Of Classifiers')
    plt.title('Evaluation Of Different Classifiers')
    plt.savefig("Performance_his.png")
    plt.show()
    

draw(result)

<h1> Running classifiers </h1>

***
Six classifiers were trained and tested. To enable running these trained classifiers on another email dataset, they were saved after being trained so that they can be loaded promptly later.



## Saving classifiers

The **joblib** module in **sklearn.externals** comes handy for this purpose. Suppose we have a trained classifier **clf** and in order to save it, simply run **joblib.dump(clf, filename)**. The saving functionality was embeded within the method **test_clf** and after traning, six **joblib** files will be saved as

> 1. MultinomialNB.joblib
> 2. RandomForest.joblib
> 3. Ridge_Regression.joblib
> 4. BernoulliNB.joblib
> 5. K_Neighbors.joblib
> 6. SVC.joblib


In [None]:
def test_clf(name, clf):
    print (u'Classifier：', clf)
    alpha_can = np.logspace(-3, 2, 10)
    model = GridSearchCV(clf, param_grid={'alpha': alpha_can}, cv=5)
    m = alpha_can.size
    if hasattr(clf, 'alpha'):
        model.set_params(param_grid={'alpha': alpha_can})
        m = alpha_can.size
    if hasattr(clf, 'n_neighbors'):
        neighbors_can = np.arange(1, 15)
        model.set_params(param_grid={'n_neighbors': neighbors_can})
        m = neighbors_can.size
    if hasattr(clf, 'C'):
        C_can = np.logspace(1, 3, 3)
        gamma_can = np.logspace(-3, 0, 3)
        model.set_params(param_grid={'C':C_can, 'gamma':gamma_can})
        m = C_can.size * gamma_can.size
    if hasattr(clf, 'max_depth'):
        max_depth_can = np.arange(4, 10)
        model.set_params(param_grid={'max_depth': max_depth_can})
        m = max_depth_can.size
    t_start = time()
    model.fit(x_train, y_train)
    t_end = time()
    t_train = (t_end - t_start) / (5*m)
    print (u'Training time for 5 -fold cross validation：%.3f/(5*%d) = %.3fsec' % ((t_end - t_start), m, t_train))
    print( u'Optimal hyperparameter：', model.best_params_)
    joblib.dump(model, "%s.joblib"%name)
    t_start = time()
    y_hat = model.predict(x_test)
    t_end = time()
    t_test = t_end - t_start
    print (u'Testing Time：%.3f sec' % t_test)
    acc = metrics.accuracy_score(y_test, y_hat)
    print (u'Accuracy ：%.2f%%' % (100 * acc))
    name = str(clf).split('(')[0]
    index = name.find('Classifier')
    if index != -1:
        name = name[:index]     # 去掉末尾的Classifier
    if name == 'SVC':
        name = 'SVM'
    return t_train, t_test, 1-acc, name

## Saving vectorizer 

In order to run classifiers on another dataset, the new dataset has to be preprocessed in exactly the same way of treating the original dataset. This requires to save the **TfidfVectorizer** after fitting it with **data_train**. New raw text will be transformed into **TF-IDF** vectors using this very vectorizer. In a similar way as saving the classifiers, the python in-built module **pickle** was utilized to save the **vectorizer**. The vectorizer was saved in the file ****vec.pickle****


In [None]:
def tfidf_data(data_train, data_test):

    vectorizer = TfidfVectorizer(input='content', stop_words='english', max_df=0.5, sublinear_tf=True)

    vec = vectorizer.fit(data_train.data)
    pickle.dump(vec, open("vec.pickle", "wb"))
    x_train = vectorizer.transform(data_train.data)  # x_train是稀疏的，scipy.sparse.csr.csr_matrix
    x_test = vectorizer.transform(data_test.data)

    return x_train, x_test, vectorizer

## Running classifiers

First specify the filename of the raw text and transform it into the **TF-IDF** vector using the method **get_tfidf**.

### Loading the vectorizer

The vectorized was fit and saved in the file **vec.pickle**. In order to load it, run **pickle.load("vec.pickle")**. The raw text can be transformed into a **TF-IDF** vector using this vectorizer.

In [3]:
from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

run_file = "test.txt" # Please specify the file name you want to test

def get_tfidf(filename):
	text = ""
	vectorizer = pickle.load(open("vec.pickle"))
	with open(filename) as fl:
		for line in fl:
			text += line
	return vectorizer.transform([text]) 

x = get_tfidf(run_file)

print(x)

ValueError: unsupported pickle protocol: 3

### Loading the classifiers
