<h1> Running classifiers </h1>

***
Six classifiers were trained and tested. To enable running these trained classifiers on another email dataset, they were saved after being trained so that they can be loaded promptly later.



## 1) Saving classifiers

The **joblib** module in **sklearn.externals** comes handy for this purpose. Suppose we have a trained classifier **clf** and in order to save it, simply run **joblib.dump(clf, filename)**. The saving functionality was embeded within the method **test_clf** and after traning, six **joblib** files will be saved as

> 1. MultinomialNB.joblib
> 2. RandomForest.joblib
> 3. Ridge_Regression.joblib
> 4. BernoulliNB.joblib
> 5. K_Neighbors.joblib
> 6. SVC.joblib


In [None]:
def test_clf(name, clf):
    print (u'Classifier：', clf)
    alpha_can = np.logspace(-3, 2, 10)
    model = GridSearchCV(clf, param_grid={'alpha': alpha_can}, cv=5)
    m = alpha_can.size
    if hasattr(clf, 'alpha'):
        model.set_params(param_grid={'alpha': alpha_can})
        m = alpha_can.size
    if hasattr(clf, 'n_neighbors'):
        neighbors_can = np.arange(1, 15)
        model.set_params(param_grid={'n_neighbors': neighbors_can})
        m = neighbors_can.size
    if hasattr(clf, 'C'):
        C_can = np.logspace(1, 3, 3)
        gamma_can = np.logspace(-3, 0, 3)
        model.set_params(param_grid={'C':C_can, 'gamma':gamma_can})
        m = C_can.size * gamma_can.size
    if hasattr(clf, 'max_depth'):
        max_depth_can = np.arange(4, 10)
        model.set_params(param_grid={'max_depth': max_depth_can})
        m = max_depth_can.size
    t_start = time()
    model.fit(x_train, y_train)
    t_end = time()
    t_train = (t_end - t_start) / (5*m)
    print (u'Training time for 5 -fold cross validation：%.3f/(5*%d) = %.3fsec' % ((t_end - t_start), m, t_train))
    print( u'Optimal hyperparameter：', model.best_params_)
    joblib.dump(model, "%s.joblib"%name)
    t_start = time()
    y_hat = model.predict(x_test)
    t_end = time()
    t_test = t_end - t_start
    print (u'Testing Time：%.3f sec' % t_test)
    acc = metrics.accuracy_score(y_test, y_hat)
    print (u'Accuracy ：%.2f%%' % (100 * acc))
    name = str(clf).split('(')[0]
    index = name.find('Classifier')
    if index != -1:
        name = name[:index]     # 去掉末尾的Classifier
    if name == 'SVC':
        name = 'SVM'
    return t_train, t_test, 1-acc, name

## Saving vectorizer 

In order to run classifiers on another dataset, the new dataset has to be preprocessed in exactly the same way of treating the original dataset. This requires to save the **TfidfVectorizer** after fitting it with **data_train**. New raw text will be transformed into **TF-IDF** vectors using this very vectorizer. In a similar way as saving the classifiers, the python in-built module **pickle** was utilized to save the **vectorizer**. The vectorizer was saved in the file ****vec.pickle****


In [None]:
def tfidf_data(data_train, data_test):

    vectorizer = TfidfVectorizer(input='content', stop_words='english', max_df=0.5, sublinear_tf=True)

    vec = vectorizer.fit(data_train.data)
    pickle.dump(vec, open("vec.pickle", "wb"))
    x_train = vectorizer.transform(data_train.data)  # x_train是稀疏的，scipy.sparse.csr.csr_matrix
    x_test = vectorizer.transform(data_test.data)

    return x_train, x_test, vectorizer

## Running classifiers

First specify the filename of the raw text and transform it into the **TF-IDF** vector using the method **get_tfidf**.

### Loading the vectorizer

The vectorized was fit and saved in the file **vec.pickle**. In order to load it, run **pickle.load("vec.pickle")**. The raw text can be transformed into a **TF-IDF** vector using this vectorizer.

In [2]:
from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

run_file = "test.txt" # Please specify the file name you want to test

def get_tfidf(filename):
	text = ""
	vectorizer = pickle.load(open("vec.pickle"))
	with open(filename) as fl:
		for line in fl:
			text += line
	return vectorizer.transform([text]) 

x = get_tfidf(run_file)

print(x)

  (0, 33654)	0.058523288651348045
  (0, 33597)	0.07950599605058221
  (0, 33381)	0.03840068981835156
  (0, 33366)	0.05499176868335385
  (0, 33342)	0.02995778455336368
  (0, 33334)	0.0397846329216497
  (0, 33309)	0.01844672331708908
  (0, 33303)	0.029723119935018725
  (0, 33288)	0.02270583426947962
  (0, 33182)	0.0423064812093078
  (0, 33062)	0.08096440111054665
  (0, 32993)	0.07025388891618414
  (0, 32928)	0.09288239260230895
  (0, 32849)	0.03967538645022774
  (0, 32547)	0.04548999989759632
  (0, 32504)	0.04618701148205565
  (0, 32429)	0.07894749349356875
  (0, 32420)	0.028584813824891843
  (0, 32336)	0.038923275399569364
  (0, 32232)	0.021168942671621935
  (0, 32207)	0.04781887956354195
  (0, 32152)	0.04372636092767193
  (0, 32141)	0.051256056735957026
  (0, 31967)	0.020312897962306932
  (0, 31746)	0.024577697676100376
  :	:
  (0, 6590)	0.04322181009406479
  (0, 6579)	0.035351909389793566
  (0, 6569)	0.05560959863517859
  (0, 6568)	0.074594107888131
  (0, 6496)	0.04372636092767193
  (0

### Loading classifiers

The raw text has been transformed into the **TF-IDF** vector and it can be classified using saved classifiers. Define a method **run(x, clf_name)** where x is a **TF-IDF** vector and **clf_name** is the filename for a classifier. It returns the y value for the given x


In [3]:
def run(x, clf_name):
	clf_name = clf_name + ".joblib"
	clf = joblib.load(clf_name)
	return clf.predict(x)

In [4]:
print("The classification for %s is: \n" % run_file)
clfs = ("MultinomialNB", "BernoulliNB",  "K_Neighbors",  "Ridge_Regression",  "RandomForest",  "SVC")
cat = ('alt.atheism', 'comp.graphics','sci.space', 'talk.religion.misc')
for clf in clfs:
    print("%s: %s\n" % (clf, cat[run(x, clf)[0]]))


The classification for test.txt is: 

MultinomialNB: talk.religion.misc

BernoulliNB: talk.religion.misc

K_Neighbors: talk.religion.misc

Ridge_Regression: talk.religion.misc

RandomForest: talk.religion.misc

SVC: talk.religion.misc



  from numpy.core.umath_tests import inner1d
