# 20 Newsgroups & Naive Bayes

This notebook is a modification of 20_Newsgroups_LR example. The motivation for this one is the great dimensionality of feature vectors in the vectorized 20 newsgroups dataset. Maybe even stronger motivation is the relatively low theoretical correlation between individual features, that in fact are term frequencies times inverse document frequency (TF-IDF).

---

I've used a 5-fold cross validation scheme and Parameter Sample for hyperparameter tuning.

---

**Imports**

In [1]:
from __future__ import print_function

%matplotlib gtk

import matplotlib.pyplot as plt
plt.style.use("ggplot")

import os
import numpy as np
from sklearn import naive_bayes, preprocessing, cross_validation, datasets, grid_search
import cPickle as cpk
from time import time

**Global variables**

In [2]:
NB_FILE_PATH = "models/naive_bayes.model"
MNB_FILE_PATH = "models/multinomial_nb.model"
NB_OP_FILE_PATH = "models/optimized_naive_bayes.model"
MNB_OP_FILE_PATH = "models/optimized_multinomial_nb.model"

**Loading the dataset**

In [3]:
start = time()
# remove = ('headers', 'footers', 'quotes')
news = datasets.fetch_20newsgroups_vectorized(subset = "all")
print("Data loaded in {0} seconds".format(time() - start))
print("-------------------------------------------------------------------------------")

X = news.data
y = news.target

# _ = [print(topic) for topic in news.target_names] # you may uncoment this to see the topics

print("Number of documents in the dataset: {0}".format(*X.shape))
print("Size of the feature vector of a document: {1}".format(*X.shape))

Data loaded in 1.73747014999 seconds
-------------------------------------------------------------------------------
Number of documents in the dataset: 18846
Size of the feature vector of a document: 130107


**Some utility functions**

In [4]:
def to_str_repr(label):
    return news.target_names[label]

def formated_print(model_name, acc):
    pretty_stat = "Accuracy of the {0} model is {1}.".format(model_name, acc)
    print(pretty_stat)
    
def train_estimator(estimator, features):
    train, _ = next(iter(kfcv))
    estimator.fit(features[train], y[train])
    return estimator

def save_estimator(estimator, filename):
    """
        Note that it is recomended to use '.pkl' or '.model' file extension.
        
        Why? Because I want so.
        ;)
    """
    with open(filename, "wb") as fp:
        cpk.dump(estimator, fp)

def load_estimator(filename):
        with open(filename, "rb") as fp:
            return cpk.load(fp)

def load_or_train(estimator, feature_set, filename):
    if not os.path.exists(filename):
        print("Training...")
        start = time()
        estimator = train_estimator(estimator, feature_set)
        print("Trained in {0} seconds.".format(time() - start))
        print("-------------------------------------------------------------------------------")
        print("Saving...")
        start = time()
        save_estimator(estimator, filename)
        print("Saved in {0} seconds.".format(time() - start))
    else:
        print("Loading...")
        start = time()
        estimator = load_estimator(filename)
        print("Loaded in {0} seconds.".format(time() - start))
        
    return estimator
        
def evaluate_estimator(estimator, features):
    print("Evaluating model's accuracy...")
    start = time()
    score = np.mean(cross_validation.cross_val_score(estimator, features, y, cv = kfcv, n_jobs = 1))
    print("Evaluation done in {0} seconds".format(time() - start))
    print("-------------------------------------------------------------------------------")
    return score

**Feature vector normalization and scaling**

Due to the fact that the vectors are sparse, we must scale them just by the standard variance.

In [5]:
__X = preprocessing.scale(X, with_mean = False)
X_new = preprocessing.normalize(__X)

**5 Fold cross validation**

In order to get more objective results, I have used a 5 fold cross validation scheme, with feature-target pairs shuffling. Because of it, it is now possible to get an averaged result from 5 model evaluations, with training and testing on different portions of the dataset.

In [6]:
kfcv = cross_validation.KFold(len(y), n_folds = 5, shuffle = True)

**Estimators with default parameters**

Almost default, in order to speed up the training process I changed the number of workers from 1 to 3.

In [7]:
gaussian_nb = naive_bayes.GaussianNB()
multinomial_nb = naive_bayes.MultinomialNB()

**Accuracy benchmark for default estimators**

**Gaussian Naive Bayes**

Due to the Memory Error I chose to use `partial_fit()` method, and train and test the model on mini-batches.

In [8]:
# gaussian_nb = load_or_train(gaussian_nb, X, NB_FILE_PATH)
train, test = next(iter(kfcv))

for batch_tr in np.split(train, 3769):
    gaussian_nb.partial_fit(X_new[batch_tr].toarray(), y[batch_tr], classes = range(20))

In [9]:
# mean_acc = evaluate_estimator(gaussian_nb, X)
acc = 0.0
print("Evaluating...")
start = time()
for i, batch in enumerate(np.split(test, 10)):
    current_acc = gaussian_nb.score(X_new[batch].toarray(), y[batch])
    print("Batch Nr.{0} Accuracy {1}".format(i + 1, current_acc))
    acc += current_acc

print("-------------------------------------------------------------------------------")
print("Evaluation done in {0} seconds".format(time() - start))
formated_print("Gaussian Naive Bayes", acc / 10)

Evaluating...
Batch Nr.1 Accuracy 0.766578249337
Batch Nr.2 Accuracy 0.761273209549
Batch Nr.3 Accuracy 0.771883289125
Batch Nr.4 Accuracy 0.801061007958
Batch Nr.5 Accuracy 0.774535809019
Batch Nr.6 Accuracy 0.726790450928
Batch Nr.7 Accuracy 0.753315649867
Batch Nr.8 Accuracy 0.708222811671
Batch Nr.9 Accuracy 0.74801061008
Batch Nr.10 Accuracy 0.742705570292
-------------------------------------------------------------------------------
Evaluation done in 116.711883068 seconds
Accuracy of the Gaussian Naive Bayes model is 0.755437665782.


**Multinomial Naive Bayes**

In [10]:
multinomial_nb = load_or_train(multinomial_nb, X_new, MNB_FILE_PATH)

Training...
Trained in 0.263236999512 seconds.
-------------------------------------------------------------------------------
Saving...
Saved in 6.56074094772 seconds.


In [11]:
mean_acc = evaluate_estimator(multinomial_nb, X_new)

formated_print("Multinomial Naive Bayes", mean_acc)

Evaluating model's accuracy...
Evaluation done in 1.66627597809 seconds
-------------------------------------------------------------------------------
Accuracy of the Multinomial Naive Bayes model is 0.889844445086.


**Hyperparameter selection**

To choose near-optimal hyperparameters I will use Parameter Sample from scikit-learn.
Also it should be mentioned that I will search just for Multinomial Naive Bayes estimator, due to it's superior performance over Gaussian Naive Bayes model.

In [12]:
multinomial_nb_search_space = {"fit_prior": [True, False],
                               "alpha": [10 ** -x for x in range(5)]
                              }
param_list = grid_search.ParameterSampler(multinomial_nb_search_space, n_iter = 10)

In [13]:
results = dict()
for i, params in enumerate(param_list):
    nb = naive_bayes.MultinomialNB(**params)
    nb = load_or_train(nb, X_new, "tmp/tmp{0}.model".format(i))
    acc = evaluate_estimator(nb, X_new)
    results[acc] = params

Loading...
Loaded in 8.56118392944 seconds.
Evaluating model's accuracy...
Evaluation done in 1.5784201622 seconds
-------------------------------------------------------------------------------
Loading...
Loaded in 8.09067201614 seconds.
Evaluating model's accuracy...
Evaluation done in 1.62555885315 seconds
-------------------------------------------------------------------------------
Loading...
Loaded in 9.28507590294 seconds.
Evaluating model's accuracy...
Evaluation done in 2.26200318336 seconds
-------------------------------------------------------------------------------
Loading...
Loaded in 7.15705895424 seconds.
Evaluating model's accuracy...
Evaluation done in 2.22633314133 seconds
-------------------------------------------------------------------------------
Loading...
Loaded in 7.60700011253 seconds.
Evaluating model's accuracy...
Evaluation done in 1.82064414024 seconds
-------------------------------------------------------------------------------
Loading...
Loaded in 

In [25]:
best_results = list(reversed(sorted(results.keys())))
xs, ys = zip(*map(lambda param: (param["alpha"], param["fit_prior"]), results.values()))

print(best_results)

[0.90549746536205933, 0.90448936704780658, 0.9022077495244254, 0.89785640640911857, 0.89716665270850504, 0.88984444508565963, 0.88766837941520704, 0.8875092000706587, 0.88018671093867107, 0.87965608028077713]


In [19]:
cm = plt.cm.get_cmap('Blues')
sc = plt.scatter(xs, ys, c = best_results, vmin = 0.87, vmax = .91, s = 50, cmap = cm)
plt.colorbar(sc)
plt.show()

**Optimized estimators**

After playing around with Parameter Sampler hyperparameter optimizer, I've found the below near optimial configuration.

In [22]:
multinomial_nb_op = naive_bayes.MultinomialNB(**results[best_results[0]])

**Accuracy benchmark for optimized models**

**Multinomial Naive Bayes**

In [23]:
multinomial_nb_op = load_or_train(multinomial_nb_op, X_new, MNB_OP_FILE_PATH)

Training...
Trained in 0.280801057816 seconds.
-------------------------------------------------------------------------------
Saving...
Saved in 5.22973680496 seconds.


In [24]:
mean_acc = evaluate_estimator(multinomial_nb_op, X_new)

formated_print("Optimized Multinomial Naive Bayes", mean_acc)

Evaluating model's accuracy...
Evaluation done in 1.40515398979 seconds
-------------------------------------------------------------------------------
Accuracy of the Optimized Multinomial Naive Bayes model is 0.905497465362.


# Conclusion

From all the evaluations done above I can conclude the following 2 things:
- Data preparation, in this case, scaling and normalizing feature vectors is extremely important and leads to major accuracy improvements (Like I didn't knew it...).
- Multinomial Naive Bayes can be trained significantly faster than SGD Classifier, but its accuracy is slower, which means that SGD Classifier is still preferable.