# 20 Newsgroups & Logistic Regression

This notebook is a slightly enhanced demo used during my presentation during the first, closed-beta Data Science Community meetup that took place on <br> 16 <sup>th</sup> of January 2017.

Here's shown how to use scikit-learn's Logistic Regression, and SGDClassifier (that is a more robust implementation of LR, mainly used for large datasets).

---

The used dataset is 20 Newsgroups, pre-vectorized using TF-IDF algorithm.

I've used a 5-fold cross validation scheme and Grid Search for hyperparameter tuning.

Also to get higher accuracy, feature vectors were scaled and normalized.

---

**Imports**

In [1]:
from __future__ import print_function

import os
import numpy as np
from sklearn import linear_model, preprocessing, cross_validation, datasets
import cPickle as cpk
from time import time



**Global variables**

In [4]:
LR_FILE_PATH = "../tmp/models/lr.pkl"
SGD_FILE_PATH = "../tmp/models/sgd.pkl"
LR_OP_FILE_PATH = "../tmp/models/optimized_lr.pkl"
SGD_OP_FILE_PATH = "../tmp/models/optimized_sgd.pkl"

**Loading the dataset**

In [3]:
start = time()
news = datasets.fetch_20newsgroups_vectorized()
print("Data loaded in {0} seconds".format(time() - start))
print("-------------------------------------------------------------------------------")

X = news.data
y = news.target

# _ = [print(topic) for topic in news.target_names] # you may uncoment this to see the topics

print("Number of documents in the dataset: {0}".format(*X.shape))
print("Size of the feature vector of a document: {1}".format(*X.shape))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Data loaded in 26.1854040623 seconds
-------------------------------------------------------------------------------
Number of documents in the dataset: 11314
Size of the feature vector of a document: 130107


**Some utility functions**

In [5]:
def to_str_repr(label):
    return news.target_names[label]

def formated_print(model_name, acc):
    pretty_stat = "Accuracy of the {0} model is {1}.".format(model_name, acc)
    print(pretty_stat)
    
def train_estimator(estimator, features):
    train, _ = next(iter(kfcv))
    estimator.fit(features[train], y[train])
    return estimator

def save_estimator(estimator, filename):
    """
        Note that it is recomended to use '.pkl' or '.model' file extension.
        
        Why? Because I want so.
        ;)
    """
    with open(filename, "wb") as fp:
        cpk.dump(estimator, fp)

def load_estimator(filename):
        with open(filename, "rb") as fp:
            return cpk.load(fp)

def load_or_train(estimator, feature_set, filename):
    if not os.path.exists(filename):
        print("Training...")
        start = time()
        estimator = train_estimator(estimator, feature_set)
        print("Trained in {0} seconds.".format(time() - start))
        print("-------------------------------------------------------------------------------")
        print("Saving...")
        start = time()
        save_estimator(estimator, filename)
        print("Saved in {0} seconds.".format(time() - start))
    else:
        print("Loading...")
        start = time()
        estimator = load_estimator(filename)
        print("Loaded in {0} seconds.".format(time() - start))
        
    return estimator
        
def evaluate_estimator(estimator, features):
    print("Evaluating model's accuracy...")
    start = time()
    score = np.mean(cross_validation.cross_val_score(estimator, features, y, cv = kfcv, n_jobs = 4))
    print("Evaluation done in {0} seconds".format(time() - start))
    print("-------------------------------------------------------------------------------")
    return score

**Feature vector normalization and scaling**

Due to the fact that the vectors are sparse, we must scale them just by the standard variance.

In [6]:
__X = preprocessing.scale(X, with_mean = False)
X_new = preprocessing.normalize(__X)

**5 Fold cross validation**

In order to get more objective results, I have used a 5 fold cross validation scheme, with feature-target pairs shuffling. Because of it, it is now possible to get an averaged result from 5 model evaluations, with training and testing on different portions of the dataset.

In [7]:
kfcv = cross_validation.KFold(len(y), n_folds = 5, shuffle = True)

**Estimators with default parameters**

Almost default, in order to speed up the training process I changed the number of workers from 1 to 3.

In [8]:
sgd = linear_model.SGDClassifier(n_jobs = 3)
lr = linear_model.LogisticRegression(n_jobs = 3)

**Accuracy benchmark for default estimators**

**Logistic Regression**

In [9]:
lr = load_or_train(lr, X, LR_FILE_PATH)

Training...


  " = {}.".format(self.n_jobs))


Trained in 13.4839808941 seconds.
-------------------------------------------------------------------------------
Saving...
Saved in 1.54520487785 seconds.


In [10]:
mean_acc = evaluate_estimator(lr, X)

formated_print("Linear Regression", mean_acc)

Evaluating model's accuracy...
Evaluation done in 47.4883239269 seconds
-------------------------------------------------------------------------------
Accuracy of the Linear Regression model is 0.796623887995.


**SGD Classifier**

In [11]:
sgd = load_or_train(sgd, X, SGD_FILE_PATH)

Training...




Trained in 0.661247968674 seconds.
-------------------------------------------------------------------------------
Saving...
Saved in 2.02776503563 seconds.


In [12]:
mean_acc = evaluate_estimator(sgd, X)

formated_print("Stochastic Gradient Descent", mean_acc)

Evaluating model's accuracy...
Evaluation done in 3.21923995018 seconds
-------------------------------------------------------------------------------
Accuracy of the Stochastic Gradient Descent model is 0.872371479375.


**Optimized estimators**

After playing around with Grid Search hyperparameter optimizer, I've found the below near optimial configuration.

In [13]:
sgd_op = linear_model.SGDClassifier(n_iter = 25, alpha = 0.00005, n_jobs = 3)
lr_op = linear_model.LogisticRegression(max_iter = 500, C = 3593.8136638046258, n_jobs = 3)

**Accuracy benchmark for optimized models**

**Logistic Regression**

In [14]:
lr_op = load_or_train(lr_op, X_new, LR_OP_FILE_PATH)

Training...
Trained in 17.5047140121 seconds.
-------------------------------------------------------------------------------
Saving...
Saved in 1.66693496704 seconds.


In [15]:
mean_acc = evaluate_estimator(lr_op, X_new)

formated_print("Optimized Linear Regression", mean_acc)

Evaluating model's accuracy...
Evaluation done in 57.7978448868 seconds
-------------------------------------------------------------------------------
Accuracy of the Optimized Linear Regression model is 0.919746016043.


**SGD Classifier**

In [16]:
sgd_op = load_or_train(sgd_op, X_new, SGD_OP_FILE_PATH)

Training...




Trained in 2.50821805 seconds.
-------------------------------------------------------------------------------
Saving...
Saved in 1.86658596992 seconds.


In [17]:
mean_acc = evaluate_estimator(sgd_op, X_new)

formated_print("Optimized Stochastic Gradient Descent", mean_acc)

Evaluating model's accuracy...




Evaluation done in 12.5524609089 seconds
-------------------------------------------------------------------------------
Accuracy of the Optimized Stochastic Gradient Descent model is 0.922662889297.


# Conclusion

From all the evaluations done above I can conclude the following 2 things:
- Data preparation, in this case, scaling and normalizing feature vectors is extremely important and leads to major accuracy improvements.
- SGD Classifier is much faster than Logistic Regression while the accuracy is on par or slightly better, so it is preferable when working with big datasets.