# Demo of Logistic Regression Algorithms

In this notebook, we implement two algorithms for Logistic Regression inspired by Bregman distances presented in 'Logistic Regression, AdaBoost and Bregman Distances' (Schapire et al 2002) and 'Bregman Distance to L1 Regularized Logistic Regression' (Huang and Gupta, 2010). We compare them to two more standard algorithms - Logistic Regression (LBFGS, No Regularization) and Lasso Regression (via Coordinate Descent). 

We run these 4 algorithms to perform binary classification on a variety of datasets.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import cvxpy as cvx
from sklearn import preprocessing, datasets, model_selection, linear_model

# Import Data

Data is processed so labels are +1,-1 (which is useful for the algorithms we implement). We split off 1/7-th of the data as the test set and use the rest for training. For datasets with more than 2 classes (in particular, MNIST and Fashion MNIST), we filter the instances for those belonging to 2 particular classes.

In [2]:
# Dictionary of datasets, keys are names (string) of datasets, values are 4 tuples: 
# (X_train, X_test, y_train, y_test)
data_dict = dict()
lb = preprocessing.LabelBinarizer(neg_label=-1, pos_label = 1)

for openml_id in [554, 40996, 59, 37, 53, 1510]:
    dataset = datasets.fetch_openml(data_id = openml_id)
    good_targets = np.unique(dataset.target)[0:2]
    ix = np.isin(dataset.target, good_targets)
    X = dataset.data[ix]
    y = lb.fit_transform(dataset.target[ix]).ravel()
    data_dict[dataset.details['name']]= model_selection.train_test_split(X, y, test_size=1/7.0, random_state=0)

# Logistic Regression (No Regularization)

In [3]:
def LogitR(X_train, X_test, y_train, y_test):
    # sklearn's Logistic Regression has no option to disable regularization
    # We set the regularization strength very low to get similar results
    Logit_model = linear_model.LogisticRegression(C = 1e9, solver = 'lbfgs', max_iter = 1e4)
    Logit_model.fit(X_train, y_train)
    accuracy = Logit_model.score(X_test, y_test)
    
    # The predictions and weights can also be calculated:
    #predictions = Logit_model.predict(X_test)
    #weights = np.concatenate([Logit_model.intercept_, Logit_model.coef_[0]])

    return accuracy

# Lasso Regression (L1 Regularization)

The regularization strength is determined by Cross-Validation.

In [4]:
def LassoR(X_train, X_test, y_train, y_test):
    Lasso_model = linear_model.LassoCV(cv=4)
    Lasso_model.fit(X_train, y_train)

    predictions = np.sign(Lasso_model.predict(X_test))
    accuracy = np.mean(predictions==y_test)
    #weights = np.concatenate([[Lasso_model.intercept_], Lasso_model.coef_])
    return accuracy

# Bregman Logistic Regression by Schapire et al.

In [5]:
def BregmanLogit(X_train, X_test, y_train, y_test, max_iters = 1000):
    # First preprocess the data to 
    # 1) Include a bias parameter 
    # 2) Scale all instances x_i so the max l1 norm is = 1/2
    X_train = np.concatenate([np.ones((X_train.shape[0],1)), X_train], axis = 1)
    X_test = np.concatenate([np.ones((X_test.shape[0],1)), X_test], axis = 1)
    
    l1_max = max(np.linalg.norm(X_train, ord = np.inf), np.linalg.norm(X_test, ord = np.inf))
    X_train = X_train/(2*l1_max)
    X_test = X_test/(2*l1_max)
    
    n_train_samples, x_dim = X_train.shape
    
    
    # Train weight vector (Parallel Algorithm, Section 5)
    w = np.zeros(x_dim)
    q = 1/2 * np.ones(n_train_samples)
    M = X_train * y_train[:, np.newaxis] # Makes M[i] = y[i] * x[i] so M[i][j] = y[i] x[i][j]
    
    M_pos = np.multiply(M, M>0)
    M_neg = np.multiply(-M, M<0)

    for t in range(1,max_iters+1):
        # Update q
        if t==1: 
            q = 1/2 * np.ones(n_train_samples)
        if t>1: 
            q = np.divide(q, np.multiply(1-q, np.exp(M @ d)) + q)
        
        # Update d
        W_pos = q @ M_pos
        W_neg = q @ M_neg

        def delta(w_pos, w_neg):
            # delta is picked to minimize the summand in Equation 27
            if w_pos == 0 and w_neg == 0:
                return 0
            if w_pos == 0 and w_neg != 0:
                return -99
            if w_pos != 0 and w_neg == 0:
                return 99
            return 1/2 * np.log(w_pos/w_neg)

        delta_vec = np.vectorize(delta)
        d = delta_vec(W_pos, W_neg)
        w += d

        if t>1 and abs(np.linalg.norm(w, ord=1)/np.linalg.norm(w-d, ord=1)-1) < 1e-4:
            # print("Converged on iter:", t)
            break
            
    # Make predictions on test and evaluate accuracy
    # Pass X_test @ w through h for class probablities
    # from scipy.special import expit as h # Logistic Sigmoid
    predictions = np.sign(X_test @ w)
    accuracy = np.mean(y_test.T==predictions)
    return accuracy

# Bregman Logistic Regression with L1 regularization

In [6]:
def BregmanLogit_Reg(X_train, X_test, y_train, y_test, alpha, max_iters = 1000):
    # First preprocess the data to 
    # 1) Include a bias parameter 
    # 2) Scale all instances x_i so the max l1 norm is = 1/2
    X_train = np.concatenate([np.ones((X_train.shape[0],1)), X_train], axis = 1)
    X_test = np.concatenate([np.ones((X_test.shape[0],1)), X_test], axis = 1)
    
    l1_max = max(np.linalg.norm(X_train, ord = np.inf), np.linalg.norm(X_test, ord = np.inf))
    X_train = X_train/(2*l1_max)
    X_test = X_test/(2*l1_max)
    
    n_train_samples, x_dim = X_train.shape
    
    
    # Train weight vector (Parallel Algorithm, Section 5)
    w = np.zeros(x_dim)
    q = 1/2 * np.ones(n_train_samples)
    M = X_train * y_train[:, np.newaxis] # Makes M[i] = y[i] * x[i] so M[i][j] = y[i] x[i][j]
    
    M_pos = np.multiply(M, M>0)
    M_neg = np.multiply(-M, M<0)
    
    for t in range(1,max_iters+1):
        # Update q
        if t==1: 
            q = 1/2 * np.ones(n_train_samples)
        if t>1: 
            q = np.divide(q, np.multiply(1-q, np.exp(M @ d)) + q)
        
        # Update d
        W_pos = q @ M_pos
        W_neg = q @ M_neg
        
        d = cvx.Variable(x_dim)
        # d is chosen to minimize Equation 27 (Schapire et al)
        bregman_bound = W_pos * (cvx.exp(-d) - 1) + W_neg * (cvx.exp(d)-1)
        objective = cvx.Minimize(bregman_bound)
        constraint = [cvx.norm1(w+d) <= alpha]
        prob = cvx.Problem(objective, constraint)
        prob.solve()  # Returns the optimal value.
        d = d.value
        w += d
        
        if t>1 and abs(np.linalg.norm(w, ord=1)/np.linalg.norm(w-d, ord=1)-1) < 1e-3:
            #print("Converged on iter:", t)
            break
            
    # Make predictions on test and evaluate accuracy
    # Pass X_test @ w through h for class probablities
    # from scipy.special import expit as h # Logistic Sigmoid
    predictions = np.sign(X_test @ w)
    accuracy = np.mean(y_test.T==predictions)
    return w, predictions, accuracy

In [7]:
# Evaluate accuracies of the 4 algorithms on the datasets

for func in [LogitR, LassoR, BregmanLogit]:
    print(func.__name__ + '\n')
    for key, value in data_dict.items():
        print(key, func(*value))
    print('--------------------\n')
    
print("BregmanLogit_Reg\n")    
for key, value in data_dict.items():
    if key in ['mnist_784', 'Fashion-MNIST']:
        # We limit the number of iterations for these cases because
        # cvxpy has convergences issues on these high dimensional datasets
        print(key, BregmanLogit_Reg(*value, 1e5, 100)[2])
    else:
        print(key, BregmanLogit_Reg(*value, 1e3)[2])

LogitR

mnist_784 0.9976325757575758
Fashion-MNIST 0.984
ionosphere 0.9803921568627451
diabetes 0.8272727272727273
heart-statlog 0.8974358974358975
wdbc 0.9878048780487805
--------------------

LassoR

mnist_784 0.9910037878787878
Fashion-MNIST 0.989
ionosphere 0.9019607843137255
diabetes 0.8363636363636363
heart-statlog 0.8717948717948718
wdbc 0.9512195121951219
--------------------

BregmanLogit

mnist_784 0.9962121212121212
Fashion-MNIST 0.988
ionosphere 0.8823529411764706
diabetes 0.7363636363636363
heart-statlog 0.8717948717948718
wdbc 0.9024390243902439
--------------------

BregmanLogit_Reg

mnist_784 0.9990530303030303
Fashion-MNIST 0.9695
ionosphere 0.8627450980392157
diabetes 0.7272727272727273
heart-statlog 0.8974358974358975
wdbc 0.8658536585365854


# Results

|               |Instances | Features| LR        | Bregman LR           | LR+L1  |Bregman LR+L1
| ------------- |:-------------:| -----:|                |         |
| MNIST      | 14780 | 784 | 99.8             | 99.6 |     99.1        |     99.9      |
| Fashion MNIST|  14000 | 784 |   98.4  |   98.8 |     98.9      |     97.0          |
| Ionosphere      | 351 | 35 | 98.0      |  88.2   |    90.2       |    86.3           |
| Diabetes      | 768 | 8 |   82.7   |   73.7  |          83.6 |     72.7          |
| Heart Statlog      |  270 | 13 |  89.7  |  87.2   |          87.2 |   89.7            |
| WDBC      | 569 | 30 |   98.8   |   90.2  |          95.1 |      86.6         |

There is no obvious relation between the number of features/instances vs the relative performance of the Bregman versions over the non-Bregman versions.

On the high dimensional data (MNIST and Fasion MNIST) all 4 algorithms perform very well and quite similarly. In the other datasets however our implementations of the Bregman versions perform consistently worse than the standard versions, with up to a 10% decrease in prediction accuracy (as seen with the unregularized Ionosphere results).

It is difficult to completely understand where these differences have come from, as we could not always set all analogous choices to be the same in the algorithms so that we could more directly compare them. For example, the regularization in the Lasso (LR+L1) model is the Tikonov variant, whereas the Bregman LR+L1 algorithm performs Ivanov regularization. Further, this Ivanov regularization parameter in the Bregman LR+L1 algorithm, as well as the number of iterations, could not be set completely freely - the optimization package CVXPY runs into convergences issues if these are altered to bad values. For the Bregman LR algorithm, we encountered an edge case not explicitly mentioned in the paper by Collins, Schapire and Singer where certain weights may need to be adjusted by -inf or +inf. At first we tried simply add or substract a large number in these cases, however if the number of iterations became too large these weights became np.nan and this would flow through our results. Thus, we were somewhat artifically limited in the number of iterations we could perform by this. 