# Lab Assignment Four: Extending Logistic Regression

__*Austin Chen, Luke Hansen, Oscar Vallner*__

## 1. Preparation and Overview

### 1.1 Business Understanding

In today's world, it might be easy to make the assumption that those who have higher paying jobs must have achieved a higher level of education at some point in their lifetimes. For example, one might expect an employee who achieved a Masters degree to get paid more than an employee who only achieved a high school diploma. For the last couple decades, it has been a topic of wide cultural debate, whether higher level degrees of education (Masters and Doctoral) are worthy investments in careers.

Therefore, we have chosen to analyze a subset of 40 year's worth of the United States' federal payroll records. This large dataset contains the payroll records of government employees for the past 40 years. Our federal payroll data was obtained through the Freedom of Information Act by BuzzFeed news. We have chosen government payroll data for a few different reasons. First, having such a large volume of data (40 years worth of payroll records) provides us with flexibility; with a large dataset, we have the ability to subsample--whether it be randomized, by department, or year. Second, as patrons and benefactors of the United States Government, we have chosen to place faith in the assumption that officially published government data is accurate. The dataset contains several attributes regarding education level, payment, government agency division, etc. with each specific employees name abstracted. By trusting the integrity of this data, we can potentially create a useful, real-world classifier to predict an employee's highest education level. Even if the government has lied about the accuracy of the data, rendering the classifier useless for their internal operations, the classifier can still be useful to the public, contingent on an high accuracy rate.


While there is no singular use-case that this data is meant to solve, we have decided to pick apart this payroll data in order to see if there are any meaningful conclusions we can draw on any government worker's career and life decisions, based on data of their current job position. Perhaps it is _indeed_ true that those who achieve higher levels of education end up with higher compensations. It could be the case, however, that higher government compensation is merely a function of a longer length of service. We hope that through building a classifier, we can answer some of these questions.

Due to the sheer volume of the dataset, we have decided to take a subset of the 40 years of data, and narrow our focus to an easy-to-grasp classification problem regarding educational level. Given attributes such as the agency division, age, length of service, pay, and more, we will be attempting to clasisfy the highest level of education each employee received.

There may not be any __short-term__, immediate actions one could take with these results. However, through time, a trained Logistic Regression classifier could aid the United States government in analyzing the value of educational degrees in government jobs. By looking at factors such as pay, length of service, and age, the United States government could more appropriately create compensation models for their employees based off their highest level of education. Or, they could use any conclusions drawn from the classifer to verify whether the compensation consistency between many employees with the same degree. Furthermore, a classification model that is able to accurately predict an employee's education level could potentially be extended to contexts beyond government jobs. With enough data exploration, a classification model such as ours could be experimented with in other paradigms of the career market. _Perhaps_ a Masters degree isn't reflective of higher salaries in a government job, but could be indicative of higher compensation in a discipline such as engineering, business, or education. 

We ultimately decided to divide our classification problem into four classes. Each class represents the HIGHEST edcuation level that an employee attained.

1. Highschool (or under) not completed
2. Highschool Completed
3. Bachelor's Completed
4. Graduate degree completed (Masters and up)

As a baseline, our classification model should at least beat random, 25%. However, because this classification model could directly affect the United States government and its decisions, we should plan for our model to be as accurate as possible. In order to make helpful, informed decisions for the United States government based on our classification model, we wish to obtain at least 85% accuracy.


---
Link to dataset: https://ia600608.us.archive.org/16/items/opm-federal-employment-data/

In [1]:
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import numpy as np
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import ShuffleSplit
import plotly
from plotly.graph_objs import Scatter, Marker, Layout, XAxis, YAxis, Bar, Line
plotly.offline.init_notebook_mode()

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


### 1.2 Class Variables

In [2]:
df = pd.read_csv('data/2014-clean.csv', encoding = 'latin1',low_memory=False)
df_adjust = df

# Encodes education into a much more manageable prediction attribute
df_adjust.Education[df_adjust.Education < 4] = 1
df_adjust.Education[(df_adjust.Education > 3) & (df_adjust.Education < 13)] = 2
df_adjust.Education[(df_adjust.Education > 12) & (df_adjust.Education < 17)] = 3
df_adjust.Education[df_adjust.Education > 16] = 4

#Binary encode, 8 is not a Supervisor, <8 are unordered types of Supervisors
df_adjust.SupervisoryStatus[df_adjust.SupervisoryStatus < 8] = 1
df_adjust.SupervisoryStatus[df_adjust.SupervisoryStatus == 8] = 0

#ensure data is integer encoded
df_adjust = df_adjust[np.isfinite(df_adjust['Education'])]
df_adjust = df_adjust[np.isfinite(df_adjust['SupervisoryStatus'])]
df_adjust.Education = df_adjust.Education.astype(int)
df_adjust.SupervisoryStatus = df_adjust.SupervisoryStatus.astype(int)

# Encodes Length of Service (LOS)
df_adjust.LOS[df_adjust.LOS == '< 1'] = 0
df_adjust.LOS[df_adjust.LOS == '1-2'] = 1
df_adjust.LOS[df_adjust.LOS == '3-4'] = 2
df_adjust.LOS[df_adjust.LOS == '5-9'] = 3
df_adjust.LOS[df_adjust.LOS == '10-14'] = 4
df_adjust.LOS[df_adjust.LOS == '15-19'] = 5
df_adjust.LOS[df_adjust.LOS == '20-24'] = 6
df_adjust.LOS[df_adjust.LOS == '25-29'] = 7
df_adjust.LOS[df_adjust.LOS == '30-34'] = 8
df_adjust.LOS[df_adjust.LOS == '35+'] = 9
df_adjust.LOS[df_adjust.LOS == 'UNSP'] = np.NaN

#we have more than ehough data. Data with missing elements is considered unreliable.
df_adjust = df_adjust.dropna()

#convert tointeger
df_adjust.Pay = df_adjust.Pay.astype(int)
df_adjust.LOS = df_adjust.LOS.astype(int)

In [3]:
print(df_adjust.Education.value_counts())
print(len(df_adjust))

2    306080
3    216420
4    137024
1      8342
Name: Education, dtype: int64
667866


In [4]:
# This block of code removes all agencies without at least 10,000 people in it

agency_names = (df_adjust.AgencyName.value_counts() > 10000)
trimmed_agencies = []

for i in range(0, len(agency_names)):
    if agency_names[i] == True:
        trimmed_agencies.append(agency_names.axes[0][i])
        
    else:
        break
print(trimmed_agencies)

df_adjust = df_adjust[~df_adjust['AgencyName'].isin(trimmed_agencies)]
print("Length: ", len(df_adjust))

['VETERANS HEALTH ADMINISTRATION', 'INTERNAL REVENUE SERVICE', 'FEDERAL AVIATION ADMINISTRATION', 'NATIONAL INSTITUTES OF HEALTH', 'VETERANS BENEFITS ADMINISTRATION', 'NATIONAL PARK SERVICE', 'FOOD AND DRUG ADMINISTRATION', 'OFC SEC HEALTH AND HUMAN SERVICES', 'ENVIRONMENTAL PROTECTION AGENCY', 'FEDERAL EMERGENCY MANAGEMENT AGENCY', 'INDIAN HEALTH SERVICE', 'BUREAU OF PRISONS/FEDERAL PRISON SYSTEM', 'DEPARTMENT OF ENERGY']
Length:  153661


In [5]:
#df_usable = df_adjust[['Agency','Age','Education','PayPlan', 'LOS', 'Category','Pay', 'SupervisoryStatus', 'Schedule', 'NSFTP', 'AgencyName']]
df_usable = df_adjust[['Age','Education','PayPlan', 'LOS', 'Category','Pay', 'SupervisoryStatus', 'Schedule', 'NSFTP', 'AgencyName']]

print(df_usable.head())
print(df_usable.describe())
for x in df_usable:
    print(x, " : ", len(df_usable[x].unique()))

     Age  Education PayPlan  LOS Category     Pay  SupervisoryStatus Schedule  \
0  35-39          3      GS    3        P  149993                  0        F   
1  35-39          3      GS    3        P  149993                  0        F   
2  40-44          3      GS    4        P  145504                  0        F   
3  30-34          3      GS    3        P  124995                  0        F   
4  30-34          3      GS    3        P  145827                  0        F   

   NSFTP                     AgencyName  
0      1  OFFICES, BOARDS AND DIVISIONS  
1      1  OFFICES, BOARDS AND DIVISIONS  
2      1  OFFICES, BOARDS AND DIVISIONS  
3      1  OFFICES, BOARDS AND DIVISIONS  
4      1  OFFICES, BOARDS AND DIVISIONS  
           Education            LOS            Pay  SupervisoryStatus  \
count  153661.000000  153661.000000  153661.000000      153661.000000   
mean        2.895354       4.259916   93296.271826           0.178399   
std         0.789885       2.319148   3430

In [6]:
#one hot encode necessary data
df_dummies = pd.get_dummies(data=df_usable, columns=['Age', 'PayPlan', 'Category', 'SupervisoryStatus', 'Schedule', 'NSFTP', 'AgencyName'])

df_dummies = df_dummies.dropna()

In [7]:
#splitting the dataset
#X_train, X_test, y_train1, y_test1 = train_test_split(df_dummies.drop('Education',1), df_dummies.Education, test_size=0.20, random_state=42)

### 1.3 Train-Test-Split & Cross Validation

## 2. Modeling

### 2.1 One-versus-all Logistic Regression Classifier

### 2.2 Performance Optimization

### 2.3 Comparing Results to sci-kit-learn

## 3. Modeling

## 4. Integrating with GridSearchCV

# SCRATCH WORK --------------------------------------------------------------------

## Logistic Regression Classes

In [8]:
class BinaryLogisticRegressionBase:
    # private:
    def __init__(self, eta, iterations=20):
        self.eta = eta
        self.iters = iterations
        # internally we will store the weights as self.w_ to keep with sklearn conventions
    
    def __str__(self):
        return 'Base Binary Logistic Regression Object, Not Trainable'
    
    # convenience, private:
    @staticmethod
    def _sigmoid(theta):
        return 1/(1+np.exp(-theta)) 
    
    @staticmethod
    def _add_bias(X):
        return np.hstack((np.ones((X.shape[0],1)),X)) # add bias term
    
    # public:
    def predict_proba(self,X,add_bias=True):
        # add bias term if requested
        Xb = self._add_bias(X) if add_bias else X
        return self._sigmoid(Xb @ self.w_) # return the probability y=1
    
    def predict(self,X):
        return (self.predict_proba(X)>0.5) #return the actual prediction

In [9]:
# inherit from base class
class BinaryLogisticRegression(BinaryLogisticRegressionBase):
    #private:
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained Binary Logistic Regression Object'
        
    def _get_gradient(self,X,y):
        # programming \sum_i (yi-g(xi))xi
        gradient = np.zeros(self.w_.shape) # set gradient to zero
        for (xi,yi) in zip(X,y):
            # the actual update inside of sum
            gradi = (yi - self.predict_proba(xi,add_bias=False))*xi 
            # reshape to be column vector and add to gradient
            gradient += gradi.reshape(self.w_.shape) 
        
        return gradient/float(len(y))
       
    # public:
    def fit(self, X, y):
        Xb = self._add_bias(X) # add bias term
        num_samples, num_features = Xb.shape
        
        self.w_ = np.zeros((num_features,1)) # init weight vector to zeros
        
        # for as many as the max iterations
        for _ in range(self.iters):
            gradient = self._get_gradient(Xb,y)
            self.w_ += gradient*self.eta # multiply by learning rate 

In [10]:
from scipy.special import expit

class VectorBinaryLogisticRegression(BinaryLogisticRegression):
    # inherit from our previous class to get same functionality
    @staticmethod
    def _sigmoid(theta):
        # increase stability, redefine sigmoid operation
        return expit(theta) #1/(1+np.exp(-theta))
    
    # but overwrite the gradient calculation
    def _get_gradient(self,X,y):
        ydiff = y-self.predict_proba(X,add_bias=False).ravel() # get y difference
        gradient = np.mean(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
        
        return gradient.reshape(self.w_.shape)

In [11]:
class LogisticRegression:
    def __init__(self, eta, iterations=20):
        self.eta = eta
        self.iters = iterations
        # internally we will store the weights as self.w_ to keep with sklearn conventions
    
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained MultiClass Logistic Regression Object'
        
    def fit(self,X,y):
        num_samples, num_features = X.shape
        self.unique_ = np.unique(y) # get each unique class value
        num_unique_classes = len(self.unique_)
        self.classifiers_ = [] # will fill this array with binary classifiers
        
        for i,yval in enumerate(self.unique_): # for each unique value
            y_binary = y==yval # create a binary problem
            # train the binary classifier for this class
            blr = VectorBinaryLogisticRegression(self.eta,self.iters)
            blr.fit(X,y_binary)
            # add the trained classifier to the list
            self.classifiers_.append(blr)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
        
    def predict_proba(self,X):
        probs = []
        for blr in self.classifiers_:
            probs.append(blr.predict_proba(X)) # get probability for each classifier
        
        return np.hstack(probs) # make into single matrix
    
    def predict(self,X):
        return np.argmax(self.predict_proba(X),axis=1) # take argmax along row

In [12]:
class RegularizedBinaryLogisticRegression(VectorBinaryLogisticRegression):
    # extend init functions
    def __init__(self, C=0.0, **kwds):        
        # need to add to the original initializer 
        self.C = C
        # but keep other keywords
        super().__init__(**kwds) # call parent initializer
        
        
    # extend previous class to change functionality
    def _get_gradient(self,X,y):
        # call get gradient from previous class
        gradient = super()._get_gradient(X,y)
        
        # add in regularization (to all except bias term)
        gradient[1:] += -2 * self.w_[1:] * self.C
        return gradient

In [13]:
# now redefine the Logistic Regression Function where needed
class RegularizedLogisticRegression(LogisticRegression):
    def __init__(self, C=0.0, **kwds):        
        # need to add to the original initializer 
        self.C = C
        # but keep other keywords
        super().__init__(**kwds) # call parent initializer
        
    def fit(self,X,y):
        num_samples, num_features = X.shape
        self.unique_ = np.unique(y) # get each unique class value
        num_unique_classes = len(self.unique_)
        self.classifiers_ = [] # will fill this array with binary classifiers
        
        for i,yval in enumerate(self.unique_): # for each unique value
            y_binary = y==yval # create a binary problem
            # train the binary classifier for this class
            blr = RegularizedBinaryLogisticRegression(eta=self.eta,
                                                      iterations=self.iters,
                                                      C=self.C)
            blr.fit(X,y_binary)
            # add the trained classifier to the list
            self.classifiers_.append(blr)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T

In [24]:
from scipy.optimize import minimize_scalar
import copy
class LineSearchLogisticRegression(BinaryLogisticRegression):
    
    # define custom line search for problem
    @staticmethod
    def line_search_function(eta,X,y,w,grad,C):
        wnew = w + grad*eta
        yhat = (1/(1+np.exp(-X @ wnew)))>0.5
        return np.sum((y-yhat)**2)-C*np.sum(wnew**2)
        
    def fit(self, X, y):
        Xb = self._add_bias(X) # add bias term
        num_samples, num_features = Xb.shape
        
        self.w_ = np.zeros((num_features,1)) # init weight vector to zeros
        
        # for as many as the max iterations
        for _ in range(self.iters):
            gradient = self._get_gradient(Xb,y)
            
            # do line search in gradient direction, using scipy function
            opts = {'maxiter':self.iters/20} # unclear exactly what this should be
            res = minimize_scalar(self.line_search_function, # objective function to optimize
                                  bounds=(self.eta/1000,self.eta*10), #bounds to optimize
                                  args=(Xb,y,self.w_,gradient,self.C), # additional argument for objective function
                                  method='bounded', # bounded optimization for speed
                                  options=opts) # set max iterations
            
            eta = res.x # get optimal learning rate
            self.w_ += gradient*eta # set new function values

In [25]:
class StochasticLogisticRegression(BinaryLogisticRegression):
    # stochastic gradient calculation 
    def _get_gradient(self,X,y):
        idx = int(np.random.rand()*len(y)) # grab random instance
        ydiff = y[idx]-self.predict_proba(X[idx],add_bias=False) # get y difference (now scalar)
        gradient = X[idx] * ydiff[:,np.newaxis] # make ydiff a column vector and multiply through
        
        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] += -2 * self.w_[1:] * self.C
        
        return gradient

In [26]:
from numpy.linalg import pinv
class HessianBinaryLogisticRegression(BinaryLogisticRegression):
    # just overwrite gradient function
    def _get_gradient(self,X,y):
        g = self.predict_proba(X,add_bias=False).ravel() # get sigmoid value for all classes
        hessian = X.T @ np.diag(g*(1-g)) @ X - 2 * self.C # calculate the hessian

        ydiff = y-g # get y difference
        gradient = np.sum(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] += -2 * self.w_[1:] * self.C
        
        return pinv(hessian) @ gradient

In [27]:
from scipy.optimize import fmin_bfgs
class BFGSBinaryLogisticRegression(BinaryLogisticRegression):
    
    @staticmethod
    def objective_function(w,X,y,C):
        g = expit(X @ w)
        return -np.sum(np.log(g[y==1]))-np.sum(np.log(1-g[y==0])) + C*sum(w**2) #-np.sum(y*np.log(g)+(1-y)*np.log(1-g))

    @staticmethod
    def objective_gradient(w,X,y,C):
        g = expit(X @ w)
        ydiff = y-g # get y difference
        gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
        gradient = gradient.reshape(w.shape)
        gradient[1:] += -2 * w[1:] * C
        return -gradient
    
    # just overwrite fit function
    def fit(self, X, y):
        Xb = self._add_bias(X) # add bias term
        num_samples, num_features = Xb.shape
        
        self.w_ = fmin_bfgs(self.objective_function, # what to optimize
                            np.zeros((num_features,1)), # starting point
                            fprime=self.objective_gradient, # gradient function
                            args=(Xb,y,self.C), # extra args for gradient and objective function
                            gtol=1e-03, # stopping criteria for gradient, |v_k|
                            maxiter=self.iters, # stopping criteria iterations
                            disp=False)
        
        self.w_ = self.w_.reshape((num_features,1))

In [28]:
class MultiClassLogisticRegression:
    def __init__(self, eta, iterations=20, C=0.0001):
        self.eta = eta
        self.iters = iterations
        self.C = C
        self.classifiers_ = []
        # internally we will store the weights as self.w_ to keep with sklearn conventions
    
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained MultiClass Logistic Regression Object'
        
    def fit(self,X,y):
        num_samples, num_features = X.shape
        self.unique_ = np.sort(np.unique(y)) # get each unique class value
        num_unique_classes = len(self.unique_)
        self.classifiers_ = []
        for i,yval in enumerate(self.unique_): # for each unique value
            y_binary = y==yval # create a binary problem
            # train the binary classifier for this class
            hblr = BFGSBinaryLogisticRegression(self.eta,self.iters,self.C)
            hblr.fit(X,y_binary)
            #print(accuracy(y_binary,hblr.predict(X)))
            # add the trained classifier to the list
            self.classifiers_.append(hblr)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
        
    def predict_proba(self,X):
        probs = []
        for hblr in self.classifiers_:
            probs.append(hblr.predict_proba(X).reshape((len(X),1))) # get probability for each classifier
        
        return np.hstack(probs) # make into single matrix
    
    def predict(self,X):
        return np.argmax(self.predict_proba(X),axis=1) # take argmax along row

In [29]:
class ParallelMultiClassLogisticRegression(MultiClassLogisticRegression):
    @staticmethod
    def par_logistic(yval,eta,iters,C,X,y):
        y_binary = y==yval # create a binary problem
        # train the binary classifier for this class
        hblr = BFGSBinaryLogisticRegression(eta,iters,C)
        hblr.fit(X,y_binary)
        return hblr
    
    def fit(self,X,y):
        num_samples, num_features = X.shape
        self.unique_ = np.sort(np.unique(y)) # get each unique class value
        num_unique_classes = len(self.unique_)
        backend = 'threading' #'multiprocessing'
        
        self.classifiers_ = Parallel(n_jobs=-1,backend=backend)(
            delayed(self.par_logistic)(yval,self.eta,self.iters,self.C,X,y) for yval in self.unique_)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T

In [None]:
class MyLogisticRegression:
    def __init__(self, eta, iters=20, method="steepdescent", regularization="none", C=0.0):
        self.eta = eta
        self.iters = iters
        self.method = method
        self.regularization = regularization
        self.C = C
        
    
        
            
            

## Read and clean the data

## PCA

In [14]:
X_base = normalize(df_dummies.drop('Education',1).as_matrix())
y_base = df_dummies.Education
h, w = X_base.shape
print(X_base)

[[  2.00009334e-05   1.00000000e+00   0.00000000e+00 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.00009334e-05   1.00000000e+00   0.00000000e+00 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.74906532e-05   9.99999999e-01   0.00000000e+00 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 ..., 
 [  6.34920633e-05   9.99999996e-01   0.00000000e+00 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  3.53556779e-05   9.99999995e-01   0.00000000e+00 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.18895018e-05   9.99999998e-01   0.00000000e+00 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]]



Data with input dtype int64 was converted to float64 by the normalize function.



In [15]:
from sklearn.decomposition import PCA
n_components = 50
pca = PCA(n_components=n_components)
%time X_transform = pca.fit_transform(X_base)
pca_res = pca.components_.reshape((n_components,w))

CPU times: user 5.64 s, sys: 682 ms, total: 6.33 s
Wall time: 4.34 s


In [16]:
def plot_explained_variance(pca):
    explained_var = pca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
    })

In [17]:
%matplotlib inline
plot_explained_variance(pca)

### Conclusion 
- TODO: One hot encoding took 10 components (8 which needed to be one hot encoded) and increased the dimensionality to 466 components. Normalization and PCA reduced this number back down to needing 31 components to explain 96% of the variance in the data

In [18]:
y_graph = df_adjust.Education # make problem binary

plotly.offline.init_notebook_mode() # run at the start of every notebook

graph1 = {'labels': np.unique(y_graph),
          'values': np.bincount(y_graph)[1:],
            'type': 'pie'}
fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Binary Class Distribution',
                'autosize':False,
                'width':500,
                'height':300}

plotly.offline.iplot(fig)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


In [21]:

y = y_base.values
X = X_transform
num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(
                         n_splits=num_cv_iterations,
                         test_size  = 0.2)

In [22]:
%%time
# run logistic regression and vary some parameters
from sklearn import metrics as mt

# first we create a reusable logisitic regression object
#   here we can setup the object with different learning parameters and constants
lr_clf = RegularizedLogisticRegression(eta=0.1,iterations=100) # get object


for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
    lr_clf.fit(X[train_indices],y[train_indices])  # train object
    y_hat = lr_clf.predict(X[test_indices]) # get test set precitions

    # print the accuracy and confusion matrix 
    print("====Iteration",iter_num," ====")
    print("accuracy", mt.accuracy_score(y[test_indices],y_hat)) 
    print("confusion matrix\n",mt.confusion_matrix(y[test_indices],y_hat))

====Iteration 0  ====
accuracy 0.350047180555
confusion matrix
 [[    0   157     0     0]
 [    0 10758     0     0]
 [    0 11724     0     0]
 [    0  8094     0     0]]
====Iteration 1  ====
accuracy 0.352422477467
confusion matrix
 [[    0   175     0     0]
 [    0 10831     0     0]
 [    0 11756     0     0]
 [    0  7971     0     0]]
====Iteration 2  ====
accuracy 0.349851950672
confusion matrix
 [[    0   161     0     0]
 [    0 10752     0     0]
 [    0 11769     0     0]
 [    0  8051     0     0]]
CPU times: user 1min 2s, sys: 7.84 s, total: 1min 10s
Wall time: 36.8 s


## SCI KIT LEARN SHIT


In [23]:
%%time
from sklearn.linear_model import LogisticRegression as SKLogisticRegression

lr_sk = SKLogisticRegression() # all params default

for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
    lr_sk.fit(X[train_indices],y[train_indices])  # train object
    y_hat = lr_sk.predict(X[test_indices]) # get test set precitions

    # print the accuracy and confusion matrix 
    print("====Iteration",iter_num," ====")
    print("accuracy", mt.accuracy_score(y[test_indices],y_hat)) 
    print("confusion matrix\n",mt.confusion_matrix(y[test_indices],y_hat))

====Iteration 0  ====
accuracy 0.379168971464
confusion matrix
 [[    0     0   160     0]
 [    0     0 10952     0]
 [    0     0 11653     0]
 [    0     0  7968     0]]
====Iteration 1  ====
accuracy 0.379526892916
confusion matrix
 [[    0     0   176     0]
 [    0     0 10908     0]
 [    0     0 11664     0]
 [    0     0  7985     0]]
====Iteration 2  ====
accuracy 0.386587707025
confusion matrix
 [[    0     0   157     0]
 [    0     0 10755     0]
 [    0     0 11881     0]
 [    0     0  7940     0]]
CPU times: user 4.48 s, sys: 226 ms, total: 4.71 s
Wall time: 4.68 s


Aite so basically, scikit learn is better than us