# AccelerateAI: Logistic Regression

## Logistic Regression with sklearn vs statsmodels

We will create a synthetic data randomly generated and will compare.

### 1. Import Libraries and Prepare dataset

In [1]:
import numpy as np
import pandas as pd

np.set_printoptions(precision=5)

In [2]:
# Create random numbers with seed set for consistency
random_generator = np.random.default_rng(seed=124)

# Create an array with 600 rows and 3 columns
X_for_creating_probabilities = random_generator.normal(size=(600,3))

In [3]:
X_for_creating_probabilities[:5]

array([[-0.30062, -0.57846, -1.11057],
       [-1.18548,  0.81149,  0.60825],
       [-1.02757,  1.08403, -0.4984 ],
       [ 1.11852, -1.22129,  0.24011],
       [-1.91692,  0.07204,  1.0028 ]])

Let us create an array with first column removed. The removed column can be thought of as random noise OR as a feature that we do not have access to while creating the model. In real life scenarios, this could typically be a case.  

In [4]:
X1 = np.delete(X_for_creating_probabilities,0,axis=1)

X1[:5]

array([[-0.57846, -1.11057],
       [ 0.81149,  0.60825],
       [ 1.08403, -0.4984 ],
       [-1.22129,  0.24011],
       [ 0.07204,  1.0028 ]])

Now we will add 2 features/columns correlated with X1. Practical datasets often have highly correlated features. Correlation increases the likelihood of overfitting. We will concatenate these 2 new features to firm up a single array.

In [5]:
X2 = X1 + 0.1 * np.random.normal(size=(600,2))

X_predictors = np.concatenate((X1,X2),axis=1)

In [6]:
X_predictors[:5]

array([[-0.57846, -1.11057, -0.5971 , -1.05121],
       [ 0.81149,  0.60825,  0.92463,  0.55305],
       [ 1.08403, -0.4984 ,  1.16891, -0.47691],
       [-1.22129,  0.24011, -1.12678,  0.25713],
       [ 0.07204,  1.0028 ,  0.00452,  1.02345]])

We will create our outcome/target feature and have it related to ```X_predictors```. In order to accomplish that, we will use our data as inputs to the logistic regression to get probabilities. Then, we can set our outcome feature to ```TRUE``` when probability is above 0.5

In [7]:
Prob = 1 / (1 + np.e**(-np.matmul(X_for_creating_probabilities,[1,1,1])))

Y = Prob > 0.5

np.mean(Y)

0.5183333333333333

In [8]:
Y[:5]

array([False,  True, False,  True, False])

Let us split the data to TRAIN and TEST set.

In [9]:
k=330

X_train = X_predictors[:k]
y_train = Y[:k]

X_test = X_predictors[k:]
y_test = Y[k:]

print(f"X_train:{len(X_train)}  X_test:{len(X_test)}")

X_train:330  X_test:270


### 2. Logistic regression with sklearn

In [10]:
from sklearn.linear_model import LogisticRegression

In [11]:
# sklearn LR applies regularization by default
sklearn_default = LogisticRegression(random_state=124).fit(X_train, y_train)

print(f"Intercept:{sklearn_default.intercept_}  Coefficients: {sklearn_default.coef_}")
print(f"Train Accuracy:{sklearn_default.score(X_train,y_train)}")
print(f"Test Accuracy:{sklearn_default.score(X_test,y_test)}")

Intercept:[-0.10917]  Coefficients: [[0.78477 1.36389 0.87383 0.39505]]
Train Accuracy:0.8121212121212121
Test Accuracy:0.7851851851851852


We can turn off regularization by setting ```penalty=none```. Applying regularization reduces the magnitude of coefficients. Setting the penalty to none will increase the value of coefficients.

In [12]:
# sklearn LR with no penalty
sklearn_no_penalty = LogisticRegression(random_state=124,penalty='none').fit(X_train, y_train)

print(f"Intercept:{sklearn_no_penalty.intercept_}  Coefficients: {sklearn_no_penalty.coef_}")
print(f"Train Accuracy:{sklearn_no_penalty.score(X_train,y_train)}")
print(f"Test Accuracy:{sklearn_no_penalty.score(X_test,y_test)}")

Intercept:[-0.14249]  Coefficients: [[ 0.54707  3.72311  1.21029 -1.83822]]
Train Accuracy:0.8151515151515152
Test Accuracy:0.7962962962962963


In above scenario, ```C=1.0``` is set by default. Smaller values of C increase the regularization. So let's say if we set ```C=0.1``` then magnitude of coefficients reduces.

In [13]:
# sklearn LR with larger penalty
sklearn_larger_penalty = LogisticRegression(random_state=124,C=0.1).fit(X_train, y_train)

print(f"Intercept:{sklearn_larger_penalty.intercept_}  Coefficients: {sklearn_larger_penalty.coef_}")
print(f"Train Accuracy:{sklearn_larger_penalty.score(X_train,y_train)}")
print(f"Test Accuracy:{sklearn_larger_penalty.score(X_test,y_test)}")

Intercept:[-0.07565]  Coefficients: [[0.66188 0.76325 0.67581 0.6625 ]]
Train Accuracy:0.8090909090909091
Test Accuracy:0.7851851851851852


What is the best / optimal value for C?

We can explore with GridSearchCV.

In [14]:
from sklearn.model_selection import GridSearchCV

In [15]:
parameters = {'C':[0.01,0.1,1,10],'solver':['newton-cg','lbfgs']}
Logistic = LogisticRegression(random_state=124)
sklearn_GridSearchCV = GridSearchCV(Logistic,parameters)
sklearn_GridSearchCV.fit(X_train,y_train)

print(f"best_estimator:{sklearn_GridSearchCV.best_estimator_}")

best_estimator:LogisticRegression(C=0.01, random_state=124, solver='newton-cg')


In [16]:
print(f"Train Accuracy:{sklearn_GridSearchCV.score(X_train,y_train)}")
print(f"Test Accuracy:{sklearn_GridSearchCV.score(X_test,y_test)}")

Train Accuracy:0.8181818181818182
Test Accuracy:0.7925925925925926


### 3. Logistic regression with statsmodels

Statsmodels LR defaults don't have intercept or regularization. It does not include intercept by default. We need to use ```sm.add_constant``` method to include the intercept. 

In [17]:
import statsmodels.api as sm

In [18]:
X_train_constant = sm.add_constant(X_train)
X_test_constant = sm.add_constant(X_test)

sm_model_all_predictors = sm.Logit(y_train,X_train_constant).fit()

Optimization terminated successfully.
         Current function value: 0.390316
         Iterations 7


In [19]:
print(sm_model_all_predictors.summary())

                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                  330
Model:                          Logit   Df Residuals:                      325
Method:                           MLE   Df Model:                            4
Date:                Sun, 11 Sep 2022   Pseudo R-squ.:                  0.4357
Time:                        12:39:53   Log-Likelihood:                -128.80
converged:                       True   LL-Null:                       -228.25
Covariance Type:            nonrobust   LLR p-value:                 6.521e-42
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.1425      0.159     -0.897      0.370      -0.454       0.169
x1             0.5471      1.583      0.346      0.730      -2.556       3.650
x2             3.7231      1.566      2.377      0.0

In [20]:
sm_model_X1_X2 = sm.Logit(y_train,X_train_constant[:,:3]).fit()

Optimization terminated successfully.
         Current function value: 0.393376
         Iterations 7


In [21]:
print(sm_model_X1_X2.summary())

                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                  330
Model:                          Logit   Df Residuals:                      327
Method:                           MLE   Df Model:                            2
Date:                Sun, 11 Sep 2022   Pseudo R-squ.:                  0.4313
Time:                        12:39:53   Log-Likelihood:                -129.81
converged:                       True   LL-Null:                       -228.25
Covariance Type:            nonrobust   LLR p-value:                 1.782e-43
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.1199      0.157     -0.764      0.445      -0.428       0.188
x1             1.7312      0.222      7.791      0.000       1.296       2.167
x2             1.8391      0.224      8.215      0.0

Now we can observe that both X1 and X2 are statistically significant.

Statsmodels does not have the same accuracy method that we have in sklearn. We will have to use the predict method to predict probabilities. Then we can use the decision rule that probabilities above 0.5 are TRUE and rest are FALSE.

In [22]:
all_predicted_train = sm_model_all_predictors.predict(X_train_constant)> 0.5
all_predicted_test = sm_model_all_predictors.predict(X_test_constant)> 0.5

X1_X2_predicted_train = sm_model_X1_X2.predict(X_train_constant[:,:3])> 0.5
X1_X2_predicted_test = sm_model_X1_X2.predict(X_test_constant[:,:3])> 0.5

In [23]:
print(f"sm_train_all:{(y_train == all_predicted_train).mean()} and sm_test_all:{(y_test == all_predicted_test).mean()}")
print(f"sm_train_x1_x2:{(y_train == X1_X2_predicted_train).mean()} and sm_test_x1_x2:{(y_test == X1_X2_predicted_test).mean()}")

sm_train_all:0.8151515151515152 and sm_test_all:0.7962962962962963
sm_train_x1_x2:0.8121212121212121 and sm_test_x1_x2:0.7888888888888889


### 4. Summary

In [24]:
compare_list = [['sklearn','default',sklearn_default.score(X_train,y_train),sklearn_default.score(X_test,y_test)],
               ['sklearn','no_penalty',sklearn_no_penalty.score(X_train,y_train),sklearn_no_penalty.score(X_test,y_test)],
               ['sklearn','larger_penalty',sklearn_larger_penalty.score(X_train,y_train),sklearn_larger_penalty.score(X_test,y_test)],
               ['sklearn','GridSearchCV',sklearn_GridSearchCV.score(X_train,y_train),sklearn_GridSearchCV.score(X_test,y_test)],
               ['statsmodels','With Intercept + All predictors',(y_train == all_predicted_train).mean(),(y_test == all_predicted_test).mean()],
               ['statsmodels','With Intercept + X1 and X2',(y_train == X1_X2_predicted_train).mean(),(y_test == X1_X2_predicted_test).mean()] 
               ]

df = pd.DataFrame(compare_list, columns=['Library','Strategy','Train_Accuracy','Test_Accuracy'])

df

Unnamed: 0,Library,Strategy,Train_Accuracy,Test_Accuracy
0,sklearn,default,0.812121,0.785185
1,sklearn,no_penalty,0.815152,0.796296
2,sklearn,larger_penalty,0.809091,0.785185
3,sklearn,GridSearchCV,0.818182,0.792593
4,statsmodels,With Intercept + All predictors,0.815152,0.796296
5,statsmodels,With Intercept + X1 and X2,0.812121,0.788889
