1. Load Dataset
2. Divide in train_test_split
3. Apply LogisticRegression liblinear and saga-solver
4. Apply GridSearchCV to find best hyperparameter
5. Evaluate it on trained model on test dataset
6. Print Evaluation Metrics

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
data = load_breast_cancer()
X = data.data
y = data.target

In [3]:
data

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

In [4]:
X

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [5]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The liblinear solver is a library for large linear classification. It supports logistic regression and linear support vector machines.

In [7]:
log_reg = LogisticRegression(solver='liblinear')

Hyperparameters for GridSearch (GridSearchCV) are defined to find the best hyperparameters.

In [8]:
param_grid = {
    'penalty': ['l2', 'l1'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'fit_intercept': [True, False]
}
grid_search = GridSearchCV(log_reg, param_grid, cv=5)
grid_search.fit(X_train, y_train)



In [9]:
# best hyperparameters
grid_search.best_params_

{'C': 100, 'fit_intercept': False, 'penalty': 'l1'}

C(Regularization Strength) decreases => Regularization increases => Simpler Model it comes

Regularization is used to avoid overfitting

In [10]:
for penalty in ['l1', 'l2']:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]:
        for fit_intercept in [True, False]:
            log_reg = LogisticRegression(solver='liblinear', penalty=penalty, C=C, fit_intercept=fit_intercept)
            log_reg.fit(X_train, y_train)
            print("\n")
            print(f"Penalty: {penalty}, C: {C}, fit_intercept: {fit_intercept}")
            print("Coefficients:", log_reg.coef_)
            print("Intercept:", log_reg.intercept_)



Penalty: l1, C: 0.001, fit_intercept: True
Coefficients: [[ 0.          0.          0.05920773  0.00454094  0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.         -0.00974846
   0.          0.          0.          0.          0.          0.        ]]
Intercept: [0.]


Penalty: l1, C: 0.001, fit_intercept: False
Coefficients: [[ 0.          0.          0.05920759  0.00454206  0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.         -0.00974936
   0.          0.          0.          0.          0.          0.        ]]
Intercept: 0.0


Penalty: l1, C: 0.01, fit_intercept: True
Coefficients: [[ 0.          0.          0.14090725  0.00725436  0.          0.
   0.          0.          0. 





Penalty: l1, C: 1, fit_intercept: True
Coefficients: [[ 4.1516401   0.13726017 -0.24974548 -0.01602487  0.          0.
   0.          0.          0.          0.          0.          1.69459925
   0.         -0.09948163  0.          0.          0.          0.
   0.          0.          0.0886935  -0.42311283 -0.03236683 -0.01545385
   0.          0.         -3.67962477  0.          0.          0.        ]]
Intercept: [0.]


Penalty: l1, C: 1, fit_intercept: False
Coefficients: [[ 4.3294493   0.13829702 -0.26989543 -0.01659272  0.          0.
   0.          0.          0.          0.          0.          1.70024389
   0.         -0.09927253  0.          0.          0.          0.
   0.          0.          0.01669019 -0.42454849 -0.02806291 -0.01497921
   0.          0.         -3.61333556  0.          0.          0.        ]]
Intercept: 0.0


Penalty: l1, C: 10, fit_intercept: True
Coefficients: [[ 1.45879069e+00  1.78768478e-01 -2.46660182e-02 -1.14502548e-02
   0.00000000e+00  0.000

In [11]:
# It evaluate traind model on test dataset
y_pred = grid_search.predict(X_test)

In [12]:
# Evaluation Metrics
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print("Evaluation Metrics:")
print("Accuracy:\n", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_rep)

Evaluation Metrics:
Accuracy:
 0.9824561403508771
Confusion Matrix:
 [[42  1]
 [ 1 70]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.98      0.98        43
           1       0.99      0.99      0.99        71

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114



LogisticRegression model with saga solver and elasticnet penalty.

Penalty: Tyoe of Regularization L1, L2 elasticnet

saga: Used for training large models like logistic regression for better efficiency & sometimes it overfits the model sometimes.

In [13]:
log_reg_elasticnet = LogisticRegression(solver='saga', penalty='elasticnet')
param_grid_elasticnet = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'fit_intercept': [True, False],
    'l1_ratio': [0.1, 0.5, 0.9]
}

saga solver only supports elasticnet penalty and liblinear solver supports only L1,L2 regularization. Thus, GridSearch to find best hyperparameter is done seperately for saga after liblinear.

In [14]:
grid_search_elasticnet = GridSearchCV(log_reg_elasticnet, param_grid_elasticnet, cv=5)
grid_search_elasticnet.fit(X_train, y_train)



In [15]:
# best hyperparameters
best_params_elasticnet = grid_search_elasticnet.best_params_
print("Best Hyperparameters for ElasticNet:", best_params_elasticnet)

Best Hyperparameters for ElasticNet: {'C': 0.001, 'fit_intercept': True, 'l1_ratio': 0.5}


Best Setting for elasticnet

In [16]:
log_reg_best_elasticnet = LogisticRegression(solver='saga', penalty='elasticnet', **best_params_elasticnet)
log_reg_best_elasticnet.fit(X_train, y_train)
print("Coefficients:", log_reg_best_elasticnet.coef_)
print("Intercept:", log_reg_best_elasticnet.intercept_)

Coefficients: [[ 0.          0.00163238  0.01461168  0.01007971  0.          0.
   0.          0.          0.          0.          0.          0.
   0.         -0.00293993  0.          0.          0.          0.
   0.          0.          0.          0.00316307  0.01444283 -0.01106964
   0.          0.          0.          0.          0.          0.        ]]
Intercept: [0.00040758]




In [17]:
y_pred_elasticnet = grid_search_elasticnet.predict(X_test)

Evaluation Metrics for elasticnet

In [18]:
accuracy_elasticnet = accuracy_score(y_test, y_pred_elasticnet)
conf_matrix_elasticnet = confusion_matrix(y_test, y_pred_elasticnet)
classification_rep_elasticnet = classification_report(y_test, y_pred_elasticnet)
print("Accuracy:", accuracy_elasticnet)
print("Confusion Matrix:\n", conf_matrix_elasticnet)
print("Classification Report:\n", classification_rep_elasticnet)

Accuracy: 0.9385964912280702
Confusion Matrix:
 [[37  6]
 [ 1 70]]
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.86      0.91        43
           1       0.92      0.99      0.95        71

    accuracy                           0.94       114
   macro avg       0.95      0.92      0.93       114
weighted avg       0.94      0.94      0.94       114

