In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

In [2]:
x = np.arange(10).reshape(-1,1)
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
x

array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

In [3]:
y

array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

In [4]:
model = LogisticRegression(solver='liblinear', random_state=0).fit(x, y)

In [5]:
model.predict_proba(x)

array([[0.74002157, 0.25997843],
       [0.62975524, 0.37024476],
       [0.5040632 , 0.4959368 ],
       [0.37785549, 0.62214451],
       [0.26628093, 0.73371907],
       [0.17821501, 0.82178499],
       [0.11472079, 0.88527921],
       [0.07186982, 0.92813018],
       [0.04422513, 0.95577487],
       [0.02690569, 0.97309431]])

In [6]:
model.predict(x)

array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1])

In [7]:
model.score(x, y)

0.9

In [8]:
confusion_matrix(y, model.predict(x))

array([[3, 1],
       [0, 6]])

C is Regularization strengh

In [9]:
model = LogisticRegression(solver='liblinear', C = 10, random_state=0).fit(x, y)

In [10]:
model.predict_proba(x)

array([[0.97106534, 0.02893466],
       [0.9162684 , 0.0837316 ],
       [0.7810904 , 0.2189096 ],
       [0.53777071, 0.46222929],
       [0.27502212, 0.72497788],
       [0.11007743, 0.88992257],
       [0.03876835, 0.96123165],
       [0.01298011, 0.98701989],
       [0.0042697 , 0.9957303 ],
       [0.00139621, 0.99860379]])

In [11]:
model.predict(x)

array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

In [12]:
model.score(x, y)

1.0

In [13]:
confusion_matrix(y, model.predict(x))

array([[4, 0],
       [0, 6]])

In logistic regression, the regularization strength is controlled by the hyperparameter known as "C," which is the inverse of the regularization strength. It is used to balance the trade-off between fitting the training data well and preventing overfitting. Here's what C does in logistic regression:

Low Values of C (High Regularization): When C is small (e.g., 0.01 or 0.001), it means strong regularization. In this case, the algorithm will emphasize simpler models that are less likely to overfit the training data. The model will under-predict the training data to avoid fitting noise, which can lead to high bias and underfitting.

High Values of C (Low Regularization): When C is large (e.g., 1, 10, or 100), it means weak regularization. A higher C allows the logistic regression model to fit the training data more closely, even if it means capturing noise in the data. This can lead to a more complex model that is prone to overfitting the training data.

In summary, the choice of the regularization parameter C in logistic regression is crucial. Selecting a small C encourages a simpler model that is less likely to overfit but might have high bias. Choosing a large C allows the model to fit the data more closely but increases the risk of overfitting. The optimal value of C depends on the specific dataset and the problem you're trying to solve. It's often determined through techniques like cross-validation, where different values of C are tried, and the one that results in the best model performance on a validation set is selected.