# Logistic Regression

In [2]:
# Importing of the necessary libraries
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

In practice, there are usually data to work with. In this instance arrays of values for input(𝑥) and for output(𝑦) will be created.

In [3]:
x = np.arange(10).reshape(-1, 1)
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

Both input and output must be NumPy arrays. The function numpy.arange() creates an array of consecutive, equal values within a given range. The x array must be two dimensional. It should have one column for each input and the number of rows should be equal to the number of observations. To make x a two-dimensional matrix, the function .reshape() with the arguments of -1 to get as many rows as needed and 1 to get the results in one column, is called.

In [6]:
# Displaying the x matrix
print(x)

[[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]


In [7]:
# Displaying the y matrix
print(y)

[0 0 0 0 1 1 1 1 1 1]


y is a vector of ten items. Again, each item corresponds to one observation. It contains only zeros (0) and ones (1) because this is a binary sorting problem.

Once both the input and output are defined, the creation and definition of the classification model can occur. In this example it is represented with an instance of the LogisticRegression class.

In [8]:
model = LogisticRegression(solver = 'liblinear', random_state = 0)

Once the model has been created it must be fitted/trained. In the example the fit() function is used:

In [9]:
model.fit(x, y)

The fit() function takes the x, y and possible weights associated with the observation and i then adjusts the model and returns the model instance.

Once the model is defined,its performance can be tested using the .predict_proba() function. This returns the matrix of probabilities of t predicted output being equal to one or zero.

In [10]:
model.predict_proba(x)

array([[0.74002157, 0.25997843],
       [0.62975524, 0.37024476],
       [0.5040632 , 0.4959368 ],
       [0.37785549, 0.62214451],
       [0.26628093, 0.73371907],
       [0.17821501, 0.82178499],
       [0.11472079, 0.88527921],
       [0.07186982, 0.92813018],
       [0.04422513, 0.95577487],
       [0.02690569, 0.97309431]])

In the table above, each row corresponds to a single observation. The left column displays the probability of the predicted output being zero, i.e. 1 - 𝑝(𝑥). The right column displays the probability of the output being one or 𝑝(𝑥).

The predictions based on the probability matrix and the values of 𝑝(𝑥), can be acquired with the .predict() function:

In [11]:
model.predict(x)

array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1])

This function returns the predicted output values as a one-dimensional array.

When nine out of ten observations have been correctly classified, the accuracy of the model equals 9/10 = 0.9, which can be acquired with the .score() function:

In [12]:
model.score(x, y)

0.9

The .score() function takes both the input and output as arguments and returns the ratio of the number of correct predictions to the number of observations.

More information about the accuracy of the model can be made known through the confusion matrix. To create the confusion matrix, the confusion_matrix() function can be used while passing the actual and predicted outputs as arguments:

In [13]:
confusion_matrix(y, model.predict(x))

array([[3, 1],
       [0, 6]], dtype=int64)

#### Optimizing the model

Improvement of the existing model can be achieved by adjusting the different parameters. For example, the value of the parameter regularization strength 'C' can be set equal to 10.0, instead of the default value of 1.0.

In [14]:
model = LogisticRegression(solver = 'liblinear', C=10.0, random_state = 0).fit(x, y)

This action has resulted an other model with different parameters.

In [15]:
# Display of the accuracy
model.score(x, y)

1.0

In [16]:
# Display of the confusion matrix
confusion_matrix(y, model.predict(x))

array([[4, 0],
       [0, 6]], dtype=int64)

So, the model has been optimized and the maximum accuracy has been achieved. The same can be seen in the confusion table as well, which now shows that all values have been predicted correctly.