# Logistic Regression

---

> Logistic regression is a classification model that is very easy to implement but performs very well on linearly separable classes. It is one of the most widely used algorithms for classification in industry.

> To explain the idea behind logistic regression as a probabilistic model, let's first introduce the odds ratio: the odds in favor of a particular event. The odds ratio can be written as $$ \frac{p}{1-p} $$ where $$ p $$ stands for the probability of the positive event. 

> **The term positive event does not necessarily mean good, but refers to the event that we want to predict, for example, the probability that a patient has a certain disease**; we can think of the positive event as class label . We can then further define the logit function, which is simply the logarithm of the odds ratio (log-odds):

$$ logit(p) = \log\frac{p}{1-p} $$

> The logit function takes as input values in the **range 0 to 1** and transforms them to values over the **entire real-number range**, which we can use to express a linear relationship between feature values and the log-odds:

<center><img src="1.jpg">

> Here, is the **conditional probability** that a particular sample belongs to class 1 given its features x.

> Now, we are actually interested in predicting the probability that a certain sample belongs to a particular class, which is the inverse form of the logit function. It is also called logistic sigmoid function, sometimes simply abbreviated to sigmoid function due to its characteristic S-shape

<center><img src="2.jpg">

> Here z is the net input, the linear combination of weights and sample features, $$ z = W^TX $$

<center><img src="3.jpg">

> To explain how we can derive the cost function for logistic regression, let's first define the likelihood L that we want to maximize when we build a logistic regression model, assuming that the individual samples in our dataset are independent of one another. The formula is as follows:

<center><img src="4.jpg"></center>

> In practice, it is easier to maximize the (natural) log of this equation, which is called
the log-likelihood function:

<center><img src="5.jpg"></center>

> Now we could use an optimization algorithm such as gradient ascent to maximize this log-likelihood function. Alternatively, let's rewrite the log-likelihood as a cost function J that can be minimized using gradient descent

<center><img src="6.jpg"></center>

> A simplified version to remember

<center><img src="7.jpg"></center>

>  plotting that illustrates the cost of classifyinga single-sample instance for different values of $$ \phi(z) $$

<center><img src="8.jpg"></center>

# Evaluation Metrics for Classification

---

### Problem statement for illustration

<center><img src="9.jpg">

In [2]:
y_true = [0,0,1,0,1,1,1,0,1,0]
y_pred = [0,1,1,0,0,1,1,1,1,0]

In [3]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_true, y_pred)

array([[3, 2],
       [1, 4]], dtype=int64)

In [6]:
from pandas import DataFrame
cm = DataFrame(confusion_matrix(y_true, y_pred), columns=['model says 0', 'model says 1'], index=['we know 0', 'we know 1'])
cm

Unnamed: 0,model says 0,model says 1
we know 0,3,2
we know 1,1,4


In [8]:
cm.sum(axis=1)

we know 0    5
we know 1    5
dtype: int64

In [7]:
cm.sum(axis=0)

model says 0    4
model says 1    6
dtype: int64

<center><img src="10.jpg">

> **true positive (TP)**

> A test result that correctly indicates the presence of a condition or characteristic

> **true negative (TN)**

> A test result that correctly indicates the absence of a condition or characteristic

> **false positive (FP)**

> A test result which wrongly indicates that a particular condition or attribute is present

> **false negative (FN)**

> A test result which wrongly indicates that a particular condition or attribute is absent

<center><img src="12.jpg">
<center><img src="11.jpg">
<center><img src="13.jpg">

In [11]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
precision_score(y_true=y_true, y_pred=y_pred)

0.6666666666666666

In [12]:
recall_score(y_true=y_true, y_pred=y_pred)

0.8

In [13]:
f1_score(y_true=y_true, y_pred=y_pred)

0.7272727272727272

In [14]:
accuracy_score(y_true=y_true, y_pred=y_pred)

0.7

# ROC curve & AUC

Receiver Operating Characteristic (ROC) graphs are useful tools to select models for classification based on their performance with respect to the FPR and TPR, which are computed by shifting the decision threshold of the classifier. The diagonal of an ROC graph can be interpreted as random guessing, and classification models that fall below the diagonal are considered as worse than random guessing. A perfect classifier would fall into the top left corner of the graph with a TPR of 1 and an FPR of 0. Based on the ROC curve, we can then compute the so-called ROC Area Under the Curve (ROC AUC) to characterize the performance of a classification model.

<center><img src="14.jpg">