<h2 id="Contents">Contents<a href="#Contents"></a></h2>
        <ol>
        <li><a class="" href="#Classification">Classification</a></li>
<ol><li><a class="" href="#Why-Not-Linear-Regression?">Why Not Linear Regression?</a></li>
<li><a class="" href="#Logistic-Regression">Logistic Regression</a></li>
<ol><li><a class="" href="#The-Logistic-Model">The Logistic Model</a></li>
<li><a class="" href="#Estimating-the-Regression-Coefficients">Estimating the Regression Coefficients</a></li>
<li><a class="" href="#Multiple-Logistic-Regression">Multiple Logistic Regression</a></li>
<li><a class="" href="#Logistic-Regression-for-More-than-2-Response-Classes">Logistic Regression for More than 2 Response Classes</a></li>
</ol>

# Classification

In case the response variable is quantitative, we can not use linear regression to make predictions. Predicting a qualitative response for an obserclassification
vation can be referred to as classifying that observation and the process is called classification.

## Why Not Linear Regression?

To see why linear regression is not the best choice for classification, let's look at a simple example. Suppose that we are trying to predict the medical condition of a patient
in the emergency room on the basis of her symptoms. In this simplified
example, there are three possible diagnoses: stroke, drug overdose, and
epileptic seizure. We could consider encoding these values as a quantitative response variable, Y , as follows:
$$
Y = \begin{cases}
1 & \text{if stroke} \\
2 & \text{drug overdose} \\
3 & \text{epileptic seizure}
\end{cases}
$$
But this does not make sense for a lot of reasons! First, this coding implies an ordering on the outcomes, putting drug overdose in
between stroke and epileptic seizure, and insisting that the difference
between stroke and drug overdose is the same as the difference between
drug overdose and epileptic seizure. 

Second, the encoding is not unique, we can make another encoding as:
$$
Y = \begin{cases}
1 & \text{epileptic seizure}\\
2 & \text{drug overdose} \\
3 &  \text{if stroke} 
\end{cases}
$$
which will result in entirely different model.

Note that if the outcome is binary, then these problems are not relevant and we can make a encoding like:
$$
Y = \begin{cases}
0 & \text{if stroke} \\
1 & \text{drug overdose}
\end{cases}
$$
In this case, we can, in principle, use linear regression to predict the outcome. However, there is no guarantee that the output of the linear model will lie between 0 and 1. Due to these problems, we use other models such as logistic regression while doing classification.

## Logistic Regression

Rather than modeling the response Y
directly, logistic regression models the probability that Y belongs to a particular category. For the `Default` data, logistic regression models the probability of default.
For example, the probability of default given balance can be written as
$$
Pr(\text{default} = \text{Yes|balance})
$$

### The Logistic Model

We can start with a model
$$
p(X) = \beta_0 + \beta_1 X
$$
where $X$ is the explanatory variable and $p$ is the probability of the outcome. However, using this model, p can take any value from $-\infty$ to $+\infty$, which is absurd. Instead, we use a logistic function which gives output in the range $0$ to $1$. The model is:
$$
p(X) = \frac{\exp(\beta_0+ \beta_1 X)}{1 + \exp(\beta_0+ \beta_1 X)}
$$

>**Logit**: Logit is a function that takes the log of the probability and then divides by the log of 1 plus the log of the probability. For the model defined above, we get:
$$
\log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1 X
$$
> We see that the logistic
log-odds
regression model has a logit that is linear in X.

In a linear regression model, $\beta_1$ gives the
average change in Y associated with a one-unit increase in X. In contrast,
in a logistic regression model, increasing X by one unit changes the log odds
by $\beta_1$, or equivalently it multiplies the odds by $e^{\beta_1}$. This means that there is not a linear relationship between X and p(X).

### Estimating the Regression Coefficients

Just like in linear regression, we have to calculate the coefficients $\beta_0$ and $\beta_1$. This may be done via least squares, however, using *maximum likelihood* estimation, is more efficient and general.

The basic intuition behind using maximum likelihood
to fit a logistic regression model is as follows: we seek estimates for $\beta_0$ and
$\beta_1$ such that the predicted probability $\hat{p}(x_i)$ of default for each individual, corresponds as closely as possible to the individual’s observed
default status. Mathematically, we need to maximize the following function:
$$
l(\beta_0, \beta_1) = \prod_{i:y_i=1} p(x_i)\prod_{i: y_i = 0} (1-p(x_i))
$$
The above function is called the *likelihood function*.

>Maximum likelihood is a very general approach that is used to fit many
of the non-linear models that we examine throughout this book. In the
linear regression setting, the least squares approach is in fact a special case
of maximum likelihood. 

### Multiple Logistic Regression

By analogy with the extension from simple to multiple linear
regression, we can generalize the one parameter logit function to multiple parameters as:
$$
\log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p
$$
and hence, the probaility becomes:
$$
p(X) = \frac{\exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}{1 + \exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}
$$

Just as before, we can use maximum likelihood to estimate the coefficients.

### Logistic Regression for More than 2 Response Classes

The two-class logistic regression models discussed in the previous sections have multiple-class
extensions, but in practice they tend not to be used all that often. Instead, we use *discriminant analysis*.