# Introduction
Logistic Regression is a simple method for predicting classifications for data.  Simple regression will perform binary classification (for example, diagnosing whether or not cancer is present).  Multinomial logistic regression can learn to predict multiple classes from the data

Logistic regression is very similar to linear regression.  The difference is in the details of the hypothesis function $h(x)$ and the cost function $J(\theta)$.  

With linear regression, $h(x)=\theta^Tx$ and $J(\theta) = \frac{1}{m}\sum_{i=1}^m(y-h)^2$.  In this case, $h(x)$ predicts directly the value of $y$.  $h$ and $J$ are different with logistic regression.  $h(x)$ predicts the probability of the row being in one category or other.

### Calculating $h(x)$
Let's look at logistic regression where we are trying to predict whether a response is in a class or not.  Then we want $h(x)$ to represent the probability that the response is in the class.  For example, suppose we were trying to diagnose the presence of cancer based on some tests.  Then $h(x) = P(y=1 \ |\  x;\theta)$, $h$ is the probability that $y$ is $1$ given the test results $x$ and the regression parameters $\theta$.  What does $h$ look like in this case?

From the expression above, we also know an expression for $P(y=0 \ |\ x;\theta) = 1-h(x)$.  Then there are a couple of identities that we will know about $h(x)$ and $P$:

$$ P(y=1) + P(y=0) = 1$$
and 
$$ \frac{P(y=1)}{P(y=0)} = \frac{h(x)}{1-h(x)} $$ 

Which implies that when $P(y=1) = P(y=0)$, 

$$ h(x) = 1 - h(x) \implies h(x) = 0.5$$

There are a number of functions that fit these features, but logistic regression is named after a particular function called the **logit function**: 

$$ g(z) = \frac{1}{1-\exp^{-z}}$$

if we use the logit to find the probability of class membership, then:

$$ h_{\theta}(x) = \frac{1}{1-\exp^{-\theta^Tx}}$$


The logit gives us the means to meet the requirements we set out for $h(x)$ and we can develop a useful cost function for learning from it.

### Calculating the Cost Function $J(\theta)$

The cost function expresses the error over the training data during the training function.  Machine learning, in general, can be thought of as an optimization problem that serves to minimize the cost function.

Since $h(x)$ represents a probability, $P(y=1)$, what we would like is for 

 * $Cost = 0$ when $h(x) = 1$ and $y \in 1$, 
 * $Cost \Rightarrow \infty$ when $h(x) = 0$ and $y \in 1$,
 * $Cost = 0$ when $h(x) = 0$ and $y \in 0$, and
 * $Cost \Rightarrow \infty$ when $h(x) = 1$ and $y \in 0$.

One easy way to support these relationships is to use natural logarithm functions to describe the cost function.  In this case we have:

* $Cost(h_{\theta}(x), y=1) = -log(h_{\theta}(x))$
* $Cost(h_{\theta}(x), y=0) = -log(1 - h_{\theta}(x))$

Where $0$ and $1$ are the values that $y$ can take on (the 'classes' recognized by the classifier after training).

Now we can start looking at what $J(\theta)$ looks like for logistic regression.  In general, $J(\theta)$ is a *total* cost function used for two purposes:

* To direct the optimization central to training machine learning systems (Training), 
* To check the accuracy of the resulting system against test data (Testing).

So, in general, $J(\theta)$ is the sum of the cost of each row of training data.  The training data is divided into the predictor data $x^{(i)}$ and the *actual* response data $y(i)$ for each row $i$.  The sum is then:

$$J(\theta) = \frac{1}{2m}\sum_{i=1}^m Cost(h_{\theta}(x^{(i)}), y^{(i)})$$

This is a general expression for $J(\theta)$. 

We can now plug in our cost function for logistic regression to get the expression we will optimize.  Currently our expression for cost is broken into two parts, one for $y=0$ and another for $y=1$.  We would prefer to optimize a single expression so we combine the two as follows:

$$ Cost(h_{\theta}(x), y) = -(y\ log(h_{\theta}(x)) + (1 - y)log(1 - h_{\theta}(x)))$$.

With this expression for cost, we can now write out the expression for $J(\theta)$ as:

$$J(\theta) = \frac{-1}{m}\sum_{i=1}^m (y^{(i)} log(h_{\theta}(x^{(i)})) + (1 - y^{(i)})log(1 - h_{\theta}(x^{(i)})))$$

### The Update Rule for Learning $\theta$
The general update rule provides us with a way to iterate, improving our estimate for $\theta$ at each step.  The rule use gradient descent, so generally, 

$$ \theta_j^{t+1} = \theta_j^t - \alpha \frac{\partial J(\theta)}{\partial \theta_j^t} $$

Where $j$ is the 'dimension' (or column) of the $x_j$ variable associated with $\theta_j$, $t$ is the time step of the iteration, and $\alpha$ is the learning rate.  Again, this is a general expression, true of any machine learning method's update rule.

To make this rule specific for logistic regression, we get the following update rule:

$$ \theta_j^{t+1} = \theta_j^t - \alpha \sum_{i=1}^m ((h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}) $$

Where $h_{\theta}(x)$ is defined as above.  Note that this rule works for any number of predictors in the logistic regression problem.