# Logistic Regression

Other than linear regression where the goal is to predict a continous variable, with logistic regression the aim is to classify training examples using distinct discrete values. First off, the simplest form is the binary classification (e.g. 0 or 1), but classification into multiple classes is also possible.

Examples: Spam or no spam, fraudulent transaction or normal transaction etc. (binary cases)

#### Function

Instead of using the linear function as in linear regression, logistic regression uses the sigmoid function to map any real number z to [0;1]

[IMAGES]

In these cases, h(x) gives us the probability that x gets the value 1. Keep in mind that 1 often means a membership to the class with positive outcome (e.g. spam, fraudulent). Given this probabilty, the probability belonging to 0 can be easily calculated.

[IMAGES]

#### Decision boundary

The decision boundary is a (n-1) dimensional plane representing the boundary for classification decisions. Taking n=2 as an example, it is a line dividing the area into different subareas. The decision boundary is calculated using the characteristic of the sigmoid function: h(x) is equal to 0.5, if z is 0 or greater. Using the function inside of z (e.g. linear function) and setting it to zero gives us the decision boundary.

[IMAGES]

#### Cost Function

For logistic regression we can not take the cost function of linear regression, since it would not be convex, meaning having local optima. 

![Convex vs not Convex](Resources/convexvsnot.png "Convex vs Not Convex")

Therefore, the following (already simplified) function for costs. This function is convex. 

![Logistic Regression Cost Function](Resources/logregcf.png "Logistic Regression Cost Function")

#### Minimizing the Cost Function using GD

To select the best set of Theta (set of model parameters), we have to minimize the cost function. Again, GD can be used for this purpose.

![Logistic Regression Cost Function](Resources/gdlogregcf.png "Logistic Regression Cost Function")

Using batch GD we can again update all theta simultaniously:

![Logistic Regression Cost Function](Resources/gdlogregcfderivative.png "Logistic Regression Cost Function")

This looks exactly like the expression used for linear regression, though it isn't. Keep in mind that h(x) is not simply a multiplication of theta transposed with x, but the sigmoid function.

![Logistic Regression Cost Function](Resources/differencetolinreg.png "Logistic Regression Cost Function")

#### Other means to minimize the Cost Function

Other algorithms to minimze the cost function are

* Conjugate Gradient
* BFGS
* L-BFGS

Those require the cost value J(Theta) and the gradients (for each theta, cost function partially derived).
For more check the Coursera Documentation


#### Mulitclass classification

Here we learn the one-vs-all classification. The method is simple: Compute hi(x) where i denotes the i-th class that is considered. In hi(x), the dataset is divided into 1 and 0s, 1 is the i-th class and 0 the rest. Do this for each class et voila! Then maximize hi(x), meaning choosing the class with the highest probability hi(x).

![Logistic Regression Cost Function](Resources/multiclasslogreg.png "Logistic Regression Cost Function")


# Avoiding the problem of overfitting

### Overfitting vs. Underfitting

#### Overfitting

* Algorithm performs well on training instances, generalizes poorly to unseen instances
* The model is usually too complex (high polynomial) or the data set has too many features

#### Underfitting: 

* Algorithm is performing bad even on the training set
* Model is usually too general (too few features)

![Logistic Regression Cost Function](Resources/overvsunderfit.png "Logistic Regression Cost Function")

### Methods to address overfitting

#### Reduce number of features
* by manual selection
* by model selection algorithms
* more on this later

#### Regularization
* Keep all the features, but reduce magnitudes / values of thetas
* works well if all attributes somehow contribute to the target

### Regularization

#### How to?

To regularize, we specify the cost function of our specific problem to include the regulization parameter (Lambda) which is used to penalize each parameter. If we then decide to minimize the cost function, the added term is then taken into consideration and theta is therefore usually kept smaller. If the regularization parameter is too large, we underfit, else, we overfit (check for yourself).

![Logistic Regression Cost Function](Resources/regularizedcostfunction.png "Logistic Regression Cost Function")

#### Linear Regression - Gradient Descent

* Add the term to the cost function
* Regard and update theta(0) seperately
* Use the adjusted costfunction for theta(1) onwards

![Logistic Regression Cost Function](Resources/regularizedcostfunctiongd.png "Logistic Regression Cost Function")

#### Linear Regression - Normal Equation
To the usual normal equation (X * XT)^-1 * XT * y add lambda times an identity matrix whose first value is 0 on the diagonal

#### Logistic Regression - Regularized cost function

![Logistic Regression Cost Function](Resources/logregregcf.png "Logistic Regression Cost Function")

* GD: Same as for Linear Regression (derivative is the same, even if the cost function is different)
* Advanced methods: Same just change the cost function, but still Jval and gradients needed to feed them into these algorithms





