# Chapter 4: Training Models

## Linear Regression
Linear Regression involves training a set of data with a bias term and one hyperparameter per feature.
<br> A Linear Regression model's accuracy is measured with the MSE or RMSE cost function.
### The Normal Equation
Normal Equation - equation that minimizes the cost function
<br>Scikit Learn has LinearRegression class that can compute linear regression easily.
### Computational Complexity
The LinearRegression class is O(n^2), while the Normal Equation is ~ O(n^2.75)
### Gradient Descent
Gradient Descent - generic optimization optimization algorithm
<br> Gradient Descent operates by a testing a linear model, then modifying parameters so as to decrease the cost function.
<br> Computes the partial derivative of a certain hyperparameter for each step.
<br> Learning Rate - hyperparamter that describes how much each hyperparamter of the linear model will be changed per step in gradient descent
<br> A Learning Rate that is too large will miss the minimum of cost function while a rate too small will take too long to compute.
#### Batch Gradient Descent
Batch Gradient Descent - use mathematical gradient of the cost function to figure out the direction to tweak hyperparamter
<br> To find a decent learning rate, use grid search.
<br> Batch Gradient Descent relies on the full training set, thus can be time consuming.
#### Stochastic Gradient Descent
Stochastic Gradient Descent - similar to batch, but selects a random instance to compute partial derivative and direction of hyperparameter tuning
<br> Results in "jumpy" tuning, which can help to avoid local minima, but causes tuning to "miss" the global minima slightly.
<br> Learning Schedule - function for the Learning Rate, which should reduce over time with Stochastic Gradient Descent algorithm
#### Mini Batch Gradient Descent
Mini Batch Gradient Descent - perform Batch Gradient Descent on a subset of the training set
<br> This provides many of the benefits of both Batch Gradient Descent and Stochastic Gradient Descent.
## Polynomial Regression
Polynomial Regression - a more complex model useful for non-linear data
<br> Beware of combinatoral explosion if choosing a high-order polynomial model.
## Learning Curve
If a model performs well on the training data, but poorly on validation set, then it is overfitting the data.
<br> Learning Curves - plots of model's performance on training set and validation set (using cost function)
### Bias/Variance Tradeoff
Generalization Error can be expressed as the sum of three different kinds of errors:
1. Bias - error caused by wrong assumptions, ex data is linear when in reality it is not
2. Variance - error caused by excessive sensitivity to variations in data
3. Irreducible - error caused by inherant noisyness of data

Increasing the complexity of a model will increase its variance, but decrease its bias.
## Regularized Linear Models
Regularization of a model can reduce overfitting data.
### Ridge Regression
Ridge Regression - regularized version of Linear Regression that adds a regularization term to cost function related to the l2 norm
<br> Effective with polynomial models.
<br> Results in learning algorithm that tries to keep the constants in the model as in-extreme as possible and prevents overfitting.
### Lasso Regression 
Least Absolute Shrinkage and Selection Operator Regression - similar to Ridge Regression, except uses l1 norm instead of l2.
<br> Tends to eliminate weights of least important features. 
<br> Performs feature selection to output a sparse model.
### Elastic Net
Elastic Net - a combination of Ridge and Lasso Regression
<br> User can select how potent each type of regression's influence should be.
<br> Elastic Net can be used to implement 100% Ridge or 100% Lasso Regularization.
### Early Stopping 
Early Stopping - regularization technique to stop training once optimial performance on validation set is achieved
<br> Avoids overfitting data by simply reducing amount of data
## Logistic Regression
Logistic Regression - estimates the probability that an item is of a certain class
### Estimating Probabilities
Once the probablity is calculated, the algorithm can easily make the classification of yes or no.
### Training and Cost Function 
Log Loss - logistic regression cost function
<br> There is no known equation to compute the minimum of the Log Loss, so Gradient Descent method is used.
### Decision Boundaries 
It is important to note that the logistic function could predict one class or another with extremely low confidence.
### Softmax Regression
Softmax Regression - generalized logistic regression model that can support multiple classes, without the need of numerous binary classifiers
<br> Can only output one class.