# Chapter 4 Notes: Training Models

## Intro
 - Chapter will cover various topics important for properly training an ML model
 - Will start with linear regression
    - Closed form solution to directly calculate solution 
    - Gradient descent 
        - Batch GD, mini-batch GD, stochastic GD
 - Polynomial Regression
    - Detect overfitting with learning curves
 - Regularization techniques for avoiding overfitting
 - Logistic regression
 - Softmax regression


## Linear Regression 
 - notation and intro
     - Linear regression is based off of the equation for a line: y =mx +b
     - That equation can be transformed to yhat = theta0+theta1(x1) + thetan(xn)
        - Yhat is the prediction
        - Theta of 0 is the bias (y intercept)
        - X 1 to n are the feature values 
        - Theta 1 to n are the weights for each feature value (aka parameters)
        - So linear regression is sort of like a weighted sum with an offset
    - Can also be written as yhat= theta * x 
        - Theta is the parameter vector
    - Assuming MSE is our cost function then we will need to find the values for theta which minimize the MSE
        - MSE(X,htheta) = (1/m) sum theta.Txi - yi)2   for each value of i from 1 to m
 - Normal Equation
     - Normal equation is simply an equation that directly gives the values of theta which minimize the cost function 
     - thetaHat = (XTX)-1 * XTy
        - thetaHat is the theta values that minimize the cost function
        - y  is the vector of target values for instances 1 to m
 - Computational Complexity
     - Compute requirements increase exponentially with the number of features n

## Gradient Descent 
 - Intro
    - GD is an optimization algorithm which finds the parameters which minimize a cost function. 
    - Measure the gradient of the error function function with regards to the parameter vector (theta) 
    - Random Initialization - start with random values for theta
    - Learning Rate - the hyperparameter which determines the size of the change to theta after each step in GD. 
        - step size is proportional to the slope of the cost function. 
    - Step sizes that are too small will take too long to converge
    - Step sizes too large will fail to converge
    - Need to be careful about discerning global vs local minima in the gradient of theh cost function 
        - not a concer with MSE for linear regressino since it is always convex
    - Parameter Space - space searched by GD for a minima. Its dimensionality is determined by the # of features. 
 - Batch Gradient Descent
    - GD involves calculating the derivative of the cost function with respect to each parameter (j) in theta
    - Batch GD requires calculating the gradient over the full dataset, X. 
    - The derivative gives you a gradient vector which points uphill. So subtract that value from theta. Theta - deltaMSE(theta)
    - Also include a learning rate multiplier, nu. 
    - Tolerance - A threshold for the gradient vector such that when the gradient is < the tolerance the algorithm will cease. Allows you to set the number of iterations very high so you know you converged on a minimum gradient value. 
    - Convergance Rate scales linearly and inversly with the tolerance. 
 - Stochastic Gradient Descent
    - randomly picks two instances and computes the gradient of the error function between them instead of using all the points in the dataset to calculate the error function. 
    - Able to take steps much faster with less compute. But will require more steps to approach the minimum and will never settle on the minimum. 
    - can be used as an out of core algo for huge datasets
    - good at escaping local minima
    - works well with a gradually decreasing local minima
    - very sensitive to poorly shuffled data
    - SGDRegressor()
 - Mini-Batch Gradient Descent
     - computes the gradient on small mini batches
     - its path through parameter space is more direct than SGD but more erratic than batch GD. 
     - takes less time on each step than batch GD.
     - can utilize GPUs 

## Polynomial Regression
 - You can use linear models to fit nonlinear data using quadratic equations
 - $y = ax^2 + bx + c$
 - generate a non linear dataset with X and y
 - add the second degree polynomial to the training set as a new feature
    - now instead of using $X_i$ to predict $y_i$ you use $X_i$ and $X^2_{i}$
    - they used polynomialFeatures() from sklearn.preprocessing to generate the second degree poly for each instance of X
    - fit a linear regressor and it will return an intercept and a coefficient for each feature ($X^2$ and X)
    - if you started with more than one original feature then polynomialFeatures will calculate combinations of features at each polynomial degree  (a,b) becomes ($a,a^2,b,b^2,ab,a^2b,ab^2$) 
        - Note the exponential increase in features 

## Learning Curves
 - using multiple polynomials is an easy way to generate an overly complex model that will overfit the training data and fail to generalize. 
 - Learning Curve - a plot of the models performance on the training set and the validation as a function of the training set size or training step. Basically, train the model on n instances of training/validation data and plot its error. Then do the same for n+1 isntances etc. 
 - They give the example of a linear regression model estimating the nonlinear dataset. It underfits the data and you can tell because the RMSE of the training data set is less than the RMSE of the training and validation sets plateau with the trainRMSE very slightly < valRMSE
    - adding more data to an underfit model will not improve performance 
 - Next they plot the learning curve for a 10th degree polynomial regressor. 
    - the RMSE is significantly lower compared to the LR model.
    - there is a gap between the trainRMSE and valRMSE with the valRMSE larger. This means the model is overfitting. This can be lessened by feeding the model much more data to train on. 
 - The Bias/Variance Trade-off
    - Bias - results from making faulty assumptions about the data. Often occurs because the model is too simple. 
    - Variance - results from excessive sensitivity to changes in the training data. Often occurs because the model is too complex or overfitting the data. 
    - Irreducible Error - due to noise in the data. 

## Regularized Linear Models
 - Intro 
    - $L_2$ norm of a vector is the square root of the sum of the square of each value in the vector. Or each dimension in the vector.  L2 norm = $sqrt(theta_i^2)$ where i = 1 to number of dimensions in the vector. 
    - $L_1$ norm of a vector is the sum of the absolute values of each dimension. 
    - Regularization is constraining a model to prevent over fitting
        - e.g. reducing the number of degrees in a polynomial regressor
        - linear models are often regularized by constraining the weights
 - Ridge Regression
    - Regularization Term - the mathematical term used to implement regularization. In RR it is $alpha(1/2)(ss(theta_i))$, for all values of i=1:n,  which is added to the cost functtion. 
    - note that reg term only added to $theta_{i=1:n}$. It is not added to the bias term $theta_0$
    - Regularization term should only be added to the cost function during training.
    - Alpha ranges from 0 to 1. 
        - if alpha =1 then weights end up close to zero and the model is a horizontal line 
        - if alpha =0 then RR is just linear regression
    - $\textbf{w}$ = weights = theta_{1:n}
    - reg term = $\frac{1}{2}$ $||\textbf{w}||_2^2$  where $||\textbf{w}||_2$ is the $L_2$ norm (basically just the sum of squares)
    - they show some examples in fig 4-17: 
        - for LR the reg term simply makes the line closer to horizontal
        - for PR the reg term flatter curves 
    - RR can also be calculated with a closed form equation: $theta^{hat}$ = ($\textbf{X}^T$$\textbf{X}$ + alpha$\textbf{A}$)$^{-1}$ $\textbf{X}$ y   where A is the identity matrix except top left =0
 - Lasso Regression
    - Least Absolute Shrinkage and Selection Operator Regression
    - adds the $L_1$ norm of the weight vector to the cost function.
    - tends to reduce the weights of the least important vectors to 0
        - for PR this will reduce most of the higher degree polynomials to 0
        - sort of an automatic feature selection to generate a more sparse model
        - this means that lasso will travl down the gradient by reaching a space where on feature's gradient =0 then continue down the gradient of a different feature. (see fig4-19)
 - Elastic Net
    - a mix of RR and Lasso
    - uses a mix ratio, r, which controls the balance of ridge vs lasso regression
    - MSE(theta) + r(alpha)(RR cost function) + $\frac{1-r}{2}$(alpha)(lasso cost function)
    - generally you want to use some form of regularization instead of plain LR. 
        - if you suspect that only a few features are important than use Lasso or elastic net. 
 - Early Stopping
    - good way to regularize any iterative learning algorithm
    - Stops training when the validation error reaches a minimum. 
    - As a model trains its validation error will decrease. But then as it becomes overfit the validation error will increase. 

## Logistic Regression
 - Intro 
    - logistic regression determines the probability that an instance belongs to a particular class. 
 - Estimating Probabilities
    - computes a weighted sum of the input features plus a bias term, just like linear regression
    - vector form of estimated probability: p$^{hat}$ = $h_{theta}$($\textbf{x}$) = sigma($\textbf{x}^T$theta)
    - sigma() represents a sigmoid function bounded by 0 and 1: sigma(t) = $\frac{1}{1+e^{-t}}$
    - if $p^{hat}$ for a given instance is > 0.5 then $y^{hat}$ for that instance is 1
 -  Training and the Cost Function
    - objective is to set high prob. for positive instances and low prob. for negative instances (binary classifier)
    - cost function for a given instance: c(theta) = -log($p^{hat}$) if y=1   and   -log(1-$p^{hat}$) if p=0
    - different cost functions for each outcome. So if the prediction matches the outcome the cost will be low and vice versa. 
    - log loss - the average cost function over the training set: J(theta)= -$\frac{1}{m}$$ \sum_{i=1}^{m}y^ilog(p^{hat(i)})+(1-y^i)log(1-p^{hat(i)}) $
    - no known cost function
    - guaranteed global minimum since the gradient is convex. 
 - Decision Boundaries
    - logistic regression produces a decision boundary which is the point where the probability of a positive class ($p^{hat}$ crosses 50%
    - if you have a single feature then the decision boundary is a point on a line, if you have two features it is a line in a plane, etc
    - can use $L_2$ or $L_1$ for regularization. 

## Softmax Regression
 - Classification for multiple classes
 - computes the scores, $s_k$($\textbf{x}$), for each class k for each instance $\textbf{x}$
 -  $s_k$($\textbf{x}$) = $\textbf{x}^Ttheta^{(k)}$
     - theta$^{(k)}$ is the parameter vector for each class.
 - Softmax Function
     - calculate the probability $\hat{p}_k$ that a given instance belongs to class k
     - take the exponential of each score and normalizes them. 
 - picks the class with the highest score for a given instance
 - predicts only once class at a time
 - Training 
    - Cross Entropy
        - penalizes the model when it estimates a low probability for the target class
        - measures the average number of bits you send per option
    - find the gradient vector for each class. Thus settling on a parameter matrix which minimizes the cost function.
    