# Math in Machine Learning

## 1. Statistics

### 1.1 The trade-off between bias and variance

Bias: how well a model fits data. 
    
    Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to **high error** on training and test data.

Variance: how much a model changes based on inputs

    Model with high variance pays a lot of attention to training data and **does not generalize** on the data which it hasn’t seen before.

- Low variance and high bias ==> underfitting (model too simple)
- High variance and low bias ==> overfitting (model too complex)
- High variance and high bias ==> garbage
- Low variance and low bias ==> perfect

### 1.2 Differentiate between correlation and covariance

**Covariance shows you how the two variables differ, whereas correlation shows you how the two variables are related.**

Covariance is a statistical term that refers to a systematic relationship between two random variables in which a change in the other reflects a change in one variable.

$$cov(X,Y) = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{n-1}$$

Correlation is a measure that determines the degree to which two or more random variables move in sequence.

$$corr(X,Y) = \frac{cov(X,Y)}{\sigma_X \sigma_Y}$$

### 1.3 Gradient Descent

Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).
Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.

## Training

### How to deal with overfitting
1. Regularization. It involves a cost term for the features involved with the objective function
2. Making a simple model. With lesser variables and parameters, the variance can be reduced 
3. Cross-validation methods like k-folds can also be used
4. If some model parameters are likely to cause overfitting, techniques for regularization like LASSO can be used that penalize these parameters

### Combat the curse of dimensionality

1. Manual Feature Selection
2. Principal Component Analysis (PCA)
3. Multidimensional Scaling
4. Locally linear embedding

### Regularization

A technique that discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

Lasso and Ridge regression are both types of regularized linear regression, which means they **add a penalty term to the linear regression objective function** to prevent overfitting. The main difference between the two is the type of penalty term they use.

- L1 norm: **Lasso (Least Absolute Shrinkage and Selection Operator) regression** adds a penalty term to the objective function that is proportional to the absolute value of the coefficients. This results in some of the coefficients becoming exactly zero, which effectively eliminates those features from the model. This is called feature selection and it helps in reducing the number of features in the model. Lasso regression is useful for models with many features and it can help to select a subset of those features that are most useful for predicting the target variable.

$$Objective function = Sum of squared errors (SSE) + λ * Sum of absolute values of the coefficients (|w|)$$

- L2 norm: **Ridge Regression**, on the other hand, adds a penalty term to the objective function that is proportional to the square of the coefficients. This results in all coefficients shrinking towards zero, but none of them becoming exactly zero. This helps to reduce the impact of any one feature on the model, which can help to prevent overfitting. Ridge Regression is useful when there are high multicollinearity between the features.

Objective function = Sum of squared errors (SSE) + λ * Sum of squared values of the coefficients (w^2)

In short, Lasso regression is useful for feature selection, while Ridge regression is useful for reducing the impact of correlated features.

### How to deal with unbalanced dataset

1. Oversampling or undersampling. Instead of sampling with a uniform distribution from the training dataset, we can use other distributions so the model sees a more balanced dataset.
2. **Data augmentation**. A
   1. Add data in the less frequent categories by modifying existing data in a controlled way: Flip the images with illnesses; Add noise to copies of the images in such a way that the illness remains visible.
3. **Using appropriate metrics**. In asteroid is going to hit the earth example, if we had a model that always made negative predictions, it would achieve a precision of 99.9%. There are other metrics such as precision, recall, and F-score that describe the accuracy of the model better when using an imbalanced dataset.

### What is Momentum (w.r.t NN optimization)?

Momentum lets the optimization algorithm remembers its last step, and adds some proportion of it to the current step. This way, even if the algorithm is stuck in a flat region, or a small local minimum, it can get out and continue towards the true minimum.

### What is the difference between Batch Gradient Descent and Stochastic Gradient Descent?

Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.

Stochastic gradient descent (SGD) computes the gradient using a single sample. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal.

### What is vanishing gradient?

As we add more and more hidden layers, back propagation becomes less and less useful in passing information to the lower layers. In effect, as information is passed back, the gradients begin to vanish and become small relative to the weights of the networks.
