## Algorithms

### Regression problem:
* Ordinal regression: Data in rank ordered categories
* Poisson regression: Predict event counts
* Fast forest quantile regression: Predict a distribution
* Bayesian Linear regression: Linear model, small data sets
* Nerual network regression: Accurate, long training times
* Decision Forest regression: Accurate, fast training times
* Boosted Decision Tree regression: Accurate, fast training times, large memory footprint

### Clustering
K-means: unsupervised learning

### Anomaly detection
* PCA-Based Anomaly detection: fast training times
* Two-class Classification: Under 100 features, aggressive boundary

### Two-class classification
* Two-class SVM: under 100 features, linear model
*  Two-class averaged perceptron: fast training, linear model
* Two-class Bayes point machine: fast training, linear model
* Two-class decision forest: accurate, fast training
* Two-class logistic regression: Fast training. linear model
* Two-class boosted decision tree: Accurate, fast training, large memory footprint
* Two-class decision jungle: Accurate, small memory footprint
* Two-class locally deep SVM: under 100 features
* Two-class neural network: Accurate, long training times

### Multicalss classification
* Logistic regression: fast training times, linear model
* Neural network: accurate, long training times
* Decision forest: accurate. fast training times
* decision jungle:
* one-v-all multiclass: depends on the two-class classifier



### Machine learning
* supervised learning ( labeled data, inputs and outputs)
* unsupervised learning (data without labels, find patterns or intrinsic structures in the data, somehow reduce dimensions)
* semi supervised learning (a small amount of labeled data with a large amount of unlabeled data)
* reinforcement learning (maximize the **reward function**)


### Underfitting VS Overfitting
* **underfitting** : performs poorly in both training data and test data (**excessively simple**)
* **overfitting** : performs perfectly in training data but fail to predict well on new data
* both of them lead to poor predictions on new data

### Model validation strategy
* **Hold-out validation**
* **k-fold cross validation**
* **Leave-one-out cross validation**

#### Hold-out strategy
**Split the data into 2 parts- a training set, test set**  
**Note:**: It can have a **high variance**. The evaluation may depend heavily on which data points end up in the training set and which end up in the test set and thus the evaluation may be significantly different depending on how the division is made.  

**A better approach: Split the data into a training(60%), cross validation(20%) and test set(20%)**  (for relatively small data set)  
#### k-fold cross validation
** the data set is divided into k equal size subsets, and the Hold-out is repeated k times. Each time, one of the k subsets is used as test set, and the other k-1 subsets are put together to form a training set. Then the average error across all k trails is computed**  
Adv: less depend on how the data is divided  
Dadv: time-consuming, computation expensive  

#### Leave-one-out cross validation(LOOCV)
extreme case of k-fold, when k=n (sample size)  

** Large dataset: hold-out validation**  
** Small dataset: cross validation: 10-fold cross validation is common**


## Model evaluation metrics

### Classification model
* Accuracy  

 * $Accuracy = \frac{n_{correct}}{n_{total}}$  
 * ** not a good metric for Skewed datasets, say a dataset with 1% patients**
 * only good for symmtric data

* Precision  

* Recall
* F score
* ROC
* AUC
* Log loss  

### Confusion matrix visually shows more details of the error
|                           |Predicted value(+)|Predicted value(-)  |  
|------------------|------------------------------------------|
|Actual Value(+)  |TP(True positive)   | FN(false negative) |
|Acutal Value(-)   |FP(False positive)  |TN(True negative) |

**True positive**: we predicted"+", and the true class is "+"    
**True negative**: we predicted"-" and the true class is "-"  
**False positive**: we predicted "+” and the true class is "-"  
**False negative**: we predicted"-" and the true class is "+"

$$Accurancy = \frac{TP+TN}{TP+FN+FP+TN}$$
  
$$Prediction = \frac{TP}{TP+FP}$$  
  
$$Recall = \frac{TP}{TP+FN}$$  
  
**There is an inverse relationship between precision and recall: trade off**


### F score
**F  score combines precision and recall into one measure**  
** select the algorithm with the highest F score**
$$F_1 = 2\frac{P*R}{P+R}$$

### ROC Curve
* **ROC shows how many correct positive classifications can be gained as you allow for more and more false positives**  
* **The closer this curve is to the upper left corner, the better the classifier's performance is **  
* Insensitive to class distribution datasets

### AUC(area under the curve)
**Higher AUC will be better**

### Regression Model
* MAE
* MSE
* RMSE
* MAPE
* $R^2$ adjusted

### How to address underfitting(High Bias)
* Train a more complex model
 1. To make the same algorithm more complex (polynominal terms, more depth in decision tree)
 2. Change to a more complicated algorithm/model (to neural network, random forest)
* Add more features as input
* Adjust parameters/Hyperparameters search
* Use ensemble learning-Boosting

### How t fix high variance issue(Overfitting)
**Overfitting often occurs when a model is too complex or when there is insufficient data**  
* use more data
* use **regularization**   
L2 regularization  (RIDGE) **Result in plenty of relatively small ut nonzero parameters**
$$||\theta||_2 = \sum_{i=1}^n \theta_i^2$$  
L1 regularization:  (LASSO)**Push certain weights to be exactly 0**
$$||\theta||_2 = \sum_{i=1}^n |\theta_i|$$
* reduce number of features
* change to less complex model.
* Adjust parameters/hyperparameters search
* use ensemble learning- Bagging & Random forest  

**Note: If learning algorithm is suffering from high bias, getting more data will not help**

### Tuning the Hyperparameter
Examples:
1. number of trees in a random forest
2. learning rate of gradient descent
3. Regularization  

**Hyperparameter**
* it is a iterative process
 1. Grid search
 2. Random search(**recommended**)
 3. Bayesian Optimization
*  it controls the model complexity
* It controls the behavior of the training algorithm

### Hyperparameter VS parameter
**Hyperparameter are specified before the training algorithm starts and can't be optimized inside the training algorithm itself. 
They are external to the model**
* Used in processes to help estimate model parameters
*  Cannot be estimated from data
* Often specified by the practitioner
* Not change during a training job  

** Parameters are the variables that learning algorithm uses to adjust to your data. They are internal to the data**
* Required by the model when making predictions
* They are estimated or learned from data
* Not set manually by the practitioner
* They change during a training job

Hyperparameter examples:  
**neural network:**
* Learning rate
* Number of layers
* Number of hidden units
* Type of unit
* Mini-batch size

**SVM**
* C, Kernel, gamma

**Lasso,Ridge Regression**
* Regularization parameter

** K-means**
* Number of clusters

Parameter examples  
**Neural networks**
* weight

**Linear Regression**
* W and b


**Grid search tries the exhaustive searches for all the possible hyperparameter combinations, so it is a costly and time-consuming approach**  
**Prefer random search to grid search especially when hyperparameter search space is large**  
** Try random value, don't use a grid, Use a coarse to fine sampling scheme**  
** say, it can be helpful to first search in coarse ranges(e.g. $10^{**}[-6,1]$, and then depending on where the best results are turning up, narrow the range**