## Regression Model Performance

### R Squared - Goodness of fit (greater is better)
- Residual Sum of Squares (Regression) -> SS<sub>res</sub>
- Total Sum of Squares (Average) -> SS<sub>tot</sub>
- R<sup>2</sup> = 1 - (SS<sub>res</sub> / SS<sub>tot</sub>)
- The better the model, the better the Residual Sum of Squares and the higher the R<sup>2</sup>
    - R<sup>2</sup> = 1 -> perfect (suspicious)
    - R<sup>2</sup> ~0.9 -> very good
    - R<sup>2</sup> < 0.7 -> not great
    - R<sup>2</sup> < 0.4 -> terrible
    - R<sup>2</sup> < 0 -> invalid model
- Problem: Adding another variable does not impact SS<sub>tot</sub> but does impact SS<sub>res</sub>
    - SS<sub>tot</sub> doesn't change
    - SS<sub>res</sub> will decrease or stay the same
        - SS<sub>res</sub> will never increase when another variable is added
        - Ordinary Least Squares seeks to minimize SS<sub>res</sub>

### Adjusted R Squared
- Adj R<sup>2</sup> -> 1 - ( 1 - R<sup>2</sup>) * ((n-1)/(n-k-1))
    - k - number of independent variables
    - n - sample size
    - penalizes the model by adding another variable
        - the new variable must provide enough value to compensate for the penalty

## Regression Model Selection
|Model|Pros|Cons|
|:-----|:-----|:-----|
|Linear|Works with any size of dataset, gives information about relevance of features|The Linear Regression Assumptions|
|Polynomial|Works on any size of dataset, works very well on non linear problems|Need to choose the right polynomial degree for a good bias/variance tradeoff|
|SVR|Easily adaptable, works very well on non linear problems, not biased by outliers|Compulsoary to apply feature scaling, not well known, more difficult to understand|
|Decision Tree|Interpretability, no need for feature scaling, works on both linear/non linear problems|Poor results on too small datasets, overfitting can easily occur|
|Random Forest|Powerful and accurate, good performance on many problems, including non linear|No interpretability, overfitting can easily occur, need to choose the right number of trees|

# Classification

## Confusion Matrix
||Pred - Neg|Pred - Pos|
|:-----|:-----|:-----|
|Actual - Neg|True Neg|False Pos|
|Actual - Pos|False Neg|True Pos|

- Accuracy Rate = (TN + TP) / Total
- Error Rate = (FP + FN) / Total

## Classification Model Selection

|Model|Pros|Cons|
|:-----|:-----|:-----|
|Logistic Regression|Probalistic approach, gives informations about statistical significance of features|The Logistic Regression Assumptions|
|K-NN|Simple to understand, fast and efficient|Need to choose then number of neighbors k|
|SVM|Performant, not biased by outliers, not sensitive to overfitting|Not appropriate for non linear problems, not the best choice for large number of features, more complex|
|Kernel SVM|High performance on nonlinear problems, not biased by outliers, not sensitive to overfitting|Not the best for large number of features, more complex|
|Naive Bayes|Efficient, not biased by outliers, works on nonlinear problems, probabilistic approach|Based on the assumption that features hvae same statistical relevance|
|Decision Tree|Interpretability, no need for feature scaling, works on both linear/nonlinear problems|Poor results on too small datasets, overfitting can easily occur|
|Random Forest|Powerful and accurate, good performance on many problems, including nonlinear|No interpretability, overfitting can easily occur, need to choose the number of trees|
