<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/MLPG-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Machine Learning Project Guidelines - For Beginners`** book written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Books/ML-Project-Guidelines).

<br>
<!--NAVIGATION-->

<[ [Stage-2: Data Understanding](08.00-mlpg-Stage-2-Data-Understanding.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Stage-4: Data Preprocessing](10.00-mlpg-Stage-4-Data-Preprocessing.ipynb) ]>

# 9. Stage-3: Research

Based on the type of the problem (_regression, classification or clustering, etc._) do research on the available algorithms and mention here the list of ML algorithms to be used to build the models. The final model will be selected based on their performances.
* List down the names of algorithms, classifiers, and types of algorithms (linear, non-linear, ensemble, etc.) for the project
* Identify the Evaluation metrics selected for the project with the reasons
* An overall approach on how this project will be developed (similar to development methodology in traditional projects)

## 9.1. List of selected algorithms to build models
The following are the algorithms covering a variety of regression/classification strategies and techniques that have been selected for the project. All models will be developed with default parameters and trained. Only a few best will be selected for algorithm tuning based on their performances (evaluation metrics). Then, each of those models will be individually tuned to minimize error. The final model will be selected based on their evaluation metrics on training and test datasets. Select the algorithms based on the type of problem you are trying to solve. Commonly used algorithms are:

### 9.1.1. Regression algorithms
```
 1) Linear Regression (LR)          LinearRegression()            - Linear
 2) Lasso (Lasso)                   LassoCV()                     - Linear
 3) Ridge (Ridge)                   RidgeCV()                     - Linear
 4) ElasticNet (EN)                 ElasticNetCV()                - Linear
 5) K-Nearest Neighbors (KNN)       KNeighborsRegressor()         - Non-linear
 6) Supprt Vector Machines (SVM)    SVR()                         - Non-linear
 7) Decision Trees (DT)             DecisionTreeRegressor()       - Non-linear
 8) Random Forest (RF)              RandomForestRegressor()       - Ensemble - Bagging
 9) Gradient Boosting (GB)          GradientBoostingRegressor()   - Ensemble - Boosting
10) Extreme Boosting (XGB)          XGBRegressor()                - Ensemble – Boosting
```

### 9.1.2. Classification algorithms
```
 1) Logistic Regression (LR)        LogisticRegression()          - Simple Linear
 2) SGD Classifier (SGD)            SGDClassifier()               - Simple Linear
 3) K-Nearest Neighbors (KNN)       KNeighborsClassifier()        - Nonlinear
 4) Support Vector Machines (SVM)   SVC()                         - Nonlinear
 5) Gaussian Naive Bayes (NB)       GaussianNB()                  - Nonlinear
 6) Decision Trees (DT)             DecisionTreeClassifier()      - Nonlinear
 7) Random Forest Trees (RF)        RandomForestClassifier()      - Ensemble Bagging
 8) Gradient Boosting (GB)          GradientBoostingClassifier()  - Ensemble Boosting   
 9) AdaBoost                        AdaBoostClassifier()          - Ensemble Boosting
10) Extreme Boosting (XGB)          XGBClassifier()               - Ensemble Boosting
```

## 9.2. List of model evaluation metrics
**Performance/Evaluation Metric:**
* An evaluation metric is a way to quantify the performance of a predictive model
* Evaluation metric ≠ Loss function
* There is no "one fits all" evaluation metric
* Get to know your data
* Keep in mind the business objective of your ML problem

Select one or more metrics based on the problem type and business priorities. Commonly used metrics are:

![](figures/MLPG-ModelEvalMetrics.png)

**NOTE:**
* CV is a cross-validation score and, for regression, the scorer can be anything such as `R^2, MAE, MSE, RMSE, and RMSLE`
* CV is a cross-validation score and, for classification, the scorer can be anything such as `Accuracy, ROC-AUC, PR-AUC, Logloss`, etc.
* The following are not metrics, but they help to gain insight into the type of errors a model is making
  - `Confusion matrix`
  - `Classification report (produces Precision, Recall, F1 scores)`
* Algorithm `run-time is also a metric`

### 9.2.1. Regression model evaluation metrics
* **R^2, MAE, MSE, RMSE,** and **RMSLE** can be calculated independently using sklearn
* **Cross-Validation** can be calculated using any of the independent metrics mentioned above
* **Bias** and **Variance** are found by comparing the calculated evaluation metrics for Training datasets (_`X_train, y_train`_) and Test datasets (_`X_test, y_test`_) separately using any of the 5 independent metrics mentioned above
  - **Bias Error (Underfitting)**: Bias is the simplifying assumptions made by a model to make the target function easier to learn
    - `Low Bias`: Suggests fewer assumptions about the form of the target function (too simple)<br>
    _Algorithms include Decision Trees, k-Nearest Neighbors, and SVMs_
    - `High-Bias`: Suggests more assumptions about the form of the target function (complex)<br>
    _Algorithms include Linear Regression, Linear Discriminant Analysis, and Logistic Regression_
  - **Variance Error (Overfitting)**: Variance is the amount that the estimate of the target function will change if different training data was used
    - `Low Variance`: Suggests small changes to the estimate of the target function with changes to the training dataset (complex)<br>
    _Algorithms include Linear Regression, Linear Discriminant Analysis, and Logistic Regression_
    - `High Variance`: Suggests large changes to the estimate of the target function with changes to the training dataset (too complex)<br>
    _Algorithms include Decision Trees, k-Nearest Neighbors, and SVMs_
  - **Bias-Variance Trade-Off**: The goal of any supervised machine learning algorithm is to achieve low bias and low variance. In turn, the algorithm should achieve good prediction performance
    - Linear ML algorithms often have a high bias but a low variance
    - Nonlinear ML algorithms often have a low bias but a high variance

* **Residual Analysis:**
  - Regression lines are the best fit for a set of data
  - A residual value is a measure of how much a regression line vertically misses a data point
  - A residual plot has the Residual Values on the vertical axis; the horizontal axis displays the independent variable and is typically used to find problems with regression
  - Some data sets are not good candidates for regression, including:
    - Heteroscedastic data (points at widely varying distances from the line)
    - Data that is non-linearly associated
    - Data sets with outliers
  - The residual analysis is done during model validation or just before selecting the final model
  - Residual equation = $Residual (\epsilon) = y - \hat{y}$

* **Normality Test (Q-Q Plot):**
  - Helps to identify outliers/skewness
  - This can be done during the Data Understanding stage to understand the raw data and can be done during model validation or just before selecting the final model

### 9.2.2. Classification model evaluation metrics
**Threshold Metrics for Balanced / Imbalanced Classification**
```
Accuracy    = Correct Predictions / Total Predictions
Error       = Incorrect Predictions / Total Predictions       (complement of Accuracy)
Sensitivity = TruePositive / (TruePositive + FalseNegative)
Specificity = TrueNegative / (FalsePositive + TrueNegative)   (complement to Sensitivity)
G-Mean      = sqrt(Sensitivity * Specificity)                 (Geometric-mean that balances both)
Precision   = TruePositive / (TruePositive + FalsePositive)
Recall      = TruePositive / (TruePositive + FalseNegative)
F-Measure   = (2 * Precision * Recall) / (Precision + Recall)
```

**Ranking Metrics for Balanced / Imbalanced Classification**
```
TruePositiveRate  = TruePositive / (TruePositive + FalseNegative)
FalsePositiveRate = FalsePositive / (FalsePositive + TrueNegative)
ROC AUC           = ROC Area Under Curve
PR AUC            = Precision-Recall Area Under Curve
```

**Probabilistic Metrics for Balanced / Imbalanced Classification**
```
LogLoss         = -((1 – y) * log(1 – yhat) + y * log(yhat))  (for Binary classification)
LogLoss         = -(sum c in C y_c * log(yhat_c))             (for multi-class classification)
BrierScore      = 1/N * Sum i to N (yhat_i – y_i)^2
BrierSkillScore = 1 – (BrierScore / BrierScore_ref)
```

### 9.2.3. How to choose a Binary Classification model evaluation metric (for imbalanced datasets)
* Are you predicting probabilities? 
  - Do you need class labels? 
    - Is the positive class more important? 
      - Use **Precision-Recall AUC**
    - Are both classes important? 
      - Use **ROC AUC**
  - Do you need probabilities? 
    - Use **Brier Score** and **Brier Skill Score**
* Are you predicting class labels? 
  - Is the positive class more important? 
    - Are False Negatives and False Positives Equally Important? 
      - Use **F1-Measure**
    - Are False Negatives More Important? 
      - Use **F2-Measure**
    - Are False Positives More Important? 
      - Use **F0.5-Measure**
  - Are both classes important? 
    - Do you have < 80%-90% Examples for the Majority Class? 
      - Use **Accuracy**
    - Do you have > 80%-90% Examples for the Majority Class? 
      - Use **G-Mean**

![](figures/MLPG-BinaryClassModelEvalMetric.png)

## 9.3. Deliverables from Stage-3
* List of selected algorithms to build models
* List of selected Evaluation metrics with reasons
* Results of the overall research
* Model architecture

<!--NAVIGATION-->
<br>

<[ [Stage-2: Data Understanding](08.00-mlpg-Stage-2-Data-Understanding.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Stage-4: Data Preprocessing](10.00-mlpg-Stage-4-Data-Preprocessing.ipynb) ]>