<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/MLPG-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Machine Learning Project Guidelines - For Beginners`** book written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Whitepapers/ML-Project-Guidelines).

<br>
<!--NAVIGATION-->

<[ [Stage-11: Model Deployment](17.00-mlpg-Stage-11-Model-Deployment.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Other Considerations - Modeling](18.02-mlpg-Other-Considerations-Modeling.ipynb) ]>

# 18. Other Considerations

## 18.1. Machine Larning
### 18.1.1. No Free Lunch Theorem for Machine Learning
* The No Free Lunch Theorem, often abbreviated as NFL or NFLT, is a theoretical finding that suggests all optimization algorithms perform equally well when their performance is averaged over all possible objective functions
* It implies that there is no single best optimization algorithm. Because of the close [relationship between optimization, search, and machine learning](https://machinelearningmastery.com/applied-machine-learning-as-a-search-problem/), it also implies that there is no single best machine learning algorithm for predictive modeling problems such as classification and regression

### 18.1.2. Ensemble Learning
* An ensemble is a machine learning model that combines the predictions from two or more models
* Why should we consider using an ensemble?
  - **Performance**: An ensemble can make better predictions and achieve better performance than any single contributing model
  - **Robustness**: An ensemble reduces the spread or dispersion of the predictions and model performance. In other words, it’s the **reliability** of the average performance of a model

### 18.1.3. How Do Ensembles Work
**Intuition for Regression Ensembles:**<br>
<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-IntuitionRegEnsembles.png"><br>
<br><br><br><br><br>
Image credit [ (Source) ](https://machinelearningmastery.com/how-ensemble-learning-works/)

**Intuition for Classification Ensembles:**<br>
<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-IntuitionClassEnsembles.png"><br>
<br><br><br><br><br>
Image credit [ (Source) ](https://machinelearningmastery.com/how-ensemble-learning-works/)

### 18.1.4. Machine Learning workflow
<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-MLWorkflow.png"><br>
<br><br><br><br><br>
Image credit [ (Source) ](https://www.coursera.org/in)

### 18.1.5. Types of Unsupervised Learning
* USL involves tasks that operate on datasets without labeled responses or target values
* Instead, the goal is to capture interesting structure or information
* **Applications** of unsupervised learning:
  - Visualize the structure of a complex dataset
  - Density estimation to predict probabilities of events
  - Compress and summarize data
  - Extract features for supervised learning
  - Discover important clusters or outliers
* **Two major types** of unsupervised learning methods:
  - **Transformations**: Process that extract or compute information
  - **Clustering**: Find groups in data and assign every point in the dataset to one of the groups

<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-USLMethodTransformations.png"><br>
<br><br><br><br>
Image credit [ (Source) ](https://www.coursera.org/in)

<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-USLMethodClustering.png"><br>
<br><br><br><br>
Image credit [ (Source) ](https://www.coursera.org/in)

### 18.1.6. Hard Clustering and Soft (or Fuzzy) Clustering
* **Hard clustering**: each data point belongs to exactly one cluster
* **Soft clustering**: each data point is assigned a weight/score/probability of membership for each cluster

## 18.1.7.. How to choose ML Algorithms?
* No free lunch theorem:
  - “All algorithms perform equally well when their performances are averaged over all possible objective functions”
  - There is no single best algorithm for predictive modeling problems such as regression and classification
* Determining which algorithm to use depends on many factors from the ``type of problem`` at hand to the ``type of output`` we are looking for
* The algorithm selection depends on the following factors, to name a few:
  1. Business objective or Problem statement (both in business terms and analytical terms)
  2. Type of the data
  3. Size of the data
  4. Feature space (number of features)
  5. Linearity 
  6. Accuracy and/or Interpretability of the output
  7. Algorithm run-time or Training time
  8. Available computing power

### 18.1.7.1. Defining Business objective / Problem statement
* Define the problem statement both in business terms and analytical terms
* Use the following guideline to select the algorithms:
  - ``Predict numerical values`` such as forecast: ``REGRESSION`` problem
    - Linear – Low computing power, Fast training time<br>
      _LinearRegression, LASSO, Ridge, ElasticNet_
    - Non-linear – High computing power, Long training time, More memory<br>
      _KNN, SVM, Decision Trees, Neural Networks_
    - Ensembles (Bagging & Boosting) – Accurate, Fast training time, More memory <br>
      _Random Forest, Extreme Boosting_
  - ``Predict categories/classes`` (both binary class & multi-class): ``CLASSIFICATION`` problem
    - Logistic Regression – Fast training time
    - Neural Network – Accurate, Long training time
    - Random Forest – Accurate, Fast training time, More memory
    - SVM – Long training time, More memory, use under 100 features
  - ``Discover structure``: ``CLUSTERING`` problem
    - K-Means
  - ``Find unusual occurrences``: ``Anomaly Detection``
    - One-class SVM – use under 100 features
    - Local Outlier Factor (neighbors-based technique)
    - Minimum Covariance Determinant 
    - Isolation Forest (ensemble technique)
  - ``Generate Recommendations``: ``Recommender System``
    - Content-based filtering (Supervised learning - Classification):
       - Recommends products that are similar to the ones that a user has liked in the past
    - Collaborative Filtering (Supervised learning - Classification):
       - Recommends products based on the user behavior that is similar to other user groups
       - Simple and appropriate for small datasets
    - Clustering (Unsupervised Learning):
       - Appropriate for large datasets
    - Neural Networks
       - Deep learning approach used by YouTube
  - ``Information Extraction``: ``Natural Language Processing (NLP)``
    - NLTK (Natural Language Toolkit)
  - ``Networks and Graphs analysis``:
    - NetworkX
  - ``Geographic/Geospatial/Maps``:
    - Folium
  - ``Image and video processing``:
    - OpenCV
    - TensorFlow (deep learning platform)
    - Keras (Python API built on top of TensorFlow)
    - PyTorch (deep learning library)

### 18.1.7.2. Type of data
* Datasets may contain only numerical values or categorical values or texts or images or audios or videos or a mixture of any/all of these
* The type(s) of the data play an important role in branching out the options for algorithm types such as Regression, Classification, Clustering, Text extraction, Maps, etc.

### 18.1.7.3. Size of the data
* It’s recommended to gather a good amount of data to get reliable predictions, however, many a time, the availability of data is a constraint
* ``If the training data is smaller`` or it has a fewer # of observations and a higher # of features, then ``choose algorithms with high bias/low variance like Linear regression, Naïve Bayes, or Linear SVM``
* ``If the training data is sufficiently large`` and the # of observations is higher as compared to the # of features, then, ``choose low bias/high variance algorithms like KNN, Decision trees, or kernel SVM``

### 18.1.7.4. Feature space (number of features)
* The dataset may have a large number of features that may not all be relevant and significant
* For a certain type of data, such as genetics or textual, the number of features can be very large compared to the number of data points
* A large number of features can bog down some algorithms, making training time unfeasibly long
* ``If the feature space is large and lesser observations``, then ``use Linear SVM``
* Use PCA and feature selection techniques to reduce dimensionality and select important features

### 18.1.7.5. Linearity
* Many algorithms work on the assumption that classes can be separated by a straight line, e.g., Logistic Regression and SVMs
* Linear Regression algorithms assume that data trends follow a straight line, If the data is linear, then these algorithms perform quite good
* However, not always is the data is linear, so we require other algorithms which can handle high dimensional and complex data structures, e.g., kernel SVM, Random Forest, NNs
* ``The best way to find out the linearity is to either fit a linear line or run a Logistic Regression or Linear SVM and check for residual errors. A higher error means the data is not linear and would need complex algorithms to fit``

### 18.1.7.6. Accuracy/Interpretability of the output
* ``Accuracy``: A function that predicts a response value for a given observation, which is close to the true response value for that observation
* ``Flexible models`` give ``high accuracy`` but ``low interpretability`` (e.g., KNN with k=1)
* ``Restrictive models`` give ``low accuracy`` but ``high interpretability`` (e.g., linear regression)
* The choice of the algorithm depends on the objective of the business problem
* If the ``inference is the goal``, then ``Restrictive models are better`` as they are much more interpretable
* ``If accuracy is the goal``, then ``Flexible models are better``
* In general, as the flexibility of a method increases, its interpretability decreases

**Trade-off between _``Accuracy``_ and _``Interpretability``_**<br>
<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-AccuracyVsInterpretability.png"><br>
<br><br><br><br><br>
Image credit [(Source)](https://cdn.oreillystatic.com/en/assets/1/event/105/Overcoming%20the%20Barriers%20to%20Production-Ready%20Machine-Learning%20Workflows%20Presentation%201.pdf)<br>

<br>**Trade-off between _``Flexibility``_ and _``Interpretability``_**<br>
<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-FlexibilityVsInterpretability.png"><br>
<br><br><br><br><br>
Image credit [(Source)](https://faculty.marshall.usc.edu/gareth-james/index.html)

### 18.1.7.7. Algorithm run-time/Training time
* Higher accuracy typically means higher training time
* Also, algorithms require more time to train on large training data
* In real-world applications, the choice of algorithm is driven by these two factors predominantly
* ``Short run-time and easy to implement but low accuracy: Naïve Bayes and Linear and Logistic regression``
* ``Long run-time and not so easy to implement but high accuracy: Random Forest, NN, SVM``

### 18.1.7.8. Available computing power
* The choice of algorithms is based on the available CPUs and GPUs as well
* If multiple CPUs and GPUs are available with multiple cores, then deep learning can be a good option

<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-MLAlgorithmsCheatSheet1.png"><br>
<br><br><br><br><br><br><br><br>
Image credit [(Source)](https://blogs.sas.com/content/subconsciousmusings/2020/12/09/machine-learning-algorithm-use/)<br>

<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-MLAlgorithmsCheatSheet2.png"><br>
<br><br><br><br><br><br><br><br>
Image credit [(Source)](https://www.kdnuggets.com/2020/05/guide-choose-right-machine-learning-algorithm.html)<br>

<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-MLAlgorithmsCheatSheet3.png"><br>
<br><br><br><br><br><br><br><br>
Image credit [(Source)](https://www.kdnuggets.com/2020/05/guide-choose-right-machine-learning-algorithm.html)<br>

### 18.1.7.9. Rule of Thumb
* **``It's not who has the best algorithm that wins. It's who has the most data``** – (Andrew Ng)
* **``An inferior algorithm with enough data can outperform a superior algorithm with less data``** – (A. Ng)
* **``More data beats better algorithms but better data beats more data``** – (Bala)<br>
  **Step1**: Define the problem in analytical terms from the business objectives<br>
  **Step2**: Start with a simple algorithm, build a baseline model, benchmark it, and be familiar with the data<br>
  **Step3**: Then try with more complex algorithms and build complex models to meet the business needs

<!--NAVIGATION-->
<br>

<[ [Stage-11: Model Deployment](17.00-mlpg-Stage-11-Model-Deployment.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Other Considerations - Modeling](18.02-mlpg-Other-Considerations-Modeling.ipynb) ]>