<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/MLPG-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Machine Learning Project Guidelines - For Beginners`** book written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Books/ML-Project-Guidelines).

<br>
<!--NAVIGATION-->

<[ [Other Considerations - Machine-Learning](18.01-mlpg-Other-Considerations-Machine-Learning.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Other Considerations - Algorithms](18.03-mlpg-Other-Considerations-Algorithms.ipynb) ]>

# 18. Other Considerations

## 18.2. Modeling

### 18.2.1. Most commonly used model categories in ML
* **Predictive models** (Supervised Learning): 
  - Analyze the past performance to predict future predictions
  - There are 3 types:
    - Regression models 
    - Classification models
    - Time-series models
* **Descriptive models** (Unsupervised Learning): 
  - Quantify the relationships in data in a way that is often used to classify data sets into groups
  - There are 2 types:
    - Clustering (e.g., K-Means)
    - Association rules (e.g., Apriori)

### 18.2.2. The drawbacks of a linear model
* Sensitive to outliers
* Overfitting – if the number of observations is much higher compared to the number of features 
* Assumes linear relationships between features; any nonlinear relationship that exists would result in a bad model

### 18.2.3. Cross-Validation
* CV is a statistical method used to estimate the performance of ML models on unseen data
* CV is a resampling or out-of-sampling method
* Used to protect against overfitting in a predictive model, in a case where the amount of data is limited 
* CV is performed on the training dataset (e.g., k-fold cv, stratified k-fold cv, etc.)
* CV is a method that goes beyond evaluating a single model using a single Train/Test split of the data; it uses multiple Train/Test splits, each of which is used to train and evaluate a separate model
* CV is used to evaluate the model and not learn or tune a new model
* When applying Supervised learning methods, follow a consistent series of steps:
  1) _`Partition the data set`_ into training and test sets using the Train/Test split function
  2) _`Call the Fit Method`_ on the training set to estimate the model
  3) _`Apply the model by using the Predict Method`_ to estimate a target value for the new data instances, (or) by using the Score Method to evaluate the trained model's performance on the test set
* To do model tuning, see how to tune the model parameters using something called "Grid Search"
* A note on performing CV for more advanced scenarios:
  - In some cases (e.g. when feature values have very different ranges), we need to scale/ normalize the training & test sets before use with a classifier. The proper way to do CV when you need to scale the data is not to scale the entire dataset with a single transform, since this will indirectly leak info into the training data about the whole dataset, including the test data (data leakage). Instead, scaling/normalizing must be computed and applied for each CV fold separately. To do this, the easiest way in scikit-learn is to use `pipelines`.

### 18.2.4. Differences between PCA and EFA
![](figures/MLPG-OC-DifferencesPCAEFA.png)

### 18.2.5. Dimensionality Reduction and PCA

**Dimensionality reduction (DR):**
* DR is the transformation of input features from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains meaningful properties of the original data
* DR refers to techniques that reduce the number of input variables/features in a dataset
* DR is a data preparation technique performed on data before modeling, and It might be performed after data cleaning and data scaling and before training a predictive model
* `DR methods` include `PCA, feature selection/engineering, matrix factorization, autoencoders`, etc.
* The number of input variables/features for a dataset is referred to as its dimensionality
* Higher dimensionality may mean 100s, 1000s, or even millions of input variables
* More input features often make a predictive modeling task more challenging to model, more generally referred to as the `curse of dimensionality`
* There is no best technique for DR and no mapping of techniques to problems
* It is good practice to either normalize or standardize data before using these methods if the input variables have different scales or units

**Why do we need DR techniques** (or) **the motivations for using them:**
* **Data compression:**
  - Less data means `less computer memory` which will `speed up the learning algorithm`
  - **Note**: DR techniques reduce the **`n`** (`number of features`); not **`m`** (`number of observations which will remain the same`)
* **Visualization:**
  - It’s not easy to visualize data that is more than 3Ds
  - Reduce the dimensions of the data to 3 or less to plot it
  - We need to find a few features that can `summarize` all other features
  - `Example`: 100s of features related to a country’s economic system may be combined into 1 feature and that can be called “Economic activity”

**Principal Component Analysis (PCA):**
* PCA is one of the most commonly used dimensionality reduction technique
* It’s used to speed up supervised learning (both regression and classification)
* Any DR (such as PCA) performed on a training set must also be performed on new data, such as a test set, validation set, and data when making a prediction with the final model, however, note:
  - Define the PCA reduction ($x^{(i)}$ to $z^{(i)}$) only on the training set; not on the CV or test sets
  - Apply the mapping of $z^{(i)}$ to the CV and test sets after it is defined on the training set
* **Rule of Thumb:**
  - `Do not assume we need to do PCA; try the full ML algorithm first, then use PCA if needed`

**Purpose of PCA:**
* Often used as a dimensionality-reduction technique to increase the interpretability of the datasets
* Represents multivariate data as a smaller set of variables (summary indices) to observe trends, jumps, clusters, and outliers
* Particularly useful in processing data where multi-colinearity exists between the features
* Used to explain the variance-covariance structure of a set of variables through linear combinations
* The projection of data in a smaller feature space (normally 2D) acts as an overview and may uncover the relationships between observations and variables, and among the variables

**Bad use of PCA:**
* Trying to prevent overfitting
  - We may think that reducing the feature space with PCA would address overfitting
  - It might work sometimes, but is not recommended because it does not consider the values of the results y
  - It’s better to use regularization instead of PCA to address overfitting issues

<!--NAVIGATION-->
<br>

<[ [Other Considerations - Machine-Learning](18.01-mlpg-Other-Considerations-Machine-Learning.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Other Considerations - Algorithms](18.03-mlpg-Other-Considerations-Algorithms.ipynb) ]>