# Machine Learning for practicing data scientists

I will assume that most (all?) have taken or are concurrently taking a class (or classes) that focus  on the mathematics and algorithms behind ML. Here we focus mostly on practical aspects of ML (which also tend to be glossed over in more academic courses, but are nonetheless useful for a practicing data scientist).

### Acknowledgments & Credits

This lesson is adapted from the excellent curriculum materials by Cliburn Chan (2021) at https://github.com/cliburn/bios-823-2021/ under the MIT License.

-----

## Understanding ML

- Model is learned from data, rather than pre-specified
    * No explicit instructions
    * No expert-constructed rules
- Algorithms that get better at performing a task by learning from data

### Learning modalities

#### Labeled vs unlabeled data

- Labeled data $\to$ supervised learning
- Unlabeled data $\to$ unsupervised learning
    * Could also be self-supervised learning
- Future reward $\to$ reinforcement learning

#### Structured vs unstructured data

- Structured data $\to$ tabular data
- Unstructured data just means non-tabular:
    * free text
    * images, video, audio
    * sequences (e.g., time series)
    * molecular sequences
- In the past, unstructured data was first converted to structured data by *feature engineering*; this  has been upended by deep learning methods

-----

### ML model examples

#### Supervised learning models

##### Nearest neighbor
[![KNN visualization](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1531424125/KNN_final_a1mrv9.png)](https://medium.com/analytics-vidhya/different-types-of-machine-learning-algorithm-b4f76b5730fd)

##### Linear models
[![Polynomial regression visual](https://static.javatpoint.com/tutorial/machine-learning/images/machine-learning-polynomial-regression.png)](https://shishirkant.com/polynomial-regression-for-python/)

##### Support vector machines
[![SVM boundary visualization](https://upload.wikimedia.org/wikipedia/commons/thumb/7/72/SVM_margin.png/300px-SVM_margin.png)](https://en.wikipedia.org/wiki/Support_vector_machine)

##### Trees
[![Visualization of Decision Tree](https://i1.wp.com/cdn-images-1.medium.com/max/1024/0*sE2yI-WvzJKNhdme.png?ssl=1&w=1600&resize=1600&ssl=1)](https://towardsai.net/p/programming/decision-trees-explained-with-a-practical-example-fe47872d3b53)

##### Neural networks
[![Neural Network visualization](https://ml-cheatsheet.readthedocs.io/en/latest/_images/dynamic_resizing_neural_network_4_obs.png)](https://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html)

##### Deep Neural Networks
[![Deep Neural Network visualization](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*UnHS6UvgzgFtBqVxcITHcQ.png)](https://medium.com/@anushkamittal/an-introduction-to-neural-networks-8f2cd1280ca9)

-----

## ML stages

[![Machine learning lifecycle](https://images.javatpoint.com/tutorial/machine-learning/images/machine-learning-life-cycle.png)](https://www.javatpoint.com/machine-learning-life-cycle)

### Data processing

We typically need to process the data for it to work with a broad class of ML models. For example:
- Categorical features need to be encoded as numbers
- Sequences of categorical features need to be encoded as vectors
- Unstructured data columns (natural language text etc) need to be encoded as vectors
- Missing data needs may need to be imputed
- Large variations in measurement scales need to be standardized

**Note:** To avoid data *leakage*, any preprocessing that has a *fit* stage should estimate parameters on *training* data only.

#### Category encoding

- Encoding without labels
- Encoding with labels
- Numeric encoding
- One-hot encoding

#### Feature selection

- Uninformative variables
- Collinear or multi-collinear variables
- Dependent features
- Recursive feature elimination

In deep learning, feature selection is often done implicitly by the model during training. However, this requires regularization to prevent overfitting, and to incentivize the model to drop or downweight uninformative features.

#### Shuffling 

- Shuffling breaks up any order to the observations
- This is important for cross-validation and for stochastic gradient descent. If you haven't shown that there isn't an order to the data, assume that there is.

#### Standardization

- Distance-based models may be sensitive to scale
- Convert features to have zero mean and unit standard deviation

-----

### Model training

#### Memorization and generalization: The bias-variance trade-off

- The entire point of any form of learning is generalization, not memorization
- Model capacity is the amount of information a model can store
- If model capacity $\gg$ data complexity, the model will perform best by just memorizing the data $\to$ over-fitting
- If model capacity $\ll$ data complexity, the model will not be very good $\to$ under-fitting

[![Bias-vs-variance tradeoff visualization](figs/Bias-vs-Variance-tradeoff.jpg)](https://stats.stackexchange.com/questions/543509/why-test-error-and-variance-has-different-curve-in-bias-variance-trade-off-graph)

[![img](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*3flBvsYv8dsRqX4ruYdjew.png)](https://medium.com/@cristianefragata/machine-learning-bias-and-variance-26b6ee572af)

#### Tracking training and validation measures

[![Visualization of training and test loss over epochs](https://www.baeldung.com/wp-content/uploads/sites/4/2021/11/epoch-training-curve.png)](https://www.baeldung.com/cs/ml-underfitting-overfitting)

#### Remedies for over-fitting

- Synthetic data and data augmentation
- Pre-training
- Early stopping
- L1 and L2 regularization
- Model-specific parameters for controlling model complexity
- Dropout

#### Remedies for under-fitting

- Collect more data
- Increase model complexity
- Decrease regularization
- Change learning rate

#### Data leakage

- Data leakage is using information in the model training process that would not be expected to be available at prediction time.
    * When evaluating model performance on held-out test data, if data leakage occurred, some information about the test data has been "leaked" to the model training, making the test data not truly unseen data.
    * Causes the predictive scores (metrics) to overestimate the model's performance.
    * Can also result in choosing an inferior model, or suboptimal hyperparameters.

[![Fig. 1: Visualization of the lifecycle of an ML model f. From: Bernett et al. Guiding questions to avoid data leakage in biological machine learning applications. Nat Methods 21, 1444–1453 (2024)](figs/41592_2024_2362_Fig1_HTML.png)](https://www.nature.com/articles/s41592-024-02362-y/figures/1)

- Data leakage can be quite subtle and can occur in many ways:
    * Feature leakage: Including a feature in training that "leaks" information about the target variable (but wouldn't be available at prediction time)
    * Row-wise leakage:
        + Duplication of rows or items between training and test data.
        + Data pre-processing that uses information from the test set. For example, standardizing or normalizing the data _before_ splitting the data.
        + Groups of related rows (e.g., same patient, same gene, etc) that end up in both training and test data.
- Importance of using reproducible workflow

Data leakage has been identified as a major issue in ML reproducibility. For example, Kapoor and Narayanan (2023) identify data leakage as the most common cause of the reproducibility crisis in ML-based science:
> [...] We also tested the reproducibility of ML in a specific field: predicting civil wars, where complex ML models were thought to outperform traditional statistical models. Interestingly, when we corrected for data leakage, the supposed superiority of ML models disappeared: they did not perform any better than older methods. [...]
>
> Kapoor S, Narayanan A. [Leakage and the reproducibility crisis in machine-learning-based science.](https://doi.org/10.1016/j.patter.2023.100804) Patterns (N Y). 2023 Aug 4;4(9):100804. PMID: 37720327; PMCID: PMC10499856.

[![Fig. 2: Schematic overview of the seven questions designed to reveal data leakage. From: Bernett et al. Guiding questions to avoid data leakage in biological machine learning applications. Nat Methods 21, 1444–1453 (2024)](figs/41592_2024_2362_Fig2_HTML.png)](https://www.nature.com/articles/s41592-024-02362-y/figures/2)

Fig. 2: Schematic overview of the seven questions designed to reveal data leakage. From:
> Bernett, J., Blumenthal, D.B., Grimm, D.G. et al. [Guiding questions to avoid data leakage in biological machine learning applications.](https://doi.org/10.1038/s41592-024-02362-y) Nat Methods 21, 1444–1453 (2024).

#### Imbalanced data

- Choice of evaluation metrics (e.g. [Kappa](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html))
- Weighting samples
- Majority under-sampling
- Minority over-sampling

#### Hyper-parameter tuning

- Role of cross-validation
- Grid search
- Auto-tuning

-----

### Model evaluation

#### Unsupervised learning metrics

##### Dimension reduction

- Principal Component Analysis (PCA)
    - Explained variance
    - Scree plot
    - Loadings
    - Biplots
- t-Distributed Stochastic Neighbor Embedding ([t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding))
- Uniform manifold approximation and projection ([UMAP](https://umap-learn.readthedocs.io/en/latest/))

Breast cancer data visualized with PCA, t-SNE, and UMAP:
![Plots contrasting PCA, t-SNE, and UMAP](figs/PCA-tSNE-UMAP.png)

[MNIST](https://en.wikipedia.org/wiki/MNIST_database) data visualized with t-SNE:
[![t-SNE visualization of MNIST data](https://upload.wikimedia.org/wikipedia/commons/f/f1/T-SNE_Embedding_of_MNIST.png)](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding)

##### Information criteria for probabilistic models

- Negative log likelihood and deviance
- AIC
- BIC

#### Supervised learning metrics

##### Classification metrics

- Confusion matrix and binary scores
- ROC curve
- PR curve
- Cumulative gains curve
- Discrimination threshold

##### Regression metrics

- Residual plot
- Prediction error plot
