# Overfit, Regularize, Fine-tune, & beat the baseline

In [1]:
%load_ext autoreload
%autoreload 2
%pdb on

Automatic pdb calling has been turned ON


## List candidate DL models

- 3 Key choices:
    - Last Layer activation
    - Loss function
    - Optimization Configuration

## Load the whole dataset

## Start with a simple architecture

## Overfit a single batch

## Find a good learning rate

## Add more components

- Add layers
- Make the layers bigger
- Train for more epochs
- Always monitor for training/validation loss

## Regularize the model

- Re-train different hyper-parameters multiple times.
- Add dropout
- Try different architectures: add or remove layers.
- Add L1 and/or L2 regularization.
- Change hyper-parameters to find the optimal configuration.
    - units per layer
    - optimizer `lr`
    - `#layers`
- Use Data Augmentation
- Use Pre-trained Models to feature extract/fine-tune.
- Conduct gradient clipping to avoid the exploding gradient effect (esp. for RNNs)

## Beat the baseline

## Conduct error analysis & XAI on the erroneous predictions

Options:
- Plot a confusion matrix [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/20_plot_confusion_matrix.ipynb)]
- Compare multiple ROC curves in a single plot [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/21_plot_roc_curve.ipynb)]
- Use AUC to evaluate multiclass problems [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/32_multiclass_auc.ipynb)]
- Ensemble models using `VotingClassifer` / `VotingRegressor` [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/46_ensembling.ipynb), [example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/47_ensemble_tuning.ipynb)]
- Adapt this pattern to solve many Machine Learning problems [[pattern](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/50_simple_ml_pattern.ipynb)]
- Conduct Feature Importance
- Test feature generalization
    - Coverage is the % of the samples that has values for this feature in the data.
    - Look into how feature distribution changes from train to test.

Think about...
- What data would a human use to avoid these errors?

## Optimize your implementation

- Pytorch Lightning tips
    - Use workers in DataLoaders: `num_workers = 4 * num_GPU`
    - Pin memory: `DataLoader(dataset, pin_memory=True)`
    - Avoid CPU ↔ GPU transfers `Bad: .cpu() / .item() / .numpy()`. `Good .detach()`
    - Create tensors directly on GPU:`tensor.rand(2,2, device=torch.device('cuda:0'))`
        - For inner tensors: `t = tensor.rand(2,2, device=self.device)`
    - Use `DistributedDataParallel` not `DataParallel`
    - Use 16-bit precision `Trainer(precision=16)`
    - Profile your code: `Trainer(profile=True) / profiler=AdvancedProfiler(); trainer=Trainer(profiler=profiler)`
- DL tips → When your NN is not working: [37 Reasons why your Neural Network is not working](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607)

## Measure performance on the test set to estimate the generalization error

---