# Week 6

## Project management


### 4. Log your evaluation results

You should regularly measure various metrics and keep these records for future reference. These will help you compare different runs and they will also help you with early bug diagnosis.

__What should you log?__ You should minimally record your _loss_ value and your _main evaluation metric_ value for both _train_ and _validation_ set. You can also measure some additional evaluation metrics (e.g. per class precision and recall for classification) and memory (how much RAM is taken) and time (how long does one batch takes) requirements.

__How often should you log?__ Most often you evaluate after each epoch. In some cases we have a really big dataset even just one epoch can take several hours. In that case we usually want to have some information about the training earlier and more often than once in an epoch. We can simply log our metrics every X steps instead of at the end of an epoch.

__How to calculate the training set metrics?__ Calculating the metrics for the whole training set is usually quite costly, these datasets are much bigger than validation sets. You can either:
1. calculate these results only for a subset of training set 
2. calculate these metrics as you train and aggregate them at the end.


### 5. Save your models

You should regularly save your models - its parameters - so you do not lose your progress. You can restore your model anytime and continue with training or run additional evaluation on it.

__How often should you save your model?__ You can just save your model anytime you evaluate.

__How many snapshots should you keep?__ You can keep all the snapshots, but this approach can fill your HDD quite easily for bigger models. In that case you can as a bare minimum keep only your last snapshot and your best performing snapshot.

### 6. Hyperparameter management

__Reminder about hyperparameter tuning:__ Random search is generally the safest bet, however it still might be too expensive for the resources you have available (depending on how difficult your project is training-wise). 

### 7. Keep experiment notes

Try keeping logs about the experiments you were running. You have results and hyperparameters logged in various files, but you should also write down your findings and thoughts, e.g. when you found out that some hyperparameter seems to be very sensitive or when you found out that some technique seems to be beneficial for your experiment.

### 8. Early stopping

You should stop training when you detect that the run does not improve anymore. This technique is called _early stopping_ and it can save you lots of GPU time. You can stop training for following reasons:

1. No significant progress was done in previous X epochs.
2. Results are getting significantly worse.
3. Model performs worse than a baseline after certain number of epochs. Some hyperparameters (such as learning rate or batch size) will make the training slower, make sure you do not penalize them.

This technique is recommended only after you get to know your model and how it behaves on the task. Otherwise, if you use early stopping too liberally, you can stop runs that could have achieved interesting results.

## Growing your model

- You have to design a model, choose a optimizer algorithm, propose data representation, pick an evaluation metrics, choose some form of regularization, etc. 
- The sheer number of decisions can be overwhelming. Even if you pick some, it is hard to tell, what to change if you fail to train your model.

- Another problem with developing deep learning models is that they can fail silently. 
- The fact that no exception was raised during the training does not mean that the training is done correctly. - You can feed wrong data in wrong format, you can fail to calculate your loss or minimize it properly, you can miscalculate you evaluation metrics, etc. In this section we mention few tricks that can help you with these problems.

### 10. From train to test

#### Starting with few batches
When I implement a model, before running a full-blown experiment, I like to make sure that it at least seems to be able of learning. For this I try to __overfit it on a very small number (tens) of training samples.__ For this I also often use small version of model, i.e. relatively small hidden layer sizes etc. If the model is not able to get a good performance or decrease loss on a small number of samples, it won't work on full training set either. It usually means that there is a logical error in implementation that should be addressed. This preliminary test serves as a kind of sanity check and can be done locally on your personal computers.

#### Training with full training set
After you establish that your model works, you can normally train it on your training dataset using GPUs. You should worry first about your __performance on the training set__. If you are not able to achieve good performance on your training data, you have no chance of achieving it on test data. Bad training set results might indicate that your model is too small (try making it bigger), it is not trained properly (try changing hyperparameters), or even that you do not feed it with data correctly (try checking what comes in and what comes out). You can assess what are good results by looking what other people get with similar data or on similar tasks.

#### Improving results on test set
After you make sure that you can fit training data, you can start worrying about __how well does your model generalize on test set__. If your test set lags significantly behind your training set, you are probably overfitting, i.e. you are memorizing training data and the knowledge you have learned does not work on previously unseen data. You can solve this problem by using more training data, using data augmentation techniques on existing data, or by employing regularization techniques.

#### Summary

1. __Several training samples__ - Is the model working? Solve with debugging and code review.
2. __Training set__ - Does it have enough capacity? Solve by making the model bigger and playing with optimization hyperparameters.
3. __Testing set__ - Can it generalize well? Solve by adding more data (including using data augmentation) or by using regularization.