# Good practices of NN/DL project design
## What to do and - more importantly perhaps - not to do

# Is my project right for Neural Networks?

* The thought process should not be: “I have some data, why don’t we try neural networks”
* But it should be: “Given the problem, does it make sense to use neural networks?”

    * Do I really need non-linear modelling?
    * What literature is out there for similar problems?
    * How much data will I be able to gather or put my hands on?
    * Are there datasets out there that I can re-use before I collect my data?



## Do I really need non-linear modelling?

* Sometimes linear methods perform just as well if not better
* Less risk of catastrophic overfitting
* Faster to code, optimize, run, debug
* Use linear modelling as a baseline before you move to non-linear methods?

## Real-life example

Drop-in question: "I tried deep learning on my data and it didn't perform better than this other simpler method"

* Classifying gene expression samples
* O(1000) features
* O(1000) samples
* 2 classes
* NN looked like this:

In [4]:
from keras.layers import Dense
from keras.models import Sequential

model = Sequential()
model.add(Dense(1000, input_dim=5000))
model.add(Dense(500))
model.add(Dense(2, activation="softmax"))

model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 1000)              5001000   
_________________________________________________________________
dense_13 (Dense)             (None, 500)               500500    
_________________________________________________________________
dense_14 (Dense)             (None, 2)                 1002      
Total params: 5,502,502
Trainable params: 5,502,502
Non-trainable params: 0
_________________________________________________________________


## Parameters (weights) vs. samples

* If the number of parameters is many times higher than the number of samples a NN will never work
* Ideally, we are looking for the inverse: way more samples than parameters
* Some rule of thumbs out there:
    * 10x as many labelled samples as there are weights
    * A few thousand samples per class
    * Just try it and downscale/regularize until you're not overfitting anymore (or until you have a linear model)

## And even if I have enough data for a NN...

... is Deep Learning the right choice?

* The tasks were Deep Learning shine are those that require feature extraction:
    * Imaging -> edge/object detection
    * Audio/text -> sound/word/sentence detection
    * Protein structure prediction -> mutation patterns/local structure/global structure

* Deep Learning makes feature extraction automatic and seem to work best when there is a hierarchy to these features
* Is your data made that way? 
    * Does it have an order (spatial/temporal)? 
    * Are smaller patterns going to form higher-order patterns?
* All these different types of layers need to be there for a reason


<img src="figures/feature_extraction.png"></img>

source: [datarobot](https://www.datarobot.com/blog/a-primer-on-deep-learning/)

## And even when both these conditions have met

... you need a few more things:

* Domain knowledge is not enough
* Sometimes people with NN/DL knowledge and no domain knowledge end up being the right ones for the job (see Alphafold)
* You also need lots of patience and time, these things rarely work out of the box

## A few more things to keep in mind

* You need extensive knowledge of your data:
    * Split the data in a rigorous way to avoid introducing biases
    * Check for _information leakage_ before you get overly optimistic results
    * Make sure that there are no errors in your data

And therein lies the main issue:
* Some think that DL is about having a model magically fixing your data
* Instead, DL is _mostly_ about knowing your data

## Know your train/validation/test sets

* A _train set_ is a set of samples used to tune the NN weights
* A _validation set_ is a set used to tune they NN hyperparameters:
    * Type of model (maybe not even a NN)
    * Number of layers
    * Number of neurons per layer
    * Type of layers
    * Optimizer
    * Validation set results are NOT the ones that will get published
    * Doesn't matter if you cross-validate
* A _test set_ is a secluded set of samples that are used only once to test the final model
    * Give an idea of how well the model generalizes to unseen data (results go on paper)

### Beware of similar samples across sets

<img src="figures/homer.png" width=500>

<br>
<br>
<img src="figures/guyincognito.png">
(2F08 “Fear of Flying”)

## Knowing what each set does is half the battle

Train, validation and test sets cannot be too similar to each other, or you will not be able to tell if the network is generalizing or just memorizing

* _How_ different they should be depends on what you're trying to achieve
* Come up with a similarity measure
* At the very least remove duplicate samples
* You would be surprised how often scientists mess this up



<img src="figures/andrewng.png">

<img src="figures/trainvalidationleak1.png">

<img src="figures/andrewng.png">

<img src="figures/trainvalidationleak.png">

# Sad ending :(
<img src="figures/trainvalidationleak2.png">

## Another example, protein structure prediction

* For some reason most researchers try to split train/validation/test by sequence similarity
* If two proteins have <25% identical amino acids, they are deemed different enough
* But protein families/superfamilies contain many proteins that share no detectable sequence similarity
* Sequence similarity is not the right metric!

<img src="figures/25percent.png">

## Lab 1: splitting a protein sequence dataset (~1 h.)

Jupyter notebook:

session_goodPracticesDatasetDesign/lab_validation/rigorous_train_validation_splitting.ipynb

Two different strategies will be tested:
* Random split
* Split by alignment score

Which works best? Different groups test different networks on each strategy

## Neural Nets are very good at detecting patterns and they will use this against you

### (a.k.a. target leakage)

## Target leakage

* Making a predictor when you know the answers is not as easy as it seems
* Need to remove any revealing info you would not have access to in real scenario
* Classic example: predict yearly salary of employee
    * But one of the features is "monthly income"

## Example: detecting COVID-19 from chest scans 
(https://www.datarobot.com/blog/identifying-leakage-in-computer-vision-on-medical-images/)

* COVIDx dataset
* Training set: 66 positive COVID results, 120 random non-COVID examples
* 2-class classifier based on ResNet50 Featurizer
* Perfect validation results! Great!


## Example: detecting COVID-19 from chest scans 

Inspecting dataset with image embeddings tells another story: can anyone tell what's wrong?

<img src="figures/covidchest.png">
(https://www.datarobot.com/blog/identifying-leakage-in-computer-vision-on-medical-images/)

## Example: detecting COVID-19 from chest scans 

Let's look at activations map and see more in detail
* Get final layer's output after activation (ReLU) and plot figure

<img src="figures/covidchest2.png">
(https://www.datarobot.com/blog/identifying-leakage-in-computer-vision-on-medical-images/)

## Lab 2: looking for target leakage in a text dataset (~1 h.)

Jupyter notebook:

session_goodPracticesDatasetDesign/lab_targetLeakage/investigating_target_leakage.ipynb

Visualize the layers of a NN for Natural Language Processing:

* Can you tell if there is target leakage of some kind?
* Propose solutions to curb the issue

## Your model is only as good as your data 

Reasons why one of my networks wouldn't work:

* Labels were wrong (label for amino acid n was assigned to amino acid n+1)
* The actual target sequence was missing from the multiple sequence alignment
* Inputs weren't correctly scaled/normalized
* Script to convert 3-letter code amino acid to one letter (LYS -> K) didn't work as expected



<img src="figures/unknown.png?0">

## NNs are robust

They will "kind of" work even when some labels are incorrect, but it is going to be very tricky to understand if and what is wrong

* Before training:
    * Plot data distributions
    * Test all data preparation scripts
    * Manually look at data files
    * Check labels for mistakes, unbalancedness

* While training:    
    * Look at badly predicted samples
    * Be paranoid when something doesn't work well, even more when it works surpisingly well

<br>
<br>
<img src="figures/monk.jpg" width=400>

## Having the right data for NN/DL, but not enough of it: what now?

Main avenues:
* Find more of it
* Cut down insignificant features
* Make smaller models
* Generate artificial samples: Data augmentation
* Transfer learning (so find more data, again)
* Think outside the (black) box

## Lab 3: transfer learning in imaging data (~1 h.)

Jupyter notebook: 

session_goodPracticesDatasetDesign/lab_transferLearning/transfer_learning.ipynb

* Christophe's lab on cell classification:
    * We want to train a larger network
    * Use a network pre-trained on completely different data
    * Is it going to help?

## Tips and tricks on training your Neural Networks

* Fix training seeds for reproducibility
* Calculate metric for baseline naïve predictor
* Training on small datasets first: can you make it overfit?
* Can you make it overfit on the normal dataset?
* Now scale it back
* Change one thing at a time!
* Neural Networks are not necessarily black boxes, visualize outputs from different layers to see where the network is focusing