# Tutorial and in-depth usage description for **compresstraj**

### the tutorial uses the data provided in the examples folder. All paths are with respect to this working directory

## protein only

### compression

#### the directory `protein` contains a PDB file and an xtc file for Hen Egg White Lysozyme (HEWL). 
#### The system contains 1963 particles and the trajectory has 500 frames. 
#### Using a GPU is recommended, however not necessary. 

In [1]:
# to view all the options and documentation of the script
!python ../scripts/compress.py -h

usage: compress.py [-h] -r REFFILE -t TRAJFILE -p PREFIX [-e EPOCHS] [-b BATCH] [-l LATENT] [-c COMPRESSION] [-sel SELECTION] [-gid GPUID]
                   [-ckpt CKPT] [--layers LAYERS] [-o OUTDIR]

Process input files and model files for compression.

options:
  -h, --help            show this help message and exit
  -r, --reffile REFFILE
                        Path to the reference file (pdb/gro)
  -t, --trajfile TRAJFILE
                        Path to the trajectory file (xtc/trr/dcd/xyz)
  -p, --prefix PREFIX   prefix to to the files to be generated.
  -e, --epochs EPOCHS   Number of epochs to train [default=200]
  -b, --batch BATCH     batch size [default=128]
  -l, --latent LATENT   Number of latent dims
  -c, --compression COMPRESSION
                        Extent of compression to achieve if latent dimension is not specified. [default = 20]
  -sel, --selection SELECTION
                        a list of selections. the training will treat each selection as a separated ent

### `-l` will override the latent space computed using the `-c` flag.
### for example if number of particles is 200, and `-l 32` and `-c 10` both are passed, 
### the code will set the latent space dimensions to 32 (from the `-l` flag) and
### not 20 (which would be case if only `-c` was passed `N/c = 200/10 = 20`)
</br></br>
### for a detailed description of the selection commands, please see the MDAnalysis [selection documentation](https://docs.mdanalysis.org/stable/documentation_pages/selections.html).

In [2]:
# run the compression
!python ../scripts/compress.py \
-r protein/hewl.pdb \
-t protein/hewl.xtc \
-p hewl \
-e 1000 \
-b 256 \
-c 20 \
-sel "all" \
-o protein/test 

Using device: cuda:0.
saved COG in protein/test/hewl_cog.npy
scaler saved at protein/test/hewl_scaler.pkl
selected atom coordinates saved at protein/test/hewl_select.pdb
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type             | Params | Mode 
-----------------------------------------------------
0 | model   | DenseAutoEncoder | 57.3 M | train
1 | loss_fn | RMSDLoss         | 0      | train
-----------------------------------------------------
57.3 M    Trainable params
0         Non-trainable params
57.3 M    Total params
229.089   Total estimated model params size (MB)
19        Modules in train mode
0         Modules in eval mode
Epoch 999: 100%|█| 2/2 [00:00<00:00, 24.65it/s, v_num=4, avg_training_loss=0.046`Trainer.fit` stopped: `max_epochs=1000` reached.
Epoch 999: 100%|█| 2/2 [00:00<00:00,  2.09it/s, v_num=4, avg_training_loss=0.046
Final model sav

### view the files generated

In [3]:
!ls protein/test

hewl_cog.npy	       hewl_losses.pkl	hewl_select.pdb
hewl_compressed.pkl    hewl_model.pt	restart_hewl.ckpt
hewl_config.json       hewl_rmsd.txt	restart_hewl-v1.ckpt
hewl_decompressed.xtc  hewl_scaler.pkl	restart_hewl-v2.ckpt


### a new folder is created if the output path does not exist. the output files are saved to `protein/test` in this case (from the `-o` flag). They are listed below with short descriptions:

`hewl_cog.npy`: the center of geometries from each frame.

`hewl_config.json`: the configuration used to run the script.

`hewl_model.pt`: the trained AE model.

`hewl_select.pdb`: PDB of the select atoms.

`hewl_compressed.pkl`: the compressed coordinates in pickle format.

`hewl_losses.pkl`: training and validation losses recorded during training.

`hewl_scaler.pkl`: the scaler object in pickle format.


### decompression

In [4]:
# to view all the options and documentation of the script
!python ../scripts/decompress.py -h

usage: decompress.py [-h] -m MODEL -s SCALER -r REFFILE [-t TRAJFILE] -c COMPRESSED -p PREFIX [-sel SELECTION] [-gid GPUID] [-o OUTDIR]
                     [-cog COG]

Process input parameters.

options:
  -h, --help            show this help message and exit
  -m, --model MODEL     Path to the model file.
  -s, --scaler SCALER   Path to the scaler file.
  -r, --reffile REFFILE
                        Path to the reference file.
  -t, --trajfile TRAJFILE
                        Path to the trajectory file 1 (xtc)
  -c, --compressed COMPRESSED
                        Path to the compressed file.
  -p, --prefix PREFIX   output file prefix.
  -sel, --selection SELECTION
                        a list of selections. the training will treat each selection as a separated entity.
  -gid, --gpuID GPUID   select GPU to use [default=0]
  -o, --outdir OUTDIR   output directory
  -cog, --cog COG       center of geometry.


In [6]:
# decompress
!python ../scripts/decompress.py \
-m protein/test/hewl_model.pt \
-s protein/test/hewl_scaler.pkl \
-r protein/test/hewl_select.pdb \
-c protein/test/hewl_compressed.pkl \
-cog protein/test/hewl_cog.npy \
-p hewl -sel "all" -o protein/test \
-t protein/hewl.xtc # optional. 
                    # if provided will calculate the pairwise RMSD and report its range

Using device: cuda:0.
Decompressing.....
decompressed trajectory saved to protein/test/hewl_decompressed.xtc.
calculating RMSD between original and decompressed trajectory.
Computing RMSD: 100%|███████████████████████| 500/500 [00:00<00:00, 3968.50it/s]
RMSD values saved in protein/test/hewl_rmsd.txt
RMSD = (0.03, 0.06) nm


#### the decompressed trajectory is generated in `protein/test/hewl_decompressed.xtc`. The RMSD, if calculated, are stored at `protein/test/hewl_rmsd.txt`