# Tutorial and in-depth usage description for **compresstraj**

### the tutorial uses the data provided in the examples folder. All paths are with respect to this working directory

## protein and ligand 

### compression

#### the directory `protein_ligand` contains a PDB file and an xtc file for Cytochrome P450 and camphor. 
#### The system contains 6469 particles and the trajectory has 271 frames. 
#### Using a GPU is recommended, however not necessary. 

In [1]:
# to view all the options and documentation of the script
!python ../scripts/compress.py -h

usage: compress.py [-h] -r REFFILE -t TRAJFILE -p PREFIX [-e EPOCHS] [-b BATCH] [-l LATENT] [-c COMPRESSION] [-sel SELECTION] [-gid GPUID]
                   [-ckpt CKPT] [--layers LAYERS] [-o OUTDIR]

Process input files and model files for compression.

options:
  -h, --help            show this help message and exit
  -r, --reffile REFFILE
                        Path to the reference file (pdb/gro)
  -t, --trajfile TRAJFILE
                        Path to the trajectory file (xtc/trr/dcd/xyz)
  -p, --prefix PREFIX   prefix to to the files to be generated.
  -e, --epochs EPOCHS   Number of epochs to train [default=200]
  -b, --batch BATCH     batch size [default=128]
  -l, --latent LATENT   Number of latent dims
  -c, --compression COMPRESSION
                        Extent of compression to achieve if latent dimension is not specified. [default = 20]
  -sel, --selection SELECTION
                        a list of selections. the training will treat each selection as a separated ent

### `-l` will override the latent space computed using the `-c` flag.
### for example if number of particles is 200, and `-l 32` and `-c 10` both are passed, 
### the code will set the latent space dimensions to 32 (from the `-l` flag) and
### not 20 (which would be case if only `-c` was passed `N/c = 200/10 = 20`)
</br></br>
### for a detailed description of the selection commands, please see the MDAnalysis [selection documentation](https://docs.mdanalysis.org/stable/documentation_pages/selections.html).

In [2]:
# run the compression
# we will decompose the system into its components: 
## -protein
!python ../scripts/compress.py \
-r protein_ligand/p450_cam.pdb \
-t protein_ligand/p450_cam.xtc \
-p prt \
-e 1000 \
-b 256 \
-c 20 \
-sel "(not resname HEM) and (not resname CAM)" \
-o protein_ligand/test 

## -HEME subunit
!python ../scripts/compress.py \
-r protein_ligand/p450_cam.pdb \
-t protein_ligand/p450_cam.xtc \
-p hem \
-e 1000 \
-b 256 \
-c 20 \
-sel "resname HEM" \
-o protein_ligand/test 

## -ligand (camphor)
!python ../scripts/compress.py \
-r protein_ligand/p450_cam.pdb \
-t protein_ligand/p450_cam.xtc \
-p cam \
-e 1000 \
-b 256 \
-c 20 \
-sel "resname CAM" \
-o protein_ligand/test 

Using device: cuda:0.
saved COG in protein_ligand/test/prt_cog.npy
scaler saved at protein_ligand/test/prt_scaler.pkl
selected atom coordinates saved at protein_ligand/test/prt_select.pdb
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type             | Params | Mode 
-----------------------------------------------------
0 | model   | DenseAutoEncoder | 166 M  | train
1 | loss_fn | RMSDLoss         | 0      | train
-----------------------------------------------------
166 M     Trainable params
0         Non-trainable params
166 M     Total params
667.687   Total estimated model params size (MB)
19        Modules in train mode
0         Modules in eval mode
Epoch 999: 100%|█| 2/2 [00:00<00:00,  9.52it/s, v_num=0, avg_training_loss=0.080`Trainer.fit` stopped: `max_epochs=1000` reached.
Epoch 999: 100%|█| 2/2 [00:03<00:00,  0.60it/s, v_num=0, avg_training_loss=0.0

### view the files generated

In [3]:
!ls protein_ligand/test

cam_cog.npy	    cam_select.pdb	hem_scaler.pkl	    prt_model.pt
cam_compressed.pkl  hem_cog.npy		hem_select.pdb	    prt_scaler.pkl
cam_config.json     hem_compressed.pkl	prt_cog.npy	    prt_select.pdb
cam_losses.pkl	    hem_config.json	prt_compressed.pkl  restart_cam.ckpt
cam_model.pt	    hem_losses.pkl	prt_config.json     restart_hem.ckpt
cam_scaler.pkl	    hem_model.pt	prt_losses.pkl	    restart_prt.ckpt


### a new folder is created if the output path does not exist. the output files are saved to `protein/test` in this case (from the `-o` flag). They are listed below with short descriptions:
here subunits are named: prt, hem and cam

`<subunit>_cog.npy`: the center of geometries from each frame.

`<subunit>_config.json`: the configuration used to run the script.

`<subunit>_model.pt`: the trained AE model.

`<subunit>_select.pdb`: PDB of the select atoms.

`<subunit>_compressed.pkl`: the compressed coordinates in pickle format.

`<subunit>_losses.pkl`: training and validation losses recorded during training.

`<subunit>_scaler.pkl`: the scaler object in pickle format.


### decompression

In [4]:
# to view all the options and documentation of the script
!python ../scripts/decompress.py -h

usage: decompress.py [-h] -m MODEL -s SCALER -r REFFILE [-t TRAJFILE] -c COMPRESSED -p PREFIX [-sel SELECTION] [-gid GPUID] [-o OUTDIR]
                     [-cog COG]

Process input parameters.

options:
  -h, --help            show this help message and exit
  -m, --model MODEL     Path to the model file.
  -s, --scaler SCALER   Path to the scaler file.
  -r, --reffile REFFILE
                        Path to the reference file.
  -t, --trajfile TRAJFILE
                        Path to the trajectory file 1 (xtc)
  -c, --compressed COMPRESSED
                        Path to the compressed file.
  -p, --prefix PREFIX   output file prefix.
  -sel, --selection SELECTION
                        a list of selections. the training will treat each selection as a separated entity.
  -gid, --gpuID GPUID   select GPU to use [default=0]
  -o, --outdir OUTDIR   output directory
  -cog, --cog COG       center of geometry.


In [5]:
# decompress
## prt
!python ../scripts/decompress.py \
-m protein_ligand/test/prt_model.pt \
-s protein_ligand/test/prt_scaler.pkl \
-r protein_ligand/test/prt_select.pdb \
-c protein_ligand/test/prt_compressed.pkl \
-cog protein_ligand/test/prt_cog.npy\
-p prt -sel "(not resname HEM) and (not resname CAM)"\
-o protein_ligand/test

## heme subunit
!python ../scripts/decompress.py \
-m protein_ligand/test/hem_model.pt \
-s protein_ligand/test/hem_scaler.pkl \
-r protein_ligand/test/hem_select.pdb \
-c protein_ligand/test/hem_compressed.pkl \
-cog protein_ligand/test/hem_cog.npy \
-p hem -sel "resname HEM" \
-o protein_ligand/test


## ligand (camphor)
!python ../scripts/decompress.py \
-m protein_ligand/test/cam_model.pt \
-s protein_ligand/test/cam_scaler.pkl \
-r protein_ligand/test/cam_select.pdb \
-c protein_ligand/test/cam_compressed.pkl \
-cog protein_ligand/test/cam_cog.npy \
-p cam \
-sel "resname CAM" \
-o protein_ligand/test

Using device: cuda:0.
Decompressing.....
decompressed trajectory saved to protein_ligand/test/prt_decompressed.xtc.
Using device: cuda:0.
Decompressing.....
decompressed trajectory saved to protein_ligand/test/hem_decompressed.xtc.
Using device: cuda:0.
Decompressing.....
decompressed trajectory saved to protein_ligand/test/cam_decompressed.xtc.


### fragments of the original trajectory have now been decompressed. we need to recompose the system again. 

In [6]:
# recompose.py documentation
!python ../scripts/recompose.py -h

usage: recompose.py [-h] -r REFFILE [-o OUTDIR] -c CONFIG [-t TRAJFILE] [-sel SELECTION]

Recomposed fragmented decompressed trajectories.

options:
  -h, --help            show this help message and exit
  -r, --reffile REFFILE
                        Path to the reference file (pdb/gro)
  -o, --outdir OUTDIR   output directory
  -c, --config CONFIG   config JSON
  -t, --trajfile TRAJFILE
                        Path to the trajectory file 1 (xtc)
  -sel, --selection SELECTION
                        atom selections.


In [7]:
# use recompose.json and recompose.py to join the systems properly.
# the order in which the systems are present in the actual PDB
# needs to be preserved in recompose.json
!python ../scripts/recompose.py \
-r protein_ligand/p450_cam.pdb \
-o protein_ligand/test \
-c protein_ligand/recompose-tutorial.json \
-sel "all" \
-t protein_ligand/p450_cam.xtc # optional

100%|███████████████████████████████████████| 271/271 [00:00<00:00, 1974.65it/s]
recomposed trajectory written to protein_ligand/test/p450_cam_decompressed.xtc
the respective pdb has been written to protein_ligand/test/p450_cam_select.pdb
calculating RMSD between original and decompressed trajectory.
Computing RMSD: 100%|███████████████████████| 271/271 [00:00<00:00, 1565.92it/s]
RMSD values saved in protein_ligand/test/p450_cam_rmsd.txt
RMSD = (0.08, 0.14) nm


#### the decompressed trajectory is generated in `protein_ligand/test/p450_cam_decompressed.xtc`.
#### The RMSD, if calculated, are stored at `protein_ligand/test/p450_cam_rmsd.txt`