# Basic Tutorial for using the Pipeline

Get the package code

In [None]:
import sys
sys.path.append("./pipeline")
import pipeline

## 1. Prepare your data

* Replace the files in `data/latent` with your latent embeddings named `latent_dict_<model_id>.pt`
* The `.pt` files must contain dictionaries where each entry corresponds to a protein structure with key `<protein_target_id>_<pdb_id>` and the latent space embedding of the structure as value
* Replace the numbers in `gnn_model_names` in `pipeline/dataset/preparation/generate_proteins_table.py` with your model ids

Run `pipeline/dataset/preparation/generate_proteins_table.py` to scrape the auxiliary protein data for your dataset:

In [None]:
!python3 pipeline/dataset/preparation/generate_proteins_table.py

Set the ID of the model you want to investigate and the size of the input latent vectors

In [None]:
model_id = 170
latent_dim = 512

Get the UniProt and ligand annotations for your dataset

In [None]:
anns_df = pipeline.create_anns_dataframe(model_id)
anns_df.to_csv('data/annotations/annotations.csv')

## (1a. Determine the optimal latent dimension [optional])

Run the elbow analysis and determine the optimal number of components visually

In [None]:
from scripts.elbow_analysis import elbow_analysis

elbow_analysis(model_id, latent_dim)

## (1b. Retrain the autoencoder [optional])

* Modify `train_range`and `val_range` in `pipeline/train.py` according to the model_ids of your models
* Run the training script 

In [None]:
!python3 pipeline/train.py

## 2. Modify your base config file

Modify the default hyperparameters in `base.yaml` such that they fit your dataset and use case

## 3. Run the pipeline

Override any hyperparameters in your base config by adding `++<hyperparameter>=<value>` to the command below.
You need to set reload_families=true for the first run on your dataset to create the family dictionary.

In [None]:
!python3 pipeline/run_pipeline.py ++model_id=$model_id ++base_dir=$(pwd) ++reload_families=true ++clustering.agglomerative.n_clusters=[7,8,9,10]