# Python API tutorial

## Installation

### Getting `plinder`

Due to dependencies that are not installable via `pip`, `plinder` is currently not
available at PyPI.
You can download the official
[_GitHub_ repository](https://github.com/plinder-org/plinder/)
instead, for example via `git`.

```console
$ git clone https://github.com/plinder-org/plinder.git
```

### Creating the Conda environment

The most convenient way to install the aforementioned extra dependencies is a _Conda_
environment.
If you have not _Conda_ installed yet, we recommend its installation via
[miniforge](https://github.com/conda-forge/miniforge).
Afterwards the environment can be created from the `environment.yml` in the local
repository clone.

:::{note}
We currently only support a Linux environment.
`plinder` uses `openstructure` for some of its functionality and is available from the
`aivant` conda channel using `conda install aivant::openstructure`, but it is only built
targeting Linux architectures.
For Windows and MacOS users, please see the relevant
[_Docker_](#docker-target) resources.
:::

```console
$ mamba env create -f environment.yml
$ mamba activate plinder
```

### Installing `plinder`

Now `plinder` can be installed into the created environment:

```console
$ pip install .
```

(docker-target)=
### Alternative: Using a Docker container

We also publish the `plinder` project as a docker container as alternative to the
_Conda_-based installation, to ensure the highest level of compatibility with
non-Linux platforms.
See the relevant docker resources here for more details:

- `docker-compose.yml`: defines a `base` image, the `plinder` "app" and a `test`
  container
- `dockerfiles/base/`: contains the files for the `base` image
- `dockerfiles/main/`: contains the files for the `plinder` "app" image

### Configure dataset environment variables

We need to set environment variables to point to the release and iteration of choice.
For the sake of demonstration, this will be set to point to a smaller tutorial example
dataset, which are `PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION=tutorial`.

:::{note}
The version used for the preprint is `PLINDER_RELEASE=2024-04` and
`PLINDER_ITERATION=v1`, while the current version with updated annotations to be used
for the MLSB challenge is`PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION=v2`.
:::

In [None]:
import os
from pathlib import Path

release = "2024-06"
iteration = "tutorial"
os.environ["PLINDER_RELEASE"] = release
os.environ["PLINDER_ITERATION"] = iteration
os.environ["PLINDER_REPO"] =  str(Path.home()/"plinder-org/plinder")
os.environ["PLINDER_LOCAL_DIR"] =  str(Path.home()/".local/share/plinder")
os.environ["GCLOUD_PROJECT"] = "plinder"
version = f"{release}/{iteration}"

As alternative these variables could also be set from terminal via `export` (*UNIX*) or
`set` (*Windows*).

## Overview

The user-facing subpackage of `plinder` is {mod}`plinder.core`.
This provides access to the underlying utility functions for accessing the dataset,
split and annotations.
It provides access to five top-level functions:

:::{currentmodule} plinder.core
:::

- {func}`get_config()`: access *PLINDER* global configuration
- {func}`get_plindex()`: access full annotation table
- {func}`get_manifest`: map *PLINDER* system to PDB ID
- {func}`get_split`: access full split table

:::{currentmodule} plinder
:::

In addition, it provides access to the data class {class}`PlinderSystem` for
reconstituting a *PLINDER* system from its `system_id`.

To supplement these data, {mod}`plinder.core.scores` provides functionality for
querying metrics, such as protein/ligand similarity and cluster identity.

## Getting the configuration

At first we get the configuration to check that all parameters are correctly set. 
In the snippet below, we will check, if the local and remote *PLINDER* paths point to
the expected location.

In [None]:
import plinder.core.utils.config

cfg = plinder.core.get_config()
print(f"local cache directory: {cfg.data.plinder_dir}")
print(f"remote data directory: {cfg.data.plinder_remote}")

## Query annotations

:::{currentmodule} plinder.core
:::

### Full dataset

The annotation table is also called *PLINDER index* or *PLINDEX* in short.
{func}`get_plindex()` loads the entire annotation table as
[`pandas`](https://pandas.pydata.org) data frame.
A description of all columns is available in the
[Dataset Reference](#annotation-table-target).

In [None]:
from plinder.core import get_plindex
annotation_df = get_plindex()
annotation_df.head()

### Query specific columns 

:::{currentmodule} plinder.core.scores
:::

To query the annotations table for specific columns or filter by specific criteria, use
{func}`query_index()`.
The function could be called without any argument to yield a table of `system_id` and
`entry_pdb_id`.
However, the function could be called by passing `columns` argument, which is a list of
[column names](#annotation-table-target). 

In [None]:
from plinder.core.scores import query_index
# Get system_id and entry_pdb_id columns
query_index()

In [None]:
# Get specific columns from the annotation table
cols_of_interest = ["system_id", "entry_pdb_id", "entry_release_date", "entry_oligomeric_state",
"entry_clashscore", "entry_resolution"]
query_index(columns=cols_of_interest)

### Query annotations with specific filters

We could also pass additional `filters`, where each filter is a logical comparison
of a column name with some given value.
Only those rows, that fulfill all conditions, are returned.
See the description of
[`pandas.read_parquet()`]https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html
for more information on the filter syntax.

In [None]:
# Query for single-ligand systems
filters = [("system_num_ligand_chains", "==", "1")]
query_index(columns=cols_of_interest, filters=filters)

## Query protein similarity
The are three kinds of similarity datasets we provide:
- Similarity between ligand bound structures (`holo`)
- Similarity between ligand bound and unbound protein structures (`apo`)
- Similarity between ligand bound and Alphafold predicted structures (`pred`)
Any of these could be specified with {func}`query_protein_similarity()`

:::{note} With the full dataset, some similarity queries might require a large amount of memory. For example, `query_protein_similarity(search_db="apo", filters=[("similarity", ">", "50")]) will use up  >500G RAM.:::

In [72]:
from plinder.core.scores import query_protein_similarity
query_protein_similarity(
    search_db="apo",
    columns=["query_system", "target_system", "similarity"],
    filters=[("similarity", ">", "50"), ("metric", "==", "protein_qcov_max")]
)

2024-08-27 11:03:39,711 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.20s
2024-08-27 11:03:39,863 | plinder.core.scores.protein.query_protein_similarity:24 | INFO : runtime succeeded: 1.37s


Unnamed: 0,query_system,target_system,similarity
0,1b5d__1__1.A_1.B__1.D,1b49_A,100
1,1b5d__1__1.A_1.B__1.D,1b49_B,100
2,1b5d__1__1.A_1.B__1.D,6a9b_A,100
3,1s2g__1__1.A_2.C__1.D,1s2l_A,100
4,1s2g__1__1.A_2.C__1.D,1s2l_B,100
...,...,...,...
3093,4n7m__1__1.A_1.B__1.C,6bz9__1__1.A_1.B__1.C,52
3094,4n7m__1__1.A_1.B__1.C,6f6r__1__1.A_1.B__1.C,52
3095,4n7m__1__1.A_1.B__1.C,6f6r__1__2.A_2.B__2.C,52
3096,4n7m__1__1.A_1.B__1.C,6f6r__2__1.A_1.B__1.C,52


## Working with a PLINDER system

A {class}`PlinderSystem` is the representation of a single System.
This object provides access to all PDB entry and system level annotations, as well as
the structures of the system components.

### Load systems from IDs

To reconstitute PLINDER systems directly from a set of IDs use class {class}`PlinderSystem`.


In [57]:
from plinder.core import PlinderSystem
plinder_system = PlinderSystem(system_id="4agi__1__1.C__1.W")

Users can choose the granularity level of input:
In the cases above the systems were specified by their system ID, but as alternative
passing PDB IDs (or their two middle characters) is also possible, which gives you all
systems corresponding to the given PDB IDs.

### Accessing annotations

The `PlinderSystem.entry` property provides PDB entry-level annotations for that system.
Here, we will list the accessible categories of entry annotations and access the
oligomeric state of a given system.

In [None]:
entry_annotations = plinder_system.entry
print(list(entry_annotations.keys()))
print(entry_annotations["oligomeric_state"])

Instead, `PlinderSystem.system` returns annotations on the system level.
Here, we will extract the SMILES string of the first ligand of a given system.

In [None]:
system_annotations = plinder_system.system
print(list(system_annotations.keys()))
# Show ligand smiles of the first ligand of a given system
print(system_annotations["ligands"][0]["smiles"])

### Getting structure file paths

The `PlinderSystem` also provides access to the structure files the system is based on.
This could be helpful for loading the structures for training a model or performing
other calculations that require structural information.

In [None]:
print(plinder_system.ligands)

The same can be done for the receptor protein.

In [None]:
print(plinder_system.receptor_pdb)

### Inspect apo and predicted annotations

For users interested in using apo and predicted structures in model training, the
snippet below maps holo system IDs (`reference_system_id`) to apo or predicted
IDs (`id`) and reports their similarity measures as well. This similarity data includes protein and pocket similarity
(see description [here](/eval.md)), as well as all evaluation metrics calculated upon superposition and transplantation of ligands into each apo/predicted structure.
Another way to access the information directly wil be to use {func}`query_links` directly.

In [None]:
plinder_system.linked_structures

Querying {func}`query_links` can be done directly via:

In [None]:
from plinder.core.scores import query_links
links = query_links()
links

Here we will use this table to get the PDB and chain IDs for apo structures
corresponding to a given system ID.

In [None]:
print(links[
    (links.reference_system_id ==  "4agi__1__1.C__1.W") & (links.kind == "apo")
].id.to_list())


The structure file locations for the linked structures can also be obtained.
The directory names are named after the `reference_system_id` and `id` column.

In [None]:
for file in plinder_system.linked_archive.glob("**/*.cif"):
    print(file)

## Working with split data

### Get split table

The split table sorts each PLINDER system into a cluster and defines the split it is
part of.
To access the splits, use {func}`get_split()`.

In [None]:
from plinder.core import get_split
split_df = get_split()
split_df

For example this table can be used to get all system IDs that belong to the *test*
split.

In [None]:
split_df[split_df.split == "test"].system_id.to_list()