# Custom datasets and tasks


[<img src="https://colab.research.google.com/assets/colab-badge.svg">](https://colab.research.google.com/github/BorgwardtLab/proteinshake/blob/main/docs/readthedocs/source/notebooks/custom.ipynb)

## Custom Dataset

The `Dataset` object logic takes care of two major steps: (i) downloading the raw PDBs from a source database, and (ii) annotating each protein, possibly from another source.
These are the only steps you need to implement. All other downstream work (parsing, cleaning, representation, frameworks) is provided by ProteinShake and can be re-used. This makes customizing the datasets very simple.

Here is an example of how you can create a custom dataset in a situation where you have your own annotations on a file that looks like this:

```
pdbid | annotation
-----------------
3ONJ  |   0.1
2DMW  |   0.7
2MOF  |   0.4
...
```

Each row corresponds to a protein that is hosted in the RCSB Databank so we can subclass the `RCSBDataset` object and add our own annotations:

In [1]:
import pandas as pd
from proteinshake.datasets import RCSBDataset

class MyDataset(RCSBDataset):
    
    def __init__(self, *args, **kwargs):
        # We can load the following file also from a local path or a remote server
        self.annotations = {'3ONJ': 0.1, '2DMW': 0.7, '2MOF': 0.4, '3NHA': 0.3, '2X27': 0.9, '3NHB': 0.2, '5GV0': 0.5, '1H8M': 0.9, '1IOU': 0.4, '2UWR': 0.6}
        pdb_ids = list(self.annotations.keys())
        super().__init__(from_list=pdb_ids, *args, **kwargs)

    def add_protein_attributes(self, protein):
        """ Store annotation in downloaded protein object
        """
        protein['protein']['my_annotation'] = self.annotations[protein['protein']['ID']]
        return protein

Now you can use the same functionality as the hosted datasets.

## Custom Task

The `Task` object defines four major components: (i) the base dataset, (ii) the prediction target, (iii) train/validation/test splits, and (iv) prediction evaluation.

This is a template for a fully customized task:

In [2]:
from proteinshake.tasks import Task
from proteinshake.datasets import RCSBDataset
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


class MyTask(Task):

    """ This static attribute needs to correspond to a
        proteinshake dataset from our hosted datasets
        or one that you created yourself
    """
    DatasetClass = MyDataset

    def target(self, protein):
        """ Accepts a protein dictionary and
            returns the prediction target for that protein.
            This can accept pairs of proteins or other objects
            depending on the task.
        """ 
        return protein['protein']['my_annotation']

                    
    def compute_custom_split(self, split_type):
        """ Let's implement a simple random split.
        """
        train_index, valtest = train_test_split(range(self.size), test_size=0.4, random_state=0)
        val_index, test_index = train_test_split(valtest, test_size=0.5, random_state=0)
        return train_index, val_index, test_index

    def evaluate(self, y_true, y_pred):
        """ Accepts a list of model outputs and returns 
            a dictionary containing evaluation
            metrics. By convention, `y_pred`
            is a list of values where each item corresponds
            to a prediction on one item of the test set.
            `y_true` can be task.test_targets, task.val_targets,
            or any custom provided values.
        """
        return {'mse': mean_squared_error(y_true, y_pred)}

Now you (and others) can use your task like the following. We need `use_precomputed=False` here, as the task is not (yet) hosted on the ProteinShake database.

In [3]:
task = MyTask(root='task_test', use_precomputed=False, verbosity=1)
test_proteins = task.proteins[task.test_index]
print([p['protein']['ID'] for p in test_proteins])
print(task.test_targets)

['5GV0', '1IOU']
[0.5 0.4000000059604645]


Tip: If you would like to share your custom tasks and datasets, please feel free to open a [pull request](https://github.com/BorgwardtLab/proteinshake/pulls) on our GitHub repository.