# Adding a dataset to EUGENe 

**Authorship:**
Adam Klie, *09/29/2022*
***
**Description:**
This tutorial is intended to show how to add a dataset to EUGENe. It's a pretty simple process, but it's important to follow the steps in order to ensure that the dataset is properly formatted and can be used by the EUGENe pipeline.
***

In [5]:
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload 
%autoreload 2

import os
import numpy as np
import pandas as pd
import eugene as eu

eu.settings.dataset_dir = "./tutorial_datasets"

# 1. Identify your dataset
EUGENe currently supports handling any dataset where you have a set of DNA or RNA sequences and a set of labels for those sequences. A couple notes:

* EUGENe currently supports directly reading from CSV files, numpy compressed files, FASTA files, BED files, BAM files and BigWig files. If you have a dataset in a different format, you will need to convert it to one of these formats and then use EUGENe to read it.
* BED, BAM and BigWig have labels that are inherent to those files, so you often you don't need to provide a separate file with labels. However, you will need to have the right combination of files to get the labels you want. See the readthe docs page on dataloading for more information.
* The labels can be any type of information, but should be aligned to your sequences in some way.

# 2. Create a loader function
* This should be a normal Python function that should be named by the following convention:
    - If your dataset is associated with a publication, use the last name of the first author followed by the year of publication (nameYY). For example, if the dataset was published in 2021 by the author "Jane Doe", the function should be named `doe21`.
    - If your dataset is not associated with a publication, you can come up with the name of the dataset followed by the year of creation (nameYY). For example, if the dataset was created in 2021 and you've named it "eugene", the function should be named `eugene21`.
* Implementing the function
    - At minimum, the function should return a SeqData object to the user that contains either `seqs` or `ohe_seqs`.
    - The SeqData should also have `names` that can be from the dataset or just `seq1`, `seq2`, etc.
    - The SeqData should also have targets to predict in the `seqs_annot` field. These can be any type of information, but should be aligned to your sequences in some way.
    - The dataset should be downloaded to the users current machine and then loaded in. You can use whatever method you want to download the dataset, but we offer a helper function (`try_download_urls`) that will try to download the dataset from a list of URLs. If the dataset is not available at any of the URLs, it will raise an error.
    - You are welcome to create your own functions for loading in different files and datatypes, but we have several dataload functions in the `_io.py` script that you can use to load in different types of files. See the read the docs page on the `dataload` module for more information.

In [6]:
# The eugene helpers I most often use for adding a new dataset
from eugene.datasets._utils import try_download_urls
from eugene.dataload import read_csv
from eugene.dataload import SeqData

As an example, I will show how I added the dataset available at https://zenodo.org/record/6863861#.YzcG9exKglU. This dataset comes from Farley et al (2015), so I will name corresponding function `farley15`

In [7]:
# Function definition for downloading the farley15
def farley15(
    return_sdata=True, 
    **kwargs: dict
) -> pd.DataFrame:

    # We typically start with a url list. We had to create a Zenodo archive for this dataset
    urls_list = [
        "https://zenodo.org/record/6863861/files/farley2015_seqs.csv?download=1",
        "https://zenodo.org/record/6863861/files/farley2015_seqs_annot.csv?download=1",
    ]

    # We then use a helper function to try downloading the files
    paths = try_download_urls([0, 1], urls_list, "farley15")

    # If specified, we return a SeqData object
    if return_sdata:
        # Here we just read in the first csv file
        path = paths[0]
        seq_col = "Enhancer"
        data = read_csv(
            path,
            sep=",",
            seq_col=seq_col,
            auto_name=True,
            return_dataframe=True,
            **kwargs,
        )
        
        # Make some cosmetic tweaks, build a SeqData and return it
        n_digits = len(str(len(data) - 1))
        ids = np.array(["seq{num:0{width}}".format(num=i, width=n_digits) for i in range(len(data))])
        sdata = SeqData(
            seqs=data[seq_col],
            names=ids,
            seqs_annot=data[["Barcode", "Biological Replicate 1 (RPM)", "Biological Replicate 2 (RPM)"]],
        )
        return sdata
    
    # Otherwise we just point the user to where we downloaded the file
    else:
        return paths

# 3. Test your function
* You should test your function to make sure that it works as expected. 
    - You can first do this by running the function in a Jupyter notebook and then checking the output. 
    - We also require that you write a test function that will be run by our continuous integration (CI) pipeline. This test function should be named `test_{name of loader function}`. For example, the test function for the `farley15` function would be named `test_farley15`. This function should be added to the `test_dataload.py` script in the `tests` folder.
    - Test multiple aspects of your function
        * Does it load in the correct number of sequences?
        * Does it load in the targets?
        * Does it have proper names?

In [13]:
def test_farley15():
    sdata = farley15()
    assert sdata.n_obs == 163708
    assert "seq163707" == sdata.names[-1]
    assert sdata.seqs_annot.shape == (163708, 3)
    sdata_path = farley15(return_sdata=False)[0]
    assert os.path.exists(sdata_path) 

In [14]:
test_farley15()

Dataset farley15 farley2015_seqs.csv has already been downloaded.
Dataset farley15 farley2015_seqs_annot.csv has already been downloaded.
Dataset farley15 farley2015_seqs.csv has already been downloaded.
Dataset farley15 farley2015_seqs_annot.csv has already been downloaded.


Once you've confirmed everything works as anticipated, you can add your function to the `datasets` module.
This involves just copying the function into the `_datasets.py` script in the `datasets` module. Then add the function to the import statement in the module's `__init__.py` script.

Don't forget to add your test function to the `test_datasets.py` script in the `tests` folder as well! You can then run the tests to make sure everything works as expected.

```bash
pytest tests/test_datasets.py -k "test_farley15"
```

# 4. Document your function

## Docstring

Once your happy with your function, you've tested it and it's working as expected, you can add documentation to the function. This is done by adding a docstring to the function. The docstring should be formatted in numpydoc format and should contain a parameters and a returns section. Don't worry too much about adding details for the dataset, that is done in the next step! You can see examples of this in the `_datasets.py` script. You can also see examples of this in the `datasets` module's read the docs page.

```python
"""
Reads in the farley15 dataset.

Parameters
----------
return_sdata : bool, optional
    If True, return SeqData object for the farley15 dataset. The default is True.
    If False, return the paths to any downloaded files.
**kwargs : kwargs, dict
    Keyword arguments to pass to read_csv.

Returns
-------
sdata : SeqData
    SeqData object for the farley15 dataset.
""" 
```

## Add information to `datasets.csv`
The last thing to do is to add your dataset to the `datasets.csv` file in the `datasets` folder. This file helps keep track of the datasets that are available in EUGENe and allows users to view them. The columns in the file are as follows:
* name (required): The name of the dataset. This should be the same as the name of the function.
* n_seqs (required): The number of sequences in the dataset.
* n_targets (required): The number of targets in the dataset.
* metadata (optional): Any additional information about the dataset stored in the `seqs_annot` field of the SeqData object.
* url (required): The URL pointing to the dataset. This doesn't necessarily have to be the URL where the dataset is hosted, but it should include these URLs if available.
* description (required): A short description of the dataset. At minimum, this should include the type of data (DNA or RNA), the type of targets (e.g. binding affinity, gene expression, etc.) and a few sentences about how the dataset was generated. If there is a publication associated with the dataset, you can include a link to the publication in the description.
* author (required): Your name and email address. This is so we can contact you if there are any issues with the dataset.
* info_page (optional): A URL pointing to a page with more information about the dataset. I often build Notion pages for my datasets, but you can use whatever you want!

# 5. Submit a pull request
Once you've completed all of the above steps, you can submit a pull request to the EUGENe repository. We will review your pull request and merge it into the main branch if everything looks good. If there are any issues, we will let you know and you can make the necessary changes. Once your pull request is merged, your dataset will be available in the next release of EUGENe!

# Wrapping up
Hopefully this guide was helpful in getting you started with adding your own dataset to EUGENe. If you have any questions, feel free to open a GitHub issue.

---