# Adding the OAS Dataset: Downloading and Preprocessing

Adding a new dataset to BioNeMo is a common task. This tutorial will show the developer how to accomplish this objective. The [Observed Antibody Space (OAS)](https://opig.stats.ox.ac.uk/webapps/oas/) dataset will be used for this example. The OAS dataset is a database of antibody sequences containing over one billion sequences from 80 different studies for use in large scale analysis. 

The task of adding a new dataset can be broken into three development tasks which can make use of associated base and helper classes in BioNeMo and NeMo. This dataset will be added to the ESM1-nv pre-training pipeline. There are three steps to this process:

1. Preprocessing includes download of the raw data and any additional preparation steps, such as extracting the files. It also includes dividing the data into train, validation, and test splits. The preprocessing step can make use of two BioNeMo base classes, `RemoteResource` and `ResourcePreprocessor`, from `bionemo.utils.remote` and `bionemo.data.preprocess`, respectively. Their use is optional but they provide some basic functionality which can accelerate development. This step is covered by the current tutorial. </br></br>
2. Development of the new dataset class. Here, the NeMo dataset class [CSVMemMapDataset](https://github.com/NVIDIA/NeMo/blob/b0e5bf3627dbcfb3f4a72d73d3c5e92184d8b1f6/nemo/collections/nlp/data/language_modeling/text_memmap_dataset.py#L286) will be used. This task will be covered by the next tutorial, <a href="custom-dataset-class-fw.html">Modifying the Dataset Class</a>. </br></br>
3. Modification of the dataloader classes. This task will be covered by the third tutorial, <a href="custom-dataloader-fw.html">Adding a Custom Dataloader</a>.  TODO FIX LINK WHEN TUTORIAL FINISHED </br></br>

## Setup and Assumptions

This tutorial assumes that a copy of the BioNeMo framework repo exists on workstation or server and has been mounted inside the container at `/workspace/bionemo` as described in the <a href="../quickstart-fw.html#code-development">Code Development section of the Quickstart Guide</a>. This path will be referred to with the variable `BIONEMO_WORKSPACE` in the tutorial. 

All commands should be executed inside the BioNeMo docker container.

In [1]:
BIONEMO_WORKSPACE = '/workspace/bionemo'

In [2]:
### Utility functions 

from IPython.display import Code
import re
import os
import shutil

def stage_files(tag: str,
                source_directory: str = f'{BIONEMO_WORKSPACE}/examples/oas_dataset'):
    """Stage files for each step of the tutorial"""
    source_path = os.path.join(source_directory, tag)
    
    data_path = os.path.join(BIONEMO_WORKSPACE, 'bionemo/data/preprocess/protein')
    shutil.copyfile(os.path.join(source_path, 'oas_paired_subset_download.sh'), 
                    os.path.join(data_path, 'oas_paired_subset_download.sh'))
    
    preprocess_path = os.path.join(BIONEMO_WORKSPACE, 'bionemo/data/preprocess/protein')
    shutil.copyfile(os.path.join(source_path, 'oas_preprocess.py'), 
                    os.path.join(preprocess_path, 'oas_preprocess.py'))
    
    config_path = os.path.join(BIONEMO_WORKSPACE, 'examples/protein/esm1nv/conf')
    shutil.copyfile(os.path.join(source_path, 'pretrain_oas.yaml'), 
                    os.path.join(config_path, 'pretrain_oas.yaml'))
    
    pretrain_path = os.path.join(BIONEMO_WORKSPACE, 'examples/protein/esm1nv')
    shutil.copyfile(os.path.join(source_path, 'pretrain_oas.py'), 
                    os.path.join(pretrain_path, 'pretrain_oas.py'))

def show_code(filename: str,
              language: str,
              start_line = None,
              end_line = None,
              end_column = None):
    """Display syntax highlighted section of code"""
    
    with open(filename, 'r') as fh:
        code = fh.readlines()

    if end_line:
        code = code[:end_line]
        code.append('...\n')
    if start_line:
        code = code[start_line:]
        code.insert(0, '...\n')
    if end_column:
        for line in code:
            line = line[:end_column] + '...\n'
        
    code = ''.join(code)
    return Code(data=code, language=language)


def filter_log(logfile_list, regex):
    """Filter a list of log output until a regex match is found"""

    reg = re.compile(regex)
    string_matches = filter(reg.search, logfile_list)
    position_matches = list(map(lambda x: logfile_list.index(x), string_matches))
    logfile_list = logfile_list[position_matches[0]:]
    return '\n'.join(logfile_list)

In [3]:
! rm -rf /data/OASpaired

## Accessing OAS Dataset 

In [4]:
TUTORIAL_FILE_VERSION = 'step_010_download'
stage_files(TUTORIAL_FILE_VERSION)

The [paired sequence subset of the data](https://opig.stats.ox.ac.uk/webapps/oas/oas_paired/) will be used for this tutorial. The tutorial requires a shell script containing url links for the appropriate files. This script cannot be directly downloaded from the website and must be generated from the [paired sequences search page](https://opig.stats.ox.ac.uk/webapps/oas/oas_paired/) by selecting "search" without choosing any attributes, as [described here](https://opig.stats.ox.ac.uk/webapps/oas/downloads_paired/). 

The full dataset currently contains links to 158 sequence files. This tutorial will use a subset of the data -- the first ten files. The contents of the file are shown below and, if preferred, can be copied directly instead of downloading from OAS. Save this file to `$BIONEMO_WORKSPACE/bionemo/data/preprocess/protein/oas_paired_subset_download.sh`. The contents of `oas_paired_subset_download.sh` should look like this:


In [5]:
filename = f'{BIONEMO_WORKSPACE}/bionemo/data/preprocess/protein/oas_paired_subset_download.sh'
show_code(filename=filename, language='shell')

## Downloading and Verifying Data

The `RemoteResource` class is used to create the existing download location (if needed), download a file, and verify its checksum. If the dataset contains multiple files (as is the case with OAS data), then multiple RemoteResources will be used. In practice, this class is rarely interacted with directly. Instead, it is usually called as part of the second class, `ResourcePreprocessor`. `ResourcePreprocessor` will be used as the base class for creation of the OAS preprocessing class.

The creation of the OAS preprocessing class will require the implementation of two methods: 

1. `get_remote_resources`, which implements a RemoteResource for each file, downloads it, and verifies the checksum; and
2.  `prepare`, which performs any preprocessing on the data and splits into train, val, and test datasets.

### Data Preprocessing Class

First, let's create the functionality to download the files. In the same directory as `oas_paired_subset_download.sh`, create a file called `oas_preprocess.py`. The path will be `$BIONEMO_WORKSPACE/bionemo/data/preprocess/protein/oas_preprocess.py`. In this file, create a class based on `ResourcePreprocessor` that parses the URLs in the download script, returns a `RemoteResource` for each of the URLs, and downloads the files referenced by the URLs.

Here is an example of such a class for `oas_preprocess.py`. This class saves the downloaded files to `/data/OASpaired/raw`. 

In [6]:
filename = f'{BIONEMO_WORKSPACE}/bionemo/data/preprocess/protein/oas_preprocess.py'
show_code(filename=filename, language='python')

### Custom YAML Config

A custom YAML configuration file is useful for making changes to the model and training configuration parameters. Copy the file `$BIONEMO_WORKSPACE/examples/protein/esm1nv/conf/pretrain_small.yaml` to `examples/protein/esm1nv/conf/pretrain_oas.yaml`. 

To this new file, make the following modifications:

* Delete the entire downstream task validation portion in the model section (`model.dwnstr_task_validation`). This can be reintroduced in the future to enable this functionality, but for now removing it will simplify working with the configuration file.
* Give the training a new name -- here `esm1nv-oas` has been chosen.
* Set `do_training` to False since the focus is currently data preprocessing
* Disable Weights and Biases logging for now by since it won't be used for preprocessing by creating an `exp_manager` section and setting `create_wandb_logger` to False.

Here is what the new yaml config file looks like.

In [7]:
filename = f'{BIONEMO_WORKSPACE}/examples/protein/esm1nv/conf/pretrain_oas.yaml'
show_code(filename, language='yaml')

### Python Execution Script

A python script to execute our job will also need to be created. In the directory `examples/protein/esm1nv`, copy the existing pre-train script `pretrain.py` to `pretrain_oas.py`. This will be the file which performs preprocessing and runs the pre-training once the pipeline is completed.

Make the following changes to the new pre-training file:
* Remove the imports for `UniRef50Preprocess` and `FLIPPreprocess`
* Add an import for `OASPairedPreprocess` from `bionemo.data.preprocess.protein.oas_preprocess`
* Modify the section with the log `Starting Preprocessing` so that it downloads the data and calculates the MD5 checksums for each of the OAS files.

Here is an example:

In [8]:
filename = f'{BIONEMO_WORKSPACE}/examples/protein/esm1nv/pretrain_oas.py'
show_code(filename=filename, language='python')

### Testing 

Run the pipeline with the following command:

```shell
cd examples/protein/esm1nv
python pretrain_oas.py
```

The end of the logged output is shown below:

In [9]:
std_out = ! cd {BIONEMO_WORKSPACE}/examples/protein/esm1nv && python pretrain_oas.py
print(filter_log(std_out, 'Calculating Checksums'))

[NeMo I 2023-08-17 16:44:39 pretrain_oas:26] ************** Calculating Checksums ***********
[NeMo I 2023-08-17 16:44:39 oas_preprocess:27] The following URLs were parsed: ['https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358523_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358524_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Eccles_2020/csv/SRR10358525_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179273_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179274_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_2019/csv/SRR9179275_paired.csv.gz', 'https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Goldstein_20

### Results 

This should have downloaded the ten sequence files and calculated their checksums. The files are found in the path (`root_directory`/`dest_directory`) as defined in the preprocessing class (`/data/OASpaired/raw`):

In [10]:
! ls /data/OASpaired/raw

SRR10358523_paired.csv.gz  SRR11528762_paired.csv.gz  SRR9179276_paired.csv.gz
SRR10358524_paired.csv.gz  SRR9179273_paired.csv.gz   SRR9179277_paired.csv.gz
SRR10358525_paired.csv.gz  SRR9179274_paired.csv.gz
SRR11528761_paired.csv.gz  SRR9179275_paired.csv.gz


## Decompressing OAS Sequence Files

In [11]:
TUTORIAL_FILE_VERSION = 'step_020_unzip'
stage_files(TUTORIAL_FILE_VERSION)

### Data Preprocessing Class

Now, the functionality to finish processing the data will be added. The following edits should be made to `bionemo/data/preprocess/protein/oas_preprocess.py`:

* Add the checksum list to the `checksum` dictionary in `get_remote_resources`
* Create a method `prepare_resource` that downloads each file, performs any additional processing (such as unzipping the files), and returns the final, full path of each file. 
* Create a method `prepare` that runs `prepare_resource` for each file.

In [12]:
filename = f'{BIONEMO_WORKSPACE}/bionemo/data/preprocess/protein/oas_preprocess.py'
show_code(filename=filename, language='python')

### Python Execution Script 

Now, modify the python pre-train script so that it creates an instance of the class and runs the `prepare` method. These are the final set of changes which need to be made to the pre-training script.

In [13]:
filename = f'{BIONEMO_WORKSPACE}/examples/protein/esm1nv/pretrain_oas.py'
show_code(filename=filename, language='python')

### Testing 

Execute the pre-train script as before:

```shell
cd examples/protein/esm1nv
python pretrain_oas.py
```

Below is the relevant portion of the log statments:

In [14]:
std_out = ! cd {BIONEMO_WORKSPACE}/examples/protein/esm1nv && python pretrain_oas.py
print(filter_log(std_out, 'Starting Preprocessing'))

[NeMo I 2023-08-17 16:48:11 pretrain_oas:27] ************** Starting Preprocessing ***********
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
[NeMo I 2023-08-17 16:48:11 oas_preprocess:70] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:11 oas_preprocess:66] Downloading https://opig.stats.ox.ac.uk/webapp

### Results 

The original files already existed, so they were not downloaded. But each of the files has now been extracted.

In [15]:
! ls /data/OASpaired/raw

SRR10358523_paired.csv	   SRR11528761_paired.csv.gz  SRR9179275_paired.csv
SRR10358523_paired.csv.gz  SRR11528762_paired.csv     SRR9179275_paired.csv.gz
SRR10358524_paired.csv	   SRR11528762_paired.csv.gz  SRR9179276_paired.csv
SRR10358524_paired.csv.gz  SRR9179273_paired.csv      SRR9179276_paired.csv.gz
SRR10358525_paired.csv	   SRR9179273_paired.csv.gz   SRR9179277_paired.csv
SRR10358525_paired.csv.gz  SRR9179274_paired.csv      SRR9179277_paired.csv.gz
SRR11528761_paired.csv	   SRR9179274_paired.csv.gz


The CSV files contain an extra row at the top and a lot of additional columns. See the first three lines from `SRR10358523_paired.csv` below, as an example.

These extra columns will increase seek time during training, so they should be removed. The files also need to be split and numbered consecutively for training, validation, and test splits, respectively. 

In [16]:
! head -n 3 /data/OASpaired/raw/SRR10358523_paired.csv

"{""Run"": ""SRR10358523"", ""Link"": ""https://doi.org/10.1016/j.celrep.2019.12.027"", ""Author"": ""Eccles et al., 2020"", ""Species"": ""human"", ""Age"": ""33"", ""BSource"": ""PBMC"", ""BType"": ""RV+B-Cells"", ""Vaccine"": ""None"", ""Disease"": ""None"", ""Subject"": ""Healthy-1"", ""Longitudinal"": ""no"", ""Unique sequences"": 100, ""Isotype"": ""All"", ""Chain"": ""Paired""}"
sequence_id_heavy,sequence_heavy,locus_heavy,stop_codon_heavy,vj_in_frame_heavy,productive_heavy,rev_comp_heavy,v_call_heavy,d_call_heavy,j_call_heavy,sequence_alignment_heavy,germline_alignment_heavy,sequence_alignment_aa_heavy,germline_alignment_aa_heavy,v_alignment_start_heavy,v_alignment_end_heavy,d_alignment_start_heavy,d_alignment_end_heavy,j_alignment_start_heavy,j_alignment_end_heavy,v_sequence_alignment_heavy,v_sequence_alignment_aa_heavy,v_germline_alignment_heavy,v_germline_alignment_aa_heavy,d_sequence_alignment_heavy,d_sequence_alignment_aa_heavy,d_germline_alignment_heavy,d_germline_alignme

## Cleaning and Splitting OAS Sequence Files

In [17]:
TUTORIAL_FILE_VERSION = 'step_030_csv'
stage_files(TUTORIAL_FILE_VERSION)

### Data Preprocessing Class

A new method, `process_files` will be created to clean up the files and create train, validation, and test splits. For this exercise, only the columns containing the sequence id and the sequence for the antibody heavy chain will be retained (`sequence_id_heavy`, `sequence_heavy`).

Edit the file `$BIONEMO_WORKSPACE/bionemo/data/preprocess/protein/oas_preprocess.py` to add this functionality. These are the final edits that will need to be made to the preprocessing class. Here is an example of such a file:

In [18]:
filename = f'{BIONEMO_WORKSPACE}/bionemo/data/preprocess/protein/oas_preprocess.py'
show_code(filename=filename, language='python')

### Testing

As before, execute the pre-train script:

```shell
cd examples/protein/esm1nv
python pretrain_oas.py
```

This is what the end of the log looks like once preprocessing has started:

In [19]:
std_out = ! cd {BIONEMO_WORKSPACE}/examples/protein/esm1nv && python pretrain_oas.py
print(filter_log(std_out, 'Starting Preprocessing'))

[NeMo I 2023-08-17 16:48:23 pretrain_oas:26] ************** Starting Preprocessing ***********
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528761_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 remote:117] Resource already exists, skipping download: https://opig.stats.ox.ac.uk/webapps/ngsdb/paired/Alsoiussi_2020/csv/SRR11528762_paired.csv.gz
[NeMo I 2023-08-17 16:48:23 oas_preprocess:78] Extracting the gzipped file
[NeMo I 2023-08-17 16:48:23 oas_preprocess:74] Downloading https://opig.stats.ox.ac.uk/webapp

### Results 

This has split the data into train, val, and test directories and cleaned up the data:

In [20]:
! ls /data/OASpaired/processed/heavy/*

/data/OASpaired/processed/heavy/test:
x000.csv  x001.csv

/data/OASpaired/processed/heavy/train:
x000.csv  x001.csv  x002.csv  x003.csv	x004.csv  x005.csv

/data/OASpaired/processed/heavy/val:
x000.csv  x001.csv


This is what the first five lines of one of the training files looks like:

In [21]:
! head -n 5 /data/OASpaired/processed/heavy/train/x000.csv

sequence_id_heavy,sequence_heavy
AAACCTGAGACTTGAA-1_contig_1,GGGAGAGGAGGCCTGTCCTGGATTCGATTCCCAGTTCCTCACATTCAGTCAGCACTGAACACGGACCCCTCACCATGAACTTCGGGCTCAGCTTGATTTTCCTTGTCCTTGTTTTAAAAGGTGTCCAGTGTGAAGTGATGCTGGTGGAGTCTGGGGGAGGCTTAGTGAAGCCTGGAGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGATTCACTTTCAGTAGCTATGCCATGTCTTGGGTTCGCCAGACTCCGGAGAAGAGGCTGGAGTGGGTCGCAACCATTAGTAGTGGTGGTAGTTACACCTACTATCCAGACAGTGTGAAGGGGCGATTCACCATCTCCAGAGACAATGCCAAGAACACCCTGTACCTGCAAATGAGCAGTCTGAGGTCTGAGGACACGGCCATGTATTACTGTGCAAGACGGGGGAATGATGGTTACTACGAAGACTACTGGGGCCAAGGCACCACTCTCACAGTCTCCTCAGAGAGTCAGTCCTTCCCAAATGTCTTCCCCCTCGTCTCCTGCGAGAGCCCCCTGTCTGATAAGAATCTGGTGGCCATGGGCTGCCTGG
AAACCTGAGCGCCTTG-1_contig_2,GAGCTCTGACAGAGGAGGCCAGTCCTGGAATTGATTCCCAGTTCCTCACGTTCAGTGATGAGCACTGAACACAGACACCTCACCATGAACTTTGGGCTCAGATTGATTTTCCTTGTCCTTACTTTAAAAGGTGTGAAGTGTGAAGTGCAGCTGGTGGAGTCTGGGGGAGGCTTAGTGAAGCCTGGAGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGATTCGCTTTCAGTAGCTATGACATGTCTTGGGTTCGCCAGACTCCGGAGAAGAGGCTGGAGTGGGTCGCATACATTAGTAGTGGTGGTGGTATCACCTACTATCCAGACACTGTGA

## Optional Variation: Process the Light Chain Data

What if instead the light chain columns (`sequence_id_light` and `sequence_light`) were desired? How could the existing class be subclassed to create a preprocessing class for light chains?

Starting with the existing `OASPairedPreprocess` class, the only additional changes that would need to be made are:
* Change the `columns_to_keep` to preserve the light chains instead of the heavy ones
* Optionally change the directory for the processed files

Here is an example of a class that could be added to `$BIONEMO_WORKSPACE/bionemo/data/preprocess/protein/oas_preprocess.py` to accomplish this:

```python
@dataclass
class OASPairedLightPreprocessor(OASPairedPreprocess):
    """OASPairedLightPreprocessor to download and preprocess OAS paired antibody light chain data."""
    processed_directory: str = 'OASpaired/processed/light'
    columns_to_keep = ['sequence_id_light', 'sequence_light']
```