# Adding the OAS Dataset: Modifying the Dataset Class

This tutorial is the second part of a series focused on adding a new dataset to BioNeMo using the [Observed Antibody Space (OAS)](https://opig.stats.ox.ac.uk/webapps/oas/) database. There are three steps to this task:

1. Preprocessing includes download of the raw data and any additional preparation steps, such as extracting the files. It also includes dividing the data into train, validation, and test splits. The preprocessing step can make use of two BioNeMo base classes, `RemoteResource` and `ResourcePreprocessor`, from `bionemo.utils.remote` and `bionemo.data.preprocess.dna.preprocess`, respectively. Their use is optional but they provide some basic functionality which can accelerate development. This step is covered by this tutorial. This objective was accomplished by the previous tutorial, <a href="custom-dataset-preprocessing-fw.html">Downloading and Preprocessing</a>. </br></br>
2. Development of the new dataset class. Here, the NeMo dataset class [CSVMemMapDataset](https://github.com/NVIDIA/NeMo/blob/b0e5bf3627dbcfb3f4a72d73d3c5e92184d8b1f6/nemo/collections/nlp/data/language_modeling/text_memmap_dataset.py#L286) will be used. This step will be completed during the current tutorial. </br></br>
3. Modification of the dataloader classes. This task will be covered by the third tutorial, <a href="custom-dataloader-fw.html">Adding a Custom Dataloader</a>.  TODO FIX LINK WHEN TUTORIAL FINISHED </br></br>

This tutorial assumes the first step has been completed successfully.

## Setup and Assumptions

This tutorial assumes that a copy of the BioNeMo framework repo exists on workstation or server and has been mounted inside the container at `/workspace/bionemo` as described in the <a href="../quickstart-fw.html#code-development">Code Development section of the Quickstart Guide</a>. This path will be referred to with the variable `BIONEMO_WORKSPACE` in the tutorial. 

All commands should be executed inside the BioNeMo docker container.

In [1]:
BIONEMO_WORKSPACE = '/workspace/bionemo'

In [2]:
### Utility functions 

from IPython.display import Code
import re
import os
import shutil

def stage_files(tag: str,
                source_directory: str = f'{BIONEMO_WORKSPACE}/examples/oas_dataset'):
    """Stage files for each step of the tutorial"""
    source_path = os.path.join(source_directory, tag)
    
    data_path = os.path.join(BIONEMO_WORKSPACE, 'bionemo/data/preprocess/protein')
    shutil.copyfile(os.path.join(source_path, 'oas_paired_subset_download.sh'), 
                    os.path.join(data_path, 'oas_paired_subset_download.sh'))
    
    preprocess_path = os.path.join(BIONEMO_WORKSPACE, 'bionemo/data/preprocess/protein')
    shutil.copyfile(os.path.join(source_path, 'oas_preprocess.py'), 
                    os.path.join(preprocess_path, 'oas_preprocess.py'))
    
    config_path = os.path.join(BIONEMO_WORKSPACE, 'examples/protein/esm1nv/conf')
    shutil.copyfile(os.path.join(source_path, 'pretrain_oas.yaml'), 
                    os.path.join(config_path, 'pretrain_oas.yaml'))
    
    pretrain_path = os.path.join(BIONEMO_WORKSPACE, 'examples/protein/esm1nv')
    shutil.copyfile(os.path.join(source_path, 'pretrain_oas.py'), 
                    os.path.join(pretrain_path, 'pretrain_oas.py'))

def show_code(filename: str,
              language: str,
              start_line = None,
              end_line = None,
              end_column = None):
    """Display syntax highlighted section of code"""
    
    with open(filename, 'r') as fh:
        code = fh.readlines()

    if end_line:
        code = code[:end_line]
        code.append('...\n')
    if start_line:
        code = code[start_line:]
        code.insert(0, '...\n')
    if end_column:
        for line in code:
            line = line[:end_column] + '...\n'
        
    code = ''.join(code)
    return Code(data=code, language=language)


def filter_log(logfile_list, regex):
    """Filter a list of log output until a regex match is found"""

    reg = re.compile(regex)
    string_matches = filter(reg.search, logfile_list)
    position_matches = list(map(lambda x: logfile_list.index(x), string_matches))
    logfile_list = logfile_list[position_matches[0]:]
    return '\n'.join(logfile_list)

def clean_progress_bar(logfile_list):
    """Remove incremental progress bar entries. Must also prune empty lines."""

    progress_reg = re.compile(r"""\d+\%|.+?\| (?P<cur_iter>\d+)\/(?P<max_iter>\d+)""")
    clean_list = []

    for row in logfile_list:
        keep_line = True
        row = row.strip()

        progress_match = re.search(progress_reg, row)
        if progress_match:
            if progress_match.group('cur_iter') != progress_match.group('max_iter'):
                keep_line = False

        if keep_line:
            clean_list.append(row)

    return '\n'.join(clean_list)

## Configuring the CSV Memory Mapped Dataset

In [3]:
TUTORIAL_FILE_VERSION = 'step_040_dataset'
stage_files(TUTORIAL_FILE_VERSION)

### Custom YAML Config

BioNeMo uses memory mapping to enable the flexibility of text based data formats, such as CSV, while also minimizing memory usage. The key elements of the [CSVMemMapDataset](https://github.com/NVIDIA/NeMo/blob/b0e5bf3627dbcfb3f4a72d73d3c5e92184d8b1f6/nemo/collections/nlp/data/language_modeling/text_memmap_dataset.py#L286) dataset that must be changed in the `model.data` section of the YAML configuration file are:

* `dataset_paths`: a list of the paths to all data files for a given split, which contains the `train`, `val`, `test` columns. For the OAS heavy chain data, the path is `/data/OASpaired/processed/heavy`.
* `data_col`: the zero-based integer number of the column containing the pretraining data. This will be set to `1` to select the column `sequence_heavy`.
* `data_sep`: the delimiter for the CSV dataset, defaults to '`,`'. This will not need to be changed.
* `header_lines`: the number of header lines in the data files, defaults to `1`. This will not need to be changed.

The range of exsting datafiles must also be updated to reflect that there are six files (named `x000.csv` through `x005.csv` for training and two (`x000.csv` and `x001.csv`) for validation and test data, respectively. `do_training` will also be set to `True` since a pretraining run is required to test the dataset class.

The YAML configuration file below demonstrates these changes. Config files are located in ``{BIONEMO_WORKSPACE}/examples/protein/esm1nv/conf/``

In [4]:
filename = f'{BIONEMO_WORKSPACE}/examples/protein/esm1nv/conf/pretrain_oas.yaml'
show_code(filename, language='yaml')

### Testing 

No additional changes should need to be made to other files before testing.

As before, execute the pretrain script:

```shell
cd examples/protein/esm1nv
python pretrain_oas.py
```

The entire log is shown this time for completeness, but the sections associated with loading data can be found by searching for the text "Loading data from".

In [5]:
! rm -rf /result/nemo_experiments/esm1nv-oas
std_out = ! cd {BIONEMO_WORKSPACE}/examples/protein/esm1nv && python pretrain_oas.py
print('\n'.join(std_out))

[NeMo W 2023-08-17 16:22:29 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-08-17 16:22:29 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
    
    See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    
[NeMo I 2023-08-17 16:22:30 pretrain_oas:14] 
    
    ************** Experiment configuration ***********
[NeMo I 2023-08-17 16:22:30 pretrain_oas:15] 
    name: esm1nv-oas
    do_training: true
    do_testing: false
    restore_from_path: null
    trainer:
      devices: 1
      num_nodes: 1
      accelerator: gpu
      precision: 16-mixed
      logger: false
      enable_checkpoi

### Results 

The training run will create a directory called `esm1nv-oas_pretraining` in `/result/nemo_experiments/esm1nv-oas` containing the files (logs, checkpoints, etc.) for the training run:

In [6]:
! ls /result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining

checkpoints
cmd-args.log
events.out.tfevents.1692289352.drugdiscovery3-dt.335.0
git-info.log
hparams.yaml
lightning_logs.txt
nemo_error_log.txt
nemo_log_globalrank-0_localrank-0.txt
