# Training covid models
### This notebook is an example usage of how to use the model alongside the covid-data-collector in order to train, evaluate and test the model
#### In this notebook you will find example usages on how to use the core functionalities of the model 

#### Import third party modules, and also the data_collector: covid19_genome and the model module

In [1]:
import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # Uncomment to disable GPU
import glob

from model import Model, DatasetName, load_model, remove_model

__ORIG_WD__ = os.getcwd()
print(__ORIG_WD__)

os.chdir(f"{__ORIG_WD__}/../data_collectors/")
from covid19_genome import Covid19Genome

os.chdir(__ORIG_WD__)


2024-02-25 21:21:07.317608: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-25 21:21:07.427599: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-25 21:21:07.797043: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-25 21:21:07.797085: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-25 21:21:07.871357: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

/host/home/user_7321/user_7321_backup/user_7321/docker/ProjectB---Vital/vital/models


#### Create a model, or try to load it, if it was already have been created.

In order to use the model, the first thing you have to do is provide it with a dataset (with the help of the data_collector). In the following cell you are provided with an example that create the dataset.

You should note that when you are creating the dataset, you are passing the dataset type. You can obtain the available dataset types in the system by calling the model class function ```get_ds_types()```

In [2]:
model_name = "cov19-1024e"

try:
    print("loading model")
    model = load_model(model_name)
except Exception as e:
    print (e)
    print("creating model")
    covid19_genome = Covid19Genome()
    lineages = covid19_genome.getLocalLineages(1024)
    lineages.sort()
    dataset = []
    def get_dataset():
        for lineage in lineages:
            dataset.append((lineage, covid19_genome.getLocalAccessionsPath(lineage)))
        return dataset

    portions = {
        DatasetName.trainset.name: 0.8,
        DatasetName.validset.name: 0.1,
        DatasetName.testset.name: 0.1
    }

    dataset = get_dataset()
    model = Model(model_name)
    model.create_datasets(model.get_ds_types()[0], dataset, portions)

loading model
++++
./data/cov19-1024e


2024-02-25 21:21:22.197689: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-25 21:21:22.372639: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2256] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


After you have created the model, and created its datasets. You can check which neural network structures is available. You can do that by calling the model class function ```get_ml_model_structure()```.

After you see all the ml_model structures available in the system, you can check which hyper parameters are needed to define each and every ml_model structure. This is done by calling the model class function ```get_ml_model_structure_hps()```. The ```get_ml_model_structure_hps()``` will return which hps are required, and what it their type.

In [3]:
print(model.get_ml_model_structures())
print(model.get_ml_model_structure_hps(model.get_ml_model_structures()[-1]))

['VitStructure', 'ConvStructure', 'VitStructure_ex', 'CLSTMStructure']
{'d_model': 'required', 'd_val': 'required', 'd_key': 'required', 'd_ff': 'required', 'heads': 'required', 'dropout_rate': 'optional', 'regularizer': 'optional', 'initializer': 'optional', 'activation': 'optional', 'LSTM_repeats': 'required', 'labels': 'required', 'LSTM_units': 'required'}


You can also see which properties help define the current type of dataset by calling to the model class function ```get_ds_props()``` This function could be called only after the dataset have been succesfully created. This function will return the properties of the dataset as well as their values.

In [4]:
print(model.get_ds_props())

{'coverage': 4, 'substitution_rate': 0.005, 'insertion_rate': 0.001, 'deletion_rate': 0.001, 'read_length': 128, 'frag_len': 128, 'num_frags': 256}


A use case of the system with the VitStructure model and the minhash genome datasets (a.k.a. mh_genome_ds).

In the mh_genome_ds the coverage is a dataset property that sets the genome coverage rate.

In the VitStructure, the model_depth is the number of transformer encoders.

In this example use-case these two parameters will help us define a neural network that will be trained on the dataset (with the current coverage rate)

In [5]:
sequencer_instrument_to_error_profile_map = {
    "illumina": {
        "substitution_rate": 0.005,
        "insertion_rate": 0.001,
        "deletion_rate": 0.001
    },
    "ont": {
        "substitution_rate": 0.01,
        "insertion_rate": 0.04,
        "deletion_rate": 0.04
    },
    "pacbio": {
        "substitution_rate": 0.005,
        "insertion_rate": 0.025,
        "deletion_rate": 0.025
    },
    "roche": {
        "substitution_rate": 0.002,
        "insertion_rate": 0.01,
        "deletion_rate": 0.01
    }
}

In [6]:
coverage = 4
ml_model_depth = 1
# coverage_list = [1, 4]
# ml_model_depth_list = [1, 2, 4]
sequencer_instrument = "illumina"
batch_size = 1024
mini_batch_size = 256

def get_model_name(ml_model_depth, coverage, sequencer_instrument):
    if not sequencer_instrument in sequencer_instrument_to_error_profile_map:
        raise Exception(f"Invalid sequencer instrument: {sequencer_instrument}")
    return f"clstm.{ml_model_depth}.{coverage}x.{sequencer_instrument}"

In [7]:
def write_to_res_file(results_file, model, ml_model_name):
    categorical_acc = float(model.ml_models[ml_model_name].net.metrics[0].result())
    loss = float(model.ml_models[ml_model_name].net.metrics[1].result())
    results_file.write(f"\nCategorical Accuracy is {categorical_acc}\n")
    results_file.write(f"\nloss is {loss}\n")

In [None]:
# Hyperparamaters exploration
# Search Grid: Dropout, learning rate, regularizer
epochs = 10


In [6]:
results = open('results.txt','a')
for coverage in coverage_list:
    for ml_model_depth in ml_model_depth_list:
        ml_model_name = get_model_name(ml_model_depth, coverage, sequencer_instrument)
        print(ml_model_name)
        results.write(f"Model Name is : {ml_model_name}\n\n")
        results.write(f"Model Depth is : {ml_model_depth}\n")
        results.write(f"Coverage is : {coverage}\n")
        newly_added = True
        try:
            model.add_ml_model(ml_model_name, hps={
                "structure": model.get_ml_model_structures()[2],
                "d_model": model.get_ds_props()["frag_len"],
                "seq_len": model.get_ds_props()["num_frags"],
                "d_val": 128,
                "d_key": 128,
                "heads": 8,
                "d_ff": 128+256,
                "labels":  len(model.get_labels()),
                "activation": "relu",
                "optimizer": {
                    "name": "Adam",
                    "params": {
                        "learning_rate": 0.001,
                    },
                },
                # "encoder_repeats": ml_model_depth,
                "LSTM_repeats":  ml_model_depth,
                "LSTM_units": 64,
                "regularizer": {
                    "name": "l2",
                    "params": {
                        "l2": 0.0001
                    }
                },
                "dropout": 0.2,
                "initializer": "he_normal"
            })
        except Exception as e:
            print (e)
            newly_added = False
            print("Model already exists")

        model.update_ds_props({"coverage": coverage,} 
        | sequencer_instrument_to_error_profile_map[sequencer_instrument])
        model.set_ds_batch_size(mini_batch_size)

        model.train(ml_model_name, epochs=30)
        results.write(f"Train Results : {ml_model_name}\n")
        write_to_res_file(results, model, model_name)
        results.write(f"Validation Results : {ml_model_name}\n")
        model.evaluate(ml_model_name)
        write_to_res_file(results, model, model_name)
        results.write(f"Test Results : {ml_model_name}\n")
        model.test(ml_model_name)
        write_to_res_file(results, model, model_name)
print("DONE")

NameError: name 'coverage_list' is not defined