# Training covid models
### This notebook is an example usage of how to use the model alongside the covid-data-collector in order to train, evaluate and test the model
#### In this notebook you will find example usages on how to use the core functionalities of the model 

#### Import third party modules, and also the data_collector: covid19_genome and the model module

In [1]:
import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # Uncomment to disable GPU
import glob

from model import Model, DatasetName, load_model, remove_model

__ORIG_WD__ = os.getcwd()

os.chdir(f"{__ORIG_WD__}/../data_collectors/")
from covid19_genome import Covid19Genome

os.chdir(__ORIG_WD__)


2024-03-04 17:38:13.846578: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-04 17:38:13.909467: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-04 17:38:14.277567: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2024-03-04 17:38:14.277624: W tensorflow/com

#### Create a model, or try to load it, if it was already have been created.

In order to use the model, the first thing you have to do is provide it with a dataset (with the help of the data_collector). In the following cell you are provided with an example that create the dataset.

You should note that when you are creating the dataset, you are passing the dataset type. You can obtain the available dataset types in the system by calling the model class function ```get_ds_types()```

In [2]:
model_name = "cov19-1024e"

try:
    print("loading model")
    model = load_model(model_name)
except Exception:
    print("creating model")
    covid19_genome = Covid19Genome()
    lineages = covid19_genome.getLocalLineages(1024)
    lineages.sort()
    dataset = []
    def get_dataset():
        for lineage in lineages:
            dataset.append((lineage, covid19_genome.getLocalAccessionsPath(lineage)))
        return dataset

    portions = {
        DatasetName.trainset.name: 0.8,
        DatasetName.validset.name: 0.1,
        DatasetName.testset.name: 0.1
    }

    dataset = get_dataset()
    model = Model(model_name)
    model.create_datasets(model.get_ds_types()[0], dataset, portions)

loading model
++++
./data/cov19-1024e


2024-03-04 17:38:15.369719: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-03-04 17:38:15.384292: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-03-04 17:38:15.384416: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-03-04 17:38:15.384804: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operati

After you have created the model, and created its datasets. You can check which neural network structures is available. You can do that by calling the model class function ```get_ml_model_structure()```.

After you see all the ml_model structures available in the system, you can check which hyper parameters are needed to define each and every ml_model structure. This is done by calling the model class function ```get_ml_model_structure_hps()```. The ```get_ml_model_structure_hps()``` will return which hps are required, and what it their type.

In [3]:
print(model.get_ml_model_structures())
print(model.get_ml_model_structure_hps(model.get_ml_model_structures()[1]))

['VitStructure', 'ConvStructure', 'VitStructure_ex', 'CLSTMStructure']
{'convclass': 'required', 'convnet': 'required', 'labels': 'required', 'seq_len': 'required', 'regularizer': 'optional', 'initializer': 'optional', 'd_model': 'required', 'activation': 'optional'}


You can also see which properties help define the current type of dataset by calling to the model class function ```get_ds_props()``` This function could be called only after the dataset have been succesfully created. This function will return the properties of the dataset as well as their values.

In [4]:
print(model.get_ds_props())

{'coverage': 4, 'substitution_rate': 0.005, 'insertion_rate': 0.001, 'deletion_rate': 0.001, 'read_length': 128, 'frag_len': 128, 'num_frags': 256}


A use case of the system with the VitStructure model and the minhash genome datasets (a.k.a. mh_genome_ds).

In the mh_genome_ds the coverage is a dataset property that sets the genome coverage rate.

In the VitStructure, the model_depth is the number of transformer encoders.

In this example use-case these two parameters will help us define a neural network that will be trained on the dataset (with the current coverage rate)

In [5]:
coverage = 4
ml_model_depth = 1
sequencer_instrument = "illumina"
batch_size = 1024
mini_batch_size = 256

def write_to_res_file(results_file, model, ml_model_name):
    categorical_acc = float(model.ml_models[ml_model_name].net.metrics[0].result())
    loss = float(model.ml_models[ml_model_name].net.metrics[1].result())
    results_file.write(f"\nCategorical Accuracy is {categorical_acc}\n")
    results_file.write(f"\nloss is {loss}\n")

In [6]:
sequencer_instrument_to_error_profile_map = {
    "illumina": {
        "substitution_rate": 0.005,
        "insertion_rate": 0.001,
        "deletion_rate": 0.001
    },
    "ont": {
        "substitution_rate": 0.01,
        "insertion_rate": 0.04,
        "deletion_rate": 0.04
    },
    "pacbio": {
        "substitution_rate": 0.005,
        "insertion_rate": 0.025,
        "deletion_rate": 0.025
    },
    "roche": {
        "substitution_rate": 0.002,
        "insertion_rate": 0.01,
        "deletion_rate": 0.01
    }
}

def get_model_name(ml_model_depth, coverage, sequencer_instrument):
    if not sequencer_instrument in sequencer_instrument_to_error_profile_map:
        raise Exception(f"Invalid sequencer instrument: {sequencer_instrument}")
    return f"vitex.{ml_model_depth}.{coverage}xxxx.{sequencer_instrument}"

ml_model_name = get_model_name(ml_model_depth, coverage, sequencer_instrument)
print(ml_model_name)

vitex.1.4xxxx.illumina


#### Adding a new neural network

In this cell we will create an ml_model with the required hps (and also optional) as outputted earlier.

In [7]:
newly_added = True
try:
    model.add_ml_model(ml_model_name, hps={
        "structure": model.get_ml_model_structures()[2],
        "d_model": model.get_ds_props()["frag_len"],
        "seq_len": model.get_ds_props()["num_frags"],
        "d_val": 128,
        "d_key": 128,
        "heads": 8,
        "d_ff": 128+256,
        "convclass": "resnet_v2",
        "convnet": "ResNet152V2",
        "labels":  len(model.get_labels()),
        "activation": "relu",
        "optimizer": {
            "name": "Adam",
            "params": {
                "learning_rate": 0.001,
            },
        },
        "encoder_repeats": ml_model_depth,
        # "LSTM_repeats":  ml_model_depth,
        # "LSTM_units": 64,
        "pooling_width": 2,
        "kernel_width": 3,
        "regularizer": {
            "name": "l2",
            "params": {
                "l2": 0.0001
            }
        },
        "dropout": 0.2,
        "initializer": "he_normal"
    })
except Exception as e:
    print (e)
    newly_added = False
    print("Model already exists")

compiled the model with the following metrics: ['categorical_accuracy', 'AUC']
Heyyyyy
input size: (1, 256, 128, 4)
x after embedding size: (1, 256, 128, 1)
x before encoder size: (1, 256, 128)
x after encoder size: (1, 256, 128)
x after CNNEX size: (1, 115)
output size: (1, 188)
Heyyyyy
input size: (1, 256, 128, 4)
x after embedding size: (1, 256, 128, 1)
x before encoder size: (1, 256, 128)


x after encoder size: (1, 256, 128)


2024-03-04 17:38:16.706540: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:630] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2024-03-04 17:38:16.754462: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8700


x after CNNEX size: (1, 115)
output size: (1, 188)


In [8]:
# ml_model_class = "efficientnet"
# ml_model_name = "efficient_net_b0"

In [9]:
model.get_ml_model_structures()

['VitStructure', 'ConvStructure', 'VitStructure_ex', 'CLSTMStructure']

In [10]:
# import sys
# try:
#     model.add_ml_model(ml_model_name, hps={
#         "structure": model.get_ml_model_structures()[-1],
#         "convclass": ml_model_class,
#         "convnet": ml_model_name,
#         "seq_len": model.get_ds_props()["num_frags"],
#         "d_model": model.get_ds_props()["frag_len"],
#         "labels":  len(model.get_labels()),
#         "batch_size": batch_size,
#         "mini_batch_size": mini_batch_size,
#     })
# except:
#     # print exception info
#     print(sys.exc_info()[0])
#     print(sys.exc_info()[1])

In [11]:
models = model.list_ml_models()
print(models)

['vit.1.1xxxx.illumina', 'clstm.1.4xxxx.illumina', 'conv.1.4xxxx.illumina', 'vitex.1.4xxxx.illumina']


In [12]:
# if newly_added:
#     assert False, "Please consider doing transfer learning"
# # model.transfer(get_model_name(ml_model_depth, coverage * 2, sequencer_instrument), ml_model_name, False)

In [13]:
# model.change_ml_hps(ml_model_name, {
#     "regularizer": {
#         "name": "l2",
#         "params": {
#             "l2": 0.00005,
#         },
#     },
#     "optimizer": {
#         "name": "AdamW",
#         "params": {
#             "learning_rate": 0.0002,
#         },
#     },
# })

#### Updating the dataset coverage

In [14]:
model.update_ds_props({
    "coverage": coverage,
    } | sequencer_instrument_to_error_profile_map[sequencer_instrument])

#### Setting dataset batch size and training

In [15]:
# mini_batch_size = batch_size
# while(True):
#     try:
#         model.set_ds_batch_size(mini_batch_size)
#         model.train(ml_model_name, epochs=30)
#     except:
#         mini_batch_size = int(mini_batch_size // 2)
#         model.change_ml_hps(ml_model_name, {
#             "batch_size": mini_batch_size,
#         })

In [16]:
model.set_ds_batch_size(mini_batch_size)
# results = open('results.txt','a')
# results.write(f"Model Name is : {ml_model_name}\n\n")
# results.write(f"Model Depth is : {ml_model_depth}\n")
# results.write(f"Coverage is : {coverage}\n")

model.train(ml_model_name, epochs=30)
# results.write(f"Train Results : {ml_model_name}\n")
# write_to_res_file(results, model, model_name)
# results.write(f"Validation Results : {ml_model_name}\n")
model.evaluate(ml_model_name)
# write_to_res_file(results, model, model_name)
# results.write(f"Test Results : {ml_model_name}\n")
model.test(ml_model_name)
# write_to_res_file(results, model, model_name)

Epoch 1/30
Heyyyyy
input size: (None, 256, 128, 4)
x after embedding size: (None, 256, 128, 1)
x before encoder size: (None, 256, 128)
x after encoder size: (None, 256, 128)
x after CNNEX size: (None, 115)
output size: (None, 188)
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


2024-03-04 17:38:22.058800: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] layout failed: INVALID_ARGUMENT: Size of values 0 does not match size of permutation 4 @ fanin shape invit_structure_ex/sequential/dropout/dropout/SelectV2-2-TransposeNHWCToNCHW-LayoutOptimizer
2024-03-04 17:38:29.897635: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x7f5241808e10 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-03-04 17:38:29.897655: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (0): NVIDIA GeForce RTX 3090, Compute Capability 8.6
2024-03-04 17:38:29.900196: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-03-04 17:38:29.963161: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


KeyboardInterrupt: 

In [None]:

# model.evaluate(ml_model_name)

0.00754310330376029
4.746018886566162


In [None]:

# model.test(ml_model_name)

25