# ProPythia DNA Deep Learning module quick start

This is a notebook that explains how to perform every step of the developed Deep Learning modules. They include all the necessary steps to complete an entire Deep Learning pipeline. The steps are:

- Data reading and validation
- Encoders
- DNA Descriptors
- Data splitting
- Model building and training
- Hyperparameter tuning

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import sys
sys.path.append("../")

## 1. Data reading and validation

(The machine learning pipeline uses the same module to read and validate the sequences.)

This module comprehends functions to read and to validate DNA sequences. First is necessary to create the object ReadDNA.

In [2]:
from read_sequence import ReadDNA
reader = ReadDNA()

It is possible to create sequence objects using a single DNA sequence, a *CSV* and a *FASTA* file. The single sequence is going to be validated (check if all letters belong to the DNA alphabet) and the output will be the sequence in upper case.

In [3]:
data = reader.read_sequence("ACGTACGAGCATGCAT")
print(data)

ACGTACGAGCATGCAT


With *CSV* there must be at least a column named 'sequence' in the file. The labels may also be retrieved and validated if the user wants them, but he must specify the `with_label` parameter as **True** and the column with the labels must be named 'label'.

In [4]:
filename = "../datasets/primer/dataset.csv"
data = reader.read_csv(filename, with_labels=False)
print(data.head())
print(data.shape)

print("-" * 100)

data = reader.read_csv(filename, with_labels=True)
print(data.head())
print(data.shape)

                                            sequence
0  CCGAGGGCTATGGTTTGGAAGTTAGAACCCTGGGGCTTCTCGCGGA...
1  GAGTTTATATGGCGCGAGCCTAGTGGTTTTTGTACTTGTTTGTCGC...
2  GATCAGTAGGGAAACAAACAGAGGGCCCAGCCACATCTAGCAGGTA...
3  GTCCACGACCGAACTCCCACCTTGACCGCAGAGGTACCACCAGAGC...
4  GGCGACCGAACTCCAACTAGAACCTGCATAACTGGCCTGGGAGATA...
(2000, 1)
----------------------------------------------------------------------------------------------------
                                            sequence  label
0  CCGAGGGCTATGGTTTGGAAGTTAGAACCCTGGGGCTTCTCGCGGA...      0
1  GAGTTTATATGGCGCGAGCCTAGTGGTTTTTGTACTTGTTTGTCGC...      0
2  GATCAGTAGGGAAACAAACAGAGGGCCCAGCCACATCTAGCAGGTA...      0
3  GTCCACGACCGAACTCCCACCTTGACCGCAGAGGTACCACCAGAGC...      1
4  GGCGACCGAACTCCAACTAGAACCTGCATAACTGGCCTGGGAGATA...      1
(2000, 2)


The *FASTA* format is similar to the *CSV* format. It always reads the sequence, and the labels only if the user wants them. The *FASTA* format must be one of the following examples:

```
>sequence_id1
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG...
>sequence_id2
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG...
``` 

```
>sequence_id1,label1
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG...
>sequence_id2,label2
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG...
``` 

In [5]:
filename = "../datasets/primer/dataset.fasta"
data = reader.read_fasta(filename, with_labels=False)
print(data.head())
print(data.shape)

print("-" * 100)

data = reader.read_fasta(filename, with_labels=True)
print(data.head())
print(data.shape)

                                            sequence
0  CCGAGGGCTATGGTTTGGAAGTTAGAACCCTGGGGCTTCTCGCGGA...
1  GAGTTTATATGGCGCGAGCCTAGTGGTTTTTGTACTTGTTTGTCGC...
2  GATCAGTAGGGAAACAAACAGAGGGCCCAGCCACATCTAGCAGGTA...
3  GTCCACGACCGAACTCCCACCTTGACCGCAGAGGTACCACCAGAGC...
4  GGCGACCGAACTCCAACTAGAACCTGCATAACTGGCCTGGGAGATA...
(2000, 1)
----------------------------------------------------------------------------------------------------
                                            sequence  label
0  CCGAGGGCTATGGTTTGGAAGTTAGAACCCTGGGGCTTCTCGCGGA...      0
1  GAGTTTATATGGCGCGAGCCTAGTGGTTTTTGTACTTGTTTGTCGC...      0
2  GATCAGTAGGGAAACAAACAGAGGGCCCAGCCACATCTAGCAGGTA...      0
3  GTCCACGACCGAACTCCCACCTTGACCGCAGAGGTACCACCAGAGC...      1
4  GGCGACCGAACTCCAACTAGAACCTGCATAACTGGCCTGGGAGATA...      1
(2000, 2)


## 2. Encoders

Deep learning models automatically extract features from the sequences, but it is necessary to build a representation of the sequences first due to the fact that models can't handle anything other than numerical values. Encoders are easily calculated and can serve as numerical representations of sequences, which can subsequently be used as model input.

This module comprehends functions to encode the DNA sequences. The encoding step is important because sequences need to be converted into a numerical value in order to create an input matrix for the model. The encoders that have been implemented are:

- One-hot encoding
- Chemical encoding
- K-mer One-hot encoding

Below there's an example for each of them.

| Encoder             | Sequence | Encoded sequence                             |
| ------------------- | -------- | -------------------------------------------- |
| One-Hot             | ACGT     | [[1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1]] |
| Chemical            | ACGT     | [[1,1,1], [0,1,0], [1,0,0], [0,0,1]]         |
| K-mer One-Hot (k=2) | ACGT     | [[0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]] |

### 2.1. One-hot encoding

One-hot encoding is extensively used in deep learning models and is well suited for most models. It is a simple encoding that converts the DNA alphabet into a binary vector. 

- A -> [1,0,0,0]
- C -> [0,1,0,0]
- G -> [0,0,1,0]
- T -> [0,0,0,1]


To encode a sequence, we need first to create the object DNAEncoder.

In [6]:
from src.encoding import DNAEncoder
encoder = DNAEncoder('ACGTACGAGCATGCAT')

Now, we only need to specify the encoder method (one-hot, chemical, k-mer one-hot).

In [7]:
encoded_sequence = encoder.one_hot_encode()
print(encoded_sequence)

[[1 0 0 0]
 [0 1 0 0]
 [0 0 1 0]
 [0 0 0 1]
 [1 0 0 0]
 [0 1 0 0]
 [0 0 1 0]
 [1 0 0 0]
 [0 0 1 0]
 [0 1 0 0]
 [1 0 0 0]
 [0 0 0 1]
 [0 0 1 0]
 [0 1 0 0]
 [1 0 0 0]
 [0 0 0 1]]


### 2.2. Chemical encoding

The chemical encoding is a more complex encoding that uses the chemical properties of the DNA alphabet. Each letter is assigned a chemical property and the chemical properties are combined to create a vector. In a nutshell, the chemical properties are:

<table>
  <thead>
    <tr>
      <th>Chemical property</th>
      <th>Class</th>
      <th>Nucleotides</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td rowspan="2">Ring structure</td>
      <td>Purine</td>
      <td>A, G</td>
    </tr>
    <tr>
      <td>Pyrimidine</td>
      <td>C, T</td>
    </tr>
    <tr>
      <td rowspan="2">Hydrogen bond</td>
      <td>Weak</td>
      <td>A, T</td>
    </tr>
    <tr>
      <td>Strong</td>
      <td>C, G</td>
    </tr>
    <tr>
      <td rowspan="2">Functional group</td>
      <td>Amino</td>
      <td>A, C</td>
    </tr>
    <tr>
      <td>Keto</td>
      <td>G, T</td>
    </tr>
  </tbody>
</table>

If the letter is in the list of the first nucleotides, it is assigned the value 1 and if it is in the list of the second nucleotides, it is assigned the value 0. 

- A -> [1, 1, 1]
- C -> [0, 0, 1]
- G -> [1, 0, 0]
- T -> [0, 1, 0]

The encoder object is already created so we just need to specify the encoder method.

In [8]:
encoded_sequence = encoder.chemical_encode()
print(encoded_sequence)

[[1 1 1]
 [0 0 1]
 [1 0 0]
 [0 1 0]
 [1 1 1]
 [0 0 1]
 [1 0 0]
 [1 1 1]
 [1 0 0]
 [0 0 1]
 [1 1 1]
 [0 1 0]
 [1 0 0]
 [0 0 1]
 [1 1 1]
 [0 1 0]]


### 2.3. K-mer One-hot encoding

Using one-hot encoding on DNA sequences solely preserves the positional information of each nucleotide. Recent investigations, however, have shown that including high-order dependencies among nucleotides may enhance the efficacy of DNA models. The K-mer One-hot encoding is a method that aims to overcome this problem.

If k = 1,the encoder will create the same vector as the one-hot encoding.

If k = 2, 16 dinucleotides will be created, and the encoder will create a vector with the following values:

- AA = [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
- AC = [0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
- AG = [0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0]
- ...
- TT = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]

If k = 3, 64 trinucleotides will be created, and the encoder will create a vector with the following values:

- AAA = [1,0,0,0,...,0,0,0,0]
- AAC = [0,1,0,0,...,0,0,0,0]
- ...
- TTT = [0,0,0,0,...,0,0,0,1]

The value of K can be any integer greater than 1 and less than or equal to the length of the sequence.

In [9]:
encoded_sequence = encoder.kmer_one_hot_encode(k=2)
print(encoded_sequence)

[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]


This module also allows the user to encode multiple sequences at once. The encoder can receive a column of a dataframe full of sequences and return an array of all encoded sequences.

In [10]:
df = pd.DataFrame(
    [
        ['CGACGATGCAT', 1], 
        ['CGAAGGTGTAC', 0], 
        ['AGTAGGGGTAA', 1]
    ], 
    columns=['sequence', 'labels']
)

column = df['sequence'].values
encoder = DNAEncoder(column)
encoded_sequences = encoder.one_hot_encode()
print(encoded_sequences)

[[[0 1 0 0]
  [0 0 1 0]
  [1 0 0 0]
  [0 1 0 0]
  [0 0 1 0]
  [1 0 0 0]
  [0 0 0 1]
  [0 0 1 0]
  [0 1 0 0]
  [1 0 0 0]
  [0 0 0 1]]

 [[0 1 0 0]
  [0 0 1 0]
  [1 0 0 0]
  [1 0 0 0]
  [0 0 1 0]
  [0 0 1 0]
  [0 0 0 1]
  [0 0 1 0]
  [0 0 0 1]
  [1 0 0 0]
  [0 1 0 0]]

 [[1 0 0 0]
  [0 0 1 0]
  [0 0 0 1]
  [1 0 0 0]
  [0 0 1 0]
  [0 0 1 0]
  [0 0 1 0]
  [0 0 1 0]
  [0 0 0 1]
  [1 0 0 0]
  [1 0 0 0]]]


## 3. DNA Descriptors

As mentioned in the `quick-start-DL.ipynb` notebook, descriptors are manually calculated and are an attempt to serve as features for the classification model. However, deep learning models cannot use descriptors as features because their purpose is to extract features on their own instead of manually calculating beforehand. The DNA descriptors are being mentioned here because there are some deep learning models that can use them as features, such as deep neural networks, but models like CNNs and RNNs are not able to use them as features.

So, at this point, the user can either choose to use encoders or descriptors to proceed to the next step. Using encodings it would be something like:

In [11]:
reader = ReadDNA()
data = reader.read_csv(filename='../datasets/primer/dataset.csv', with_labels=True)

fps_x = data['sequence'].values
fps_y = data['label'].values

# choosing one hot encoding
encoder = DNAEncoder(fps_x)
fps_x = encoder.one_hot_encode()
print(fps_x.shape)

(2000, 50, 4)


Using descriptors it would be something like:

In [12]:
reader = ReadDNA()
data = reader.read_csv(filename='../datasets/primer/dataset.csv', with_labels=True)

from calculate_features import calculate_and_normalize
from sklearn.preprocessing import StandardScaler

fps_x, fps_y = calculate_and_normalize(data)

scaler = StandardScaler().fit(fps_x)
fps_x = scaler.transform(fps_x)
fps_y = fps_y.to_numpy()
print(fps_x.shape)

0 / 2000
100 / 2000
200 / 2000
300 / 2000
400 / 2000
500 / 2000
600 / 2000
700 / 2000
800 / 2000
900 / 2000
1000 / 2000
1100 / 2000
1200 / 2000
1300 / 2000
1400 / 2000
1500 / 2000
1600 / 2000
1700 / 2000
1800 / 2000
1900 / 2000
Done!
(2000, 247)


## 4. Data splitting

The sequences are at this point converted into numerical representations and are ready to be split into training, validation, and test sets. After that, each set needs also to be represented as the *PyTorch* object called *DataLoader*, which is a *Python* iterable over a dataset. All of this can be achieved using the function `data_splitting` from the `prepare_data.py` file.

In [13]:
from src.prepare_data import data_splitting
batch_size = 32
train_size = 0.6
validation_size = 0.2
test_size = 0.2

trainloader, testloader, validloader, _ = data_splitting(fps_x, fps_y, batch_size, train_size, test_size, validation_size)

## 5. Model building and training

**Important Note:** Before continuing, it is worth noting that all of the previous steps, from the data reading, calculation of encoder/descriptors, and even the data splitting step, were compiled into a single function called `prepare_data` that can be called from the `prepare_data.py` file. An example of how to use this function will be shown later.

At this point, the data is now ready to be used by a model. The user can choose to use one of the 6 implemented *PyTorch* models. They are:

| Models                | Features    |
| --------------------- | ----------- |
| MLP                   | Descriptors |
| CNN                   | Encoders    |
| LSTM / BiLSTM         | Encoders    |
| GRU / BiGRU           | Encoders    |
| CNN-LSTM / CNN-BiLSTM | Encoders    |
| CNN-GRU / CNN-BiGRU   | Encoders    |

As we can see, some models require the use of encoders and some require descriptors. Also, some models have the bidirectional option, resulting in 2 + 4*2 = 10 different models.

Imagining the scenario that we want to use descriptors as features, we need to choose the *MLP* model. We also need to specify some parameters for the training function. To make it easier for the user, a config file was created to provide an overview of all the parameters that will be used from now on. An example of a `config.json` file is:

```json
{
    "combination":{
        "model_label": "mlp",
        "mode": "descriptor",
        "data_dir": "primer"
    },
    "do_tuning": false,
    "fixed_vals":{
        "epochs": 500,
        "optimizer_label": "adam",
        "loss_function": "cross_entropy",
        "patience": 8,
        "output_size": 2,
        "cpus_per_trial":1, 
        "gpus_per_trial":0,
        "num_samples": 15,
        "num_layers": 2,
        "kmer_one_hot": 3
    },
    "hyperparameters": {
        "hidden_size": 32,
        "lr": 1e-3,
        "batch_size": 32,
        "dropout": 0.35
    },
    "hyperparameter_search_space": {
        "hidden_size": [32, 64, 128, 256],
        "lr": [1e-5, 1e-2],
        "batch_size": [8, 16, 32],
        "dropout": [0.3, 0.5]
    },
    "train_all_combinations": false
}
```

To read the values from the configuraton file, we can use the function `read_config` from the `deep_ml.py` file. This functions also validates the configuration file and returns a dictionary with the values.

In [14]:
from deep_ml import read_config
config = read_config(filename='../config.json')

for key, val in config.items():
    if(key == "do_tuning" or key == 'train_all_combinations'):
        print(key, ":", val)
    else:
        print(key, "{")
        for k, v in val.items():
            print("\t", k,":", v)
        print("}")

Training on: cuda:0
combination {
	 model_label : cnn
	 mode : one_hot
	 data_dir : /home/jabreu/propythia/src/propythia/DNA/datasets/primer
}
do_tuning : False
fixed_vals {
	 epochs : 500
	 optimizer_label : adam
	 loss_function : CrossEntropyLoss()
	 patience : 7
	 output_size : 2
	 cpus_per_trial : 2
	 gpus_per_trial : 2
	 num_samples : 15
	 num_layers : 2
	 kmer_one_hot : 3
}
hyperparameters {
	 hidden_size : 32
	 lr : 0.001
	 batch_size : 32
	 dropout : 0.35
}
hyperparameter_search_space {
	 hidden_size : <ray.tune.sample.Categorical object at 0x7f7c10558d10>
	 lr : <ray.tune.sample.Float object at 0x7f7c10558a90>
	 batch_size : <ray.tune.sample.Categorical object at 0x7f7c105583d0>
	 dropout : <ray.tune.sample.Float object at 0x7f7c10558dd0>
}
train_all_combinations : False


As we can see, there is a dict called 'hyperparameters' for the training. These values were arbitrarily chosen, which can lead to poor performance, and that's why we need hyperparameter tuning to find the best values. But so far let's keep it simple and use the default values. Hyperparameter tuning will be discussed later in the tutorial (the dict called 'hyperparameter_search_space' will be used later).

Now, we just need to call the training function with all of these values and we will obtain a trained model. But before this, it important to specify which device we want the model to be trained on. Generally, it is a good idea to use the GPU if it is available. It is also a good practice to set a seed to ensure that the results are reproducible.

In [15]:
import numpy
import os
import torch

numpy.random.seed(2022)
torch.manual_seed(2022)
os.environ["CUDA_VISIBLE_DEVICES"] = '1,2,3,4,5'
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

Now we are ready to call the training function.

In [16]:
from src.train import traindata
hyperparameters = config['hyperparameters']
model = traindata(hyperparameters, device, config)

[1/500, 0/38] loss: 0.70695961
The Current Loss: 0.6356495572970464
trigger times: 0
[2/500, 0/38] loss: 0.64424402
The Current Loss: 0.49543410539627075
trigger times: 0
[3/500, 0/38] loss: 0.50877994
The Current Loss: 0.42889038645304167
trigger times: 0
[4/500, 0/38] loss: 0.36436936
The Current Loss: 0.3913511771422166
trigger times: 0
[5/500, 0/38] loss: 0.44344398
The Current Loss: 0.3759752168105199
trigger times: 0
[6/500, 0/38] loss: 0.33551249
The Current Loss: 0.37056772525493914
trigger times: 0
[7/500, 0/38] loss: 0.32777476
The Current Loss: 0.36761671763200027
trigger times: 0
[8/500, 0/38] loss: 0.32554531
The Current Loss: 0.3595673373112312
trigger times: 0
[9/500, 0/38] loss: 0.36578175
The Current Loss: 0.35537290343871486
trigger times: 0
[10/500, 0/38] loss: 0.32901487
The Current Loss: 0.3488809076639322
trigger times: 0
[11/500, 0/38] loss: 0.31945065
The Current Loss: 0.3502378830542931
trigger Times: 1
[12/500, 0/38] loss: 0.3177934
The Current Loss: 0.3420598

As we can see, we didn't need to read any data or calculate the descriptors. This is because the training function already did all of those steps using the `prepare_data` function mentioned in the introduction of this chapter's important note. However, we will need to do it again now to obtain the test set to see if the model is working properly. This is inconvenient because we are reading and splitting the data twice, but this is required because later we will use 'batch_size' (which is used to read the data) as a varying hyperparameter. Because we can only vary the hyperparameters inside the train function, we have to read the data in that function.

In [17]:
from src.prepare_data import prepare_data
mode = config['combination']['mode']
data_dir = config['combination']['data_dir']
kmer_one_hot = config['fixed_vals']['kmer_one_hot']
model_label = config['combination']['model_label'] 
batch_size = config['hyperparameters']['batch_size']

_, testloader, _, _, _ = prepare_data(
    data_dir=data_dir,
    mode=mode,
    batch_size=batch_size,
    k=kmer_one_hot
)

Now let's see how well the model performs on the test set. The metrics chosen are the accuracy, the Matthews correlation coefficient, and the confusion matrix.

In [18]:
from src.test import test

acc, mcc, report = test(device, model, testloader)
print("Results in test set:")
print("--------------------")
print("- model:  ", model_label)
print("- mode:   ", mode)
print("- dataset:", data_dir.split("/")[-1])
print("--------------------")
print('Accuracy: %.3f' % acc)
print('MCC: %.3f' % mcc)
print(report)

Results in test set:
--------------------
- model:   cnn
- mode:    one_hot
- dataset: primer
--------------------
Accuracy: 0.990
MCC: 0.980
[[200   3]
 [  1 196]]


## 6. Hyperparameter tuning

As mentioned before, there was developed a method to find the best hyperparameters. This method is called *hyperparameter tuning*. It is a process of tuning the hyperparameters of a model to obtain the best performance. A function called `hyperparameter_tuning` was implemented that performs this process. It takes as input the config object (which must have the hyperparameters search space) and the device on which the model will be trained. It will create a scheduler called `ASHAScheduler` that will be used terminate the training if the model does not improve for a certain number of epochs. There will be created also a `CLIReporter` object that will report the metrics on the console (accuracy, Matthews correlation coefficient, and loss). Then, `num_samples` samples will be drawn from the hyperparameter search space and the model will be trained on each of them. The best model will be the one that has the highest Matthews correlation coefficient and will be then tested on the test set, outputting the metrics.

In [19]:
os.chdir('../')
sys.path.append(os.getcwd())
from src.hyperparameter_tuning import hyperparameter_tuning
config['do_tuning'] = True
hyperparameter_tuning(device, config)

2022-08-17 15:59:25,306	ERROR syncer.py:147 -- Log sync requires rsync to be installed.


== Status ==
Current time: 2022-08-17 15:59:25 (running for 00:00:00.21)
Memory usage on this node: 88.7/251.3 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 256.000: None | Iter 128.000: None | Iter 64.000: None | Iter 32.000: None | Iter 16.000: None | Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 2.0/64 CPUs, 2.0/4 GPUs, 0.0/122.97 GiB heap, 0.0/56.69 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /home/jabreu/ray_results/traindata_2022-08-17_15-59-25
Number of trials: 15/15 (14 PENDING, 1 RUNNING)
+-----------------------+----------+------------------------+--------------+-----------+---------------+-------------+
| Trial name            | status   | loc                    |   batch_size |   dropout |   hidden_size |          lr |
|-----------------------+----------+------------------------+--------------+-----------+---------------+-------------|
| traindata_2f206_00000 | RUNNING  | 192.168.85.249:3931827 |         

2022-08-17 16:04:29,108	INFO tune.py:748 -- Total run time: 304.09 seconds (303.84 seconds for the tuning loop).


Best trial config: {'hidden_size': 256, 'lr': 0.0006702634531750633, 'batch_size': 8, 'dropout': 0.42089114402459704}
Best trial final validation loss: 0.3167447644472122
Best trial final validation accuracy: 0.995
Best trial final validation mcc: 0.98999899989999
Results in test set:
--------------------
- model:   cnn
- mode:    one_hot
- dataset: primer
--------------------
Accuracy: 1.000
MCC: 1.000
[[203   0]
 [  0 197]]


We've reached the end of the deep learning pipeline. 