# Overview
For this exercise, we will be implementing an LSTM, a popular variant of an RNN, and training it on DMS unpaired probabilities for RNA sequences and using this to enhance the predictive performance of RNAfold in RNA secondary structure prediction. Vienna RNAfold can be used either by the webserver interface or can be installed and used locally. The workflow to solve this task will be organized as follows:

1. [Install packages and download data](#install-packages-and-download-data)
2. [Preprocess DMS data](#dataset-preprocessing)
3. [Define LSTM](#lstm-architecture)
4. [Weisfeiler Lehman algorithm](#weisfeiler-lehman-algorithm)
5. [RNAfold local implementation](#rnafold-local-implementation)
6. [LSTM training and evaluation](#lstm-training-and-evaluation)

# Install packages and download data
Before starting, ensure the following libraries are installed along with Python 3.6+

In [41]:
%pip install numpy pandas scikit-learn biopython torch torchvision torch-geometric

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


Download the DMS dataset if you have not already done so from: https://github.com/BackofenLab/ML_LS_resources/blob/master/exercise_9_rna_rnn_hands_on/dms_dataset.fa

In this workflow, we will be using an LSTM trained on DMS feature vectors of RNA structures to predict the likelihood of nucleotides being unpaired ie. accessible and not involved in base pairing for the RNA's corresponding secondary structure. We will then use these predicted DMS feature vectors to enhance the predictions of Vienna RNAfold, a thermodynamic model for RNA secondary structure prediction.

For this exercise, you may use RNAfold via either the web browser interface or by downloading, installing, and implementing it locally.

The Vienna RNAfold webserver may be accessed in a web browser via: http://rna.tbi.univie.ac.at//cgi-bin/RNAWebSuite/RNAfold.cgi

To download and install RNAfold locally, we will compile it from the source code. 

First download the source code (.tar.gz file) into your working directory then follow the instructions below to unpack, configure, and make the package directly from source. 

Download the source code for your operating system from: https://www.tbi.univie.ac.at/RNA/index.html#download

In [None]:
# Run this to install from the source code in the directory of the downloaded .tar.gz file
!tar -zxvf ViennaRNA-2.7.0.tar.gz
!cd ViennaRNA-2.7.0
!./configure
!make
!sudo make install

ViennaRNA-2.7.0/
ViennaRNA-2.7.0/license.txt
ViennaRNA-2.7.0/configure.ac
ViennaRNA-2.7.0/RNAlib2.pc.in
ViennaRNA-2.7.0/compile
ViennaRNA-2.7.0/NEWS
ViennaRNA-2.7.0/Makefile.in
ViennaRNA-2.7.0/packaging/
ViennaRNA-2.7.0/packaging/win_installer_archlinux_x86_64.nsi.in
ViennaRNA-2.7.0/packaging/viennarna.spec.in
ViennaRNA-2.7.0/packaging/macosx/
ViennaRNA-2.7.0/packaging/macosx/resources/
ViennaRNA-2.7.0/packaging/macosx/resources/welcome.txt.in
ViennaRNA-2.7.0/packaging/macosx/resources/welcome.txt
ViennaRNA-2.7.0/packaging/macosx/resources/conclusion.html
ViennaRNA-2.7.0/packaging/macosx/resources/ohm.png
ViennaRNA-2.7.0/packaging/macosx/Makefile.in
ViennaRNA-2.7.0/packaging/macosx/Makefile.am
ViennaRNA-2.7.0/packaging/macosx/Distribution.xml.in
ViennaRNA-2.7.0/packaging/macosx/uninstall-ViennaRNA-Package.sh
ViennaRNA-2.7.0/packaging/win_installer_fedora_i686.nsi.in
ViennaRNA-2.7.0/packaging/debian/
ViennaRNA-2.7.0/packaging/debian/perl-rna.install.in
ViennaRNA-2.7.0/packaging/debian/v

Import needed libraries

In [42]:
import glob
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset
from torchvision import datasets
from  torch.utils.data import DataLoader
from torch.utils.data.sampler import SubsetRandomSampler
from torch import nn, optim
from sklearn.metrics import average_precision_score
from sklearn.model_selection import KFold
from torch_geometric.data import Data
import itertools
import uuid
import subprocess as sp
import os
import torch.nn as nn
import torch.nn.functional as F
from torch import nn
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

## Dataset Preprocessing

The dataset consists of pairs ($s_i$, $v_i$) where:

$s_i$: RNA sequence (string of nucleotides: A, U, C, G)

$v_i$: Vector of likelihoods indicating the probability of each nucleotide in $s_i$ being unpaired (does NOT form a base pair in the corresponding RNA secondary structure)

For example, an RNA sequence `AUGCGA` with DMS scores `[0.8, 0.2, 0.5, 0.9, 0.1, 0.7]`. Nucleotides with lower scores are more likely to be paired in the secondary structure. This will allow us to combine empirical data from DMS data by training our LSTM to predict these unpaired probabilties with the thermodynamic model from RNAfold.

To pre-process this data and prepare it into the train, validation, and test sets we need to train and evaluate our LSTM, we will define a custom PyTorch Dataset class. 

This implements the following pipeline:
1. Read RNA sequences and their DMS data from the downloaded file
2. Encode RNA sequences into numerical representations using k-mers
3. Prepare target values based on DMS scores
4. Create graph structures 

In [43]:
class MyDataset(Dataset):

	def __init__(self, dataset_folder, kmer, scale_factor):
		self.scale_factor = scale_factor
		self.dataset_folder = dataset_folder
		self.kmer_list = self.get_kmer_list(kmer)
		
		value_list = self.read_dataset(self.dataset_folder)

		self.dataset, self.x_values, self.y_values, self.max_length, self.seq_list = self.construct_dataset(value_list, self.kmer_list)

		self.dataset = self.dataset
		self.x_values = self.x_values
		self.y_values = self.y_values
		self.seq_list = self.seq_list
		
		
	def concat(self, x_str):
		x_concat = ""
		for x in x_str:
			x_concat += x
	
		return x_concat
		
	def num_words(self):
		return len(self.kmer_list)
		
		
	def get_kmer_list(self, kmer):
		kmer_list = list(itertools.product(["A", "T", "G", "C"], repeat=kmer))
		kmer_list = [self.concat(x) for x in kmer_list]

		return kmer_list
		

	def read_dataset(self, file_):
		value_list = []
		id_list = []
		struct_list = []
		files  = open(file_)
		file_lines = files.readlines()
		
		for i in range(0,len(file_lines),3):
			# Each RNA entry consists of three lines: ID, RNA sequence, DMS probability list
			value_list.append([file_lines[i], file_lines[i + 1], file_lines[i+2]])
		
		files.close()

		return value_list
		
		
	def construct_dataset(self, value_list, kmer_list):
		print(kmer_list)
		x_values = []
		y_values = []
		data_list = []
		seq_list = []
		max_length = 0
		

		for sample in value_list:
			seq = sample[1][:-1]
			seq_list.append(seq)
			dms = [float(dms) for dms in sample[2][:-2].split(" ")]
			kmer_length = len(kmer_list[0])

			x_encoded = []
			y_encoded = []
			edge_list = []

			
			for num,step_size in enumerate(range(0, len(seq) - (kmer_length-1))):
				# For each k-mer in the sequence, extract k-mer DMS scores and bases respectively
				kmer_dms = dms[step_size :step_size + kmer_length]
				kmer_dms = [dms for num, dms in enumerate(kmer_dms)]
				kmer_bases = seq[step_size:step_size+kmer_length]

				listofzeros = kmer_list.index(kmer_bases)
				x_encoded.append(listofzeros)
				# Create bidirectional graph edge between consecutive k-mers
				edge_list.append([num, num+1])
				edge_list.append([num+1, num])

				# Use the maximum DMS score in the k-mer as the target
				y_encoded.append(np.max(kmer_dms))

			# DMS only affects bases A and C so only scale if nucleotide is A or C, otherwise set to 0
			y_encoded = [y*self.scale_factor if seq[num] == "A" or seq[num] == "C" else 0 for num, y in enumerate(y_encoded)]
			
			
			if len(x_encoded) > max_length: max_length = len(x_encoded) 
			
			
			max_length = len(kmer_list)
			
			# Append encoded features and targets
			x_values.append(x_encoded)
			y_values.append(y_encoded)
			
			# Create the graph data object
			# Nodes: indices representing k-mers from the RNA sequence
			# Edges: between consecutive k-mers, representing their sequential relationship in the RNA sequence 
			edge_index = torch.tensor(edge_list, dtype=torch.long)
			x = torch.tensor(x_encoded, dtype=torch.float)
			y = torch.tensor(y_encoded, dtype=torch.float)
	
			data = Data(x=x, y=y, edge_index=edge_index.t().contiguous(), id = str(sample[0]))
			data_list.append(data)
		
		 
			
		return data_list, x_values, y_values, max_length, seq_list
		
		
		
	def reconstruct_input(self, input_seq):
		# Rebuild original RNA sequence from list of k-mer indices by mapping each index back toits corresponding k-mer
		seq = ""
		for enum, val in enumerate(input_seq[0]):
				
			kmer = self.kmer_list[val]
			seq += kmer
	
		return seq
        	
        
	def __len__(self):
		return len(self.y_values)

	def __getitem__(self, idx):
		return self.x_values[idx], self.y_values[idx]
		
	def return_x(self):
		return self.x_values
		
	def return_y(self):
		return self.y_values

# LSTM Architecture 
Next we define our LSTM class using torch.nn modules 

In [44]:
class LSTM(nn.Module):
    def __init__(self, input_size, hidden_size, max_length):
        super(LSTM, self).__init__()
        self.hidden_size = hidden_size
        self.max_length = max_length
        self.LSTM = nn.LSTM(max_length, hidden_size, batch_first = True)
        self.out1 = nn.Linear(self.hidden_size, self.hidden_size//2)
        self.out2 = nn.Linear(self.hidden_size//2, 1)
        
    
    def forward(self, input, hidden):
    
        # Get the one-hot encoding of the RNA sequence input 
        hot_one = self.return_hotone(input)
        
        output, hidden = self.LSTM(hot_one, hidden)
        output = F.relu(output)
        output = F.relu(self.out1(output))
        output = self.out2(output)
        return output, hidden
    

    def initHidden(self):
        return (torch.zeros(1, 1, self.hidden_size), torch.zeros(1, 1, self.hidden_size))
        
    
    
    def return_hotone(self, input):
        hot_one_vec = []
        for batch in input:
    	    for val in batch:
                hot_one = [0] * self.max_length
                hot_one[val] = 1
                hot_one_vec.append(hot_one)
                
        return torch.FloatTensor([hot_one_vec])
    
    

Define a custom Pearson Correlation Coefficient loss function for the LSTM

In [45]:
def pearson_loss(output, target_data, input_data):
	# Center data to focus on covariance 
	vx = output - torch.mean(output)
	vy = target_data - torch.mean(target_data)
	
	loss = torch.sum(vx * vy) / (torch.sqrt(torch.sum(vx ** 2)) * torch.sqrt(torch.sum(vy ** 2)))
	
	return loss.item()

Define a custom collate method for preparing the batches of data during training

In [46]:
def my_collate(batch):
    data = [item[0] for item in batch]
    target = [torch.FloatTensor(item[1]) for item in batch]
    target = target
    return [data, target]

Load and prepare our dataset using the custom data class for the LSTM, both previously defined.

In [47]:
scale_factor  = 100
dataset = MyDataset("./dms_dataset.fa", kmer=1, scale_factor = scale_factor)

['A', 'T', 'G', 'C']


Split processed dataset into 5 folds for 5-fold cross-validation.

In [48]:
# 2) Split dataset into 5 folds 
kfold_loss = []
fold_res = []
graph_sim_loss = []

kfold = KFold(n_splits=5, shuffle=False, random_state=None)

# Weisfeiler Lehman Algorithm

For each fold, we will train our model for 10 epochs and then evaluate on the test data. We will then run RNAfold on the RNA sequences with two settings and compare the predictive performance:

1) With DMS feature vectors predicted by trained LSTM model (DMS + DL + thermodynamic model)
2) Only the RNA sequence ie. RNAfold alone (thermodynamic model only)

Both of these results will be converted into graphs so that we can compute the graph similarity between them using the Weisfeiler Lehman algorithm.

In [59]:
class Weisfeiler_Lehman_Graph():
	def __init__(self):
		self.label_dict = {}
	
	
	def propagate_class_labels(self, seq1, connections1):
		new_labels_seq = []
	
		# Add label for linear connection
		for num, seq in enumerate(seq1):
			if num == 0:
				new_labels_seq.append(str(seq) + str(seq1[num+1]))
			elif num == len(seq1) -1:
				new_labels_seq.append(str(seq) + str(seq1[num-1]))
			else:
				new_labels_seq.append(str(seq) + str(seq1[num-1]) + str(seq1[num+1]))
			
		# Add label for rest connections
		for connection in connections1:
			new_labels_seq[connection[0]] += seq1[connection[1]]
			new_labels_seq[connection[1]] += seq1[connection[0]]

		# Label compression
		self.label_hashing(new_labels_seq)
		labels = self.label_compression(new_labels_seq)
		
		# seq1.extend(labels)
		# Replace old labels with new rather than extend
		seq1 = labels
		return seq1		


	def label_compression(self, labels):
		labels_compressed = [self.label_dict[l] for l in labels]
	
		return labels_compressed


	def label_hashing(self, seq):
		label_keys = self.label_dict.keys()
		last_num = len(label_keys) - 1
		
		for label in seq:
			if label not in label_keys:
				self.label_dict[label] = str(last_num + 1)
				last_num = last_num + 1
		return


	def label_count(self, labels, max_label):
		labels_int = [int(l) for l in labels]
		count_list = [0] * (max_label +1)

		for label in labels_int:
			count_list[label] += 1
	
		return count_list


	def graph_similiarity(self, label1, label2):
		similarity = np.dot(label1, label2)	
	
		return similarity


	def calculate_weisfeiler_lehman_graph(self, seq1, connections1, seq2, connections2, num_iterations=1):

		# Hash initial labels in both sequences into self.label_dict
		self.label_hashing(seq1)
		self.label_hashing(seq2)
		
		# Convert each element of seq1, seq2 to compressed labels (ints as strings)
		label_seq1 = self.label_compression(seq1)
		label_seq2 = self.label_compression(seq2)
		
		# Iterative WL color refinement 
		for n in range(num_iterations):
			label_seq1 = self.propagate_class_labels(label_seq1, connections1)
			label_seq2 = self.propagate_class_labels(label_seq2, connections2)


		# Count how many times each label occurs in each graph 
		max_val = max(max([int(l) for l in label_seq1]), max([int(l) for l in label_seq2]))
		count_list1 = self.label_count(label_seq1, max_val)
		count_list2 = self.label_count(label_seq2, max_val)
		
		# Compute dot product of the label-frequency vectors 
		similarity_raw = self.graph_similiarity(count_list1, count_list2)

		# Optionally, normalize these graph similarity scores
		norm1 = np.linalg.norm(count_list1)
		norm2 = np.linalg.norm(count_list2)
		similarity_cosine = similarity_raw / (norm1 * norm2 + 1e-9)

		return similarity_cosine


# RNAfold Local Implementation
Define the RNAfold class and helper methods to run RNAfold with specified constraints and then to extract and create the graphs from its output needed to compute the WL graph similarity. This assumes that we are using a locally installed version of RNAfold.

In [60]:
def RNAfold(rnafold_cmd, fasta_file, constraint):
    
    fasta_file_preffix = fasta_file.rsplit('.', 1)[0]
    output_pdf = fasta_file_preffix + '_proteins.fa'
    log_file = fasta_file_preffix + '_RNAfold.log'
    rnafold_cmd += ' {input_fasta} --filename-full'
    if constraint == True: rnafold_cmd += " -C"     
    
    rnafold_cmd = rnafold_cmd.format(input_fasta=fasta_file)
    
    with open(log_file, 'w') as lf:
    	sp.call(rnafold_cmd.split(), stdout = lf)
    	
    os.remove(log_file)
        
    return

In [61]:
def create_fasta_constraints(seq, dms_seq, file_name, constraints = False, divide_by = 1):

	constraints = ""
	
	# Previous cutoff threshold was 0.04
	# Apply hard constraint with cutoff threshold of 0.06
	for dms in dms_seq:
		if dms/divide_by < 0.06:
			constraints += "."
		elif dms/divide_by >= 0.06:
			constraints += "x"

	rand_id = str(uuid.uuid1())

	file_ = open(file_name, "w")
	
	file_.write(">" + rand_id)
	file_.write("\n")
	file_.write(seq)
	if constraints: file_.write("\n")
	if constraints: file_.write(constraints)	
	file_.close()

	return constraints, file_name

In [62]:
def extract_graph(file_name):


	file_ = open(file_name)
	lines = file_.readlines()
	
	start_ind = lines.index("/pairs [\n")
	end_ind = start_ind + lines[start_ind:].index("] def\n")

	connected = lines[start_ind + 1:end_ind]
	connected = [[int(c[:-1].split(" ")[0][1:])-1, int(c[:-1].split(" ")[1][:-1])-1] for c in connected]
	file_.close()

	return connected


In [63]:
def create_graph(input_file, constraint):

	RNAfold("RNAfold", input_file, constraint)
	
	os.remove(input_file)
	current_files = glob.glob("./*.ps")
	connected = extract_graph(current_files[0])
	os.remove(current_files[0])

	return connected

# LSTM Training and Evaluation
Define the train and evaluate method for our defined LSTM. 

In [64]:
# Define training method 
def train_model(trainloader, net, criterion, optimizer, epoch, teacher_forcing_ratio = 0.25):

	torch.autograd.set_detect_anomaly(True)
	running_loss = 0.0
	net.zero_grad()
	
	
	for i, data in enumerate(trainloader):
		input_data, target_data = data
		loss = 0
		hidden = net.initHidden()
		optimizer.zero_grad()
		output, hidden = net(torch.IntTensor(input_data), hidden)

		# loss = pearson_loss(output.squeeze(), target_data[0], input_data)
		loss = criterion(output.squeeze(),target_data[0])
		
		
		loss.backward()
		optimizer.step()
		running_loss = running_loss + loss

	return net, running_loss/len(trainloader)

In [65]:
# Define evaluation method
def eval_model(dataloader, net, criterion, epoch, pearson =False):


	running_loss = 0
	input_data_list = []
	output_data_list = []
	target_data_list = []
	
	with torch.no_grad():
		for data in dataloader:
			input_data, target_data = data
			loss = 0
			hidden = net.initHidden()
			output, hidden = net(torch.IntTensor(input_data), hidden)
		
			if pearson == False:
				loss = criterion(output.squeeze(), target_data[0])
			else:
			
				loss = pearson_loss(output.squeeze(), target_data[0], input_data)
			
			
			running_loss = running_loss + loss
			
			input_data_list.append(input_data)
			output_data_list.append(output.squeeze())
			target_data_list.append(target_data[0])
	    
	return running_loss/len(dataloader), output_data_list, input_data_list, target_data_list
	

Run the main training and evaluation loop for the LSTM on the preprocessed DMS data. The outer loop iterates over all folds, while the inner loop iterates over all training epochs. At the end of training, we evaluate the model on the test data. 

We use the trained LSTM to predict DMS values of the correpsonding RNA sequences. These are then used to generate and pass constraints to RNAfold as one of the settings mentioned to combine both experimental DMS data with RNAfold's thermodynamic scoring model.

We then run RNAfold on both settings with the model's DMS feature vectors and without, and then compute the WL graph similarity between them to see how the predicted DMS values from our trained LSTM affects RNAfold's predictions.

In [66]:

for enum, indices in enumerate(kfold.split( dataset.return_x(), list(range(len(dataset.return_y()))))):
    # Enumerate over every fold

    train_indices, test_val_ind = indices[0], indices[1]
    split2 = int(np.floor(0.5 * len(test_val_ind)))
    val_indices,test_indices = test_val_ind[split2:], test_val_ind[:split2]
    
    g = torch.Generator()
    g.manual_seed(0)

    train_sampler = SubsetRandomSampler(train_indices)
    valid_sampler = SubsetRandomSampler(val_indices)
    test_sampler = SubsetRandomSampler(test_indices)

    train_loader = DataLoader(dataset, batch_size=1, sampler = train_sampler, collate_fn=my_collate, generator=g)
    validation_loader = DataLoader(dataset, batch_size = 1, sampler = valid_sampler, collate_fn=my_collate, generator=g)
    test_loader = DataLoader(dataset, batch_size = 1, sampler = test_sampler, collate_fn=my_collate, generator=g)

    hidden_size = 50
    net = LSTM(dataset.num_words(),hidden_size, max_length=dataset.max_length)
    saved_net =  LSTM(dataset.num_words(),hidden_size, max_length=dataset.max_length)
    optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

    # Currently using only custom pearson loss 
    criterion = nn.L1Loss()
    best_validation = np.inf
    epoch_range = 10

    for epoch in range(0, epoch_range):
            
        net, running_loss = train_model(train_loader, net = net, criterion = criterion,  optimizer = optimizer, epoch = epoch)
        running_loss_val, output, input_data, target_data = eval_model(validation_loader, net, criterion = criterion, epoch = epoch)
        print(f'train-loss in epoch {epoch}: {running_loss}')

        if running_loss_val <= best_validation and epoch > int(epoch_range/2):
            best_validation = running_loss_val
            saved_net.load_state_dict(net.state_dict())
            
    
    print(f'best val-loss:{best_validation}')
    #### final test using pearson correlation #####
    print("##############################################################################################")
    
    
    running_loss, output, input_data, target_data_list = eval_model(test_loader, saved_net, criterion = criterion, epoch = epoch, pearson = True)
    
    print(f'pearson-correlation in fold {enum}: {running_loss}')
    
    file_name = "./test.fa"

    graph_similarity = []
    
    for en,dms_predicted in enumerate(output):
    
        seq = dataset.reconstruct_input(input_data[en])
        
        # Create constraints for the setting with DMS feature vector predicted from the LSTM and build its graph
        constraints, input_file = create_fasta_constraints(seq, dms_predicted, file_name, constraints = True, divide_by = scale_factor)
        first_connections = create_graph(input_file, constraint = True)
        
        # Create constraints for the setting without and build its graph
        constraints, input_file = create_fasta_constraints(seq, dms_predicted, file_name, constraints = False, divide_by = scale_factor)
        second_connections = create_graph(input_file, constraint = False)
        
        kernel = Weisfeiler_Lehman_Graph()

        # Compare both graphs using WL 
        graph_sim = kernel.calculate_weisfeiler_lehman_graph(seq, first_connections, seq, second_connections, num_iterations = 1)
        graph_similarity.append(graph_sim)

    kfold_loss.append(running_loss)
    graph_sim_loss.append(graph_similarity)
    
    
    print(f'graph similarity in fold {enum}: {np.mean(graph_similarity)}')

train-loss in epoch 0: 1.7065527439117432
train-loss in epoch 1: 1.3966858386993408
train-loss in epoch 2: 1.385648250579834
train-loss in epoch 3: 1.3779101371765137
train-loss in epoch 4: 1.3747235536575317
train-loss in epoch 5: 1.3742960691452026
train-loss in epoch 6: 1.3735946416854858
train-loss in epoch 7: 1.3715449571609497
train-loss in epoch 8: 1.3701952695846558
train-loss in epoch 9: 1.3700186014175415
best val-loss:1.2919033765792847
##############################################################################################
pearson-correlation in fold 0: 0.6263292723894119
graph similarity in fold 0: 0.9999999999970873
train-loss in epoch 0: 1.6560981273651123
train-loss in epoch 1: 1.3650643825531006
train-loss in epoch 2: 1.3553040027618408
train-loss in epoch 3: 1.3481696844100952
train-loss in epoch 4: 1.343807339668274
train-loss in epoch 5: 1.3416534662246704
train-loss in epoch 6: 1.3388503789901733
train-loss in epoch 7: 1.3396278619766235
train-loss in epoch 8

We then compute and compare the pearson-correlation in each fold and the (normalized) WL graph similarity. The pearson correlation coefficient represents the accuracy of RNAfold predicted secondary structures, compared to their true structures in the data. The normalized graph similarity measures how similar the predictions from both settings (with LSTM predicted DMS scores and without) of RNAfold. We can use this to observe how much of a difference the LSTM predictions make to RNAfold's prediction accuracy, with scores close to 1 indicating identical predictions and thus no significant difference.

In [69]:
for num, auprc in enumerate(kfold_loss):
    print(f'pearson-correlation in fold {num}: {kfold_loss[num]}')
    print(f'Graph-similarity in fold {num}: {np.mean(graph_sim_loss[num])}')


pearson-correlation in fold 0: 0.6275616657733917
Graph-similarity in fold 0: 391.45
pearson-correlation in fold 1: 0.5887924933433533
Graph-similarity in fold 1: 389.26
pearson-correlation in fold 2: 0.5955320447683334
Graph-similarity in fold 2: 402.32
pearson-correlation in fold 3: 0.6232126426696777
Graph-similarity in fold 3: 403.33
pearson-correlation in fold 4: 0.5983535408973694
Graph-similarity in fold 4: 378.08
pearson-correlation in fold 5: 0.6263292723894119
Graph-similarity in fold 5: 0.9999999999970873
pearson-correlation in fold 6: 0.5900688728690148
Graph-similarity in fold 6: 0.9999999999971892
pearson-correlation in fold 7: 0.5955814132094384
Graph-similarity in fold 7: 0.9999999999971468
pearson-correlation in fold 8: 0.625910864174366
Graph-similarity in fold 8: 0.9999999999971996
pearson-correlation in fold 9: 0.600650106370449
Graph-similarity in fold 9: 0.9999999999970807


In [70]:
print(f'Average Loss over all folds: {np.mean(kfold_loss)}')
print(f'Average Graph-similarity over all folds: {np.mean(graph_sim_loss)}')

Average Loss over all folds: 0.6071992916464806
Average Graph-similarity over all folds: 196.94399999999857
