**☀️ Welcome to ParkerNet Version 1.0 Notebook 2 of 3!**


With this notebook you will be able to do the following:


1.    Load in the individual .keras models for ParkerNet (each trained with a different seed and train split).
2.   Predict on a new dataset not used in training, validation, or testing parts of Notebook1.
3.   Calculate the simple soft voting (simple averaging of predictions from all models) and weighted soft voting (weight models according to their AUC PRC (Area under Precision Recall curve) value).
4.    Create prediction datasets for data visualization and analysis.

***Note: We use a day in E7 for prediction here as a working example, but this notebook can be adjusted to load in the rest of the files used for prediction.***

The files you will need to run this notebook are:

PSP_E1toE7_July23_nonoise.txt (this is the dataset used for training and validation in Notebook1)

The following new datasets are used to predict on; each of these files contain approximately 1 day from each encounter near Perihelion.

PSP_E1_ForPrediction.txt

PSP_E2_ForPrediction.txt

PSP_E4_ForPrediction.txt

PSP_E5_ForPrediction.txt

PSP_E6_ForPrediction.txt

PSP_E7_ForPrediction.txt

PSP_E8_ForPrediction.txt

You will also need the folliwing pre-trained ParkerNet models, which are provided for you:

| **Model Name** | **Model Name** | **Model Name** |
|--------------|--------------|--------------|
| ParkerNet_09172024_splitM_seed493.keras | ParkerNet_09172024_splitM_seed552.keras | ParkerNet_09172024_splitM_seed838.keras |
| ParkerNet_09172024_splitM_seed1022.keras | ParkerNet_08262024_splitM_seed123.keras | ParkerNet_08262024_splitM_seed324.keras |
| ParkerNet_08262024_splitM_seed369.keras | ParkerNet_08262024_splitM_seed564.keras | ParkerNet_08262024_splitM_seed641.keras |
| ParkerNet_08262024_splitM_seed910.keras | ParkerNet_08262024_splitM_seed1153.keras | ParkerNet_08262024_splitM_seed1187.keras |
| ParkerNet_08262024_splitM_seed775.keras | ParkerNet_08262024_splitM_seed1337.keras | ParkerNet_08262024_splitM_seed1886.keras |
| ParkerNet_08262024_splitM_seed1953.keras | ParkerNet_08262024_splitM_seed1962.keras | ParkerNet_09092024_splitN_seed493.keras |
| ParkerNet_09092024_splitN_seed552.keras | ParkerNet_09092024_splitN_seed838.keras | ParkerNet_09092024_splitN_seed1022.keras |
| ParkerNet_09092024_splitN_seed123.keras | ParkerNet_09092024_splitN_seed324.keras | ParkerNet_09092024_splitN_seed369.keras |
| ParkerNet_09092024_splitN_seed564.keras | ParkerNet_09092024_splitN_seed641.keras | ParkerNet_09092024_splitN_seed910.keras |
| ParkerNet_09092024_splitN_seed1153.keras | ParkerNet_09092024_splitN_seed1187.keras | ParkerNet_09092024_splitN_seed775.keras |
| ParkerNet_09092024_splitN_seed1337.keras | ParkerNet_09092024_splitN_seed1886.keras | ParkerNet_09092024_splitN_seed1953.keras |
| ParkerNet_09092024_splitN_seed1962.keras | ParkerNet_09032024_splitP_seed493.keras | ParkerNet_09032024_splitP_seed552.keras |
| ParkerNet_09032024_splitP_seed838.keras | ParkerNet_09032024_splitP_seed1022.keras | ParkerNet_09032024_splitP_seed123.keras |
| ParkerNet_09042024_splitP_seed324.keras | ParkerNet_09042024_splitP_seed369.keras | ParkerNet_09042024_splitP_seed564.keras |
| ParkerNet_09042024_splitP_seed641.keras | ParkerNet_09042024_splitP_seed910.keras | ParkerNet_09042024_splitP_seed1153.keras |
| ParkerNet_09042024_splitP_seed1187.keras | ParkerNet_09032024_splitP_seed775.keras | ParkerNet_09042024_splitP_seed1337.keras |
| ParkerNet_09042024_splitP_seed1886.keras | ParkerNet_09042024_splitP_seed1953.keras | ParkerNet_09032024_splitP_seed1962.keras |
| ParkerNet_10082024_splitN_seed1843.keras | ParkerNet_10082024_splitN_seed2816.keras | ParkerNet_10082024_splitN_seed983.keras |
| ParkerNet_10142024_splitN_seed2221.keras | ParkerNet_10142024_splitN_seed3060.keras | ParkerNet_10142024_splitN_seed3247.keras |
| ParkerNet_10142024_splitN_seed3364.keras | ParkerNet_10142024_splitN_seed3539.keras | ParkerNet_10142024_splitN_seed3871.keras |
| ParkerNet_10142024_splitN_seed400.keras | ParkerNet_10142024_splitN_seed4032.keras | ParkerNet_10142024_splitN_seed454.keras |
| ParkerNet_10162024_splitM_seed1843.keras | ParkerNet_10162024_splitM_seed2221.keras | ParkerNet_10162024_splitM_seed3060.keras |
| ParkerNet_10162024_splitM_seed3247.keras | ParkerNet_10162024_splitM_seed3364.keras | ParkerNet_10162024_splitM_seed3871.keras |
| ParkerNet_10162024_splitM_seed4032.keras | ParkerNet_10162024_splitM_seed983.keras | |

***Usage***: If you would like to predict on a dataset that is not in the traning data, nor in the files provided here. Make sure you follow the pre-processing steps in the publication and Notebook 1 exactly. The test set must have the same number of columns, in the same order, with the same time resolution. Once you have that, you can predict using each of the pre-trained models and compute a weighted average prediction. If no ground truth label can be found to compute AUC_PRC you may use another method to compute the weight. E.g.,  standard deviation accross each model for each sample, or Shannon entropy.




In [None]:
#load libraries
import pandas as pd
import os
import numpy as np
import tensorflow as tf
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Conv1D, MaxPooling1D,Activation, Dense, Dropout, Flatten, TimeDistributed, Bidirectional, LSTM, GlobalMaxPool1D, GlobalAveragePooling1D
from keras.optimizers import Adam
import matplotlib.pyplot as plt
from time import time
import seaborn as sns
import sklearn
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve
from sklearn.metrics import roc_auc_score, roc_curve, f1_score, precision_score, recall_score, accuracy_score, auc
from keras import initializers
import tensorflow as tf
import random
from keras.models import load_model
import os
from sklearn.metrics import average_precision_score


Dependencies listed here

In [None]:
print(f"Pandas: {pd.__version__}")
print(f"NumPy: {np.__version__}")
print(f"TensorFlow: {tf.__version__}")
print(f"Keras: {keras.__version__}")
print(f"Seaborn: {sns.__version__}")
print(f"Scikit-learn: {sklearn.__version__}")

Pandas: 2.2.2
NumPy: 1.26.4
TensorFlow: 2.18.0
Keras: 3.8.0
Seaborn: 0.13.2
Scikit-learn: 1.6.1


If files need to be uploaded from your local machine, provide the path.

In [None]:
from google.colab import drive
drive.mount('/content/MyDrive')

Mounted at /content/MyDrive


In [None]:
cd MyDrive/MyDrive

/content/MyDrive/MyDrive


In [None]:

# Load the dataset used for training
df = pd.read_csv('PSP_E1toE7_July23_nonoise.txt', sep="\t", parse_dates=['Datetime'], index_col='Datetime')


df['Class'] = df['Class'].map({True: 1, False: 0}).astype(int)

# Drop unnecessary columns
df.pop("indices")
df.pop("Encounter")
X_HCI_train = df.pop("X_HCI")
Y_HCI_train = df.pop("Y_HCI")
Z_HCI_train = df.pop("Z_HCI")
df.pop("Dist")

# Preview the dataset
df.head()




Unnamed: 0_level_0,B_r,Bmag,B_t,B_n,V_r,V_t,V_n,Vmag,V_nr,ProtonDensity,Class
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-11-04 00:00:00.000,-66.058226,70.255204,-4.140612,-23.557585,267.886,-28.3115,-39.9285,272.321015,48.947177,391.665,0
2018-11-04 00:00:00.873,-65.699724,70.171543,-3.381591,-24.416317,263.905,-69.5283,-43.8579,276.411919,82.20523,536.104,0
2018-11-04 00:00:01.747,-66.442084,70.452417,-8.469396,-21.846322,266.871,-41.5544,-24.7354,271.217143,48.359158,367.138,0
2018-11-04 00:00:02.621,-65.771892,70.127234,-15.798794,-18.500954,260.689,-57.2835,-41.0414,270.04546,70.468403,504.809,0
2018-11-04 00:00:03.495,-64.232595,70.390542,-18.842718,-21.770487,267.265,-26.057,-45.9163,272.42954,52.794639,394.882,0


Some information about the prediction files:

-The files meant for prediction has a class label made from using Huang et al. (2023) catalog data. This class column is used to calculate the AUC_PRC later for the weighted voting section.

-The file for each encounter spans about one day in time.

-When loading in the files using the code below, make sure to change the number of the encounter: for example if loading in 'PSP_E1_ForPrediction.txt' make sure to change df_EX to df_E1 etc.

# ====== Loading in data for prediction ======

In [None]:
# Read in the dataset for predicting (provided as PSP_EX_yyyymmdd.txt)
df_E7 = pd.read_csv('PSP_E7_ForPrediction.txt', sep="\t", parse_dates=['Datetime'], index_col='Datetime')#change df_E7 to df_E1, df_E2 etc when you load the corresponding file for each encounter (e.g., PSP_E1_forPrediction.txt)

# Separate the 'Class_Huang_Ex' column into df_gt
df_gt_E7 = df_E7[['Class_Huang_E7_Range']].copy()

pd.set_option('future.no_silent_downcasting', True)
df_gt_E7['Class_Huang_E7_Range'] = df_gt_E7['Class_Huang_E7_Range'].replace({True: 1, False: 0}).infer_objects(copy=False).astype(int)
# Create df_test by dropping the 'Class_Huang_Ex' column and other columns not needed. df_test has only features to predict on. The Class_Huang column has been removed and saved as a ground_truth check for later use in calculating AUC_PRC
df_test_E7= df_E7.drop(columns=['Class_Huang_E7_Range'])

df_test_E7.pop("X_HCI")
df_test_E7.pop("Y_HCI")
df_test_E7.pop("Z_HCI")



In [None]:
df_test_E7.head()



Unnamed: 0_level_0,B_r,Bmag,B_t,B_n,V_r,V_t,V_n,Vmag,V_nr,ProtonDensity
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2021-01-12 00:00:00.000000000,17.814236,41.5356,-37.521449,-8.362526,306.013,40.2825,-0.77722,308.652938,40.289997,140.895
2021-01-12 00:00:00.873799999,19.399466,42.466661,-37.776686,-6.637471,306.013,40.2825,-0.77722,308.652938,40.289997,140.895
2021-01-12 00:00:01.747599999,17.717973,42.221923,-38.32446,-6.60931,306.013,40.2825,-0.77722,308.652938,40.289997,140.895
2021-01-12 00:00:02.621399999,13.897688,41.848888,-39.473836,-8.688779,317.594,44.2585,-1.91183,320.663006,44.299773,122.34
2021-01-12 00:00:03.495199999,19.183878,42.398007,-37.809652,-5.24117,305.668,36.073,-0.6364,307.789193,36.078613,131.876


df_gt_E7 is the ground truth, in each prediction file there will be a column called Class_Huang_EX_Range. We have used the Huang et al. (2023) catalog to create the ground truth flags using LQP start to TQP end.  This column is used to calculate the AUC_PRC used later in the notebook in the ensemble averaging section.

In [None]:
df_gt_E7.head()

Unnamed: 0_level_0,Class_Huang_E7_Range
Datetime,Unnamed: 1_level_1
2021-01-12 00:00:00.000000000,0
2021-01-12 00:00:00.873799999,0
2021-01-12 00:00:01.747599999,0
2021-01-12 00:00:02.621399999,0
2021-01-12 00:00:03.495199999,0


We will use split M's stats to do the z-scaling for the prediction set

In [None]:
#test M`, train: E1 to E4, val:E5-E6 EOD Sept 23, test 1:E6 Sept 24 + E7 , test2: E15
Xtrain = df.iloc[0:1615023,:-1] #everything but the last column (split1)
ytrain = df.iloc[0:1615023,-1]#only pick the last column
Xval = df.iloc[1615023:1788065,:-1] #everything but the last column (split 1)
yval = df.iloc[1615023:1788065,-1]#only pick the last column
Xtest1 = df.iloc[1788065:,:-1] #everything but the last column
ytest1 = df.iloc[1788065:,-1]#only pick the last column

z-scaling

In [None]:
Xpred = df_test_E7 #pred dataset has only input variables so use it as it is
train_mean = Xtrain.mean()
train_std = Xtrain.std()

train_df = (Xtrain - train_mean) / train_std
val_df = (Xval - train_mean) / train_std
test_df = (Xtest1 - train_mean) / train_std
pred_df = (Xpred - train_mean) / train_std

Creating sequences

In [None]:
def split_sequences_nolags(dataset,labels, time_steps):
    data_X, data_Y = [], []
    for i in range(len(dataset)-time_steps):
        a = dataset.iloc[i:(i+time_steps)]
        data_X.append(a)
        data_Y.append(labels.iloc[i:i + time_steps])
    return np.array(data_X), np.array(data_Y)

Since the prediction sets contain no labels you will need to create sequences of the features only and not the labels

In [None]:
#use this for test set with no labels, so since the prediction set has no labels yet you will use this sequence function which is simply a modified version of the split_sequences_nolags function above
def split_sequences_nolags_test(dataset, time_steps):
    data_X = []
    for i in range(len(dataset)-time_steps):
        a = dataset.iloc[i:(i+time_steps)]
        data_X.append(a)
        #data_Y.append(labels.iloc[i:i + time_steps])
    return np.array(data_X)

In [None]:
train_features_new, train_labels_new = split_sequences_nolags(train_df,ytrain,50) #originally 343, previously:90, tried: 72, 25, current: 50
val_features_new, val_labels_new = split_sequences_nolags(val_df,yval,50)
test_features_new, test_labels_new = split_sequences_nolags(test_df,ytest1,50)#for test set with class labels
pred_features_new = split_sequences_nolags_test(pred_df,50) #note that the no class label sequence function is used here. The class label is removed and used later for calculating AUC_PRC

n_timesteps, n_features, n_outputs = train_features_new.shape[1],train_features_new.shape[2],train_labels_new.shape[1]
train_labels_new= train_labels_new.astype('float64')
val_labels_new = val_labels_new.astype('float64')
test_labels_new = test_labels_new.astype('float64')

# ====== Loading in Pre-Trained ParkerNet Models ======

You need to use this part of the code for the models to get loaded correctly. The "@" function here is a decorator function which tells Keras how to use the custom binary loss function when loading the pre-trained models. It is used to register a custom function so Keras can reconstruct the function. Without using this, you will not be able to load the models.

In [None]:
POS_WEIGHT = 200
POS_WEIGHT = POS_WEIGHT
@keras.saving.register_keras_serializable()
def weighted_binary_crossentropy(target, output):
    """
    Weighted binary crossentropy between an output tensor
    and a target tensor. POS_WEIGHT is used as a multiplier
    for the positive targets.

    Combination of the following functions:
    * keras.losses.binary_crossentropy
    * keras.backend.tensorflow_backend.binary_crossentropy
    * tf.nn.weighted_cross_entropy_with_logits
    """
    # transform back to logits
    _epsilon = tf.convert_to_tensor(tf.keras.backend.epsilon(), output.dtype.base_dtype)
    output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
    output = tf.math.log(output / (1 - output))
    loss = tf.nn.weighted_cross_entropy_with_logits(labels=target, logits=output, pos_weight=POS_WEIGHT)

    return tf.reduce_mean(loss, axis=-1)

Loading Pre-trained ParkerNet Models. Here we will load each of the pre-trained models (with differing train splits, and seeds) and put each of their predictions on the dataset (E7 in this example) into a dataframe. This will allow us to keep a record of how well each model predicts. It will also allow us to calculate an average or "ensemble" prediction.

In [None]:

#from keras.models import load_model
#import os


# Function to extract the model name from the filename without the '.keras' extension
def get_model_name(filepath):
    # Extract the base filename (remove directory)
    filename = os.path.basename(filepath)
    # Remove the '.keras' extension
    filename = os.path.splitext(filename)[0]
    # Split the filename by underscores and extract the part after the second underscore
    return "_".join(filename.split('_')[2:])

# Load the pre-trained models and their filenames, these are provided to you along with this notebook
model_paths = [
    'ParkerNet_09172024_splitM_seed493.keras',
    'ParkerNet_09172024_splitM_seed552.keras',
    'ParkerNet_09172024_splitM_seed838.keras',
    'ParkerNet_09172024_splitM_seed1022.keras',
    'ParkerNet_08262024_splitM_seed123.keras',
    'ParkerNet_08262024_splitM_seed324.keras',
    'ParkerNet_08262024_splitM_seed369.keras',
    'ParkerNet_08262024_splitM_seed564.keras',
    'ParkerNet_08262024_splitM_seed641.keras',
    'ParkerNet_08262024_splitM_seed910.keras',
    'ParkerNet_08262024_splitM_seed1153.keras',
    'ParkerNet_08262024_splitM_seed1187.keras',
    'ParkerNet_08262024_splitM_seed775.keras',
    'ParkerNet_08262024_splitM_seed1337.keras',
    'ParkerNet_08262024_splitM_seed1886.keras',
    'ParkerNet_08262024_splitM_seed1953.keras',
    'ParkerNet_08262024_splitM_seed1962.keras',
    'ParkerNet_09092024_splitN_seed493.keras',
    'ParkerNet_09092024_splitN_seed552.keras',
    'ParkerNet_09092024_splitN_seed838.keras',
    'ParkerNet_09092024_splitN_seed1022.keras',
    'ParkerNet_09092024_splitN_seed123.keras',
    'ParkerNet_09092024_splitN_seed324.keras',
    'ParkerNet_09092024_splitN_seed369.keras',
    'ParkerNet_09092024_splitN_seed564.keras',
    'ParkerNet_09092024_splitN_seed641.keras',
    'ParkerNet_09092024_splitN_seed910.keras',
    'ParkerNet_09092024_splitN_seed1153.keras',
    'ParkerNet_09092024_splitN_seed1187.keras',
    'ParkerNet_09092024_splitN_seed775.keras',
    'ParkerNet_09092024_splitN_seed1337.keras',
    'ParkerNet_09092024_splitN_seed1886.keras',
    'ParkerNet_09092024_splitN_seed1953.keras',
    'ParkerNet_09092024_splitN_seed1962.keras',
    'ParkerNet_09032024_splitP_seed493.keras',
    'ParkerNet_09032024_splitP_seed552.keras',
    'ParkerNet_09032024_splitP_seed838.keras',
    'ParkerNet_09032024_splitP_seed1022.keras',
    'ParkerNet_09032024_splitP_seed123.keras',
    'ParkerNet_09042024_splitP_seed324.keras',
    'ParkerNet_09042024_splitP_seed369.keras',
    'ParkerNet_09042024_splitP_seed564.keras',
    'ParkerNet_09042024_splitP_seed641.keras',
    'ParkerNet_09042024_splitP_seed910.keras',
    'ParkerNet_09042024_splitP_seed1153.keras',
    'ParkerNet_09042024_splitP_seed1187.keras',
    'ParkerNet_09032024_splitP_seed775.keras',
    'ParkerNet_09042024_splitP_seed1337.keras',
    'ParkerNet_09042024_splitP_seed1886.keras',
    'ParkerNet_09042024_splitP_seed1953.keras',
    'ParkerNet_09032024_splitP_seed1962.keras',
    'ParkerNet_10082024_splitN_seed1843.keras',
    'ParkerNet_10082024_splitN_seed2816.keras',
    'ParkerNet_10082024_splitN_seed983.keras',
    'ParkerNet_10142024_splitN_seed2221.keras',
    'ParkerNet_10142024_splitN_seed3060.keras',
    'ParkerNet_10142024_splitN_seed3247.keras',
    'ParkerNet_10142024_splitN_seed3364.keras',
    'ParkerNet_10142024_splitN_seed3539.keras',
    'ParkerNet_10142024_splitN_seed3871.keras',
    'ParkerNet_10142024_splitN_seed400.keras',
    'ParkerNet_10142024_splitN_seed4032.keras',
    'ParkerNet_10142024_splitN_seed454.keras',
    'ParkerNet_10162024_splitM_seed1843.keras',
    'ParkerNet_10162024_splitM_seed2221.keras',
    'ParkerNet_10162024_splitM_seed3060.keras',
    'ParkerNet_10162024_splitM_seed3247.keras',
    'ParkerNet_10162024_splitM_seed3364.keras',
    'ParkerNet_10162024_splitM_seed3871.keras',
    'ParkerNet_10162024_splitM_seed4032.keras',
    'ParkerNet_10162024_splitM_seed983.keras'




]

# Create an empty DataFrame to store the predictions
df_allprobs = pd.DataFrame()

# function to predict and add the results to the dataframe
def predict_and_store(model, model_name, df, pred_features_new):
    # Predict using the model on the test set
    probs_pred = model.predict(pred_features_new, batch_size=1024)

    final_classification = np.mean(probs_pred, axis=1)  # Get the final classification probabilities

    # Add the predictions to the dataframe with the model name as the column
    df[model_name] = pd.Series(np.squeeze(final_classification))

    return df

# Iterate through each model and its path
for model_path in model_paths:
    # Load the model
    model = load_model(model_path)

    # Extract the model name using the get_model_name function (without .keras)
    model_name = get_model_name(model_path)

    # Use the extracted name for the prediction column
    pred_column_name = f'Preds_{model_name}'

    # Make predictions with each model and add them to the DataFrame
    df_allprobs = predict_and_store(model, pred_column_name, df_allprobs, pred_features_new)

# Show the dataframe with all the model predictions
print(df_allprobs.head())

df_gt_E7.reset_index(drop=True, inplace=True)
df_allprobs.reset_index(drop=True, inplace=True)

# Concatenate df_gt and df_allprobs without changing column names from df_allprobs
df_gt_allpreds_E7 = pd.concat([df_gt_E7, df_allprobs], axis=1, join='inner')

# Display the resulting DataFrame
display(df_gt_allpreds_E7)

[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 16ms/step
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 14ms/step
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 14ms/step
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 14ms/step
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step
[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━

Unnamed: 0,Class_Huang_E7_Range,Preds_splitM_seed493,Preds_splitM_seed552,Preds_splitM_seed838,Preds_splitM_seed1022,Preds_splitM_seed123,Preds_splitM_seed324,Preds_splitM_seed369,Preds_splitM_seed564,Preds_splitM_seed641,...,Preds_splitN_seed4032,Preds_splitN_seed454,Preds_splitM_seed1843,Preds_splitM_seed2221,Preds_splitM_seed3060,Preds_splitM_seed3247,Preds_splitM_seed3364,Preds_splitM_seed3871,Preds_splitM_seed4032,Preds_splitM_seed983
0,0,0.636117,0.778164,0.588504,0.765886,0.582340,0.616913,0.649699,0.697456,0.643355,...,0.787908,0.378456,0.761302,0.821427,0.809603,0.778581,0.597045,0.784108,0.754330,0.840811
1,0,0.636482,0.777926,0.588714,0.766686,0.581585,0.611313,0.649776,0.697938,0.645148,...,0.784633,0.379010,0.761142,0.822361,0.808814,0.779287,0.594154,0.783862,0.751140,0.841016
2,0,0.639039,0.777701,0.589028,0.767213,0.580806,0.606789,0.650461,0.698042,0.647063,...,0.781467,0.377667,0.761228,0.822558,0.808534,0.779732,0.593298,0.783538,0.750893,0.841008
3,0,0.641618,0.777965,0.589636,0.767859,0.580301,0.601268,0.649226,0.698384,0.648608,...,0.778753,0.379076,0.762085,0.823634,0.808042,0.780256,0.592323,0.783155,0.750087,0.840959
4,0,0.643883,0.777323,0.589132,0.768660,0.577425,0.595551,0.651783,0.699033,0.650894,...,0.773466,0.378902,0.761935,0.824129,0.807168,0.780727,0.590951,0.783383,0.746872,0.840916
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98825,0,0.751036,0.789669,0.727342,0.811545,0.833985,0.261037,0.696583,0.702812,0.811020,...,0.899303,0.521525,0.755369,0.859219,0.761755,0.855420,0.699583,0.819264,0.880864,0.861675
98826,0,0.750871,0.789900,0.727330,0.811665,0.833850,0.261654,0.697217,0.702985,0.810512,...,0.899310,0.521675,0.754813,0.859299,0.763793,0.855317,0.700561,0.819498,0.881000,0.861843
98827,0,0.750999,0.789716,0.727558,0.811800,0.833556,0.261928,0.696529,0.702671,0.810981,...,0.899535,0.519283,0.754911,0.859294,0.763530,0.855336,0.700974,0.819053,0.881282,0.861860
98828,0,0.750388,0.789510,0.727604,0.812236,0.833917,0.263382,0.696101,0.702940,0.810357,...,0.899526,0.518855,0.755045,0.859239,0.762727,0.855259,0.700953,0.819073,0.881176,0.862006


In [None]:
df_gt_allpreds_E7.to_csv('ParkerNet_allModel_predictions_E7_20210112.csv', index=False) #use this to save your predictions for each model into a csv file

# ====== Ensemble Prediction ======

### Simple Soft Voting (Average Prediction)

We will now calculate the average prediction from all the predictions (Simple Soft Voting) ; this is not used in the analysis but is calculated in case anyone wanted to see the difference between simply averaging models predictions and using a weighted average.

In [None]:

# Dynamically extract all model prediction columns (everything except the ground truth)
model_columns = [col for col in df_gt_allpreds_E7.columns if col != 'Class_Huang_E7_Range']

# Ground truth labels
y_true = df_gt_allpreds_E7['Class_Huang_E7_Range'].values

# Step 1: Simple soft voting (equal weight for all models)
# Initialize the final predictions column
df_gt_allpreds_E7['SimpleSoftVoting_Predictions'] = np.zeros_like(y_true, dtype=float)

# Simply sum the probabilities of all models and divide by the number of models (average)
num_models = len(model_columns)
for model in model_columns:
    df_gt_allpreds_E7['SimpleSoftVoting_Predictions'] += df_gt_allpreds_E7[model]

# Normalize by the number of models to get the average (equal weight soft voting)
df_gt_allpreds_E7['SimpleSoftVoting_Predictions'] /= num_models

# The 'SimpleSoftVoting_Predictions' column now contains the soft voting ensemble probabilities
print(df_gt_allpreds_E7[['Class_Huang_E7_Range', 'SimpleSoftVoting_Predictions']].head())

   Class_Huang_E7_Range  SimpleSoftVoting_Predictions
0                     0                      0.652387
1                     0                      0.652057
2                     0                      0.651963
3                     0                      0.651543
4                     0                      0.651216


### Weighted Average Prediction using AUC_PRC as weight


Now we will calculate the weighted soft voting using the Huang class label to calculate the AUC_PRC (Area under the precision-recall curve). This is calculated and saved as "Weighted_Voting_AUC_PRC" in the dataset. This is the column that has been used in the analysis for the paper. If you scroll to the right on the displayed dataframe, you will now see this column added to the end.

In [None]:

#from sklearn.metrics import average_precision_score



# Define the target column and identify model prediction columns
target_column = 'Class_Huang_E7_Range'
model_columns = [col for col in df_gt_allpreds_E7.columns if col not in [target_column, 'SimpleSoftVoting_Predictions']]

# Calculate AUC-PRC for each model
auc_prc_scores = {}
for model in model_columns:
    auc_prc = average_precision_score(df_gt_allpreds_E7[target_column], df_gt_allpreds_E7[model])
    auc_prc_scores[model] = auc_prc

# Normalize AUC-PRC scores for weighting
total_auc = sum(auc_prc_scores.values())
weights = {model: auc / total_auc for model, auc in auc_prc_scores.items()}

# Calculate weighted soft voting predictions
df_gt_allpreds_E7['Weighted_Voting_AUC_PRC'] = sum(
    df_gt_allpreds_E7[model] * weight for model, weight in weights.items()
)


display(df_gt_allpreds_E7)

Unnamed: 0,Class_Huang_E7_Range,Preds_splitM_seed493,Preds_splitM_seed552,Preds_splitM_seed838,Preds_splitM_seed1022,Preds_splitM_seed123,Preds_splitM_seed324,Preds_splitM_seed369,Preds_splitM_seed564,Preds_splitM_seed641,...,Preds_splitM_seed1843,Preds_splitM_seed2221,Preds_splitM_seed3060,Preds_splitM_seed3247,Preds_splitM_seed3364,Preds_splitM_seed3871,Preds_splitM_seed4032,Preds_splitM_seed983,SimpleSoftVoting_Predictions,Weighted_Voting_AUC_PRC
0,0,0.636117,0.778164,0.588504,0.765886,0.582340,0.616913,0.649699,0.697456,0.643355,...,0.761302,0.821427,0.809603,0.778581,0.597045,0.784108,0.754330,0.840811,0.652387,0.649764
1,0,0.636482,0.777926,0.588714,0.766686,0.581585,0.611313,0.649776,0.697938,0.645148,...,0.761142,0.822361,0.808814,0.779287,0.594154,0.783862,0.751140,0.841016,0.652057,0.649400
2,0,0.639039,0.777701,0.589028,0.767213,0.580806,0.606789,0.650461,0.698042,0.647063,...,0.761228,0.822558,0.808534,0.779732,0.593298,0.783538,0.750893,0.841008,0.651963,0.649292
3,0,0.641618,0.777965,0.589636,0.767859,0.580301,0.601268,0.649226,0.698384,0.648608,...,0.762085,0.823634,0.808042,0.780256,0.592323,0.783155,0.750087,0.840959,0.651543,0.648848
4,0,0.643883,0.777323,0.589132,0.768660,0.577425,0.595551,0.651783,0.699033,0.650894,...,0.761935,0.824129,0.807168,0.780727,0.590951,0.783383,0.746872,0.840916,0.651216,0.648510
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98825,0,0.751036,0.789669,0.727342,0.811545,0.833985,0.261037,0.696583,0.702812,0.811020,...,0.755369,0.859219,0.761755,0.855420,0.699583,0.819264,0.880864,0.861675,0.769576,0.765197
98826,0,0.750871,0.789900,0.727330,0.811665,0.833850,0.261654,0.697217,0.702985,0.810512,...,0.754813,0.859299,0.763793,0.855317,0.700561,0.819498,0.881000,0.861843,0.769710,0.765338
98827,0,0.750999,0.789716,0.727558,0.811800,0.833556,0.261928,0.696529,0.702671,0.810981,...,0.754911,0.859294,0.763530,0.855336,0.700974,0.819053,0.881282,0.861860,0.769550,0.765179
98828,0,0.750388,0.789510,0.727604,0.812236,0.833917,0.263382,0.696101,0.702940,0.810357,...,0.755045,0.859239,0.762727,0.855259,0.700953,0.819073,0.881176,0.862006,0.769298,0.764931


Save the dataframe for analysis (will need this for Notebook 3 of 3)

In [None]:
df_gt_allpreds_E7.to_csv('ParkerNet_allModel_averaged_predictions_E7_20210112.csv', index=False)

***Now you know how to load in pre-trained keras models, predict on a dataset using all the models, and make an ensemble prediction.***