# VIME Tutorial

### VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain

- Paper: Jinsung Yoon, Yao Zhang, James Jordon, Mihaela van der Schaar, 
  "VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain," 
  Neural Information Processing Systems (NeurIPS), 2020.

- Paper link: TBD

- Last updated Date: October 11th 2020

- Code author: Jinsung Yoon (jsyoon0823@gmail.com)

This notebook describes the user-guide of self- and semi-supervised learning for tabular domain using MNIST database.

In [1]:
# pip uninstall numpy
# pip install numpy

In [2]:
import tensorflow as tf

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

a = tf.constant(2)
b = tf.constant(3)

c = tf.add(a, b)

with tf.Session() as sess:
    result = sess.run(c)
    print(result)

Instructions for updating:
non-resource variables are not supported in the long term
5


### Prerequisite
Clone https://github.com/jsyoon0823/VIME.git to the current directory.

### Necessary packages and functions call

- data_loader: MNIST dataset loading and preprocessing
- supervised_models: supervised learning models (Logistic regression, XGBoost, and Multi-layer Perceptron)

- vime_self: Self-supervised learning part of VIME framework
- vime_semi: Semi-supervised learning part of VIME framework
- vime_utils: Some utility functions for VIME framework

In [3]:
! pip install xgboost
! pip install tf-slim
# ! pip install fancyimpute statsmodels




In [4]:
import keras
import tensorflow
import numpy as np
import os
import warnings
warnings.filterwarnings("ignore")
  
from data_loader import load_mnist_data
from data_loader import load_excel_data
from data_loader import load_excel_data_multi_class

from supervised_models import logit, xgb_model, mlp

from vime_self import vime_self
from vime_self import get_encoder
from vime_self import vime_self_fnn
from vime_semi import vime_semi
from vime_utils import perf_metric

In [5]:
# Example usage:
# x_train, y_train, x_unlab, x_test, y_test = load_excel_data('TCGA_InfoWithGrade.xlsx')

In [6]:
# print(x_train.shape)
# print(y_train.shape)
# print(x_unlab.shape)

# print(x_test.shape)
# print(y_test.shape)


### Set the parameters and define output

-   label_no: Number of labeled data to be used
-   model_sets: supervised model set (mlp, logit, or xgboost)
-   p_m: corruption probability for self-supervised learning
-   alpha: hyper-parameter to control the weights of feature and mask losses
-   K: number of augmented samples
-   beta: hyperparameter to control supervised and unsupervised loss
-   label_data_rate: ratio of labeled data
-   metric: prediction performance metric (either acc or auc)

In [7]:
# Experimental parameters
label_no = 1000  
model_sets = ['logit','xgboost','mlp']
  
# Hyper-parameters
p_m = 0.3
alpha = 2.0
K = 3
beta = 1.0
label_data_rate = 0.1

# Metric
metric = 'acc'
  
# Define output
results = np.zeros([len(model_sets)+2])  

### Load data

Load original MNIST dataset and preprocess the loaded data.
- Only select the subset of data as the labeled data

In [8]:
# # Load data
# x_train, y_train, x_unlab, x_test, y_test = load_mnist_data(label_data_rate)
    
# # Use subset of labeled data
# x_train = x_train[:label_no, :]
# y_train = y_train[:label_no, :]

# print(x_train.shape)
# print(y_train.shape)
# print(x_unlab.shape)

# print(x_test.shape)
# print(y_test.shape)

In [9]:
# # Assuming the function load_excel_data_multi_class is defined in the same Jupyter notebook or imported properly

# # Call the function
# x_train, y_train , x_unlab, x_test, y_test = load_excel_data_multi_class('DARWIN.xlsx', label_data_rate=0.4, test_data_rate=0.2)

# # Print the shapes of the returned data
# print("x_train shape:", x_train.shape)
# print("y_train shape:", y_train.shape)
# print("x_unlab shape:", x_unlab.shape)
# print("x_test shape:", x_test.shape)
# print("y_test shape:", y_test.shape)

# x_unlab = x_unlab.to_numpy()

x_train shape: (56, 450)
y_train shape: (56, 2)
x_unlab shape: (84, 450)
x_test shape: (34, 450)
y_test shape: (34, 2)


In [10]:
from keras.utils import to_categorical

x_train, y_train, x_unlab, x_test, y_test = load_excel_data('TCGA_InfoWithGrade.xlsx')

# Ensure everything is a numpy ndarray
x_train = x_train.to_numpy()
y_train = y_train.to_numpy().reshape(-1, 1)
x_unlab = x_unlab.to_numpy()
x_test = x_test.to_numpy()
y_test = y_test.to_numpy().reshape(-1, 1) 

# Convert y_train and y_test into one-hot vectors
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

# Apply your transformations
label_no = 1000  # Or any appropriate value
x_train = x_train[:label_no, :]
y_train = y_train[:label_no, :]


In [69]:
# all code together:
#Ignore unless you want to try out all the modes and missingness rates at once!
listMODES = [ 'mean','median','mode','random_sampling']
listRATES = [0.05, 0.15, 0.4]

for mode in listMODES:
    for rate in listRATES:  


        from keras.utils import to_categorical

        x_train, y_train, x_unlab, x_test, y_test = load_excel_data('TCGA_InfoWithGrade.xlsx')

        # Ensure everything is a numpy ndarray
        x_train = x_train.to_numpy()
        y_train = y_train.to_numpy().reshape(-1, 1)
        x_unlab = x_unlab.to_numpy()
        x_test = x_test.to_numpy()
        y_test = y_test.to_numpy().reshape(-1, 1) 

        # Convert y_train and y_test into one-hot vectors
        y_train = to_categorical(y_train)
        y_test = to_categorical(y_test)

        # Apply your transformations
        label_no = 1000  # Or any appropriate value
        x_train = x_train[:label_no, :]
        y_train = y_train[:label_no, :]
        
        old_data = x_unlab

        import numpy as np

        def introduce_missingness(data, percentage=0.05, mechanism='MCAR'):
            """
            Introduce missingness into a numpy array based on a specific mechanism.

            Parameters:
            - data: numpy array
            - percentage: fraction of data that should be missing
            - mechanism: 'MCAR', 'MAR', or 'MNAR'

            Returns:
            - modified_data: numpy array with missing values
            """

            # Ensure data isn't an empty array
            if data.size == 0:
                return data

            # Convert percentage to a fraction of total data values
            total_values = data.size
            missing_values = int(total_values * percentage)
            values_made_missing = 0

            # MCAR
            if mechanism == 'MCAR':
                # Randomly pick indices to set as missing
                missing_indices = np.random.choice(total_values, missing_values, replace=False)
                np.put(data, missing_indices, np.nan)
                return data

            # MAR
            if mechanism == 'MAR':
                while values_made_missing < missing_values:
                    # Pick two features randomly
                    feature_1 = np.random.choice(data.shape[1], 1)
                    feature_2 = np.random.choice(data.shape[1], 1)
                    while feature_1 == feature_2:
                        feature_2 = np.random.choice(data.shape[1], 1)

                    threshold = np.mean(data[:, feature_1])
                    # Make values in feature_2 missing based on values in feature_1
                    potential_missing_indices = np.where(data[:, feature_1] > threshold)
                    remaining_missing = missing_values - values_made_missing
                    sample_size = min(len(potential_missing_indices[0]), remaining_missing)
                    if sample_size == 0:
                        continue
                    sample_missing = np.random.choice(len(potential_missing_indices[0]), sample_size, replace=False)
                    data[potential_missing_indices[0][sample_missing], feature_2] = np.nan
                    values_made_missing += sample_size
                return data

            # MNAR
            if mechanism == 'MNAR':
                feature = np.random.choice(data.shape[1], 1)
                threshold = np.mean(data[:, feature])
                potential_missing_indices = np.where(data[:, feature] > threshold)
                sample_size = min(len(potential_missing_indices[0]), missing_values)
                sample_missing = np.random.choice(len(potential_missing_indices[0]), sample_size, replace=False)
                data[potential_missing_indices[0][sample_missing], feature] = np.nan
                return data

        # Sample usage

        # data = np.random.rand(100, 5)
        x_unlab = introduce_missingness(x_unlab, percentage=rate, mechanism='MCAR')
        
        import numpy as np
        from sklearn.impute import SimpleImputer, KNNImputer
        from sklearn.experimental import enable_iterative_imputer
        from sklearn.impute import IterativeImputer

        def impute_data(data, method='mean'):
            # Find the indices of the missing values
            missing_values_indices = np.argwhere(np.isnan(data))

            if method == 'mean':
                imputer = SimpleImputer(strategy='mean')
            elif method == 'median':
                imputer = SimpleImputer(strategy='median')
            elif method == 'mode':
                imputer = SimpleImputer(strategy='most_frequent')
            elif method == 'knn':
                imputer = KNNImputer(n_neighbors=5)
            elif method == 'iterative' or method == 'regression':
                imputer = IterativeImputer(max_iter=10, random_state=0)
            elif method == 'random_sampling':
                # Random sampling imputation
                data_imputed = data.copy()
                for feature in range(data.shape[1]):
                    missing_values_idx = np.where(np.isnan(data[:, feature]))[0]
                    observed_values = data[~np.isnan(data[:, feature]), feature]
                    imputed_values = np.random.choice(observed_values, size=len(missing_values_idx))
                    data_imputed[missing_values_idx, feature] = imputed_values
                return data_imputed, missing_values_indices
            else:
                raise ValueError(f"Unknown imputation method: {method}")

            # Apply the imputer
            imputed_data = imputer.fit_transform(data)

            return imputed_data, missing_values_indices


        x_unlab, missing_values_indices = impute_data(x_unlab, method=mode)
        from keras.utils import to_categorical

        # x_train = x_train.to_numpy()
        # Logistic regression
        y_test_hat = logit(x_train, y_train, x_test)
        results[0] = perf_metric(metric, y_test, y_test_hat) 

        # XGBoost
        y_test_hat = xgb_model(x_train, y_train, x_test)    
        results[1] = perf_metric(metric, y_test, y_test_hat)   

        # MLP
        mlp_parameters = dict()
        mlp_parameters['hidden_dim'] = 100
        mlp_parameters['epochs'] = 100
        mlp_parameters['activation'] = 'relu'
        mlp_parameters['batch_size'] = 100

        # y_train = to_categorical(y_train)
        # y_test = to_categorical(y_test)    

        x_train = np.array(x_train)
        y_train = np.array(y_train)
        x_test = np.array(x_test)

        y_test_hat = mlp(x_train, y_train, x_test, mlp_parameters)
        results[2] = perf_metric(metric, y_test, y_test_hat)

        # Report performance
        for m_it in range(len(model_sets)):  

          model_name = model_sets[m_it]  

          print('Supervised Performance, Model Name: ' + model_name + 
                ', Performance: ' + str(results[m_it]))
        
        # Train VIME-Self
        vime_self_parameters = dict()
        vime_self_parameters['batch_size'] = 128
        vime_self_parameters['epochs'] = 10

        vime_self_encoder, embeddings, all_activations, encoder_output_dim, history = get_encoder(x_unlab,architecture='default', p_m=p_m, alpha=alpha, parameters=vime_self_parameters)
        print("Encoder output shape: (?, {})".format(encoder_output_dim))
        # encoder_output_dim = 392
        # vime_self_encoder, embeddings, all_activations = vime_self_fnn(x_unlab, p_m, alpha, vime_self_parameters)

        # Save encoder
        if not os.path.exists('save_model'):
          os.makedirs('save_model')

        file_name = './save_model/encoder_model.h5'

        vime_self_encoder.save(file_name)  

        # Test VIME-Self
        x_train_hat = vime_self_encoder.predict(x_train)
        x_test_hat = vime_self_encoder.predict(x_test)

        y_test_hat = mlp(x_train_hat, y_train, x_test_hat, mlp_parameters)

        results[3] = perf_metric(metric, y_test, y_test_hat)

        print('VIME-Self Performance: ' + str(results[3]))
        import tensorflow as tf


        vime_semi_parameters = dict()
        vime_semi_parameters['hidden_dim'] = 100
        vime_semi_parameters['batch_size'] = 128
        vime_semi_parameters['iterations'] = 1000
        y_test_hat = vime_semi(x_train, y_train, x_unlab, x_test, 
                               vime_semi_parameters, p_m, K, beta, file_name,encoder_output_dim)

        # Test VIME
        results[4] = perf_metric(metric, y_test, y_test_hat)

        print('VIME Performance: '+ str(results[4]))
        
        for m_it in range(len(model_sets)):  

          model_name = model_sets[m_it]  

          print('Supervised Performance, Model Name: ' + model_name + 
                ', Performance: ' + str(results[m_it]))

        print('VIME-Self Performance: ' + str(results[m_it+1]))

        print('VIME Performance: '+ str(results[m_it+2]))
        print("*******************")
        print("*******************")
        print("*******************")
        print("*******************")
        print("*******************")
        print("*******************")
        print("*******************")





Total train data length: 672
Label data length (from train_idx): 134
Unlabel data length (from train_idx): 538
Provided label_data_rate: 0.2
Supervised Performance, Model Name: logit, Performance: 0.874251497005988
Supervised Performance, Model Name: xgboost, Performance: 0.874251497005988
Supervised Performance, Model Name: mlp, Performance: 0.8562874251497006
Train on 538 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Encoder output shape: (?, 23)
VIME-Self Performance: 0.8562874251497006
Iteration: 0/1000, Current loss: 0.7589
Iteration: 100/1000, Current loss: 0.6124
Iteration: 200/1000, Current loss: 0.604
Iteration: 300/1000, Current loss: 0.5864
INFO:tensorflow:Restoring parameters from ./save_model/class_model.ckpt
VIME Performance: 0.874251497005988
Supervised Performance, Model Name: logit, Performance: 0.874251497005988
Supervised Performance, Model Name: xgboost, Performance: 0.874251497005988
Supervise

Train on 538 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Encoder output shape: (?, 23)
VIME-Self Performance: 0.8682634730538922
Iteration: 0/1000, Current loss: 1.2692
Iteration: 100/1000, Current loss: 0.5195
Iteration: 200/1000, Current loss: 0.4513
Iteration: 300/1000, Current loss: 0.4536
INFO:tensorflow:Restoring parameters from ./save_model/class_model.ckpt
VIME Performance: 0.8922155688622755
Supervised Performance, Model Name: logit, Performance: 0.8922155688622755
Supervised Performance, Model Name: xgboost, Performance: 0.844311377245509
Supervised Performance, Model Name: mlp, Performance: 0.9041916167664671
VIME-Self Performance: 0.8682634730538922
VIME Performance: 0.8922155688622755
*******************
*******************
*******************
*******************
*******************
*******************
*******************
Total train data length: 672
Label data length (from train_idx): 134
Unlabel d

Train on 538 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Encoder output shape: (?, 23)
VIME-Self Performance: 0.6107784431137725
Iteration: 0/1000, Current loss: 1.0418
Iteration: 100/1000, Current loss: 0.6372
Iteration: 200/1000, Current loss: 0.6218
Iteration: 300/1000, Current loss: 0.6186
Iteration: 400/1000, Current loss: 0.6147
INFO:tensorflow:Restoring parameters from ./save_model/class_model.ckpt
VIME Performance: 0.7844311377245509
Supervised Performance, Model Name: logit, Performance: 0.8802395209580839
Supervised Performance, Model Name: xgboost, Performance: 0.8323353293413174
Supervised Performance, Model Name: mlp, Performance: 0.8622754491017964
VIME-Self Performance: 0.6107784431137725
VIME Performance: 0.7844311377245509
*******************
*******************
*******************
*******************
*******************
*******************
*******************
Total train data length: 672
Label 

Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Encoder output shape: (?, 23)
VIME-Self Performance: 0.7664670658682635
Iteration: 0/1000, Current loss: 0.6694
Iteration: 100/1000, Current loss: 0.4913
Iteration: 200/1000, Current loss: 0.4457
Iteration: 300/1000, Current loss: 0.4082
Iteration: 400/1000, Current loss: 0.4285
Iteration: 500/1000, Current loss: 0.4359
INFO:tensorflow:Restoring parameters from ./save_model/class_model.ckpt
VIME Performance: 0.8083832335329342
Supervised Performance, Model Name: logit, Performance: 0.8502994011976048
Supervised Performance, Model Name: xgboost, Performance: 0.7485029940119761
Supervised Performance, Model Name: mlp, Performance: 0.8502994011976048
VIME-Self Performance: 0.7664670658682635
VIME Performance: 0.8083832335329342
*******************
*******************
*******************
*******************
*******************
*******************
*******************
Total train data length: 672
Label d

In [67]:
print(x_train.shape)
print(y_train.shape)
print(x_unlab.shape)

print(x_test.shape)
print(y_test.shape)
# import numpy as np

# # Compute the correlation matrix using numpy
# correlation_matrix = np.corrcoef(x_unlab, rowvar=False)

# # Identify pairs of features with high absolute correlation
# threshold = 0.7
# high_corr_pairs = []
# for i in range(correlation_matrix.shape[0]):
#     for j in range(i+1, correlation_matrix.shape[1]):
#         if abs(correlation_matrix[i, j]) > threshold:
#             high_corr_pairs.append((i, j))

# print(f"Number of feature pairs with correlation above {threshold}: {len(high_corr_pairs)}")



(134, 23)
(134, 2)
(538, 23)
(167, 23)
(167, 2)


In [12]:
old_data = x_unlab

import numpy as np

def introduce_missingness(data, percentage=0.05, mechanism='MCAR'):
    """
    Introduce missingness into a numpy array based on a specific mechanism.
    
    Parameters:
    - data: numpy array
    - percentage: fraction of data that should be missing
    - mechanism: 'MCAR', 'MAR', or 'MNAR'
    
    Returns:
    - modified_data: numpy array with missing values
    """
    
    # Ensure data isn't an empty array
    if data.size == 0:
        return data
    
    # Convert percentage to a fraction of total data values
    total_values = data.size
    missing_values = int(total_values * percentage)
    values_made_missing = 0
    
    # MCAR
    if mechanism == 'MCAR':
        # Randomly pick indices to set as missing
        missing_indices = np.random.choice(total_values, missing_values, replace=False)
        np.put(data, missing_indices, np.nan)
        return data
    
    # MAR
    if mechanism == 'MAR':
        while values_made_missing < missing_values:
            # Pick two features randomly
            feature_1 = np.random.choice(data.shape[1], 1)
            feature_2 = np.random.choice(data.shape[1], 1)
            while feature_1 == feature_2:
                feature_2 = np.random.choice(data.shape[1], 1)

            threshold = np.mean(data[:, feature_1])
            # Make values in feature_2 missing based on values in feature_1
            potential_missing_indices = np.where(data[:, feature_1] > threshold)
            remaining_missing = missing_values - values_made_missing
            sample_size = min(len(potential_missing_indices[0]), remaining_missing)
            if sample_size == 0:
                continue
            sample_missing = np.random.choice(len(potential_missing_indices[0]), sample_size, replace=False)
            data[potential_missing_indices[0][sample_missing], feature_2] = np.nan
            values_made_missing += sample_size
        return data
    
    # MNAR
    if mechanism == 'MNAR':
        feature = np.random.choice(data.shape[1], 1)
        threshold = np.mean(data[:, feature])
        potential_missing_indices = np.where(data[:, feature] > threshold)
        sample_size = min(len(potential_missing_indices[0]), missing_values)
        sample_missing = np.random.choice(len(potential_missing_indices[0]), sample_size, replace=False)
        data[potential_missing_indices[0][sample_missing], feature] = np.nan
        return data

# Sample usage

# data = np.random.rand(100, 5)
x_unlab = introduce_missingness(x_unlab, percentage=0.4, mechanism='MNAR')


In [13]:
import numpy as np

# Example array
# arr = np.array([[1, 2, np.nan], [4, 5, 6], [7, np.nan, 9]])
arr = x_unlab
# Check for missing values (NaN)
has_missing_values = np.isnan(arr).any()

# Count of missing values
missing_count = np.isnan(arr).sum()

print(f"Has missing values: {has_missing_values}")
print(f"Count of missing values: {missing_count}")


Has missing values: True
Count of missing values: 35


In [14]:
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

def impute_data(data, method='mean'):
    # Find the indices of the missing values
    missing_values_indices = np.argwhere(np.isnan(data))
    
    if method == 'mean':
        imputer = SimpleImputer(strategy='mean')
    elif method == 'median':
        imputer = SimpleImputer(strategy='median')
    elif method == 'mode':
        imputer = SimpleImputer(strategy='most_frequent')
    elif method == 'knn':
        imputer = KNNImputer(n_neighbors=5)
    elif method == 'iterative' or method == 'regression':
        imputer = IterativeImputer(max_iter=10, random_state=0)
    elif method == 'random_sampling':
        # Random sampling imputation
        data_imputed = data.copy()
        for feature in range(data.shape[1]):
            missing_values_idx = np.where(np.isnan(data[:, feature]))[0]
            observed_values = data[~np.isnan(data[:, feature]), feature]
            imputed_values = np.random.choice(observed_values, size=len(missing_values_idx))
            data_imputed[missing_values_idx, feature] = imputed_values
        return data_imputed, missing_values_indices
    else:
        raise ValueError(f"Unknown imputation method: {method}")

    # Apply the imputer
    imputed_data = imputer.fit_transform(data)
    
    return imputed_data, missing_values_indices


x_unlab, missing_values_indices = impute_data(x_unlab, method="random_sampling")


In [15]:
import numpy as np

# Example array
# arr = np.array([[1, 2, np.nan], [4, 5, 6], [7, np.nan, 9]])
arr = x_unlab
# Check for missing values (NaN)
has_missing_values = np.isnan(arr).any()

# Count of missing values
missing_count = np.isnan(arr).sum()

print(f"Has missing values: {has_missing_values}")
print(f"Count of missing values: {missing_count}")

Has missing values: False
Count of missing values: 0


In [16]:
missing_values_indices.shape

(35, 2)

In [17]:
# import numpy as np
# import seaborn as sns
# import matplotlib.pyplot as plt

# # Randomly shuffle the indices
# shuffled_indices = np.random.permutation(x_unlab.shape[1])

# # Select the top 50 features
# selected_features = shuffled_indices[:23]

# # Compute the correlation matrix for the subset
# subset_corr_matrix = np.corrcoef(x_unlab[:, selected_features], rowvar=False)

# # Plot the heatmap without annotations inside
# plt.figure(figsize=(12, 10))
# sns.heatmap(subset_corr_matrix, cmap="coolwarm", vmin=-1, vmax=1)
# plt.title("Heatmap for a Subset of Features")
# plt.show()


In [18]:
# import numpy as np
# import seaborn as sns
# import matplotlib.pyplot as plt
# from scipy.cluster.hierarchy import linkage

# # Randomly shuffle the indices
# shuffled_indices = np.random.permutation(x_unlab.shape[1])

# # Select a subset of features, for instance the top 50
# selected_features = shuffled_indices[:23]

# # Remove columns with zero variance
# non_zero_var_columns = np.var(x_unlab[:, selected_features], axis=0) != 0
# x_subset = x_unlab[:, selected_features][:, non_zero_var_columns]

# # Compute the correlation matrix for the subset
# subset_corr_matrix = np.corrcoef(x_subset, rowvar=False)

# # Convert the correlation matrix to a distance for linkage computation
# distances = 1 - np.abs(subset_corr_matrix)

# # Compute hierarchical clustering linkage
# link = linkage(distances, method="average", optimal_ordering=True)

# # Create a clustered heatmap using seaborn
# sns.clustermap(subset_corr_matrix, row_linkage=link, col_linkage=link, cmap="coolwarm", vmin=-1, vmax=1, figsize=(12, 10), annot=False)

# plt.show()





# # import seaborn as sns
# # corr_matrix = np.nan_to_num(corr_matrix)

# # sns.clustermap(corr_matrix, cmap="coolwarm", figsize=(20, 20))


In [19]:
# pip install networkx matplotlib


In [20]:
# import networkx as nx
# import matplotlib.pyplot as plt

# # Assuming high_corr_pairs is already defined

# # Create an empty graph
# G = nx.Graph()

# # Add edges based on high-correlation pairs
# G.add_edges_from(high_corr_pairs)

# # Draw the graph with spacing adjustments
# plt.figure(figsize=(25, 25))
# pos = nx.spring_layout(G, k=0.5, iterations=50)  # Increase k and iterations for more spacing and better layout
# nx.draw_networkx_nodes(G, pos, node_size=500, alpha=0.8)
# nx.draw_networkx_edges(G, pos, width=2.0, edge_color="gray")  # Increase width and set edge color for better visibility
# nx.draw_networkx_labels(G, pos, font_size=12)
# plt.title("Network Graph of High-correlation Feature Pairs")
# plt.axis("off")
# plt.show()


In [21]:
# import numpy as np

# # Compute the correlation matrix using numpy
# correlation_matrix = np.corrcoef(x_unlab, rowvar=False)

# # Identify pairs of features with high absolute correlation
# threshold = 0.85
# high_corr_pairs = []
# for i in range(correlation_matrix.shape[0]):
#     for j in range(i+1, correlation_matrix.shape[1]):
#         if abs(correlation_matrix[i, j]) > threshold:
#             high_corr_pairs.append((i, j, correlation_matrix[i, j]))

# print(f"Number of feature pairs with correlation above {threshold}: {len(high_corr_pairs)}")


In [22]:
# import numpy as np

# def process_datasets(x_train, x_test, x_unlab, threshold=0.9):
#     """
#     Removes highly correlated features from x_train, x_test, and x_unlab.
    
#     Parameters:
#         - x_train: Training data.
#         - x_test: Testing data.
#         - x_unlab: Unlabeled data.
#         - threshold: Correlation threshold for feature removal.
        
#     Returns:
#         - x_train_modified: Processed training data.
#         - x_test_modified: Processed testing data.
#         - x_unlab_modified: Processed unlabeled data.
#         - correlated_pairs: List of tuples containing highly correlated feature pairs.
#     """
    
#     # Compute the correlation matrix using the unlabeled data
#     correlation_matrix = np.corrcoef(x_unlab, rowvar=False)

#     # Identify high-correlation pairs and determine features to remove
#     features_to_remove = set()
#     correlated_pairs = []  # List to keep track of highly correlated pairs
#     for i in range(correlation_matrix.shape[0]):
#         for j in range(i + 1, correlation_matrix.shape[1]):
#             if abs(correlation_matrix[i, j]) > threshold:
#                 # Decide to remove the second feature of the pair
#                 features_to_remove.add(j)
#                 correlated_pairs.append((i, j))  # Add the pair to the list

#     # Determine features to keep
#     features_to_keep = [i for i in range(x_unlab.shape[1]) if i not in features_to_remove]

#     # Modify datasets
#     x_train_modified = x_train[:, features_to_keep]
#     x_test_modified = x_test[:, features_to_keep]
#     x_unlab_modified = x_unlab[:, features_to_keep]

#     return x_train_modified, x_test_modified, x_unlab_modified, correlated_pairs

# # Example usage:
# x_train_modified, x_test_modified, x_unlab_modified, correlated_pairs = process_datasets(x_train, x_test, x_unlab)


In [23]:
# x_train_modified.shape

In [24]:
# x_unlab=x_unlab_modified
# x_train = x_train_modified
# x_test= x_test_modified
# high_corr_pairs = correlated_pairs

In [25]:
# print(x_unlab.shape)
# print(x_train.shape)
# print(x_test.shape)


In [26]:
# x_unlabSaved = x_unlab
# x_unlab=x_modified

In [27]:
# import networkx as nx
# import matplotlib.pyplot as plt
# import numpy as np

# # Check if high_corr_pairs is properly generated and not empty
# if not high_corr_pairs or len(high_corr_pairs[0]) != 3:
#     print("high_corr_pairs is either empty or not generated correctly!")
#     exit()

# # Create an empty graph
# G = nx.Graph()

# # Add edges and weights
# edge_colors = []  # This will store the correlation values for color coding

# for i, j, corr_val in high_corr_pairs:
#     G.add_edge(i, j, weight=corr_val)
#     edge_colors.append(corr_val)

# # Normalize colors
# min_corr = min(edge_colors)
# max_corr = max(edge_colors)
# edge_colors_normalized = [(c - min_corr) / (max_corr - min_corr) for c in edge_colors]

# # Draw the graph with spacing adjustments
# plt.figure(figsize=(15, 15))
# pos = nx.spring_layout(G, k=0.75, iterations=100)  # Adjusting k and iterations for potentially denser graph
# # pos = nx.spring_layout(G, k=0.1, iterations=100)  # Reduce the value of k from 0.75 to 0.5 or any desired value

# nx.draw_networkx_nodes(G, pos, node_size=500, alpha=0.8)
# nx.draw_networkx_edges(G, pos, edge_color=edge_colors_normalized, edge_cmap=plt.cm.Blues, width=2.0)
# nx.draw_networkx_labels(G, pos, font_size=12)

# # Add colorbar for edges
# sm = plt.cm.ScalarMappable(cmap=plt.cm.Blues, norm=plt.Normalize(vmin=min_corr, vmax=max_corr))
# plt.colorbar(sm)

# plt.title("Network Graph of High-correlation Feature Pairs")
# plt.axis("off")
# plt.show()


### Train supervised models

- Train 3 supervised learning models (Logistic regression, XGBoost, MLP)
- Save the performances of each supervised model.

In [28]:
# # x_unlabSaved = x_unlab
# x_unlab=x_modified
# x_unlab.shape

In [29]:
from keras.utils import to_categorical

# x_train = x_train.to_numpy()
# Logistic regression
y_test_hat = logit(x_train, y_train, x_test)
results[0] = perf_metric(metric, y_test, y_test_hat) 

# XGBoost
y_test_hat = xgb_model(x_train, y_train, x_test)    
results[1] = perf_metric(metric, y_test, y_test_hat)   

# MLP
mlp_parameters = dict()
mlp_parameters['hidden_dim'] = 100
mlp_parameters['epochs'] = 100
mlp_parameters['activation'] = 'relu'
mlp_parameters['batch_size'] = 100
      
# y_train = to_categorical(y_train)
# y_test = to_categorical(y_test)    

x_train = np.array(x_train)
y_train = np.array(y_train)
x_test = np.array(x_test)

y_test_hat = mlp(x_train, y_train, x_test, mlp_parameters)
results[2] = perf_metric(metric, y_test, y_test_hat)

# Report performance
for m_it in range(len(model_sets)):  
    
  model_name = model_sets[m_it]  
    
  print('Supervised Performance, Model Name: ' + model_name + 
        ', Performance: ' + str(results[m_it]))

Supervised Performance, Model Name: logit, Performance: 0.7058823529411765
Supervised Performance, Model Name: xgboost, Performance: 0.7352941176470589
Supervised Performance, Model Name: mlp, Performance: 0.7352941176470589


In [30]:
# from sklearn.manifold import TSNE

# tsne = TSNE(n_components=2, perplexity=30, n_iter=300)
# x_unlab_visual = tsne.fit_transform(x_unlab)
# import matplotlib.pyplot as plt

# # 2D scatter plot
# plt.scatter(x_unlab_visual[:, 0], x_unlab_visual[:, 1], s=5, alpha=0.5, c='red')
# plt.xlabel('Component 1')
# plt.ylabel('Component 2')
# plt.title('2D Visualization of x_unlab_visual')
# plt.show()

In [31]:
# visualized_cols = x_unlab[:, 100:150]
# print(visualized_cols)

### Train & Test VIME-Self
Train self-supervised part of VIME framework only
- Check the performance of self-supervised part of VIME framework.

In [32]:
# print(x_train.shape)
# print(x_unlab.shape)

In [33]:
# type(vime_self_encoder)

In [34]:
# Train VIME-Self
vime_self_parameters = dict()
vime_self_parameters['batch_size'] = 128
vime_self_parameters['epochs'] = 10

vime_self_encoder, embeddings, all_activations, encoder_output_dim, history = get_encoder(x_unlab,architecture='default', p_m=p_m, alpha=alpha, parameters=vime_self_parameters)
print("Encoder output shape: (?, {})".format(encoder_output_dim))
# encoder_output_dim = 392
# vime_self_encoder, embeddings, all_activations = vime_self_fnn(x_unlab, p_m, alpha, vime_self_parameters)
  
# Save encoder
if not os.path.exists('save_model'):
  os.makedirs('save_model')

file_name = './save_model/encoder_model.h5'
  
vime_self_encoder.save(file_name)  
        
# Test VIME-Self
x_train_hat = vime_self_encoder.predict(x_train)
x_test_hat = vime_self_encoder.predict(x_test)
      
y_test_hat = mlp(x_train_hat, y_train, x_test_hat, mlp_parameters)

results[3] = perf_metric(metric, y_test, y_test_hat)

print('VIME-Self Performance: ' + str(results[3]))


Train on 84 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Encoder output shape: (?, 450)
VIME-Self Performance: 0.6764705882352942


In [35]:
# import matplotlib.pyplot as plt

# def min_max_normalize(lst):
#     """Normalize list values to range between 0 and 1."""
#     return [(i-min(lst))/(max(lst)-min(lst)) for i in lst]

# # Extract training and validation loss from the history object
# train_loss = history.history['loss']
# normalized_train_loss = min_max_normalize(train_loss)

# mask_loss = history.history.get('mask_loss', None)  # Get mask loss if it exists
# if mask_loss:
#     normalized_mask_loss = min_max_normalize(mask_loss)

# feature_loss = history.history.get('feature_loss', None)  # Get feature loss if it exists
# if feature_loss:
#     normalized_feature_loss = min_max_normalize(feature_loss)

# # Plotting the losses
# plt.figure(figsize=(10,6))

# # Original Losses
# plt.plot(train_loss, label='Training Loss', color='blue', linestyle='--')
# if mask_loss:
#     plt.plot(mask_loss, label='Mask Loss', color='red', linestyle='--')
# if feature_loss:
#     plt.plot(feature_loss, label='Feature Loss', color='green', linestyle='--')

# # Normalized Losses
# plt.plot(normalized_train_loss, label='Normalized Training Loss', color='blue')
# if mask_loss:
#     plt.plot(normalized_mask_loss, label='Normalized Mask Loss', color='red')
# if feature_loss:
#     plt.plot(normalized_feature_loss, label='Normalized Feature Loss', color='green')

# plt.title('Losses and Normalized Losses per Epoch')
# plt.xlabel('Epoch')
# plt.ylabel('Loss')
# plt.legend()
# plt.grid(True)
# plt.show()


In [36]:
# import matplotlib.pyplot as plt

# # Extract training and validation loss from the history object
# train_loss = history.history['loss']
# mask_loss = history.history.get('mask_loss', None)  # Get validation loss if it exists
# feature_loss = history.history.get('feature_loss', None)  # Get validation loss if it exists


# # Plotting the training loss
# plt.figure(figsize=(10,6))
# plt.plot(train_loss, label='Training Loss', color='blue')
# if mask_loss:  # Plot validation loss if it exists
#     plt.plot(mask_loss, label='Mask Loss', color='red')

# if feature_loss:  # Plot validation loss if it exists
#     plt.plot(feature_loss, label='Feature Loss', color='green')
    
    
# plt.title('Losses per Epoch')
# plt.xlabel('Epoch')
# plt.ylabel('Loss')
# plt.legend()
# plt.grid(True)
# plt.show()


In [37]:
# vime_self_encoder.summary()


In [38]:
# # Plotting
# plt.figure(figsize=(10,6))
# for encoder, loss in losses.items():
#     plt.plot(loss, label=encoder)

# plt.title("Loss Curves for Each Encoder")
# plt.xlabel("Epochs")
# plt.ylabel("Loss")
# plt.legend()
# plt.grid(True)
# plt.show()

In [39]:
# ### RUN THIS IF YOU WANT THE ENCODER CHOICE TO BE AUTOMATIC

# def evaluate_architecture(architecture_name, x_unlab, p_m, alpha, vime_self_parameters, x_train, y_train, x_test, metric):
#     # Get encoder
#     vime_self_encoder, embeddings, all_activations, encoder_output_dim, history = get_encoder(x_unlab, architecture=architecture_name, p_m=p_m, alpha=alpha, parameters=vime_self_parameters)
#     print("Encoder output shape for {}: (?, {})".format(architecture_name, encoder_output_dim))

#     # Save encoder
#     if not os.path.exists('save_model'):
#         os.makedirs('save_model')

#     file_name = './save_model/encoder_model_{}.h5'.format(architecture_name)
#     vime_self_encoder.save(file_name)

#     # Test VIME-Self
#     x_train_hat = vime_self_encoder.predict(x_train)
#     x_test_hat = vime_self_encoder.predict(x_test)
    
#     y_test_hat = mlp(x_train_hat, y_train, x_test_hat, mlp_parameters)
#     result = perf_metric(metric, y_test, y_test_hat)
#     print("{}'S PERFORMANCE IS {}".format(architecture_name.upper(), str(result)))

#     return result

# # Train VIME-Self
# vime_self_parameters = {
#     'batch_size': 128,
#     'epochs': 10
# }

# architectures = ['default', 'fnn', 'autoencoder']
# results = {}

# # Evaluate each architecture
# for arch in architectures:
#     results[arch] = evaluate_architecture(arch, x_unlab, p_m, alpha, vime_self_parameters, x_train, y_train, x_test, metric)

# # Determine the best architecture
# best_architecture = max(results, key=results.get)


# # Use the best architecture
# vime_self_encoder, embeddings, all_activations, encoder_output_dim, history= get_encoder(x_unlab, architecture=best_architecture, p_m=p_m, alpha=alpha, parameters=vime_self_parameters)

# # Save the final encoder
# if not os.path.exists('save_model'):
#     os.makedirs('save_model')
# file_name = './save_model/encoder_model_final.h5'
# vime_self_encoder.save(file_name)

# # Final test with the best architecture
# x_train_hat = vime_self_encoder.predict(x_train)
# x_test_hat = vime_self_encoder.predict(x_test)
# y_test_hat = mlp(x_train_hat, y_train, x_test_hat, mlp_parameters)
# results[3] = perf_metric(metric, y_test, y_test_hat)
# print('VIME-Self Performance with best architecture:', results[3])
# print("THE CHOSEN ARCHITECTURE IS", best_architecture)


In [40]:
# counter = 0 

# for layer_name, activation in all_activations.items():
#     print(layer_name)
#     counter = counter+1  
    
#     print(activation) 
    
# print(counter)

In [41]:
# print(chosen_activations[:, neuron1_index].shape)
# print(chosen_activations[:, neuron2_index].shape)
# print(x_unlab[:, chosen_feature_index].shape)

In [42]:
# import matplotlib.pyplot as plt

# # Extracting activations using your function
# activations_dict = all_activations

# # Choose a layer for visualization, for example, the first dense layer
# # (You can adjust this according to the layer names you have in your model)
# chosen_layer_name = 'dense_3'  # example name, replace with actual layer name
# chosen_activations = activations_dict[chosen_layer_name]

# # Define indices for neurons you'd like to visualize
# neuron1_index = 0
# neuron2_index = 1

# # Chosen feature index from original data for coloring
# chosen_feature_index = 0

# # Visualization
# plt.scatter(chosen_activations[:, neuron1_index], chosen_activations[:, neuron2_index], c=x_unlab[:, chosen_feature_index], cmap='viridis')
# plt.colorbar()
# plt.title(f'Activations of {chosen_layer_name} and Feature Visualization')
# plt.xlabel(f'Activation Neuron {neuron1_index}')
# plt.ylabel(f'Activation Neuron {neuron2_index}')
# plt.show()


In [43]:
# activations_dict[chosen_layer_name].shape[1]

In [44]:
# neuron_indices = get_random_subset_indices(activations_dict[chosen_layer_name].shape[1], neuron_fraction)
# neuron_indices

In [45]:
# import numpy as np
# import seaborn as sns
# import matplotlib.pyplot as plt
# activations_dict = all_activations
# chosen_layer_name = 'dense_3'



# def compute_correlation_with_features(activations, features):
#     """Compute correlation between activations and features."""
#     num_neurons = activations.shape[1]
#     num_features = features.shape[1]
    
#     correlation_matrix = np.zeros((num_neurons, num_features))
    
#     for i in range(num_neurons):
#         for j in range(num_features):
#             correlation_matrix[i, j] = np.corrcoef(activations[:, i], features[:, j])[0, 1]
            
#     return correlation_matrix

# def get_random_subset_indices(total_length, fraction):
#     """Get a random subset of indices based on the given fraction."""
#     subset_length = int(total_length * fraction)
#     return np.random.choice(total_length, subset_length, replace=False)

# # Fraction of neurons and features you wish to visualize
# neuron_fraction = 1
# feature_fraction = 1

# # neuron_indices = get_random_subset_indices(embeddings.shape[1], neuron_fraction)
# neuron_indices = get_random_subset_indices(activations_dict[chosen_layer_name].shape[1], neuron_fraction)
# feature_indices = get_random_subset_indices(x_unlab.shape[1], feature_fraction)

# chosen_activations = activations_dict[chosen_layer_name][:, neuron_indices]
# subset_features = x_unlab[:, feature_indices]
# correlation_matrix = compute_correlation_with_features(chosen_activations, subset_features)

# # Visualization
# plt.figure(figsize=(25, 18))
# sns.heatmap(correlation_matrix, cmap='viridis', annot=True)
# plt.title(f'Correlation between Random Subset of Activations of {chosen_layer_name} and Features')
# plt.xlabel('All Features')
# plt.ylabel('All Neurons')
# plt.show()


In [46]:
# import matplotlib.pyplot as plt
# import seaborn as sns

# # Assuming `activations` is a matrix where rows are samples and columns are activations
# plt.figure(figsize=(10, 8))
# sns.heatmap(activation, cmap='viridis')
# plt.title('Activation Heatmap')
# plt.xlabel('Neurons')
# plt.ylabel('Samples')
# plt.show()


In [47]:
# # Assuming activations is a matrix and original_data is your feature data
# plt.scatter(activation[:, neuron1], activations[:, neuron2], c=original_data[:, chosen_feature], cmap='viridis')
# plt.colorbar()
# plt.title('Activations and Feature Visualization')
# plt.xlabel(f'Activation Neuron {neuron1}')
# plt.ylabel(f'Activation Neuron {neuron2}')
# plt.show()


In [48]:
# means = np.mean(embeddings, axis=0)
# variances = np.var(embeddings, axis=0)



In [49]:
# plt.figure(figsize=(10,6))
# plt.hist(means, bins=30, edgecolor='black', alpha=0.7) # you can adjust the number of bins as per your requirements
# plt.title("Distribution of Neuron Activation Means")
# plt.xlabel("Mean Activation Value")
# plt.ylabel("Number of Neurons")
# plt.grid(True, which='both', linestyle='--', linewidth=0.5)
# # plt.xlim([0.0, 1.5])  # adjust the x-axis range here
# plt.show()

In [50]:
# import matplotlib.pyplot as plt

# # Replace this placeholder with your actual variance data

# plt.figure(figsize=(10,6))
# plt.hist(variances, bins=30, color='purple', edgecolor='black', alpha=0.7)
# plt.title("Distribution of Neuron Activation Variances")
# plt.xlabel("Variance Value")
# plt.ylabel("Number of Neurons")
# plt.grid(True, which='both', linestyle='--', linewidth=0.5)
# plt.tight_layout()
# plt.show()



In [51]:
# data = embeddings 

# row_drop_fraction = 0.95  # For example, drop 20% of rows
# col_drop_fraction = 0.95  # For example, drop 30% of columns

# # Calculate the number of rows and columns to drop
# num_rows_to_drop = int(row_drop_fraction * data.shape[0])
# num_cols_to_drop = int(col_drop_fraction * data.shape[1])

# # Shuffle the row indices and column indices
# row_indices = np.arange(data.shape[0])
# col_indices = np.arange(data.shape[1])
# np.random.shuffle(row_indices)
# np.random.shuffle(col_indices)

# # Select the remaining rows and columns
# remaining_rows = row_indices[num_rows_to_drop:]
# remaining_cols = col_indices[num_cols_to_drop:]

# # Drop the specified rows and columns
# data_after_drop = data[remaining_rows][:, remaining_cols]

In [52]:
# from sklearn.neighbors import KernelDensity
# import numpy as np

# # Sample data
# activations = np.random.randn(5400, 79)

# # Compute KDE for each feature
# kde_list = []
# for i in range(activations.shape[1]):
#     kde = KernelDensity(kernel='gaussian', bandwidth=0.5)
#     kde.fit(activations[:, i][:, np.newaxis])
#     kde_list.append(kde)




In [53]:
# import numpy as np
# import matplotlib.pyplot as plt

# x = np.linspace(-5, 5, 1000)  # Adjust the range as per your data

# plt.figure(figsize=(7, 5))

# # Subset of features for clarity
# features_to_plot = range(10)  # Plotting only the first 10 features as an example

# for i in features_to_plot:
#     y = np.exp(kde_list[i].score_samples(x[:, np.newaxis]))
#     plt.plot(x, y, label=f"Feature {i}")

# plt.xlim([-0.02, 0.02])  # Adjust as needed
# plt.ylim([0, 0.5])  # Adjust as needed

# plt.legend()
# plt.title("Density plots for activations")
# plt.show()


In [54]:
# data_after_drop.shape

In [55]:
# import seaborn as sns
# def plot_kde(activations, num_features_to_plot=40):
#     # Plot KDE for the specified number of features
#     plt.figure(figsize=(15, 10))
#     for i in range(num_features_to_plot):
#         sns.kdeplot(activations[:, i], label=f'Neuron {i+1}')
    
#     plt.xlabel('Activation Value')
#     plt.ylabel('Density')
#     plt.title('KDE of Neuron Activations')
#     plt.legend()
#     plt.show()

# # Call the function with your activations
# plot_kde(data_after_drop)


In [56]:
# # embeddings=embeddings.numpy()

# print(embeddings)



In [57]:
# from sklearn.decomposition import PCA

# # Assuming 'embeddings' is your numpy array with high-dimensional data
# pca = PCA(n_components=2)
# reduced_embeddings = pca.fit_transform(embeddings)

# import matplotlib.pyplot as plt

# # 2D scatter plot
# plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], s=5, alpha=0.06) 
# plt.xlabel('Component 1')
# plt.ylabel('Component 2')
# plt.title('2D Visualization of Embeddings')
# plt.show()


In [58]:
# from sklearn.manifold import TSNE

# tsne = TSNE(n_components=2, perplexity=30, n_iter=300)
# reduced_embeddings = tsne.fit_transform(embeddings)
# import matplotlib.pyplot as plt

# # 2D scatter plot
# plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], s=15, alpha=1)
# plt.xlabel('Component 1')
# plt.ylabel('Component 2')
# plt.title('2D Visualization of Embeddings')
# plt.show()

In [59]:
# import seaborn as sns

# sns.kdeplot(x_unlab_visual[:, 0], x_unlab_visual[:, 1], alpha=0.8, cmap='Reds', shade=True, label='Scatter Plot 1')
# sns.kdeplot(reduced_embeddings[:, 0], reduced_embeddings[:, 1], alpha=0.4, cmap='Blues', shade=True, label='Scatter Plot 2')

# plt.legend()
# plt.xlabel('X')
# plt.ylabel('Y')
# plt.title('Comparison of Density Plots')
# plt.show()


In [60]:
# import seaborn as sns

# sns.kdeplot(reduced_embeddings[:, 0], reduced_embeddings[:, 1], alpha=0.8, cmap='Blues', shade=True, label='Scatter Plot 2')
# plt.legend()
# plt.xlabel('X')
# plt.ylabel('Y')
# plt.title('Comparison of Density Plots')
# plt.show()

In [61]:
# from sklearn.manifold import Isomap
# import matplotlib.pyplot as plt

# # Instantiate and fit the Isomap model
# isomap = Isomap(n_components=2, n_neighbors=5)
# isomap_results = isomap.fit_transform(embeddings)

# # Plotting
# plt.scatter(isomap_results[:, 0], isomap_results[:, 1], s=5, alpha=0.06)
# plt.xlabel('Dimension 1')
# plt.ylabel('Dimension 2')
# plt.title('Isomap Visualization')
# plt.show()

In [62]:
# pip install umap-learn --user

In [63]:
# import sys
# sys.path.append('C:\\Users\\georg\\AppData\\Roaming\\Python\\Lib\\site-packages')
# import umap
# # C:\Users\georg\AppData\Roaming\Python

# import umap

# reducer = umap.UMAP()
# reduced_embeddings = reducer.fit_transform(embeddings)

# import matplotlib.pyplot as plt

# # 2D scatter plot
# plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], s=5, alpha=0.06)
# plt.xlabel('Component 1')
# plt.ylabel('Component 2')
# plt.title('2D Visualization of Embeddings')
# plt.show()

### Train & Test VIME

Train semi-supervised part of VIME framework on top of trained self-supervised encoder
- Check the performance of entire part of VIME framework.

In [64]:
# Train VIME-Semi
import tensorflow as tf


vime_semi_parameters = dict()
vime_semi_parameters['hidden_dim'] = 100
vime_semi_parameters['batch_size'] = 128
vime_semi_parameters['iterations'] = 1000
y_test_hat = vime_semi(x_train, y_train, x_unlab, x_test, 
                       vime_semi_parameters, p_m, K, beta, file_name,encoder_output_dim)

# Test VIME
results[4] = perf_metric(metric, y_test, y_test_hat)
  
print('VIME Performance: '+ str(results[4]))

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Iteration: 0/1000, Current loss: 0.8873
Iteration: 100/1000, Current loss: 0.4405
Iteration: 200/1000, Current loss: 0.525
Iteration: 300/1000, Current loss: 0.5233
INFO:tensorflow:Restoring parameters from ./save_model/class_model.ckpt
VIME Performance: 0.6764705882352942


### Report Prediction Performances

- 3 Supervised learning models
- VIME with self-supervised part only
- Entire VIME framework

In [65]:
for m_it in range(len(model_sets)):  
    
  model_name = model_sets[m_it]  
    
  print('Supervised Performance, Model Name: ' + model_name + 
        ', Performance: ' + str(results[m_it]))
    
print('VIME-Self Performance: ' + str(results[m_it+1]))
  
print('VIME Performance: '+ str(results[m_it+2]))

Supervised Performance, Model Name: logit, Performance: 0.7058823529411765
Supervised Performance, Model Name: xgboost, Performance: 0.7352941176470589
Supervised Performance, Model Name: mlp, Performance: 0.7352941176470589
VIME-Self Performance: 0.6764705882352942
VIME Performance: 0.6764705882352942
