# LLM is a great rule-based feature engineer in few-shot tabular learning
## Overview
This notebook runs training and inference for few-shot tabular learning task over benchmark datasets. GPT-3.5 model is used in this tutorial.

## Overall process
* Prepare datasets
* Extract rules for prediction from training samples with the help of LLM
* Parse rules to the program code and convert data into the binary vector
* Train the linear model to predict the likelihood of each class from the binary vector
* Make inference with ensembling

## Differences in V6
* New pipeline

**Notes**
There are two major flaws within our logs and experiments. Logs with the outdated system may have versioning that leaves out details and includes versioning that isn't correct. For those, there was no distinction between condition prompting and interaction conditions prompting having different versioning for I only varied the interaction condition prmopting. Thus, anything with a -v2 is just cXv0 and icXv3 and -v1 is probably xCv0 and icXv2. There are probabyl are experiments with icXv0 but those are minimal. 

Second, I mislabeled the prompts as numerical for the longest time until August 4th. All experiments with the proper labeling in the feature description will now have a dash at the end with the note that clarifies this.


**DISCLAIMER:**
This code is heavily based on the open-source project FeatLLM developed by Sungwon Han.
You can find the original repository at: https://github.com/Sungwon-Han/FeatLLM.

Many of the functions and methodologies implemented here are derived from or inspired by the work in FeatLLM.
All credit goes to the original authors for their invaluable contributions to this project.
Modifications and extensions to the original codebase have been made to tailor it to specific use cases and requirements.


In [75]:
import os
import json
import copy
import numpy as np
from importlib import reload
import utils
reload(utils)
import requests

import torch
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd

from tqdm import tqdm
from torch.optim import Adam
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LinearRegression

## Prepare datasets
1. Set dataset and simulation parameters (e.g. # of training shots, and the random seed)
2. Get SNPs data and split it into train/test dataset, given simulation parameters

In [76]:
_NUM_QUERY = 20 # Number of Queries
_SHOT = 16 # Number of training shots (cannot be 15-shot)
_SEED = 3 # Seed for fixing randomness
_NUM_OF_CONDITIONS = 15
_NUM_OF_CONDITIONS_FOR_INTERACTIONS = 0
_DATA = "hearing_loss_pyramid_llm_select_modified"
#_MODEL = "gpt-4o-2024-05-13"
_MODEL = "gpt-3.5-turbo"
#_MODEL = "Meta-Llama-3.1-405B-Instruct-Turbo"
_FUNCTION_MODEL = "gpt-3.5-turbo"
_REWRITING_FUNCTION_MODEL = "gpt-4-1106-preview"
_CONDITION_PROMPT_VERSION = "v5"
_INTERACTION_PROMPT_VERSION = "v0"
_NOTE = "" # Start note with a dash
#_NOTE = "-no-examples-only"
#_NOTE = "-fixed-categorical-conditions"
_RECORD_LOGS = True
_METADATA_VERSION = "v0" # Select higher versions for augmentation
utils.set_seed(_SEED)

now, let's get the dataset

In [77]:
df, X_train, X_test, y_train, y_test, target_attr, label_list, is_cat = utils.get_dataset(_DATA, _SHOT, _SEED)
X_all = df.drop(target_attr, axis=1)

## Extract rules for prediction from training samples with the help of LLM
To enable the LLM to extract rules based on a more accurate reasoning path, we guided the problem-solving process to mimic how a person might approach a tabular learning task.   

We divided the problem into two sub-tasks for this purpose:   
1. Understand the task description and the features provided by the data, inferring the causal relationships beforehand.   
2. Use the inferred information and few-shot samples to deduce the prediction rules for each class. This two-step reasoning process prevents the model from identifying spurious correlations in irrelevant columns and assists in focusing on more significant features.   

Our prompt comprises three main components as follows:  
* Task description
* Reasoning instruction
* Response instruction

In [78]:
reload(utils)
if 'ancestry' in _DATA:
    _DATA_TYPE = 'ancestry'
else:
    _DATA_TYPE = 'hearing_loss'
if "gpt" in _MODEL:
    ask_file_name = f'./templates/ask_llm_{_CONDITION_PROMPT_VERSION}_{_DATA_TYPE}.txt'
else: 
    ask_file_name = f'./templates/ask_llm_llama_{_CONDITION_PROMPT_VERSION}.txt'
meta_data_name = f"../data/{_DATA}-metadata-{_METADATA_VERSION}.json"
templates, feature_desc = utils.get_prompt_for_asking(
    _DATA, X_all, X_train, y_train, label_list, target_attr, ask_file_name, 
    meta_data_name, is_cat, num_query=_NUM_QUERY, num_conditions=_NUM_OF_CONDITIONS,
    prompt_version = _CONDITION_PROMPT_VERSION
)
print(templates[0])

{'c.35delG': '', 'c.235delC': '', 'p.V37I': '', 'A1555G': '', 'p.W77X': '', 'p.R75W': '', 'p.E47X': '', 'p.R143Q': '', 'p.R143W': '', 'c.234_235delC': '', 'p.R75Q': '', 'p.V207L': '', 'c.919_2A>G': '', 'p.G316X': '', 'p.E303Q': ''}
You are an expert in genetics. Given the task description and the list of features and data examples, you are extracting and engineering novel features to solve the task. The purpose of this process is to generate a set of rich, dense and robust features that better express the data.

## Task
Does the subject have hereditary hearing loss? With regards to SNP variants, no mutations being found for the SNP are indicated by 0, heterozygous mutations by 1, and homozygous mutations by 2.


## Features
- c.35delG
- c.235delC
- p.V37I
- A1555G
- p.W77X
- p.R75W
- p.E47X
- p.R143Q
- p.R143W
- c.234_235delC
- p.R75Q
- p.V207L
- c.919_2A>G
- p.G316X
- p.E303Q

## Examples
c.35delG is 0. c.235delC is 0. p.V37I is 0. A1555G is 0. p.W77X is 0. p.R75W is 0. p.E47X is 0. p

In [79]:
_DIVIDER = "\n\n---DIVIDER---\n\n"
_VERSION = "\n\n---VERSION---\n\n"

rule_file_name = f'./rules/{_DATA}/{_SHOT}_shot/rule-s{_SHOT}-c{_NUM_OF_CONDITIONS}{_CONDITION_PROMPT_VERSION}-{_MODEL}-q{_NUM_QUERY}-{_SEED}{_NOTE}.out'
if os.path.isfile(rule_file_name) == False:
    results = utils.query_gpt(templates, max_tokens=2000, temperature=1, model = _MODEL)
    if _RECORD_LOGS:
        with open(rule_file_name, 'w') as f:
            total_rules = _DIVIDER.join(results)
            f.write(total_rules)
else:
    with open(rule_file_name, 'r') as f:
        total_rules_str = f.read().strip()
        results = total_rules_str.split(_DIVIDER)

print(results[0])

1. **Understand the problem**: 
   - The task is to determine if a subject has hereditary hearing loss based on their genetic mutations.
   - The provided data includes various SNP variants and their mutation status (0, 1, 2).

2. **Feature Engineering**:
   - **c.35delG** and **p.V37I**: Since both are mutations associated with Connexin 26 protein, we can create an interaction feature to capture the combined effect of these mutations: `combined_c.35delG_p.V37I = c.35delG * p.V37I`
   - **p.R143Q** and **p.R143W**: These are two mutations at the same position. We can create a feature to capture the presence of either mutation: `either_p.R143Q_p.R143W = p.R143Q OR p.R143W`

3. **Analysis**:
   - The interaction feature `combined_c.35delG_p.V37I` can provide insights into how these mutations together affect the likelihood of hereditary hearing loss.
   - The feature `either_p.R143Q_p.R143W` can help identify subjects with mutations at the specific position regardless of the type of mutat

In [80]:
parsed_rules = []

# Iterate through each result in the results list
for result in results:
    # Use utils.query_gpt to transform each result
    transformed_result = utils.query_gpt(
        [f"Extract the list of engineered features (include their equation or instructions) and list them one after another in a new line: {result}\n\nIf some features are clumped up together, make sure to list them separately.\n\nList:"], 
        max_tokens=2000, 
        temperature=0, 
        model=_FUNCTION_MODEL
    )
    # Append the transformed result to the results_transformed list
    parsed_rules.append(transformed_result[0])

# The parsed_rules list now contains all the transformed results
print(parsed_rules)


100%|██████████| 1/1 [00:01<00:00,  1.19s/it]
100%|██████████| 1/1 [00:01<00:00,  1.21s/it]
100%|██████████| 1/1 [00:02<00:00,  2.92s/it]
100%|██████████| 1/1 [00:01<00:00,  1.57s/it]
100%|██████████| 1/1 [00:01<00:00,  1.31s/it]
100%|██████████| 1/1 [00:02<00:00,  2.17s/it]
100%|██████████| 1/1 [00:01<00:00,  1.43s/it]
100%|██████████| 1/1 [00:02<00:00,  2.54s/it]
100%|██████████| 1/1 [00:03<00:00,  3.09s/it]
100%|██████████| 1/1 [00:01<00:00,  1.33s/it]
100%|██████████| 1/1 [00:01<00:00,  1.95s/it]
100%|██████████| 1/1 [00:01<00:00,  1.44s/it]
100%|██████████| 1/1 [00:02<00:00,  2.04s/it]
100%|██████████| 1/1 [00:01<00:00,  1.99s/it]
100%|██████████| 1/1 [00:03<00:00,  3.01s/it]
100%|██████████| 1/1 [00:01<00:00,  1.28s/it]
100%|██████████| 1/1 [00:01<00:00,  1.69s/it]
100%|██████████| 1/1 [00:01<00:00,  1.55s/it]
100%|██████████| 1/1 [00:01<00:00,  1.52s/it]
100%|██████████| 1/1 [00:04<00:00,  4.61s/it]

['1. combined_c.35delG_p.V37I = c.35delG * p.V37I\n2. either_p.R143Q_p.R143W = p.R143Q OR p.R143W', '1. Combined mutations of SNP variants known to impact hearing loss\n2. Patterns or combinations of mutations commonly found together in individuals with hereditary hearing loss\n3. Presence of specific combinations of mutations known to increase the risk of hearing loss\n4. Additive effects of mutations on hearing loss likelihood\n5. Multiplicative effects of mutations on hearing loss likelihood', '1. Review the provided examples to understand the format of the data and the features that are relevant to hereditary hearing loss.\n2. Analyze the significance of each SNP variant in relation to hereditary hearing loss based on existing literature and research.\n3. Identify potential interactions between different SNP variants that may have a synergistic effect on hereditary hearing loss.\n4. Consider the impact of homozygous mutations versus heterozygous mutations for each SNP variant on th




In [81]:
print(results[0])

1. **Understand the problem**: 
   - The task is to determine if a subject has hereditary hearing loss based on their genetic mutations.
   - The provided data includes various SNP variants and their mutation status (0, 1, 2).

2. **Feature Engineering**:
   - **c.35delG** and **p.V37I**: Since both are mutations associated with Connexin 26 protein, we can create an interaction feature to capture the combined effect of these mutations: `combined_c.35delG_p.V37I = c.35delG * p.V37I`
   - **p.R143Q** and **p.R143W**: These are two mutations at the same position. We can create a feature to capture the presence of either mutation: `either_p.R143Q_p.R143W = p.R143Q OR p.R143W`

3. **Analysis**:
   - The interaction feature `combined_c.35delG_p.V37I` can provide insights into how these mutations together affect the likelihood of hereditary hearing loss.
   - The feature `either_p.R143Q_p.R143W` can help identify subjects with mutations at the specific position regardless of the type of mutat

In [82]:
print(parsed_rules[12])

1. c.35delG_and_c.235delC_mutations = c.35delG + c.235delC
2. A1555G_times_p.W77X = A1555G * p.W77X
3. R75W_AND_R75Q_mutations = p.R75W AND p.R75Q
4. c.234_235delC_times_c.919_2A>G = c.234_235delC * c.919_2A>G
5. R143Q_OR_R143W_mutations = p.R143Q OR p.R143W


## Parse rules to the program code and convert data into the binary vector

We utilize the rules generated in the previous stage to transform each sample into a binary vector. These vectors are created for each answer class, indicating whether the sample satisfies the rules associated with that class. However, since the rules generated by the LLM are based on natural language, parsing the text into program code is required for automatic data transformation.  

To address the challenges of parsing noisy text, instead of building complex program code, we leverage the LLM itself. We include the function name, input and output descriptions, and inferred rules in the prompt, then input it into the LLM. The generated code is executed using Python’s exec() function along with the provided function name to perform data conversion.

In [83]:
reload(utils)
_DIVIDER = "\n\n---DIVIDER---\n\n"
_VERSION = "\n\n---VERSION---\n\n"

saved_file_name = f'./rules/{_DATA}/{_SHOT}_shot/function-s{_SHOT}-c{_NUM_OF_CONDITIONS}{_CONDITION_PROMPT_VERSION}-ic{_NUM_OF_CONDITIONS_FOR_INTERACTIONS}{_INTERACTION_PROMPT_VERSION}-{_MODEL}-{_FUNCTION_MODEL}-q{_NUM_QUERY}-{_SEED}{_NOTE}.out'    
if os.path.isfile(saved_file_name) == False:
    function_file_name = './templates/ask_for_function_v2.txt'
    fct_strs_all = []
    for parsed_rule in tqdm(parsed_rules):
        fct_template = utils.get_prompt_for_generating_function_simple(
            parsed_rule, feature_desc, function_file_name
        )
        fct_results = utils.query_gpt(fct_template, max_tokens=2500, temperature=0, model = _FUNCTION_MODEL)
        fct_strs = [fct_txt.split('<start>')[1].split('<end>')[0].strip() for fct_txt in fct_results]
        fct_strs_all.append(fct_strs[0])
    if _RECORD_LOGS:
        with open(saved_file_name, 'w') as f:
            total_str = _VERSION.join([x for x in fct_strs_all])
            f.write(total_str)
else:
    with open(saved_file_name, 'r') as f:
        total_str = f.read().strip()
        fct_strs_all = [x.split(_DIVIDER) for x in total_str.split(_VERSION)]

In [87]:
# Examine function output
print(fct_strs_all[4])

['d', 'e', 'f', ' ', 'e', 'x', 't', 'r', 'a', 'c', 't', 'i', 'n', 'g', '_', 'e', 'n', 'g', 'i', 'n', 'e', 'e', 'r', 'e', 'd', '_', 'f', 'e', 'a', 't', 'u', 'r', 'e', 's', '(', 'd', 'f', '_', 'i', 'n', 'p', 'u', 't', ')', ':', '\n', ' ', ' ', ' ', ' ', 'd', 'f', '_', 'o', 'u', 't', 'p', 'u', 't', ' ', '=', ' ', 'p', 'd', '.', 'D', 'a', 't', 'a', 'F', 'r', 'a', 'm', 'e', '(', ')', '\n', ' ', ' ', ' ', ' ', 'd', 'f', '_', 'o', 'u', 't', 'p', 'u', 't', '[', "'", 'c', '.', '3', '5', 'd', 'e', 'l', 'G', "'", ']', ' ', '=', ' ', 'd', 'f', '_', 'i', 'n', 'p', 'u', 't', '[', "'", 'c', '.', '3', '5', 'd', 'e', 'l', 'G', "'", ']', '\n', ' ', ' ', ' ', ' ', 'd', 'f', '_', 'o', 'u', 't', 'p', 'u', 't', '[', "'", 'c', '.', '2', '3', '5', 'd', 'e', 'l', 'C', "'", ']', ' ', '=', ' ', 'd', 'f', '_', 'i', 'n', 'p', 'u', 't', '[', "'", 'c', '.', '2', '3', '5', 'd', 'e', 'l', 'C', "'", ']', '\n', ' ', ' ', ' ', ' ', 'd', 'f', '_', 'o', 'u', 't', 'p', 'u', 't', '[', "'", 'p', '.', 'V', '3', '7', 'I', "'", 

In [85]:
critique_fct_strs_all = fct_strs_all

#### Self-Critiqueing Function Writing

In [86]:
reload(utils)
critique_fct_strs_all = utils.self_critique_functions_simple(parsed_rules, feature_desc, fct_strs_all, X_train, _NUM_OF_CONDITIONS, _NUM_OF_CONDITIONS_FOR_INTERACTIONS, _REWRITING_FUNCTION_MODEL, condition_tolerance=30)

'list' object has no attribute 'strip'
Function string to critique: ['d', 'e', 'f', ' ', 'e', 'x', 't', 'r', 'a', 'c', 't', 'i', 'n', 'g', '_', 'e', 'n', 'g', 'i', 'n', 'e', 'e', 'r', 'e', 'd', '_', 'f', 'e', 'a', 't', 'u', 'r', 'e', 's', '(', 'd', 'f', '_', 'i', 'n', 'p', 'u', 't', ')', ':', '\n', ' ', ' ', ' ', ' ', 'd', 'f', '_', 'o', 'u', 't', 'p', 'u', 't', ' ', '=', ' ', 'p', 'd', '.', 'D', 'a', 't', 'a', 'F', 'r', 'a', 'm', 'e', '(', ')', '\n', ' ', ' ', ' ', ' ', 'd', 'f', '_', 'o', 'u', 't', 'p', 'u', 't', '[', "'", 'c', 'o', 'm', 'b', 'i', 'n', 'e', 'd', '_', 'c', '.', '3', '5', 'd', 'e', 'l', 'G', '_', 'p', '.', 'V', '3', '7', 'I', "'", ']', ' ', '=', ' ', 'd', 'f', '_', 'i', 'n', 'p', 'u', 't', '[', "'", 'c', '.', '3', '5', 'd', 'e', 'l', 'G', "'", ']', ' ', '*', ' ', 'd', 'f', '_', 'i', 'n', 'p', 'u', 't', '[', "'", 'p', '.', 'V', '3', '7', 'I', "'", ']', '\n', ' ', ' ', ' ', ' ', 'd', 'f', '_', 'o', 'u', 't', 'p', 'u', 't', '[', "'", 'e', 'i', 't', 'h', 'e', 'r', '_', 'p'

TypeError: can only concatenate str (not "list") to str

In [None]:
if _RECORD_LOGS:
    with open(saved_file_name, 'w') as f:
        total_str = _VERSION.join([_DIVIDER.join(x) for x in critique_fct_strs_all])
        f.write(total_str)

In [None]:
# Get function names and strings
fct_names = []
fct_strs_final = []
for fct_str in critique_fct_strs_all:
    if 'def' not in fct_str:
        continue
    fct_names.append(fct_str.split('def')[1].split('(')[0].strip())
    fct_strs_final.append(fct_str)

### Checking Outputs

In [None]:
# Print an arbitrary but particular function
print(critique_fct_strs_all[2])

def extracting_features(df_input):
    df_output = pd.DataFrame()
    df_output['NEW_FEATURE_1'] = df_input['FEATURE_1']
    df_output['NEW_FEATURE_2'] = df_input['FEATURE_2']
    df_output['NEW_FEATURE_3'] = df_input['FEATURE_3'].apply(lambda x: 1 if x in [1,2] else 0)
    df_output['NEW_FEATURE_4'] = df_input['FEATURE_4'].apply(lambda x: 1 if x in [0,1] else 0)
    df_output['NEW_FEATURE_5'] = df_input.apply(lambda row: 1 if row['FEATURE_3'] >= 0 and row['FEATURE_9'] >= 1 else 0, axis=1)
    df_output['NEW_FEATURE_6'] = df_input.apply(lambda row: 1 if row['FEATURE_4'] == 1 and row['FEATURE_8'] == 0 else 0, axis=1)
    df_output['NEW_FEATURE_7'] = df_input.apply(lambda row: row['FEATURE_4'] * row['FEATURE_5'], axis=1)
    
    return df_output


In [None]:
# Check some arbitrary function if it works
exec(critique_fct_strs_all[0].strip('` "'))
fct_strs_all[0]
locals()[fct_names[0]](X_train.head(1))

Unnamed: 0,combined_c.35delG_p.V37I,either_p.R143Q_p.R143W
874,0,0.0


In [None]:
# Check some arbitrary function if it works on the test data
exec(critique_fct_strs_all[0].strip('` "'))
fct_strs_all[0]
locals()[fct_names[0]](X_test.head(1))

Unnamed: 0,combined_c.35delG_p.V37I,either_p.R143Q_p.R143W
0,0,0


In [None]:
exec(critique_fct_strs_all[1].strip('` "'))
#locals()[fct_names[1][1]](X_test.head(268)).astype('int').to_numpy().shape

In [None]:
fct_names[1]

'extracting_engineered_features'

In [None]:
# This is the final list of functions that generates the features for each ensemble
# Type: str[][]
print(critique_fct_strs_all[8])

def extracting_engineered_features(df_input):
    df_output = pd.DataFrame()
    df_output['GJB2_gene_with_c.35delG_mutation'] = df_input['c.35delG']
    df_output['SLC26A4_gene_with_c.919_2A>G_mutation'] = df_input['c.919_2A>G']
    df_output['c.35delG_mutation_in_GJB2_disrupts_Connexin_26_protein'] = df_input['c.35delG'].apply(lambda x: 1 if x == 1 else 0)
    df_output['c.919_2A>G_mutation_in_SLC26A4_affects_Pendrin_protein'] = df_input['c.919_2A>G'].apply(lambda x: 1 if x == 1 else 0)
    df_output['Presence_of_both_mutations_can_lead_to_increased_risk_and_severity_of_hearing_loss'] = df_input.apply(lambda row: 1 if row['c.35delG'] == 1 and row['c.919_2A>G'] == 1 else 0, axis=1)
    df_output['Combined_disruptions_in_gap_junction_communication_and_ion_transport_pathways_can_worsen_hearing_loss'] = df_input.apply(lambda row: 1 if row['c.35delG'] == 1 or row['c.919_2A>G'] == 1 else 0, axis=1)
    df_output['combined_mutations'] = df_input['c.35delG'] * df_input['c.919_2A>G']
    
   

### Convert to binary vectors

In [None]:
mask = X_test.notna().all(axis=1)

# Dropping weird NAs
X_test = X_test[mask]
y_test = y_test[mask]

In [None]:
reload(utils)
executable_list, X_train_all_dict, X_test_all_dict = utils.convert_to_binary_vectors_simple(fct_strs_final, 
                                                                                     fct_names, 
                                                                                     X_train, 
                                                                                     X_test, 
                                                                                     num_of_features=_NUM_OF_CONDITIONS+_NUM_OF_CONDITIONS_FOR_INTERACTIONS,
                                                                                     include_original_features=True)

Iteration 2, Error in convert_to_binary_vectors: 'FEATURE_1'
                    Is the # of columns in X_train equal to X_test after applying the functions? True
                    Is the # of rows the same after applying the functions? True
                    How many conditional features after applying the functions? 15
Iteration 4, Error in convert_to_binary_vectors: 'FEATURE_1'
                    Is the # of columns in X_train equal to X_test after applying the functions? True
                    Is the # of rows the same after applying the functions? True
                    How many conditional features after applying the functions? 15
Iteration 5, Error in convert_to_binary_vectors: 'FEATURE_1'
                    Is the # of columns in X_train equal to X_test after applying the functions? True
                    Is the # of rows the same after applying the functions? True
                    How many conditional features after applying the functions? 15
Iteration 17, Error

In [None]:
# The number of functions should be == # of ensembles. If not, some of the functions are faulty and broke when applying to training data.
len(executable_list)

16

## Train the linear model to predict the likelihood of each class from the binary vector
When given the rules for each class and a sample, a simple method to measure the class likelihood of the sample is to count how many rules of each class it satisfies (i.e., the sum of the binary vector per class). However, not all rules carry the same importance, necessitating learning their significance from training samples.    
  
We aimed to train this importance using a basic linear model without bias, applied to each class's binary vector.

In [None]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.neural_network import MLPClassifier

multiclass = True if len(label_list) > 2 else False
y_train_num = np.array([label_list.index(k) for k in y_train])
y_test_num = np.array([label_list.index(k) for k in y_test])
mlp = RandomForestClassifier()

# Fit the model
mlp.fit(X_train, y_train_num)
lr_pred_probs_train = mlp.predict_proba(X_train)
lr_metrics_train = utils.evaluate(lr_pred_probs_train, y_train_num, multiclass=multiclass, class_level_analysis=True, label_list=label_list)
lr_pred_probs_test = mlp.predict_proba(X_test)
lr_metrics_test = utils.evaluate(lr_pred_probs_test, y_test_num, multiclass=multiclass, class_level_analysis=True, label_list=label_list)
lr_metrics_test

{'AUC': 0.5837500862247361,
 'Accuracy': 0.335676625659051,
 'F1-Score': 0.3000211316316901,
 'Class No Precision': 0.24742268041237114,
 'Class No Recall': 0.9022556390977443,
 'Class No F1-Score': 0.3883495145631068,
 'Class Yes Precision': 0.8452380952380952,
 'Class Yes Recall': 0.1628440366972477,
 'Class Yes F1-Score': 0.27307692307692305}

In [None]:
from sklearn.ensemble import RandomForestClassifier

mlp_models = []

# Train an MLP model on each version of the training data
for X_train_now, X_test_now in zip(X_train_all_dict.values(), X_test_all_dict.values()):
    mlp = RandomForestClassifier()
    mlp.fit(X_train_now, y_train_num)
    mlp_models.append(mlp)
    lr_pred_probs_train = mlp.predict_proba(X_train_now)
    lr_metrics_train = utils.evaluate(lr_pred_probs_train, y_train_num, multiclass=multiclass, class_level_analysis=True, label_list=label_list)
    lr_pred_probs_test = mlp.predict_proba(X_test_now)
    lr_metrics_test = utils.evaluate(lr_pred_probs_test, y_test_num, multiclass=multiclass, class_level_analysis=True, label_list=label_list)
    print("num of features: ", X_train_now.shape[1])
    print(lr_metrics_test['AUC'])

# Initialize arrays to store ensemble predictions
ensemble_pred_probs_train = np.zeros((X_train_all_dict[0].shape[0], len(label_list)))
ensemble_pred_probs_test = np.zeros((X_test_all_dict[0].shape[0], len(label_list)))

# Predict probabilities for training and test sets using each model and combine them
for i, (X_train_now, X_test_now) in enumerate(zip(X_train_all_dict.values(), X_test_all_dict.values())):
    ensemble_pred_probs_train += mlp_models[i].predict_proba(X_train_now)
    ensemble_pred_probs_test += mlp_models[i].predict_proba(X_test_now)
    

# Average the probabilities
ensemble_pred_probs_train /= len(X_train_all_dict)
ensemble_pred_probs_test /= len(X_test_all_dict)

# Evaluate the ensemble predictions
ensemble_metrics_train = utils.evaluate(
    ensemble_pred_probs_train, 
    y_train_num, 
    multiclass=multiclass, 
    class_level_analysis=True, 
    label_list=label_list
)

ensemble_metrics_test = utils.evaluate(
    ensemble_pred_probs_test, 
    y_test_num, 
    multiclass=multiclass, 
    class_level_analysis=True, 
    label_list=label_list
)

# Output the test metrics
ensemble_metrics_test

num of features:  17
0.5837500862247361
num of features:  30
0.5643150306960061
num of features:  30
0.5827671242325998
num of features:  30
0.5615558391391323
num of features:  30
0.5616075739808236
num of features:  22
0.5593657308408636
num of features:  30
0.5827153893909085
num of features:  30
0.5827153893909085
num of features:  20
0.5832413602814375
num of features:  20
0.5610471131958337
num of features:  30
0.560572877146996
num of features:  25
0.5614609919293648
num of features:  18
0.561098848037525
num of features:  30
0.5837673311719667
num of features:  19
0.5924329171552735
num of features:  30
0.5632803338621784


{'AUC': 0.5805770159343313,
 'Accuracy': 0.335676625659051,
 'F1-Score': 0.3000211316316901,
 'Class No Precision': 0.24742268041237114,
 'Class No Recall': 0.9022556390977443,
 'Class No F1-Score': 0.3883495145631068,
 'Class Yes Precision': 0.8452380952380952,
 'Class Yes Recall': 0.1628440366972477,
 'Class Yes F1-Score': 0.27307692307692305}

In [None]:
from sklearn.tree import export_graphviz
import graphviz
estimator = mlp.estimators_[4]

# Export the tree to Graphviz format
dot_data = export_graphviz(estimator, out_file=None,
                           class_names=label_list,
                           filled=True, rounded=True,
                           special_characters=True)

# Visualize the tree using graphviz
graph = graphviz.Source(dot_data)  
graph.render("tree")  # Save the tree as a file
graph.view()  # View the tree

'tree.pdf'

In [None]:
print(ensemble_pred_probs_test)

[[0.19594749 0.80405251]
 [0.19594749 0.80405251]
 [0.85450577 0.14549423]
 ...
 [0.55842546 0.44157454]
 [0.84578913 0.15421087]
 [0.55842546 0.44157454]]


In [None]:
X_train_all_dict

{0: tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 2., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 2., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

In [None]:
X_test_all_dict

{0: tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 1.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 1., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 1: tensor([[0., 0., 0.,  ..., 1., 0., 0.],
         [0., 0., 0.,  ..., 1., 0., 0.],
         [0., 0., 1.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 1., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 3: tensor([[0., 0., 0.,  ..., 1., 0., 0.],
         [0., 0., 0.,  ..., 1., 0., 0.],
         [0., 0., 1.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 1., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 6: tensor([[0., 0., 0.,  ..., 1., 0., 0.],
         [0., 0., 0.,  ..., 1., 0., 0.],
         [0., 0., 1.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 1., 0.,  .

In [None]:
metrics = utils.evaluate(ensemble_pred_probs_test, y_test_num, multiclass=multiclass, class_level_analysis=True, label_list = label_list)
# Function to print metrics in a readable format
def print_metrics(metrics):
    print("\nEvaluation Metrics:")
    print("------------------------------------------------------")
    i = 1
    for metric, value in metrics.items():
        print(f"{metric:45}: {value:.4f}")
        i += 1
        if i % 3 == 0:
            print("------------------------------------------------------")
        
print_metrics(metrics)


Evaluation Metrics:
------------------------------------------------------
AUC                                          : 0.5806
Accuracy                                     : 0.3357
------------------------------------------------------
F1-Score                                     : 0.3000
Class No Precision                           : 0.2474
Class No Recall                              : 0.9023
------------------------------------------------------
Class No F1-Score                            : 0.3883
Class Yes Precision                          : 0.8452
Class Yes Recall                             : 0.1628
------------------------------------------------------
Class Yes F1-Score                           : 0.2731


In [None]:
saved_file_name = f'./logs/{_DATA}/{_SHOT}_shot/evaluation-s{_SHOT}-c{_NUM_OF_CONDITIONS}{_CONDITION_PROMPT_VERSION}-ic{_NUM_OF_CONDITIONS_FOR_INTERACTIONS}{_INTERACTION_PROMPT_VERSION}-{_MODEL}-{_FUNCTION_MODEL}-{_NUM_QUERY}-{(len(executable_list))}-{_SEED}{_NOTE}.out'
if _RECORD_LOGS:
    with open(saved_file_name, 'w') as f:
        f.write("Evaluation Metrics:\n")
        f.write("------------------------------------------------------\n")
        i = 1
        for metric, value in metrics.items():
            f.write(f"{metric:45}: {value:.4f}\n")
            i += 1
            if i % 3 == 0:
                f.write("------------------------------------------------------\n")