# Experiment: Ensemble Gplearn-DSR

The idea of the experiment is to use the prediction made by several models to choose the appropriate symbolic operators that can be used to train the symbolic regression model.

### Hypothesis

The hypothesis is that the set of candidate terms can be narrowed using prior models' predictions, which could inturn improve the accuracy and recovery time of the downstream model.

### Experimental Details

For this experiment, the dataset is sampled from the AI-Feynman dataset with only _3 variables_. As a result, the dataset consist of **52 equations** with 500 support points. For the prior models, _Gplearn_ and _Deep symbolic regression_ are chosen. 



| Model   | Method Type         | Accuracy | Dataset    | Support Points | Avg. Recovery Time (sec) | Noise  |
|---------|---------------------|----------|------------|----------------|--------------------------|--------|
| [Gplearn](https://github.com/trevorstephens/gplearn) | Genetic Programming | 0.38     | Feynman-03 | 500            | **11.24**                    | 0
| [Deep symbolic Regression](https://github.com/brendenpetersen/deep-symbolic-optimization)     | Deep generative     | 0.60     | Feynman-03 | 500            | 16.99                    | 0
| [AIFeynman](https://github.com/SJ001/AI-Feynman) | Heuristic Based | **0.93** | Feynman-03 | 500 |  **185.26** | 0

In [1]:
import os
from collections import defaultdict

import sympy as sp
from tqdm.notebook import tqdm

import pandas as pd
import seaborn as sns
import numpy as np

In [2]:
df_gplearn = pd.read_excel(
    "../logs/gplearn_2022-06-17_14-14-12.xlsx", 
    engine="openpyxl", 
    usecols="B:H"
)

df_dsr = pd.read_excel(
    "../logs/dsr_2022-06-17_15-26-36.xlsx", 
    engine="openpyxl",
    usecols="B:H"
)

df_aif = pd.read_excel(
    "../logs/AIF_2022-06-19_22-12-47.xlsx",
    engine="openpyxl",
    usecols="B:H"
)

In [3]:
df_aif.describe()

Unnamed: 0,number_of_points,accuracy,time
count,52.0,52.0,52.0
mean,500.0,0.928769,185.269209
std,0.0,0.218937,46.83518
min,500.0,0.022,126.735669
25%,500.0,0.998,132.957314
50%,500.0,1.0,185.813932
75%,500.0,1.0,241.665853
max,500.0,1.0,246.100262


In [5]:
def sort_freq_dict(freqDict:dict) -> dict:
    sorted_freq_tuple = sorted(freqDict.items(), key=lambda item: item[1], reverse=True)
    return dict(sorted_freq_tuple)

def combine_freq_dicts(freq_dicts:list) -> dict:
    """Returns a combined frequency dictionary from input frequency dicts"""
    
    keys = []
    for freq_dict in freq_dicts:
        keys.extend(freq_dict.keys())
    
    keys = set(keys)  
    
    result = defaultdict(int)
    
    for key in keys:
        for freq_dict in freq_dicts:
            v = freq_dict.get(key)
            if v is not None:
                result[key] += v
        
    return dict(result)

def compute_acc_weight(acc):
    return np.exp(10*acc-10)

def compute_weighted_op_count(opp_acc_list) -> list:
    
    weighted_opp_acc_list = []
    
    for ops_count, acc in opp_acc_list:
        
        w = compute_acc_weight(acc)
        
        for k, v in ops_count.items():
            ops_count[k] = v*w
        
        weighted_opp_acc_list.append(ops_count)
        
    return weighted_opp_acc_list

In [6]:
def get_op_freq(eq) -> dict:
    """Returns the frequency for each operator as dict"""
    
    ops_string = str(eq.count_ops(visual=True))
    ops_iterable = [op.strip() for op in ops_string.split("+")]

    result = {}

    for ops in ops_iterable:

        args = ops.split("*")
        if len(args) == 1: # occurs once
            op_name = args[0].lower()
            result[op_name] = 1
            continue

        freq, op_name = int(args[0]), args[1].lower()
        result[op_name] = freq

    return result    

# Heuristic Algorithm

1. compute weight,w, for each equation; w = exp(10x-10), where x is the accuracy
2. Multiply the weight with the freq
3. Combine the values of each op
4. Sort the dict
5. Choose top 5

In [7]:
k = 4 
cols = ["equation", "predicted_equation", "model_name", "accuracy"]
iterator = enumerate(zip(df_dsr[cols].iterrows(), df_gplearn[cols].iterrows()))

ensemble_df = pd.DataFrame(columns=["equation", "ops", "top_k"])

for idx, model_results in iterator:
    
    consolidated_ops_count = []
    
    for _, result in model_results:
        
        sympy_eq = sp.simplify(sp.sympify(result.predicted_equation), inverse=True)
        ops_count = get_op_freq(sympy_eq)
        
        ops_acc = (ops_count, result.accuracy)
        
        consolidated_ops_count.append(ops_acc)
        
    weighted_ops_count = compute_weighted_op_count(consolidated_ops_count)

    consolidated_ops = combine_freq_dicts(weighted_ops_count)
    
    result_ops = sorted(consolidated_ops, key=consolidated_ops.get, reverse=True)[:k]
    ops = ','.join(result_ops)
    print(f'Equation {idx}: {ops}')
    
    row = {
        "equation": model_results[0][1].equation,
        "ops": ops,
        "top_k": k,
    }
    
    ensemble_df = ensemble_df.append(row, ignore_index=True)

Equation 0: mul,add,log,exp
Equation 1: mul,add,div,log
Equation 2: div,sin,exp,mul
Equation 3: div,add,mul,pow
Equation 4: mul
Equation 5: mul,exp,add,div
Equation 6: mul
Equation 7: mul
Equation 8: add,div,pow,mul
Equation 9: mul,div,add,sub
Equation 10: mul,sub,div,exp
Equation 11: mul,sin
Equation 12: div
Equation 13: sin,mul,cos,pow
Equation 14: mul,div,add,cos
Equation 15: div
Equation 16: div,mul,add,cos
Equation 17: mul,div,pow,add
Equation 18: mul,pow,log,div
Equation 19: mul,add,log,exp
Equation 20: mul,log,pow,add
Equation 21: add,cos,mul,sub
Equation 22: mul,div,add,pow
Equation 23: add,mul,pow,div
Equation 24: mul
Equation 25: div,mul,pow,add
Equation 26: mul,add,pow,log
Equation 27: mul,neg,div,log
Equation 28: mul,add,log,div
Equation 29: mul,div,log,add
Equation 30: mul,add,sub,exp
Equation 31: div,mul,add,pow
Equation 32: mul,exp,div,add
Equation 33: sub,pow,div,mul
Equation 34: mul,add,pow,div
Equation 35: mul,cos,neg,pow
Equation 36: mul,cos,neg,add
Equation 37: div,

In [8]:
ensemble_df[:5]

Unnamed: 0,equation,ops,top_k
0,exp(-x_1**2/2)/sqrt(2*pi),"mul,add,log,exp",4
1,exp(-(x_2/x_1)**2/2)/(sqrt(2*pi)*x_1),"mul,add,div,log",4
2,exp(-((x_2-x_2)/x_1)**2/2)/(sqrt(2*pi)*x_1),"div,sin,exp,mul",4
3,x_1/sqrt(1-x_2**2/x_3**2),"div,add,mul,pow",4
4,x_1*x_2,mul,4


In [9]:
ensemble_df.to_excel("../results/top_4_ops_gp-dsr-heuristic-exp.xlsx")