# MSSP: Multi-Set Symbolic Skeleton Prediction for Symbolic Regression

## Installation

Execute `!pip install git+https://github.com/NISL-MSU/MultiSetSR`

**IMPORTANT:** This code is implemented using Pytorch and CUDA. If you're running this on Google Colab, change the runtime type to GPU.

In [1]:
!pip install -q git+https://github.com/NISL-MSU/MultiSetSR
import warnings
warnings.filterwarnings("ignore")

Found existing installation: MultiSetSR 0.0.1
Uninstalling MultiSetSR-0.0.1:
  Successfully uninstalled MultiSetSR-0.0.1
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for MultiSetSR (pyproject.toml) ... [?25l[?25hdone


## Example using pre-determined datasets

In this example, we will predict the symbolic skeletons corresponding to each variable of a system whose underlying equation is one of the following:

<br>

| Eq. | Underlying equation________________________________________________________|
|-----|------------------------|
| E1  | $ (3.0375 x_1 x_2 + 5.5 \sin (9/4 (x_1 - 2/3)(x_2 - 2/3)))/5 $|
| E2  | $ 5.5 + (1- x_1/4) ^ 2 + \sqrt{x_2 + 10} \sin( x_3/5)$|
| E3  | $(1.5 e^{1.5  x_1} + 5 \cos(3 x_2)) / 10$|
| E4  | $((1- x_1)^2 + (1- x_3) ^ 2 + 100 (x_2 - x_1 ^ 2) ^ 2 + 100 (x_4 - x_3 ^ 2) ^ 2)/10000$|
| E5  | $\sin(x_1 + x_2 x_3) + \exp{(1.2  x_4)}$|
| E6  | $\tanh(x_1 / 2) + \text{abs}(x_2) \cos(x_3^2/5)$|
| E7  | $(1 - x_2^2) / (\sin(2 \pi \, x_1) + 1.5)$|
| E8  | $x_1^4 / (x_1^4 + 1) + x_2^4 / (x_2^4 + 1)$|
| E9  | $\log(2 x_2 + 1) - \log(4 x_1 ^ 2 + 1)$|
| E10 | $\sin(x_1 \, e^{x_2})$|
| E11 | $x_1 \, \log(x_2 ^ 4)$|
| E12 | $1 + x_1 \, \sin(1 / x_2)$|
| E13 | $\sqrt{x_1}\, \log(x_2 ^ 2)$|

In [2]:
from EquationLearning.SymbolicRegressor.MSSP import *

datasetName = 'E6'
data_loader = DataLoader(name=datasetName)
data = data_loader.dataset

**Define NN and load weights**

For this example, we have already trained a feedforward neural network on the generated dataset so we only load their corresponding weights.

In [3]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
root = get_project_root()
folder = os.path.join(root, "EquationLearning//saved_models//saved_NNs//" + datasetName)
filepath = folder + "//weights-NN-" + datasetName
nn_model = NNModel(device=device, n_features=data.n_features, NNtype=data_loader.modelType)
nn_model.loadModel(filepath)

**Get Skeletons**

The following method will generate some candidate symbolic skeletons and select the most appropriate for each variable

In [4]:
regressor = MSSP(dataset=data, bb_model=nn_model)
regressor.get_skeletons()

********************************
Analyzing variable x0
********************************
Predicted skeleton 1 for variable x0: c*tanh(c*x0) + c
Predicted skeleton 2 for variable x0: c*tanh(c*x0 + c) + c
Predicted skeleton 3 for variable x0: c + tanh(c*x0)
Predicted skeleton 4 for variable x0: c + tanh(c*x0 + c)
Predicted skeleton 5 for variable x0: c*sqrt(c*tanh(c*x0) + c) + c

Choosing the best skeleton... (skeletons ordered based on number of nodes)
	Skeleton: c + tanh(c*x0). Correlation: 0.9997416536770057. Expr: tanh(0.536603*x0)
-----------------------------------------------------------
Selected skeleton: c*tanh(c*x0) + c

********************************
Analyzing variable x1
********************************
Predicted skeleton 1 for variable x1: c*Abs(x1) + c
Predicted skeleton 2 for variable x1: c*x1*tanh(c*x1) + c
Predicted skeleton 3 for variable x1: c + x1*tanh(c*x1 + c)
Predicted skeleton 4 for variable x1: c + x1*tanh(c*x1)
Predicted skeleton 5 for variable x1: c*x1*tanh(c*

[c*tanh(c*x0) + c, c*Abs(x1) + c, c*cos(c*x2**2 + c*x2 + c) + c]

## Example using custom equations

Here we will show how to use data generated from your own equations. Alternatively, you can bring your dataset (e.g., a CSV file) and load the matrix $X$ (explainable variables) and $Y$ (response variable).

In this example, consider the simple equation $y = \frac{\sin(x_1 + 1.2 \, x_2) \, x_3^2}{2}$. Suppose that $x_1$ and $x_2$ are continuous variables and $x_3$ is discrete and can take 100 possible values ($x_1 \in [-10, 10]$, $x_2 \in [-5, 5]$, and $x_3 \in [-8, ... , 8]$)

**Generate and format data**

In [5]:
np.random.seed(7)
n = 10000
# Generate data from the equation
x1 = np.random.uniform(-10, 10, size=n)
x2 = np.random.uniform(-5, 5, size=n)
x3 = np.array([np.random.choice(np.linspace(-8, 8, 100)) for _ in range(n)])  # Example of discrete variable
X = np.array([x1, x2, x3]).T
Y = np.sin(x1 + 1.2 * x2) * (x3**2 / 2)  # Or load matrices X and Y from a CSV file

# Format the dataset
names = ['x0', 'x1', 'x2']  # Specify the names of the variables
types = ['continuous', 'continuous', 'discrete']  # Specify if the variables are continuous or discrete
dataset = InputData(X=X, Y=Y, names=names, types=types)

**Train a NN**

Unlike the previous example, we haven't trained a NN for this problem so let's train it now. If you're not satisfied with the validation MSE, you can try increasing the number of epochs or try a different architecture. By default, we use the `modelType='NN'`; if you need less complexity, try `modelType='NN2'`; or if you need more complexity, try `modelType='NN3'`.

In [6]:
from EquationLearning.Trainer.TrainNNmodel import Trainer

predictor = Trainer(dataset=dataset, modelType='NN')
predictor.train(batch_size=128, epochs=3000, printProcess=False)
# Save the model
# predictor.model.saveModel(path)  # Specify your own path

*****************************************
Start MLP training
*****************************************


100%|██████████| 3000/3000 [09:20<00:00,  5.35it/s]

Val MSE: 0.07133777567923286





**Get Skeletons**

The following method will generate some candidate symbolic skeletons and select the most appropriate for each variable

In [7]:
regressor = MSSP(dataset=dataset, bb_model=predictor.model)
regressor.get_skeletons()

********************************
Analyzing variable x0
********************************
Predicted skeleton 1 for variable x0: c*cos(c + x0) + c
Predicted skeleton 2 for variable x0: c*sin(c + x0) + c
Predicted skeleton 3 for variable x0: c*cos(c*x0 + c) + c
Predicted skeleton 4 for variable x0: c*cos(c*x0) + c
Predicted skeleton 5 for variable x0: c*cos(x0) + c

Choosing the best skeleton... (skeletons ordered based on number of nodes)
	Skeleton: c*cos(x0) + c. Correlation: 0.9549434622348927. Expr: cos(x0)
	Skeleton: c*cos(c + x0) + c. Correlation: 0.9998902707120207. Expr: cos(x0 - 5.968739)
-----------------------------------------------------------
Selected skeleton: c*cos(c + x0) + c

********************************
Analyzing variable x1
********************************
Predicted skeleton 1 for variable x1: c*sin(c*x1 + c) + c
Predicted skeleton 2 for variable x1: c*cos(c*x1 + c) + c
Predicted skeleton 3 for variable x1: c*sin(c + x1) + c
Predicted skeleton 4 for variable x1: c*c

[c*cos(c + x0) + c, c*sin(c*x1 + c) + c, c*x2**2 + c]