### EvoGP Tutorial: Getting Started with Tree-Based Genetic Programming

This tutorial will guide you through using EvoGP for tree-based genetic programming (TGP), showcasing key features such as tree generation, problem definition, and algorithm customization.

In [1]:
# Import necessary modules
import torch
from evogp.tree import Forest, GenerateDiscriptor
from evogp.problem import SymbolicRegression, Classification
from evogp.algorithm import *
from evogp.pipeline import StandardPipeline

### Section 1: Introduction to Tree Generation

#### Understanding the `GenerateDiscriptor` Class Parameters

The `GenerateDiscriptor` class helps configure the parameters for tree generation. Let’s understand its arguments:

- **`max_tree_len`**: This parameter specifies the maximum number of nodes that the tree can have. It helps control the tree’s size and complexity.

- **`input_len`**: The number of input variables that the tree will take. This defines how many features or input dimensions the tree will work with.

- **`output_len`**: The number of outputs the tree will produce. This is used when dealing with multiple outputs problems.

- **`const_prob`**: This is the probability that a node in the tree will be a constant value, rather than input. A higher value means more constants are likely to appear in the tree.

- **`out_prob`**: The probability that a node in the tree will be an output node. This helps define how many nodes in the tree will directly correspond to outputs.

- **`depth2leaf_probs`**: A tensor that specifies the probability distribution for the tree’s growth at different depths. If not provided, it will be generated based on other parameters such as `max_layer_cnt` and `layer_leaf_prob`.

- **`roulette_funcs`**: A tensor that represents the cumulative probability distribution for selecting different functions (such as addition, subtraction, etc.) at each node. If not provided, it will be built from the `using_funcs` parameter.

- **`const_samples`**: This parameter contains the constant values that can be used in the tree. It can be either a list or a tensor of pre-defined constants. If not provided, the constants will be generated within the range defined by `const_range` and `sample_cnt`.

- **`using_funcs`**: A dictionary or list of functions that will be available for use at each node of the tree. If `roulette_funcs` is not provided, this parameter will be used to build it.

- **`max_layer_cnt`**: The maximum number of layers that the tree can have. This is used when `depth2leaf_probs` is not provided, helping to control the tree’s depth and structure.

- **`layer_leaf_prob`**: The probability of a node being a leaf at each layer in the tree. This is used if `depth2leaf_probs` is not provided.

- **`const_range`**: A tuple that defines the range from which constant values can be sampled. This is used if `const_samples` is not provided.

- **`sample_cnt`**: The number of constant samples to generate if `const_samples` is not provided. This works in conjunction with `const_range `to define the distribution of constants.

After initializing the `GenerateDiscriptor` class with the above parameters, they will be aggregated and processed into the following key parameters: `max_tree_len`, `input_len`, `output_len`, `const_prob`, `out_prob`, `depth2leaf_probs`, `roulette_funcs`, `const_samples`. These key parameters represent the most important aspects of the tree’s structure and behavior, which will be used throughout the genetic programming process.

You can print these parameters and use the `GenerateDiscriptor` to generate a tree as following:

In [2]:
descriptor = GenerateDiscriptor(
    max_tree_len=64,
    input_len=2,
    output_len=1,
    using_funcs=["+", "-", "*", "/"],
    max_layer_cnt=5,
    const_samples=[-1, 0, 1]
)
print(descriptor)

Forest.random_generate(1, descriptor)

max_tree_len: 64
input_len: 2
output_len: 1
const_prob: 0.5
out_prob: 0.5
depth2leaf_probs: tensor([0.2000, 0.2000, 0.2000, 0.2000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
        1.0000], device='cuda:0')
roulette_funcs: tensor([0.0000, 0.2500, 0.5000, 0.7500, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
        1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
        1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
        1.0000, 1.0000], device='cuda:0')
const_samples: tensor([-1.,  0.,  1.], device='cuda:0')



Forest(pop size: 1)
[
  * * / + 1.00 1.00 + 0.00 0.00 / x[1] * x[1] 1.00 * + / x[0] x[0] / -1.00 1.00 + - x[1] x[1] / -1.00 0.00 , 
]

#### Using the `update` Method

The GenerateDiscriptor class also provides an `update` method, which allows you to modify the descriptor’s parameters after it has been initialized. This method takes any number of keyword arguments (i.e., **kwargs) and updates the descriptor’s internal parameter dictionary.

Here’s how you can use the `update` method to generate a tree with the different configs:

In [3]:
new_descriptor = descriptor.update(using_funcs=["sin", "cos", "tan"])
print(new_descriptor)

Forest.random_generate(1, new_descriptor)

max_tree_len: 64
input_len: 2
output_len: 1
const_prob: 0.5
out_prob: 0.5
depth2leaf_probs: tensor([0.2000, 0.2000, 0.2000, 0.2000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
        1.0000], device='cuda:0')
roulette_funcs: tensor([0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.3333, 0.6667, 1.0000, 1.0000,
        1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
        1.0000, 1.0000], device='cuda:0')
const_samples: tensor([-1.,  0.,  1.], device='cuda:0')



Forest(pop size: 1)
[
  tan cos sin x[0] , 
]

### Section 2: Defining Problems

EvoGP supports various problem types:

#### a. Symbolic Regression

In [4]:
X = torch.rand((100, 2), device="cuda")
y = X[:, 0] ** 2 + 2 * X[:, 1]
problem = SymbolicRegression(datapoints=X, labels=y)

#### b. Classification

In [5]:
from sklearn.datasets import load_iris

data = load_iris()
X = torch.tensor(data.data, dtype=torch.float, device="cuda")
y = torch.tensor(data.target, dtype=torch.float, device="cuda")
problem = Classification(datapoints=X, labels=y)

#### c. Custom Functions
You can also create problems with custom functions:

In [6]:
def custom_function(x):
    y = (x[0] + x[1]) ** 2
    return y.reshape(-1)

problem = SymbolicRegression(
    func=custom_function,
    num_inputs=2,
    num_data=1000,
    lower_bounds=-5,
    upper_bounds=5
)

### Section 3: Customizing Algorithms

EvoGP provides flexibility through its genetic operators.

#### Default Configuration

In [7]:
algorithm = GeneticProgramming(
    initial_forest=Forest.random_generate(pop_size=1000, descriptor=descriptor),
    crossover=DefaultCrossover(),
    mutation=DefaultMutation(mutation_rate=0.2, descriptor=descriptor),
    selection=DefaultSelection(survival_rate=0.3, elite_rate=0.01)
)

#### Using Variants
- Selection: `RouletteSelection`, `TruncationSelection`, `RankSelection`, `TournamentSelection`
- Crossover: `DiversityCrossover`, `LeafBiasedCrossover`
- Mutation: `HoistMutation`, `SinglePointMutation`, `MultiPointMutation`, `InsertMutation`, `DeleteMutation`, `SingleConstMutation`, `MultiConstMutation`, `CombinedMutation`

Example:

In [8]:
algorithm = GeneticProgramming(
    initial_forest=Forest.random_generate(pop_size=1000, descriptor=descriptor),
    crossover=LeafBiasedCrossover(),
    mutation=CombinedMutation(
        [
            DefaultMutation(mutation_rate=0.2, descriptor=descriptor),
            HoistMutation(mutation_rate=0.2),
            MultiPointMutation(mutation_rate=0.2, descriptor=descriptor),
        ]
    ),
    selection=TournamentSelection(5),
)

### Section 4: Running the Pipeline

Finally, run the algorithm on the defined problem:

In [9]:
pipeline = StandardPipeline(
    algorithm,
    problem,
    generation_limit=50
)

best = pipeline.run()

Generation: 0, Cost time: 304.49ms
 	fitness: valid cnt: 740, max: -206.8605, min: -61244460.0000, mean: -329629.7188, std: 3537817.7500

Generation: 1, Cost time: 8.22ms
 	fitness: valid cnt: 899, max: -121.9854, min: -87567024.0000, mean: -193780.8281, std: 3249077.5000

Generation: 2, Cost time: 14.41ms
 	fitness: valid cnt: 928, max: -118.2328, min: -87687480.0000, mean: -348827.3750, std: 4484792.0000

Generation: 3, Cost time: 11.21ms
 	fitness: valid cnt: 918, max: -67.9589, min: -54049972.0000, mean: -71164.5234, std: 1792744.2500

Generation: 4, Cost time: 11.87ms
 	fitness: valid cnt: 916, max: -20.1768, min: -60463040.0000, mean: -163880.6562, std: 2746633.2500

Generation: 5, Cost time: 21.47ms
 	fitness: valid cnt: 918, max: -20.1768, min: -13894970.0000, mean: -50566.1367, std: 567872.7500

Generation: 6, Cost time: 16.55ms
 	fitness: valid cnt: 920, max: -0.0000, min: -87681184.0000, mean: -218541.3906, std: 3772692.5000

Generation: 7, Cost time: 14.01ms
 	fitness: vali

### Section 5: Inspecting Results

#### Predictions

In [10]:
predictions = best.forward(problem.datapoints[0])
print(predictions)

tensor([17.3489], device='cuda:0')


#### Symbolic Representation

In [11]:
expression = best.to_sympy_expr()
print(expression)

(x0 + x1)**2


#### Visualization

In [12]:
best.to_png("best_tree.png")