# Symbolic Transformer
## About the symbolic transformer
The text discusses the use of the SymbolicTransformer to generate new non-linear features automatically. This implies that the SymbolicTransformer is used to transform the original dataset and create new features, which might help improve the performance of a machine learning model.

In other words the symbolic transformer is transforming the feature space to improve the accuracy of the model. 

## About the data
The text refers to the "Diabetes housing dataset" and mentions that Ridge Regression is used as the estimator. The dataset is divided into training and testing sets (300 samples for training and 200 for testing).

In [2]:
from gplearn.genetic import SymbolicTransformer
from sklearn.utils import check_random_state
from sklearn.datasets import load_diabetes
import numpy as np

rng = check_random_state(0)
diabetes = load_diabetes()
perm = rng.permutation(diabetes.target.size)
diabetes.data = diabetes.data[perm]
diabetes.target = diabetes.target[perm]

## About the model accuracy

The initial benchmark to compare the performance of the Ridge regression model on the dataset is provided as an R2 score, which is approximately 0.4341.

This R2 score is simply the result of the Ridge regression running on the data set. 

In [3]:
from sklearn.linear_model import Ridge
est = Ridge()
est.fit(diabetes.data[:300, :], diabetes.target[:300])
print(est.score(diabetes.data[300:, :], diabetes.target[300:]))

0.434057421057894


## About the sybolic transformation
The SymbolicTransformer is trained on the first 300 samples of the dataset. It uses a population of 2000 individuals over 20 generations. From this population, the best 100 individuals are selected for the "hall_of_fame." The text mentions that the "least-correlated 10" individuals are used as new features. This implies that the SymbolicTransformer generates new features, but not all of them are used. Instead, the 10 features with the least correlation with the existing features are selected.

**Feature Generation:** The function_set for the SymbolicTransformer is specified, and parameters like parsimony coefficient, max_samples, and random_state are set. The transformation process is verbose, and three parallel jobs are used for processing.

**Transformed Dataset:** The transformed features generated by the SymbolicTransformer are then concatenated with the original data, creating a new dataset called "new_diabetes."

**Model Performance with New Features:** A Ridge regression model is trained on the new dataset, and its performance is evaluated on the final 200 samples. The R2 score for this model is approximately 0.5337, indicating an improvement in performance compared to the initial benchmark.

In [None]:
function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log',
                'abs', 'neg', 'inv', 'max', 'min']
gp = SymbolicTransformer(generations=20, population_size=2000,
                         hall_of_fame=100, n_components=10,
                         function_set=function_set,
                         parsimony_coefficient=0.0005,
                         max_samples=0.9, verbose=1,
                         random_state=0)
gp.fit(diabetes.data[:300, :], diabetes.target[:300])

gp_features = gp.transform(diabetes.data)
new_diabetes = np.hstack((diabetes.data, gp_features))

est = Ridge()
est.fit(new_diabetes[:300, :], diabetes.target[:300])
print(est.score(new_diabetes[300:, :], diabetes.target[300:]))

## Some important things to analyze

**Increasing Variable Space:** Increasing the number of features (variables) in the model can lead to improved accuracy, especially if these new features capture previously unmodeled patterns or relationships in the data. However, there's a trade-off to consider. Increasing the dimensionality of the feature space can also lead to overfitting. Overfitting occurs when a model becomes too complex, capturing noise in the data rather than the underlying patterns. This can result in reduced model generalization and higher bias when the model is applied to new, unseen data. So, while adding more features can improve accuracy, it should be done carefully to avoid overfitting. However, overfitting is avoided by the use of the least correlated variables.

**Using the Least Correlated Variables:** The text mentions selecting the least-correlated 10 new variables. This is a strategy to mitigate the risk of multicollinearity. Multicollinearity occurs when two or more independent variables are highly correlated with each other, making it challenging for the model to distinguish the individual effects of these variables. By choosing the least correlated among the generated features, the model aims to incorporate diverse and non-redundant information, reducing the risk of multicollinearity and, potentially, overfitting. It's a way to control the dimensionality of the model while maintaining the most valuable new information.

**Hall_of_Fame:** I first to struggle on what hall of fame was, after some reading I can state that in genetic programming, it's a collection of the best-performing individuals (solutions) that have evolved through multiple generations of the algorithm (in this case 20). The hall_of_fame is typically used to preserve and propagate the most successful and promising solutions.

## Conclusions

Genetic programming is a type of evolutionary algorithm and a machine learning technique that is used to automatically evolve computer programs to perform a specific task or solve a problem. It is inspired by the process of natural selection and genetic evolution.

Genetic programming can be applied to a wide range of problems, including symbolic regression, symbolic classification, automatic code generation. 

In the example of symbolic regression and the Boston housing dataset the symbolic regression turned out to be a better regression model than a typical ridge regression. This improvement in the performance is derived from the increase of the feature space to transform the variable space to the autamitcally selected features by the model (in this case the least 10 correlated variables). 