# Automated Feature Engineering

This notebook demonstrates the use of AutoFeat for automated feature engineering and selection on a regression dataset. It includes interpretability discussions, feature selection, model training, and evaluation before and after feature engineering.

## Overview
The key steps involve interpreting model explainability, feature selection, model training, and evaluating the impact of generated features.

## Procedure
- **Interpretability and Model Explainability**: Discuss the necessity of model interpretability.
- **Feature Selection**: Use `FeatureSelector` on the Diabetes dataset to determine the number of discarded features.
- **Feature Engineering with AutoFeat**: Apply `AutoFeatRegressor` to generate new features and assess the relative change in R2 score.

In [None]:
pip install autofeat


In [16]:
import autofeat
import warnings
warnings.filterwarnings("ignore")

2. Perform feature selection for the Diabetes regression dataset using FeatureSelector(). How many
features are discarded? (4)

3. Perform a train-test split on your dataset. Select a regression model from skLearn and fit it to the
training dataset. What is the R2 score on the training and test set? (4)

4. Keeping the train and test dataset the same, run 3 feature engineering steps using AutoFeatRegressor().
What is the R2 score on the training and test set now? Mention any five new features generated by
the output of AutoFeatRegressor(). (5)

# 1.2
2. Perform feature selection for the Diabetes regression dataset using FeatureSelector(). How many
features are discarded? (4)

In [17]:
from autofeat import FeatureSelector
from sklearn import datasets


diabetes = datasets.load_diabetes()
display(diabetes.keys())
X = diabetes.data
y = diabetes.target
features = diabetes.feature_names
display(X.shape, y.shape, features)

dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename', 'data_module'])

(442, 10)

(442,)

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [18]:
selector = FeatureSelector(verbose=2)

X_new = selector.fit_transform(X, y)

2024-04-26 07:59:46,971 INFO: [featsel] Feature selection run 1/5
2024-04-26 07:59:46,992 INFO: [featsel]	 5 initial features.
2024-04-26 07:59:47,009 INFO: 
[featsel]	 Selected   7 features after noise filtering.
2024-04-26 07:59:47,010 INFO: [featsel] Feature selection run 2/5
2024-04-26 07:59:47,027 INFO: [featsel]	 5 initial features.
2024-04-26 07:59:47,042 INFO: 
[featsel]	 Selected   5 features after noise filtering.
2024-04-26 07:59:47,043 INFO: [featsel] Feature selection run 3/5
2024-04-26 07:59:47,058 INFO: [featsel]	 6 initial features.
2024-04-26 07:59:47,073 INFO: 
[featsel]	 Selected   6 features after noise filtering.
2024-04-26 07:59:47,073 INFO: [featsel] Feature selection run 4/5
2024-04-26 07:59:47,088 INFO: [featsel]	 5 initial features.
2024-04-26 07:59:47,105 INFO: 
[featsel]	 Selected   5 features after noise filtering.
2024-04-26 07:59:47,105 INFO: [featsel] Feature selection run 5/5
2024-04-26 07:59:47,121 INFO: [featsel]	 6 initial features.
2024-04-26 07:59:

[featsel] Scaling data...done.
[featsel]	 Split  1/1:   6 candidate features identified.

Features Disgarded

In [22]:
X_new.shape, X.shape

((442, 5), (442, 10))

In [24]:
discarded_features = X.shape[1] - X_new.shape[1]

print("Number of discarded features:", discarded_features)

Number of discarded features: 5


 5 features were disgarded

# 1.3
3. Perform a train-test split on your dataset. Select a regression model from skLearn and fit it to the
training dataset. What is the R2 score on the training and test set? (4)

In [31]:
%precision 4
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_new,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2_train = model.score(X_train, y_train)
r2_test = r2_score(y_test, y_pred)


print(f"R2-train: {r2_train:.4f}")
print(f"R2-test: {r2_test:.4f}")

R2-train: 0.5147
R2-test: 0.4694


 Applying the SkLearn LR model on the dataset yields the following metrics:
- R2-train: 0.5147
- R2-test: 0.4694

# 1.4
4. Keeping the train and test dataset the same, run 3 feature engineering steps using AutoFeatRegressor().
What is the R2 score on the training and test set now? Mention any five new features generated by
the output of AutoFeatRegressor(). (5)

In [33]:
from autofeat import AutoFeatRegressor

model = AutoFeatRegressor(verbose=2, feateng_steps=3)
model.fit_transform(X_train, y_train)


2024-04-26 08:11:06,228 INFO: [AutoFeat] The 3 step feature engineering process could generate up to 14910 features.
2024-04-26 08:11:06,229 INFO: [AutoFeat] With 353 data points this new feature matrix would use about 0.02 gb of space.
2024-04-26 08:11:06,231 INFO: [feateng] Step 1: transformation of original features


[feateng]               0/              5 features transformed

2024-04-26 08:11:06,978 INFO: [feateng] Generated 20 transformed features from 5 original features - done.
2024-04-26 08:11:06,979 INFO: [feateng] Step 2: first combination of features


[feateng]             200/            300 feature tuples combined

2024-04-26 08:11:07,410 INFO: [feateng] Generated 1163 feature combinations from 300 original feature tuples - done.
2024-04-26 08:11:07,412 INFO: [feateng] Step 3: transformation of new features


[feateng]            1000/           1163 features transformed

2024-04-26 08:11:10,050 INFO: [feateng] Generated 4913 transformed features from 1163 original features - done.
2024-04-26 08:11:10,058 INFO: [feateng] Generated altogether 6508 new features in 3 steps
2024-04-26 08:11:10,059 INFO: [feateng] Removing correlated features, as well as additions at the highest level
2024-04-26 08:11:10,099 INFO: [feateng] Generated a total of 2949 additional features
2024-04-26 08:11:10,110 INFO: [featsel] Feature selection run 1/5


[featsel] Scaling data...done.       1163 features transformed


2024-04-26 08:11:11,447 INFO: [featsel]	 6 initial features.


[featsel]	 Split 17/19:  54 candidate features identified.

2024-04-26 08:11:14,492 INFO: 
[featsel]	 Selected   5 features after noise filtering.
2024-04-26 08:11:14,492 INFO: [featsel] Feature selection run 2/5


[featsel]	 Split 19/19:  63 candidate features identified.

2024-04-26 08:11:15,483 INFO: [featsel]	 17 initial features.


[featsel]	 Split 20/21:  23 candidate features identified.

2024-04-26 08:11:18,944 INFO: 
[featsel]	 Selected   6 features after noise filtering.
2024-04-26 08:11:18,945 INFO: [featsel] Feature selection run 3/5


[featsel]	 Split 21/21:  23 candidate features identified.

2024-04-26 08:11:20,865 INFO: [featsel]	 7 initial features.


[featsel]	 Split 18/19:  27 candidate features identified.

2024-04-26 08:11:23,723 INFO: 
[featsel]	 Selected   6 features after noise filtering.
2024-04-26 08:11:23,723 INFO: [featsel] Feature selection run 4/5


[featsel]	 Split 19/19:  28 candidate features identified.

2024-04-26 08:11:25,405 INFO: [featsel]	 11 initial features.


[featsel]	 Split 16/20:  29 candidate features identified.

2024-04-26 08:11:26,841 INFO: 
[featsel]	 Selected   9 features after noise filtering.
2024-04-26 08:11:26,841 INFO: [featsel] Feature selection run 5/5


[featsel]	 Split 20/20:  37 candidate features identified.

2024-04-26 08:11:28,156 INFO: [featsel]	 5 initial features.


[featsel]	 Split 17/19:  18 candidate features identified.

2024-04-26 08:11:31,191 INFO: 
[featsel]	 Selected   8 features after noise filtering.
2024-04-26 08:11:31,192 INFO: [featsel] 13 features after 5 feature selection runs
2024-04-26 08:11:31,193 INFO: [featsel] 12 features after correlation filtering
2024-04-26 08:11:31,205 INFO: [featsel] 11 features after noise filtering
2024-04-26 08:11:31,206 INFO: [AutoFeat] Computing 11 new features.


[AutoFeat]     7/   11 new featureste features identified.

2024-04-26 08:11:31,668 INFO: [AutoFeat]    11/   11 new features ...done.
2024-04-26 08:11:31,670 INFO: [AutoFeat] Final dataframe with 16 feature columns (11 new).
2024-04-26 08:11:31,671 INFO: [AutoFeat] Training final regression model.
2024-04-26 08:11:31,680 INFO: [AutoFeat] Trained model: largest coefficients:
2024-04-26 08:11:31,680 INFO: -147.60842794954348
2024-04-26 08:11:31,680 INFO: -4413.619439 * exp(x003)*Abs(x001)
2024-04-26 08:11:31,681 INFO: 348.155352 * exp(x002)*exp(x004)
2024-04-26 08:11:31,681 INFO: 153.254202 * exp(x000)*exp(x004)
2024-04-26 08:11:31,681 INFO: 118.141620 * Abs(x002 + x004)
2024-04-26 08:11:31,682 INFO: -25.535978 * 1/(1/x004 + 1/x002)
2024-04-26 08:11:31,682 INFO: 17.577041 * Abs(x000)/x000
2024-04-26 08:11:31,682 INFO: 0.630937 * x000/Abs(x004)
2024-04-26 08:11:31,682 INFO: -0.584893 * x002/x004
2024-04-26 08:11:31,683 INFO: -0.239618 * x003/Abs(x004)
2024-04-26 08:11:31,683 INFO: -0.029795 * 1/(x000 - x002**2)
2024-04-26 08:11:31,683 INFO: 0.006

[AutoFeat]    10/   11 new features

Unnamed: 0,x000,x001,x002,x003,x004,Abs(x000)/x000,exp(x003)*Abs(x001),exp(x002)*exp(x004),exp(x000)*exp(x004),x003/Abs(x004),1/(1/x004 + 1/x002),1/(x000 - x002**2),x002/x004,x000/Abs(x004),Abs(x002 + x004),1/(x003**2 - Abs(x004))
0,0.027364,0.050680,0.056301,-0.039719,0.012117,1.0,0.048707,1.070813,1.040271,-3.278014,0.009971,41.332121,4.646496,2.258347,0.068418,-94.883541
1,0.000272,0.050680,-0.033213,-0.072854,-0.018062,1.0,0.047119,0.950017,0.982368,-4.033573,-0.011700,-1203.890244,1.838857,0.015086,0.051275,-78.405611
2,0.017036,-0.044642,0.097615,-0.006584,0.049840,1.0,0.044349,1.158882,1.069163,-0.132111,0.032994,133.202577,1.958559,0.341813,0.147455,-20.081564
3,-0.049872,-0.044642,-0.029770,0.030232,-0.035307,-1.0,0.046012,0.936995,0.918348,0.856261,-0.016152,-19.701046,0.843189,-1.412542,0.065077,-29.075759
4,-0.037129,-0.044642,-0.081413,0.059685,-0.065486,-1.0,0.047387,0.863381,0.902475,0.911422,-0.036293,-22.853518,1.243222,-0.566977,0.146899,-16.149006
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
348,-0.059471,-0.044642,-0.043542,0.008142,-0.076264,-1.0,0.045007,0.887093,0.873074,0.106762,-0.027717,-16.295385,0.570937,-0.779809,0.119806,-13.123800
349,0.008641,0.050680,0.083844,0.015505,0.030440,1.0,0.051472,1.121070,1.039854,0.509380,0.022332,620.796124,2.754423,0.283860,0.114283,-33.113416
350,-0.010903,-0.044642,-0.005670,0.078093,-0.020218,-1.0,0.048268,0.974444,0.969359,3.862652,-0.004428,-91.446093,0.280471,-0.539297,0.025888,-70.826732
351,-0.038460,-0.044642,-0.040099,-0.017629,-0.023451,-1.0,0.043862,0.938427,0.939967,-0.751756,-0.014797,-24.957795,1.709907,-1.640007,0.063550,-43.214927


In [58]:
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2_train = model.score(X_train, y_train)
r2_test = r2_score(y_test, y_pred)

print(f"\n\nAutoFeat Regressor Results:")
print(f"- R2-train: {r2_train:.4f}")
print(f"- R2-test: {r2_test:.4f}\n")


for i, feat in enumerate(model.new_feat_cols_[:5]):
    print(f"{i + 1}. Autofeat Feature {i}: {feat}")

2024-04-26 08:22:31,951 INFO: [AutoFeat] Computing 11 new features.
2024-04-26 08:22:31,958 INFO: [AutoFeat]    11/   11 new features ...done.
2024-04-26 08:22:31,961 INFO: [AutoFeat] Computing 11 new features.
2024-04-26 08:22:31,967 INFO: [AutoFeat]    11/   11 new features ...done.


[AutoFeat]    10/   11 new features

AutoFeat Regressor Results:
- R2-train: 0.5777
- R2-test: 0.5190

1. Autofeat Feature 0: Abs(x000)/x000
2. Autofeat Feature 1: exp(x003)*Abs(x001)
3. Autofeat Feature 2: exp(x002)*exp(x004)
4. Autofeat Feature 3: exp(x000)*exp(x004)
5. Autofeat Feature 4: x003/Abs(x004)




AutoFeat Regressor Results:
- R2-train: 0.5777
- R2-test: 0.5190

Five AutoFeat Generated Features:
1. Autofeat Feature 0: Abs(x000)/x000
2. Autofeat Feature 1: exp(x003)*Abs(x001)
3. Autofeat Feature 2: exp(x002)*exp(x004)
4. Autofeat Feature 3: exp(x000)*exp(x004)
5. Autofeat Feature 4: x003/Abs(x004)