# Problem 2 - Automated Feature Engineering

Sources:
*   https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes
*   https://github.com/cod3licious/autofeat



## 2.1

**Answer:**
When training ML models for real work problems output interpretability is crucial as it allowes users to understand how model is making decisions and to see if the logic is valid or if there is a mistake being made. More specifically, in a lot of fields like finance or medicine models output predictions cannot just be taken for what they are but stakeholders need to understand how they were made to be sure they were ethical and that the logic behind them is not incorrect / unacceptable. Further, having models with no iterpretability means that if asked about their decisions companies cannot explain why there were made which can also lead to lawsuits in cases where models turned out to be discriminatory towards certain population (which is often the case). In addition to this, models are often used as a tool to help humans make informed, data-driven decisions rather than just letting the machine make them itself. In this kind of situation, person making the decision wants to understand the decision making process not just see the result. Finally, interpretability is crucial to be able to understand when the model is making mistakes, what kind of a mistake or bias is causing it, and in the end how and what to fix to make the model better and help it generalize better.


## 2.2

In [None]:
!pip install autofeat

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from autofeat import FeatureSelector, AutoFeatRegressor
from sklearn import datasets
from sklearn.linear_model import LinearRegression

In [None]:
# Load the diabetes dataset and get the featues and target
X, y = datasets.load_diabetes(return_X_y=True)

In [None]:
# Feature Selection
fs = FeatureSelector(verbose=1)
X_selected = fs.fit_transform(pd.DataFrame(X), pd.Series(y))

# Check how many features were discarded
print("Original feature count:", X.shape[1])
print("Selected feature count:", X_selected.shape[1])
discarded_features = X.shape[1] - X_selected.shape[1] # to see which features were removed
print("Features discarded:", discarded_features)

[featsel] Scaling data...done.
Original feature count: 10
Selected feature count: 6
Features discarded: 4


**Answer:**
AutoFeat's FeatureSelector discards features in hte dataset that it deems are not good predictors of our outcome variable. In our case, out of our 10 features it only chose to keep 6 and discard 4 that were not as valuable for preduction of our outcome y.


## 2.3

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model
r2_train = model.score(X_train, y_train)
r2_test = model.score(X_test, y_test)
print("R2 score on the training set:", r2_train)
print("R2 score on the test set:", r2_test)

R2 score on the training set: 0.5279193863361498
R2 score on the test set: 0.45260276297191937


**Answer:**

$R^2$ tells us how much of variablity in our outcome variable (y) is explained by the dependent variables (x).

From the above output of $R^2$ values for our training and testing set we can immediatelly notice that training one is quite a bit higher than the test one. $R^2$ score on the training set is about 0.53 and $R^2$ score on the test set is about 0.45. While we would always expect (except in a perfect theoretical case of 0 overfitting) the score for training to be slighly higer but in our case it could indicate that there is overfitting. This means our model is too complex and does not generlize well on new data.


## 2.4

In [None]:
# Feature engineering with AutoFeatRegressor
afreg = AutoFeatRegressor(verbose=1, feateng_steps=3)
X_train_feat = afreg.fit_transform(X_train, y_train)
X_test_feat = afreg.transform(X_test)

# Fit the model again
model.fit(X_train_feat, y_train)



  x = um.multiply(x, x, out=x)
  ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)


[featsel] Scaling data...done.


In [None]:
# Evaluate the model
r2_train_feat = model.score(X_train_feat, y_train)
r2_test_feat = model.score(X_test_feat, y_test)
print("R2 score on the training set with AutoFeat:", r2_train_feat)
print("R2 score on the test set with AutoFeat:", r2_test_feat)

R2 score on the training set with AutoFeat: 0.6325539093905881
R2 score on the test set with AutoFeat: 0.5191672981051741


In [None]:
# Convert X_train and X_train_feat to Pandas DataFrames
X_train_df = pd.DataFrame(X_train)
X_train_feat_df = pd.DataFrame(X_train_feat)

# Print new features
new_features = list(X_train_feat_df.columns[10:])
print("Five new features generated:", new_features[:5])

Five new features generated: ['Abs(x008)/x008', 'exp(x006)*Abs(x001)', 'exp(x002)*exp(x008)', 'exp(x002)*exp(x003)', '1/(x002**3 + x005**3)']


**Answer:**

From the above output of AutoFeatRegressor we can see that both of our $R^2$ measures have incresed. This indicates that feature engineering has let to a better model fit. Our $R^2$ on training data is now 0.63 which is pretty high in comparison to our previous one, and our $R^2$ on test data increased to 0.51 from 0.45. While this does indicate out model perfomrns better, the issue of overfitting has not been solved but increased. We can see that the gap between our training and testing results has increased. This is likely due to the fact that the model became more complex by adding new features and thus is not generalizing well to new, unseen data.

Above we can see example of 5 new generated features. We can see that all of these features are transformations of 1 or more of the original features. They allowed us to understand nuances of our data better but can also cause overfitting by adapting to noise in the data as well as the general trends.

Here are 5 newly generated features:
$\frac{Abs(x008))}{x008}, exp(x006)*Abs(x001), exp(x002)*exp(x008), exp(x002)*exp(x003), \frac{1}{(x002^3 + x005^3)}$
