# Feature Engineering for Better Models

This notebook covers feature engineering, the process of creating and selecting the most useful features for your regression models.

**Objectives:**
- Understand the importance of feature engineering
- Create new features from the supermarket sales dataset
- Use interactive tools to test feature impact

**Note**: in [./01_regression.ipynb](./01_regression.ipynb) we introduced the term _"variable"_, and now _"feature"_. They are the same. 

In [1]:
# Import libraries and load data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import ipywidgets as widgets

In [2]:
# Load dataset
file_path = '../../data/SuperMarketAnalysis.csv'
df = pd.read_csv(file_path)

## What is Feature Engineering?

Feature engineering involves creating new input features or transforming existing ones to improve model performance. In our use case, we notice that we can add more information by including additional features such as whether the a certain date was a weekend. Good features can make simple models powerful.

Let's create some new features from our dataset.

In [3]:
# Example: Create new features based on existing data
df['Is_weekend'] = pd.to_datetime(df['Date']).dt.dayofweek >= 5
df['Payment_encoded'] = df['Payment'].astype('category').cat.codes

# Prepare data
features = ['Unit price', 'Is_weekend', 'Payment_encoded']
X = df[features]
y = df['Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'MSE: {mse:.2f}, R^2: {r2:.2f}')

MSE: 35497.57, R^2: 0.45


By introducing additional features, we won't always improve on our MSE. When we create features, we need to consider the predictive value they will add in the mix and the type of model we are using. Having many similar features, or features that don't carry a strong predictive signal for the target variable, could affect performance in a negative way.

In [4]:
# Interactive: Select features to include
feature_options = features
@widgets.interact(selected=widgets.SelectMultiple(options=feature_options, value=tuple(feature_options), description='Features:'))
def update_features(selected):
    X_sel = X_test[list(selected)]
    model.fit(X_train[list(selected)], y_train)
    y_pred_sel = model.predict(X_sel)
    mse_sel = mean_squared_error(y_test, y_pred_sel)
    r2_sel = r2_score(y_test, y_pred_sel)
    print(f'MSE: {mse_sel:.2f}, R^2: {r2_sel:.2f}')

interactive(children=(SelectMultiple(description='Features:', index=(0, 1, 2), options=('Unit price', 'Is_weekâ€¦

As you noticed, the ability to feature engineer often relies on domain knowledge. In more complex fields, some features may arise thanks to years of experience working in that particular field.

Question: In your opinion, what other features would be interesting to include?

## Conclusion

- Feature engineering can significantly improve model performance
- Use domain knowledge to create meaningful features. Or use Github Copilot to suggest new features and implement the features in a notebook! (HVEðŸš€)