# Feature Engineering with LLMs: A Tutorial

This notebook demonstrates how to use `skfeaturellm` to automatically generate meaningful features for your machine learning tasks using Large Language Models (LLMs).

## Overview

1. We'll use the California Housing dataset as an example
2. First, we'll train a baseline model without engineered features
3. Then, we'll use `skfeaturellm` to generate new features
4. Finally, we'll compare the model performance with and without the engineered features

## Setup and Data Loading


In [47]:
import os 

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

import pandas as pd

from skfeaturellm import LLMFeatureEngineer

## Dataset Description

The California Housing dataset contains information about housing blocks in California. Here are the features available in the dataset:


In [15]:
housing = fetch_california_housing()

df_housing = pd.DataFrame(
    data=housing.data,
    columns=housing.feature_names
)

df_housing['target'] = housing.target

df_housing

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


## Baseline Model

Let's first create a baseline model without any engineered features to establish a performance benchmark.


In [44]:
X = df_housing.drop('target', axis=1)
y = df_housing['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

baseline_model = RandomForestRegressor(n_estimators=100, random_state=42)
baseline_model.fit(X_train, y_train)

baseline_preds = baseline_model.predict(X_test)

baseline_rmse = np.sqrt(mean_squared_error(y_test, baseline_preds))
baseline_r2 = r2_score(y_test, baseline_preds)

print(f"Baseline Model Performance:")
print(f"RMSE: {baseline_rmse:.4f}")

Baseline Model Performance:
RMSE: 0.4879


## LLM-based Feature Engineering

Now let's use `skfeaturellm` to automatically generate meaningful features. The library will:
1. Analyze the existing features and their descriptions
2. Understand the prediction target
3. Generate new features that could improve model performance

First, let's examine the LLMFeatureEngineer class:


In [36]:
llm_feng = LLMFeatureEngineer(
    model_name='gpt-4o',
    problem_type='regression',
    target_col='target',
    feature_prefix='feng_'
)

In [37]:
feature_descriptions = [
    {'name': 'MedInc', 'description': 'median income in block group', 'type': 'float'},
    {'name': 'HouseAge', 'description': 'median house age in block group', 'type': 'float'},
    {'name': 'AveRooms', 'description': 'average number of rooms per household', 'type': 'float'},
    {'name': 'AveBedrms', 'description': 'average number of bedrooms per household', 'type': 'float'},
    {'name': 'Population', 'description': 'block group population', 'type': 'float'},
    {'name': 'AveOccup', 'description': 'average number of household members', 'type': 'float'},
    {'name': 'Latitude', 'description': 'block group latitude', 'type': 'float'},
    {'name': 'Longitude', 'description': 'block group longitude', 'type': 'float'}
]

target_description = 'The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).'

In [38]:
llm_feng.fit(
    X=df_housing,
    feature_descriptions=feature_descriptions,
    target_description=target_description
)

In [39]:
llm_feng.generated_features_ideas

[FeatureEngineeringIdea(name='rooms_per_bedroom_ratio', description='This feature represents the ratio of average rooms to average bedrooms per household, which can provide insight into the types of housing structures in the block group. A high ratio could imply larger, more spacious homes, while a low ratio could indicate smaller, compact living spaces.', formula='AveRooms/AveBedrms'),
 FeatureEngineeringIdea(name='population_density', description='This feature calculates the population density in the block group by dividing the population by the average number of household members. This gives an indication of how crowded or spacious an area might be, which could be related to housing value.', formula='Population/AveOccup'),
 FeatureEngineeringIdea(name='longitude_latitude_interaction', description='This feature captures the interaction between longitude and latitude, which may account for geographical trends or regional effects influencing house values. Interaction terms can uncover 

## Model with Engineered Features

Now that we have our engineered features, let's train a new model and compare its performance with the baseline:


In [40]:
df_housing_feng = llm_feng.transform(
    X=df_housing
)

df_housing_feng

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target,RoomsPerBedroom,PopulationPerHousehold,IncomePerRoom,LatitudeLongitudeInteraction,HouseholdSizeEffect,RoomsPerCapita,rooms_per_bedroom_ratio,population_density,longitude_latitude_interaction,income_per_room,age_income_ratio
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526,6.821705,126.0,1.192017,-4630.0724,3.257687,0.021690,6.821705,126.0,-4630.0724,1.192017,4.924807
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585,6.418626,1138.0,1.330750,-4627.2492,3.934608,0.002598,6.418626,1138.0,-4627.2492,1.330750,2.529694
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521,7.721053,177.0,0.875637,-4626.7840,2.589838,0.016710,7.721053,177.0,-4626.7840,0.875637,7.165100
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413,5.421277,219.0,0.970046,-4627.1625,2.214765,0.010425,5.421277,219.0,-4627.1625,0.970046,9.214793
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422,5.810714,259.0,0.612272,-4627.1625,1.763125,0.011118,5.810714,259.0,-4627.1625,0.612272,13.519838
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781,4.451872,330.0,0.309249,-4780.6332,0.609348,0.005971,4.451872,330.0,-4780.6332,0.309249,16.022560
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771,4.646667,114.0,0.418185,-4786.5829,0.818751,0.017174,4.646667,114.0,-4786.5829,0.418185,7.040050
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923,4.647423,433.0,0.326575,-4779.7046,0.730983,0.005169,4.647423,433.0,-4779.7046,0.326575,10.000000
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847,4.547677,349.0,0.350351,-4783.6476,0.879423,0.007192,4.547677,349.0,-4783.6476,0.350351,9.640103


In [46]:
X_eng = df_housing_feng.drop('target', axis=1)
y_eng = df_housing_feng['target']

X_train_eng, X_test_eng, y_train_eng, y_test_eng = train_test_split(
    X_eng, y_eng, test_size=0.2, random_state=42
)

model_eng = RandomForestRegressor(n_estimators=100, random_state=42)
model_eng.fit(X_train_eng, y_train_eng)

eng_preds = model_eng.predict(X_test_eng)

eng_rmse = np.sqrt(mean_squared_error(y_test_eng, eng_preds))
eng_r2 = r2_score(y_test_eng, eng_preds)

print("Model Performance with Engineered Features:")
print(f"RMSE: {eng_rmse:.4f}")

print("\nPerformance Improvement:")
print(f"RMSE Improvement: {((eng_rmse - baseline_rmse) / baseline_rmse * 100):.2f}%")

Model Performance with Engineered Features:
RMSE: 0.4879

Performance Improvement:
RMSE Improvement: 0.00%
