# Feature Importance and Advanced Models

## Polynomial Regression

We are now extending the feature space to include quadratic and/or interaction terms to see if it improves the performance.

First, we load the dataset and initialize the KFold cross-validation.

In [14]:
import pandas as pd
from sklearn.model_selection import KFold

# Load the dataset
df = pd.read_pickle("../data/winequality.pkl")

# Initialize KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)

Then, we need to preprocess the data to include polynomial features. We can use `PolynomialFeatures` from `sklearn.preprocessing` for this purpose.

In [15]:
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df.drop(columns=["quality"]))
y = df["quality"]

Now, we initialize the linear regression model and perform the k-fold cross-validation to evaluate its performance.

In [16]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import pandas as pd

# Initialize the model
lin_reg = LinearRegression()

# Store results
mse_list, rmse_list, r2_list = [], [], []

# Perform K-Fold Cross-Validation
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X_poly[train_index], X_poly[test_index]
    y_train, y_test = y[train_index], y[test_index]

    lin_reg.fit(X_train, y_train)
    y_pred = lin_reg.predict(X_test)

    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)

    mse_list.append(mse)
    rmse_list.append(rmse)
    r2_list.append(r2)

    print(f"Fold {fold+1}: R2 = {r2:.4f}, MSE = {mse:.4f}, RMSE = {rmse:.4f}")

# Aggregate results and save to CSV
results = pd.DataFrame({
    "Metric": ["MSE", "RMSE", "R2"],
    "Mean": [np.mean(mse_list), np.mean(rmse_list), np.mean(r2_list)],
    "Variance": [np.var(mse_list), np.var(rmse_list), np.var(r2_list)]
})
results.to_csv("../reports/tables/cross_validation_results.csv", index=False)

print("\nCross-validation results (aggregated):")
results

Fold 1: R2 = 0.2809, MSE = 0.4002, RMSE = 0.6326
Fold 2: R2 = 0.3375, MSE = 0.4879, RMSE = 0.6985
Fold 3: R2 = 0.3017, MSE = 0.4732, RMSE = 0.6879
Fold 4: R2 = 0.4021, MSE = 0.4081, RMSE = 0.6388
Fold 5: R2 = 0.1598, MSE = 0.4942, RMSE = 0.7030

Cross-validation results (aggregated):


Unnamed: 0,Metric,Mean,Variance
0,MSE,0.452719,0.001627
1,RMSE,0.672164,0.000914
2,R2,0.296402,0.006358


## Regularization



## A Non-Linear Model

