# **Feature Engineering**

In this Jupyter Notebook, we explore the process of building a predictive model for fuel mileage (miles per gallon) using the Seaborn `mpg` dataset. We will perform feature engineering, handle missing data, and evaluate multiple linear regression models with different sets of features. The goal is to understand how various feature transformations and encoding techniques impact the model's performance.

***

## **1. Import Libraries**

First, we import the necessary libraries for data manipulation, visualization, and modeling.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import OneHotEncoder

***

## **2. Loading and Exploring the Data**

We will use Seaborn's `mpg` dataset, which contains information about various cars, including their fuel mileage and other characteristics.

### *2.1 ~ The Data*

**Note:** The dataset contains some missing values, which we will handle in the next section.

In [2]:
# Load the mpg dataset
df = sns.load_dataset("mpg")

# Display the first few rows of the dataset
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [3]:
# Display the entire dataset
df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino
...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,usa,ford mustang gl
394,44.0,4,97.0,52.0,2130,24.6,82,europe,vw pickup
395,32.0,4,135.0,84.0,2295,11.6,82,usa,dodge rampage
396,28.0,4,120.0,79.0,2625,18.6,82,usa,ford ranger


***

## **3. Handling Missing Values**

Before building our models, it's crucial to address any missing data to ensure the quality and reliability of our predictions.

In [4]:
# Identify and remove rows with any missing values
data = df.dropna().copy()

# Display the number of rows before and after dropping missing values
print(f"Original data shape: {df.shape}")
print(f"Data shape after dropping missing values: {data.shape}")

Original data shape: (398, 9)
Data shape after dropping missing values: (392, 9)


***

## **4. Feature Engineering**

Feature engineering involves creating new features or modifying existing ones to improve the performance of machine learning models.

### *4.1 ~ Basic Design Matrix*

We start by selecting the quantitative features that will serve as predictors in our initial model.

In [5]:
def basic_design_matrix(df):
    """
    Creates a design matrix with selected quantitative features.
    
    Parameters:
    df (pd.DataFrame): The input dataframe.
    
    Returns:
    pd.DataFrame: Design matrix with selected features.
    """
    X = df[["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]]
    return X

# Display the first few rows of the basic design matrix
basic_design_matrix(data).head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year
0,8,307.0,130.0,3504,12.0,70
1,8,350.0,165.0,3693,11.5,70
2,8,318.0,150.0,3436,11.0,70
3,8,304.0,150.0,3433,12.0,70
4,8,302.0,140.0,3449,10.5,70


### *4.2 ~ Nonlinear Transformations*

Enhancing the model by adding nonlinear transformations of the quantitative features can capture more complex relationships.

**Note:** We replace zeros with ones before applying the logarithm to avoid mathematical errors.

In [6]:
def nonlinear_design_matrix(df):
    """
    Extends the basic design matrix with nonlinear transformations.
    
    Parameters:
    df (pd.DataFrame): The input dataframe.
    
    Returns:
    pd.DataFrame: Enhanced design matrix with nonlinear features.
    """
    X = basic_design_matrix(df).copy()
    
    # Apply nonlinear transformations to each quantitative feature
    for col in X.columns:
        X[f'{col}^2'] = X[col] ** 2
        X[f'{col}^3'] = X[col] ** 3
        X[f'log_{col}'] = np.log(X[col].replace(0, 1))  # Avoid log(0)
        X[f'sin_{col}'] = np.sin(X[col])
    return X

# Display the first few rows of the nonlinear design matrix
nonlinear_design_matrix(data).head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,cylinders^2,cylinders^3,log_cylinders,sin_cylinders,...,log_weight,sin_weight,acceleration^2,acceleration^3,log_acceleration,sin_acceleration,model_year^2,model_year^3,log_model_year,sin_model_year
0,8,307.0,130.0,3504,12.0,70,64,512,2.079442,0.989358,...,8.16166,-0.901919,144.0,1728.0,2.484907,-0.536573,4900,343000,4.248495,0.773891
1,8,350.0,165.0,3693,11.5,70,64,512,2.079442,0.989358,...,8.214194,-0.998328,132.25,1520.875,2.442347,-0.875452,4900,343000,4.248495,0.773891
2,8,318.0,150.0,3436,11.0,70,64,512,2.079442,0.989358,...,8.142063,-0.784794,121.0,1331.0,2.397895,-0.99999,4900,343000,4.248495,0.773891
3,8,304.0,150.0,3433,12.0,70,64,512,2.079442,0.989358,...,8.14119,0.68948,144.0,1728.0,2.484907,-0.536573,4900,343000,4.248495,0.773891
4,8,302.0,140.0,3449,10.5,70,64,512,2.079442,0.989358,...,8.14584,-0.451757,110.25,1157.625,2.351375,-0.879696,4900,343000,4.248495,0.773891


### *4.3 ~ Categorical Data Encoding*

The `origin` column is categorical and needs to be encoded into numerical format using One-Hot Encoding.

In [7]:
# Initialize OneHotEncoder
oh_enc = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' to avoid multicollinearity

# Fit and transform the 'origin' column
origin_encoded = oh_enc.fit_transform(data[['origin']])

# Create a DataFrame with the encoded features
origin_encoded_df = pd.DataFrame(origin_encoded, 
                                 columns=oh_enc.get_feature_names_out(['origin']),
                                 index=data.index)

# Display the first few rows of the encoded categorical data
origin_encoded_df.head()

Unnamed: 0,origin_japan,origin_usa
0,0.0,1.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,0.0,1.0


In [8]:
def sklearn_ohe_design_matrix(df):
    """
    Combines nonlinear quantitative features with one-hot encoded categorical features.
    
    Parameters:
    df (pd.DataFrame): The input dataframe.
    
    Returns:
    pd.DataFrame: Combined design matrix.
    """
    # Apply nonlinear transformations
    X = nonlinear_design_matrix(df).copy()
    
    # Apply one-hot encoding to the 'origin' column
    ohe_cols = pd.DataFrame(oh_enc.transform(df[['origin']]),
                            columns=oh_enc.get_feature_names_out(['origin']),
                            index=df.index)
    
    # Combine quantitative and categorical features
    X = X.join(ohe_cols)
    return X

# Display the first few rows of the combined design matrix
sklearn_ohe_design_matrix(data).head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,cylinders^2,cylinders^3,log_cylinders,sin_cylinders,...,acceleration^2,acceleration^3,log_acceleration,sin_acceleration,model_year^2,model_year^3,log_model_year,sin_model_year,origin_japan,origin_usa
0,8,307.0,130.0,3504,12.0,70,64,512,2.079442,0.989358,...,144.0,1728.0,2.484907,-0.536573,4900,343000,4.248495,0.773891,0.0,1.0
1,8,350.0,165.0,3693,11.5,70,64,512,2.079442,0.989358,...,132.25,1520.875,2.442347,-0.875452,4900,343000,4.248495,0.773891,0.0,1.0
2,8,318.0,150.0,3436,11.0,70,64,512,2.079442,0.989358,...,121.0,1331.0,2.397895,-0.99999,4900,343000,4.248495,0.773891,0.0,1.0
3,8,304.0,150.0,3433,12.0,70,64,512,2.079442,0.989358,...,144.0,1728.0,2.484907,-0.536573,4900,343000,4.248495,0.773891,0.0,1.0
4,8,302.0,140.0,3449,10.5,70,64,512,2.079442,0.989358,...,110.25,1157.625,2.351375,-0.879696,4900,343000,4.248495,0.773891,0.0,1.0


***

## **5. Model Evaluation**

To systematically evaluate and compare different models, we define a function that trains a model, computes performance metrics, and visualizes the results.

### *5.1 ~ Evaluation Function*

In [9]:
def evaluate_model(name, model, phi, data, models=dict()):
    """
    Trains the model, evaluates its performance, and updates the models dictionary.
    
    Parameters:
    name (str): Name of the model.
    model: The machine learning model to train.
    phi (function): Function to generate the design matrix.
    data (pd.DataFrame): The dataset to use for training and evaluation.
    models (dict): Dictionary to store model performance metrics.
    
    Returns:
    None
    """
    # Generate the design matrix
    X = phi(data)
    
    # Extract target variable
    Y = data['mpg'].to_numpy()
    
    # Train the model
    model.fit(X, Y)
    
    # Make predictions
    Yhat = model.predict(X)
    
    # Compute Root Mean Squared Error (RMSE)
    rmse = np.sqrt(mean_squared_error(Y, Yhat))
    print(f"Model: {name}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.4f}\n")
    
    # Save the model and RMSE to the collection of models
    models[name] = {'model': model, 'phi': phi, 'rmse': rmse}
    
    # Generate diagnostic and comparison plots
    fig = make_subplots(rows=1, cols=2, subplot_titles=("Actual vs Predicted MPG", "Model RMSE Comparison"))
    
    # Scatter plot of Actual vs Predicted MPG
    fig.add_trace(go.Scatter(x=Yhat, y=Y, mode="markers", name=name), row=1, col=1)
    fig.update_xaxes(title_text="Predicted MPG (Yhat)", row=1, col=1)
    fig.update_yaxes(title_text="Actual MPG (Y)", row=1, col=1)
    
    # Plot y = yhat line
    ymin, ymax = min(Y.min(), Yhat.min()), max(Y.max(), Yhat.max())
    fig.add_trace(go.Scatter(x=[ymin, ymax], y=[ymin, ymax], mode="lines", name="y = yhat"), row=1, col=1)
    
    # Bar chart of RMSE for all models
    model_names = list(models.keys())
    rmse_values = [models[k]['rmse'] for k in model_names]
    fig.add_trace(go.Bar(x=model_names, y=rmse_values, name="RMSE"), row=1, col=2)
    fig.update_yaxes(title_text="RMSE", row=1, col=2)
    
    fig.update_layout(showlegend=False, height=600, width=1000)
    fig.show()

### *5.2 ~ Evaluating the Basic Model*

We start by evaluating a simple linear regression model using the basic design matrix.

In [10]:
# Initialize the models dictionary
models = {}

# Initialize the Linear Regression model
basic_model = LinearRegression()

# Evaluate the basic model
evaluate_model("Basic Linear Regression", basic_model, basic_design_matrix, data, models)

Model: Basic Linear Regression
Root Mean Squared Error (RMSE): 3.4044



### *5.3 ~ Evaluating the Nonlinear Model*

Next, we evaluate a model that includes nonlinear transformations of the features.

In [11]:
# Initialize the Linear Regression model
nonlinear_model = LinearRegression()

# Evaluate the nonlinear model
evaluate_model("Nonlinear Linear Regression", nonlinear_model, nonlinear_design_matrix, data, models)

Model: Nonlinear Linear Regression
Root Mean Squared Error (RMSE): 2.4964



### *5.4 ~ Evaluating the One-Hot Encoded Model*

We then incorporate categorical data encoding into our model.

In [12]:
# Initialize the Linear Regression model
ohe_model = LinearRegression()

# Evaluate the one-hot encoded model
evaluate_model("One-Hot Encoded Linear Regression", ohe_model, sklearn_ohe_design_matrix, data, models)

Model: One-Hot Encoded Linear Regression
Root Mean Squared Error (RMSE): 2.4729



### *5.5 ~ Evaluating the Imputed Model*

Finally, we handle missing values by imputing them and evaluate the corresponding model.

In [13]:
def imputed_design_matrix(df):
    """
    Creates a design matrix with imputed missing values.
    
    Parameters:
    df (pd.DataFrame): The input dataframe.
    
    Returns:
    pd.DataFrame: Design matrix with imputed values.
    """
    X = sklearn_ohe_design_matrix(df).copy()
    
    # Impute missing values with the mean of each column
    X = X.fillna(X.mean())
    return X

# Use the original dataframe with missing values
original_df = df.copy()

# Initialize the Linear Regression model
imputed_model = LinearRegression()

# Evaluate the imputed model
evaluate_model("Imputed Linear Regression", imputed_model, imputed_design_matrix, original_df, models)

Model: Imputed Linear Regression
Root Mean Squared Error (RMSE): 2.4913



***

## **6. Summary**

Through this notebook, we demonstrated the importance of feature engineering in building predictive models. By progressively enhancing our design matrix with nonlinear transformations and categorical data encoding, we improved the model's performance as evidenced by decreasing RMSE values. Additionally, handling missing data through imputation further refined our model. However, it's essential to balance model complexity to avoid overfitting.

For convenience, we define a comprehensive function that incorporates all feature engineering steps.

In [14]:
# Define the final design matrix function without passing data_mean directly
def get_design_matrix(df, data_mean=None):
    """
    Creates a comprehensive design matrix with nonlinear features, one-hot encoding, and imputed missing values.
    
    Parameters:
    df (pd.DataFrame): The input dataframe.
    data_mean (pd.Series): Mean values for imputing missing data. If None, uses the mean of the design matrix.
    
    Returns:
    pd.DataFrame: Final design matrix.
    """
    # Generate the design matrix with nonlinear transformations and one-hot encoding
    X = sklearn_ohe_design_matrix(df).copy()
    
    # Compute data_mean from the design matrix if not provided
    if data_mean is None:
        data_mean = X.mean()
    
    # Impute missing values with provided means
    X = X.fillna(data_mean)
    return X

# Generate the final design matrix without passing data_mean directly from original_df
final_design_matrix = get_design_matrix(original_df)
final_design_matrix.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,cylinders^2,cylinders^3,log_cylinders,sin_cylinders,...,acceleration^2,acceleration^3,log_acceleration,sin_acceleration,model_year^2,model_year^3,log_model_year,sin_model_year,origin_japan,origin_usa
0,8,307.0,130.0,3504,12.0,70,64,512,2.079442,0.989358,...,144.0,1728.0,2.484907,-0.536573,4900,343000,4.248495,0.773891,0.0,1.0
1,8,350.0,165.0,3693,11.5,70,64,512,2.079442,0.989358,...,132.25,1520.875,2.442347,-0.875452,4900,343000,4.248495,0.773891,0.0,1.0
2,8,318.0,150.0,3436,11.0,70,64,512,2.079442,0.989358,...,121.0,1331.0,2.397895,-0.99999,4900,343000,4.248495,0.773891,0.0,1.0
3,8,304.0,150.0,3433,12.0,70,64,512,2.079442,0.989358,...,144.0,1728.0,2.484907,-0.536573,4900,343000,4.248495,0.773891,0.0,1.0
4,8,302.0,140.0,3449,10.5,70,64,512,2.079442,0.989358,...,110.25,1157.625,2.351375,-0.879696,4900,343000,4.248495,0.773891,0.0,1.0


In [15]:
# Example: Using the final design matrix for modeling
final_model = LinearRegression()
final_model.fit(final_design_matrix, original_df['mpg'])
evaluate_model("Final Linear Regression", final_model, lambda df: get_design_matrix(df), original_df, models)

Model: Final Linear Regression
Root Mean Squared Error (RMSE): 2.4913

