# Quantile Random Forest (QRF) imputation

This notebook demonstrates how to use MicroImpute's QRF imputer to impute values using Quantile Random Forests. QRF is a powerful machine learning technique that extends traditional random forests to predict the entire conditional distribution of a target variable.

Currently, a single QRF object can only impute on one variable at a time.

## Setup and data preparation

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.datasets import load_diabetes
import warnings

warnings.filterwarnings("ignore")

# Import MicroImpute tools
from microimpute.comparisons.data import preprocess_data
from microimpute.evaluations import *
from microimpute.models import QRF
from microimpute.config import QUANTILES

In [2]:
# Load the diabetes dataset
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Display the first few rows of the dataset
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [3]:
# Define variables for the model
predictors = ["age", "sex", "bmi", "bp"]
imputed_variables = ["s1"]  # We'll impute 's1' (total serum cholesterol)

# Create a subset with only needed columns
diabetes_df = df[predictors + imputed_variables]

# Display summary statistics
diabetes_df.describe()

Unnamed: 0,age,sex,bmi,bp,s1
count,442.0,442.0,442.0,442.0,442.0
mean,-2.511817e-19,1.23079e-17,-2.245564e-16,-4.79757e-17,-1.3814990000000001e-17
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905
min,-0.1072256,-0.04464164,-0.0902753,-0.1123988,-0.1267807
25%,-0.03729927,-0.04464164,-0.03422907,-0.03665608,-0.03424784
50%,0.00538306,-0.04464164,-0.007283766,-0.005670422,-0.004320866
75%,0.03807591,0.05068012,0.03124802,0.03564379,0.02835801
max,0.1107267,0.05068012,0.1705552,0.1320436,0.1539137


In [4]:
# Split data into training and testing sets
X_train, X_test = preprocess_data(diabetes_df)

# Let's see how many records we have in each set
print(f"Training set size: {X_train.shape[0]} records")
print(f"Testing set size: {X_test.shape[0]} records")

Training set size: 353 records
Testing set size: 89 records


## Simulating missing data

For this example, we'll simulate missing data in our test set by removing the values we want to impute.

In [5]:
# Create a copy of the test set with missing values
X_test_missing = X_test.copy()

# Store the actual values for later comparison
actual_values = X_test_missing[imputed_variables].copy()

# Remove the values to be imputed
X_test_missing[imputed_variables] = np.nan

X_test_missing.head()

Unnamed: 0,age,sex,bmi,bp,s1
287,0.952161,-0.937474,-0.130325,-0.335978,
211,1.943844,-0.937474,0.775037,0.45932,
72,1.333577,1.064282,-0.085057,-0.263679,
321,2.020127,-0.937474,1.091914,1.664559,
73,0.265611,1.064282,-0.424568,-0.046779,


## Training and using the QRF imputer

Now we'll train the QRF imputer and use it to impute the missing values in our test set.

In [6]:
# Initialize the QRF imputer with some custom parameters
# You can customize the random forest by passing additional parameters
qrf_imputer = QRF()

# Fit the model with our training data
# This trains a quantile random forest model
fitted_qrf_imputer = qrf_imputer.fit(
    X_train,
    predictors,
    imputed_variables,
    n_estimators=100,
    min_samples_leaf=5,
)

In [7]:
# Impute values in the test set
# This uses the trained QRF model to predict missing values at specified quantiles
imputed_values = fitted_qrf_imputer.predict(X_test_missing, QUANTILES)

# Display the first few imputed values at the median (0.5 quantile)
imputed_values[0.5].head()

Unnamed: 0,s1
0,-0.784218
1,0.862797
2,1.845227
3,1.036167
4,1.469592


## Evaluating the imputation results

Now let's compare the imputed values with the actual values to evaluate the performance of our imputer.

In [None]:
# Extract median predictions for evaluation
median_predictions = imputed_values[0.5]

# Create a scatter plot comparing actual vs. imputed values
min_val = min(actual_values.min().min(), median_predictions.min().min())
max_val = max(actual_values.max().max(), median_predictions.max().max())

# Convert data for plotting
plot_df = pd.DataFrame(
    {
        "Actual": actual_values.values.flatten(),
        "Imputed": median_predictions.values.flatten(),
    }
)

# Create the scatter plot
fig = px.scatter(
    plot_df,
    x="Actual",
    y="Imputed",
    opacity=0.7,
    title="Comparison of Actual vs. Imputed Values using QRF",
)

# Add the diagonal line (perfect prediction line)
fig.add_trace(
    go.Scatter(
        x=[min_val, max_val],
        y=[min_val, max_val],
        mode="lines",
        line=dict(color="red", dash="dash"),
        name="Perfect Prediction",
    )
)

# Update layout
fig.update_layout(
    xaxis_title="Actual Values",
    yaxis_title="Imputed Values",
    width=650,
    height=500,
    template="plotly_white",
    margin=dict(l=50, r=50, t=80, b=50),  # Adjust margins
)

fig.show()

## Examining quantile predictions

QRF provides predictions at different quantiles, allowing us to capture the entire conditional distribution of the missing values.

In [9]:
# Compare predictions at different quantiles for the first 5 records
quantiles_to_show = QUANTILES
comparison_df = pd.DataFrame(index=range(5))

# Add actual values
comparison_df["Actual"] = actual_values.iloc[:5, 0].values

# Add quantile predictions
for q in quantiles_to_show:
    comparison_df[f"Q{int(q*100)}"] = imputed_values[q].iloc[:5, 0].values

comparison_df

Unnamed: 0,Actual,Q5,Q10,Q30,Q50,Q70,Q90,Q95
0,2.625393,-2.026704,-0.928693,0.487162,-0.784218,0.487162,1.585172,1.585172
1,-0.524163,-0.957588,0.024842,-0.957588,0.862797,0.978377,1.209537,2.683183
2,2.163073,-1.477699,-1.477699,-1.477699,1.845227,3.203293,3.203293,3.203293
3,1.151747,-0.957588,-0.957588,0.516057,1.036167,0.631637,1.209537,2.683183
4,0.805007,-2.026704,-2.026704,0.227107,1.469592,0.024842,0.660532,1.469592


## Visualizing prediction intervals

One of the advantages of QRF is that it can provide prediction intervals, which can help us understand the uncertainty in our imputed values.

In [14]:
# Create a prediction interval plot for the first 10 records
# Number of records to plot
n_records = 10

# Prepare data for plotting
records = list(range(n_records))
actuals = actual_values.iloc[:n_records, 0].values
medians = imputed_values[0.5].iloc[:n_records, 0].values
q30 = imputed_values[0.3].iloc[:n_records, 0].values
q70 = imputed_values[0.7].iloc[:n_records, 0].values
q10 = imputed_values[0.1].iloc[:n_records, 0].values
q90 = imputed_values[0.9].iloc[:n_records, 0].values

# Create the base figure
fig = go.Figure()

# Add 80% prediction interval (Q10-Q90)
for i in range(n_records):
    fig.add_trace(
        go.Scatter(
            x=[i, i],
            y=[q10[i], q90[i]],
            mode="lines",
            line=dict(width=10, color="rgba(173, 216, 230, 0.15)"),
            hoverinfo="none",
            showlegend=False,
        )
    )

# Add 50% prediction interval (Q30-Q70)
for i in range(n_records):
    fig.add_trace(
        go.Scatter(
            x=[i, i],
            y=[q30[i], q70[i]],
            mode="lines",
            line=dict(width=10, color="rgba(70, 130, 180, 0.3)"),
            hoverinfo="none",
            showlegend=False,
        )
    )

# Add actual values
fig.add_trace(
    go.Scatter(
        x=records,
        y=actuals,
        mode="markers",
        marker=dict(color="black", size=8),
        name="Actual",
    )
)

# Add median predictions
fig.add_trace(
    go.Scatter(
        x=records,
        y=medians,
        mode="markers",
        marker=dict(color="red", size=8),
        name="Median (Q50)",
    )
)

# Add dashed line for Q10
fig.add_trace(
    go.Scatter(
        x=[-1, -1],  # Dummy points for legend
        y=[0, 0],  # Dummy points for legend
        mode="lines",
        line=dict(color="rgba(173, 216, 230, 0.15)", width=10),
        name="80% PI (Q10-Q90)",
    )
)

# Add dashed line for Q30
fig.add_trace(
    go.Scatter(
        x=[-1, -1],  # Dummy points for legend
        y=[0, 0],  # Dummy points for legend
        mode="lines",
        line=dict(color="rgba(70, 130, 180, 0.3)", width=10),
        name="50% PI (Q30-Q70)",
    )
)

# Update layout with smaller width to fit in the book layout
fig.update_layout(
    title="QRF Imputation Prediction Intervals",
    xaxis=dict(
        title="Data Record Index",
        showgrid=True,
        gridwidth=1,
        gridcolor="rgba(211, 211, 211, 0.7)",
    ),
    yaxis=dict(
        title="Total Serum Cholesterol (s1)",
        showgrid=True,
        gridwidth=1,
        gridcolor="rgba(211, 211, 211, 0.7)",
    ),
    width=650,
    height=500,
    template="plotly_white",
    margin=dict(l=50, r=50, t=80, b=50),  # Adjust margins
    legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99),
)

fig.show()

## Assesing the method's performance

To check whether our model is overfitting and ensure robust results we can perform cross-validation and visualize the results.

In [11]:
# Run cross-validation on the same data set
qrf_results = cross_validate_model(
    QRF, diabetes_df, predictors, imputed_variables
)

qrf_results

Unnamed: 0,0.05,0.10,0.30,0.50,0.70,0.90,0.95
train,0.00348,0.00674,0.01373,0.012498,0.008909,0.006535,0.004148
test,0.004621,0.009479,0.0235,0.026368,0.02233,0.011269,0.005982


In [12]:
# Plot the results
plot_train_test_performance(qrf_results)

## Tuning the QRF model

The QRF imputer supports various parameters that can be adjusted to improve performance. More details coming soon.