# Weighted Average Modeling

Here I want to explore the concept of using weighted averaging for estimation, with multiple base-estimators. In my experience, subject matter experts often crowd-source multiple estimates, sometimes picking the estimate that provides the most comfort, and often averaging predictions together with a weighting scheme. How these weights might be chosen is often arbitrary, or not learned from the data. Here I want to create a linear regression that estimates coefficients as appropriate weights over several base estimators. In order to do that, let's create a few (3) base estimators for a regression problem, look at their test errors, and then build a regression with those base estimator predictions with conditions to ensure that the resulting model is interpretable as a weighted-average calculation:

1. $ \forall b \in B, \ 0 \leq b \leq 1 $
2. $ b_{norm} = \frac{b}{\sum B}, \ \forall b \in B $
2. $ \hat{y} = X \cdot B_{norm} $

All coefficients are bounded between 0 and 1, and each coefficient is normalized by the sum of all coefficients so that they sum to 1, prior to linear estimation.

A mathematical expression of the optimization:

$ \hat{y} = X \cdot B_{norm} $ 
   
$ sse = .5*(\hat{y} - y)^{2} = .5*(X \cdot \begin{bmatrix} b_1*(\sum{B})^{-1} \\ \ldots \\ b_k*(\sum{B})^{-1} \end{bmatrix} - y)^{2} $  
  
$ \delta_{sse,b_i} = (X \cdot \begin{bmatrix} b_1*(\sum{B})^{-1} \\ \ldots \\ b_k*(\sum{B})^{-1} \end{bmatrix} - y) \cdot (x_i * b_i * (-1) * (\sum{B})^{-2} + x_i * (\sum{B})^{-1} + \sum_{\substack{k \neq i}}{x_k * b_k * (-1) * (\sum{B})^{-2}}) $  
  
$      = (X \cdot \begin{bmatrix} b_1*(\sum{B})^{-1} \\ \ldots \\ b_k*(\sum{B})^{-1} \end{bmatrix} - y) \cdot ( X \cdot \begin{bmatrix} b_1 * (-1) * (\sum{B})^{-2} \\ \ldots \\ b_k * (-1) * (\sum{B})^{-2} \end{bmatrix} + x_i * (\sum{B})^{-1}) $  
  
$      = (X \cdot B * (\sum{B})^{-1} - y) \cdot ( X \cdot B  * (-(\sum{B})^{-2}) + x_i * (\sum{B})^{-1}) $

Partial derivatives for all coefficients in $ B $ would be expressed similarly. 

The data I'll use for this exercise is:  
Liver Disorders  
Donated on 5/14/1990  
BUPA Medical Research Ltd. database donated by Richard S. Forsyth

In [1]:
from scipy.optimize import minimize
from sklearn.datasets import fetch_openml

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MaxAbsScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

In [2]:
X,y = fetch_openml(data_id=8,as_frame=True,return_X_y=True, parser='auto')

Here I'll create a Linear Regression, Random Forest Regressor (with growth restrictions), and a KNN Regressor, which are algorithms with very different properties, namely linearity, decision-based non-linearity, and spatial considerations. I'll apply the same scaling for preprocessing for all three pipelines.



In [3]:
l1 = LinearRegression()
l2 = RandomForestRegressor(random_state=42,max_depth=4,n_estimators=50)
l3 = KNeighborsRegressor(n_neighbors=5)

m1 = Pipeline(
    steps=[
        ('scaler',MaxAbsScaler()),
        ('learner',l1)
    ]
)

m2 = Pipeline(
    steps=[
        ('scaler',MaxAbsScaler()),
        ('learner',l2)
    ]
)

m3 = Pipeline(
    steps=[
        ('scaler',MaxAbsScaler()),
        ('learner',l3)
    ]
)

Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,test_size=.2,random_state=42)

m1.fit(Xtrain,ytrain)
m2.fit(Xtrain,ytrain)
m3.fit(Xtrain,ytrain)

p1 = m1.predict(Xtrain)
p2 = m2.predict(Xtrain)
p3 = m3.predict(Xtrain)

t1 = m1.predict(Xtest)
t2 = m2.predict(Xtest)
t3 = m3.predict(Xtest)

mae1 = mean_absolute_error(ytrain,p1)
mae2 = mean_absolute_error(ytrain,p2)
mae3 = mean_absolute_error(ytrain,p3)

tmae1 = mean_absolute_error(ytest,t1)
tmae2 = mean_absolute_error(ytest,t2)
tmae3 = mean_absolute_error(ytest,t3)

print('MAEs\n')
print(
    'Training:\n',
    '\nLinear Regression:',round(mae1,3),
    '\nRF:',round(mae2,3),
    '\nKNN:',round(mae3,3),'\n')
print(
    'Testing:\n',
    '\nLinear Regression:',round(tmae1,3),
    '\nRF:',round(tmae2,3),
    '\nKNN:',round(tmae3,3))

MAEs

Training:
 
Linear Regression: 2.276 
RF: 1.99 
KNN: 2.03 

Testing:
 
Linear Regression: 2.654 
RF: 2.482 
KNN: 2.678


The MAEs here are with a similar range, and naturally the Random Forest could have been allowed to more tightly fit the training data, and potentially test more strongly. Here I prefer to work with three OK estimators. 

Now, I'll take the training-set predictions of each, and build the above-mentioned weighted-averaging linear model. 

In [4]:
X1 = pd.DataFrame({
    'p1':p1,
    'p2':p2,
    'p3':p3
},index=Xtrain.index)
n_feats = X1.shape[1]

params = np.array([1/n_feats for n in range(n_feats)]) # Initialized with equal weights
params = np.clip(params,a_min=0,a_max=1) # Clipped parameter update, so parameters will be bounded from 0 to 1.
params = params/sum(params)
l = .001

for e in range(50):

    yhat = X1@params

    #sse = np.array((yhat - ytrain)**2).sum() # The SSE, but we don't need this for optimization.

    grad = np.array(yhat - ytrain).T@(np.array((X1@params)*(-sum(params)**(-2)))[:, np.newaxis] + np.array(X1*(sum(params)**(-1)))) # The gradient for all parameters.

    params = np.clip(params - l*grad,0,1) # Clipped parameter update, so parameters will be bounded from 0 to 1.

    params = params/sum(params)
    #print(sse)

yhat = X1@params/sum(params)

mae_ = round(mean_absolute_error(ytrain,yhat),3)

X1_test = pd.DataFrame({
    'p1':t1,
    'p2':t2,
    'p3':t3
})

ypred = X1_test@(params/sum(params))

tmae_ = round(mean_absolute_error(ytest,ypred),3)

print("Learned weights: ",np.round(params/sum(params),3))

print("Training MAE:",mae_)
print("Testing MAE:",tmae_)

Learned weights:  [0.    0.864 0.136]
Training MAE: 1.977
Testing MAE: 2.484


Here the testing MAE is comparable to testing MAE of the Random Forest estimate on its own, and the Random Forest estimator's weight is indeed the dominant weight following optimization, as expected.

Interestingly and alternatively, what if a simple linear regression with a positive constraint was applied, without the normalization condition that parameters must sum to 1? The outcome is similar, with the Random Forest weight dominating, after normalizing the resulting coefficients. 

In [5]:
X1 = pd.DataFrame({
    'p1':p1,
    'p2':p2,
    'p3':p3
})

tmp_model = LinearRegression(fit_intercept=False,positive=True).fit(X1,ytrain)

tmp_model.coef_/sum(tmp_model.coef_)

array([0.        , 0.85115733, 0.14884267])

Let's run the full exercise, but with default Random Forest parameters, allowing that estimate to more closely fit the training data. The expectation here is that the Random Forest estimate would even more strongly dominate:

In [6]:
def full_exercise(rf_max_depth=None,rf_n_estimators=100):
    l1 = LinearRegression()
    l2 = RandomForestRegressor(random_state=42,max_depth=rf_max_depth,n_estimators=rf_n_estimators)
    l3 = KNeighborsRegressor(n_neighbors=5)

    m1 = Pipeline(
        steps=[
            ('scaler',MaxAbsScaler()),
            ('learner',l1)
        ]
    )

    m2 = Pipeline(
        steps=[
            ('scaler',MaxAbsScaler()),
            ('learner',l2)
        ]
    )

    m3 = Pipeline(
        steps=[
            ('scaler',MaxAbsScaler()),
            ('learner',l3)
        ]
    )

    Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,test_size=.2,random_state=42)

    m1.fit(Xtrain,ytrain)
    m2.fit(Xtrain,ytrain)
    m3.fit(Xtrain,ytrain)

    p1 = m1.predict(Xtrain)
    p2 = m2.predict(Xtrain)
    p3 = m3.predict(Xtrain)

    t1 = m1.predict(Xtest)
    t2 = m2.predict(Xtest)
    t3 = m3.predict(Xtest)

    mae1 = mean_absolute_error(ytrain,p1)
    mae2 = mean_absolute_error(ytrain,p2)
    mae3 = mean_absolute_error(ytrain,p3)

    tmae1 = mean_absolute_error(ytest,t1)
    tmae2 = mean_absolute_error(ytest,t2)
    tmae3 = mean_absolute_error(ytest,t3)

    print('MAEs\n')
    print(
        'Training:\n',
        '\nLinear Regression:',round(mae1,3),
        '\nRF:',round(mae2,3),
        '\nKNN:',round(mae3,3),'\n')
    print(
        'Testing:\n',
        '\nLinear Regression:',round(tmae1,3),
        '\nRF:',round(tmae2,3),
        '\nKNN:',round(tmae3,3))
    
    X1 = pd.DataFrame({
        'p1':p1,
        'p2':p2,
        'p3':p3
    },index=Xtrain.index)
    n_feats = X1.shape[1]

    params = np.array([1/n_feats for n in range(n_feats)]) # Initialized with equal weights
    params = np.clip(params,a_min=0,a_max=1) # Clipped parameter update, so parameters will be bounded from 0 to 1.
    params = params/sum(params)
    l = .001

    for e in range(50):

        yhat = X1@params

        #sse = np.array((yhat - ytrain)**2).sum() # The SSE, but we don't need this for optimization.

        grad = np.array(yhat - ytrain).T@(np.array((X1@params)*(-sum(params)**(-2)))[:, np.newaxis] + np.array(X1*(sum(params)**(-1)))) # The gradient for all parameters.

        params = np.clip(params - l*grad,0,1) # Clipped parameter update, so parameters will be bounded from 0 to 1.

        params = params/sum(params)
        #print(sse)

    yhat = X1@params/sum(params)

    mae_ = round(mean_absolute_error(ytrain,yhat),3)

    X1_test = pd.DataFrame({
        'p1':t1,
        'p2':t2,
        'p3':t3
    })

    ypred = X1_test@(params/sum(params))

    tmae_ = round(mean_absolute_error(ytest,ypred),3)

    print("\nLearned weights: ",np.round(params/sum(params),3))

    print("Training MAE:",mae_)
    print("Testing MAE:",tmae_)
    
    return

full_exercise()

MAEs

Training:
 
Linear Regression: 2.276 
RF: 0.94 
KNN: 2.03 

Testing:
 
Linear Regression: 2.654 
RF: 2.49 
KNN: 2.678

Learned weights:  [0. 1. 0.]
Training MAE: 0.94
Testing MAE: 2.49


Indeed that is the observation, that the Random Forest is the only remaining estimator.

Certainly, I expect this weighted-averaging technique to be a weaker final estimator than a simple regression with an intercept, but if SMEs have a strong desire to consider independent estimates directly in a weighted-averaging scheme, then estimating weights via optimization can yield a stronger estimator than an SME-selected weighting scheme.