# Multiple Outputs

In the domain of statistical modeling and machine learning, it is feasible to construct multivariate probability distributions that are parameterized by a neural network in response to given inputs. However, a prevalent approach involves assuming each predicted variable to be independent of others.

The predictive capability of neural networks can be significantly enhanced by modeling the parameters of multivariate probability distributions as functions of input variables. Despite this possibility, it is often more practical to treat each predicted output as an independent event. This approach predicates on the assumption that the occurrence of one event does not influence the probability of occurrence of another, thereby allowing for the simplification of probability calculations.

Given a neural network model, the probability of a particular outcome, $y$, given a function $f$ parameterized by inputs $x_i$ and parameters $\phi$, can be expressed as follows in a multivariate context:

$$Pr(y|f|x_i,\phi||) = \prod_{d}(Pr(y_d|f_d|x_i,\phi||))$$

- where “$f_d[xi,\phi]$” is the “$d^{th}$” set of network outputs, which describe the parameters of the distribution over “$y_d$”.

To model multiple discrete variables, where each $y_d$ can take on one of $K$ discrete values, a categorical distribution for each $y_d$ is employed. The network outputs, $f_d[xi,\phi]$, are responsible for predicting the $K$ values that define the categorical distribution for each $y_d$.

When optimizing the model by minimizing the negative log probability, this formulation simplifies as follows:

$$L|\phi| = -\sum_{i=1}^{I}log[Pr(y|f|x_i,\phi||)] = -\sum_{i=1}^{I}\sum log[Pr(y_{id}|f_d|x_i,\phi||)]$$

- where $y_{id}$ is the $d^{th}$ output from the $i^{th}$ training example

These formulations underscore the foundational aspects of treating predictions as independent within the framework of neural network-based statistical modeling.

To make two or more prediction types simultaneously, we similarly assume the errors in each are independent. For example, to predict wind direction and strength, we might choose the von Mises distribution (defined on circular domains) for the direction and the exponential distribution (defined on positive real numbers) for the strength.

In [38]:
import numpy as np
import tensorflow as tf
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import GradientBoostingRegressor
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import CategoricalCrossentropy

In [16]:
K = 5  # # K is the number of categories for y_d
numberofSamples = 1000 # Number of samples
inputFeatures = 5

In [22]:
model = Sequential(
    [
        Dense(64,activation="relu",input_shape=(5,)), # Assume input features are of size 10
        Dense(32,activation="relu"),
        Dense(K,activation="softmax")
    ]
)

In [23]:
model.compile(optimizer="adam",loss=CategoricalCrossentropy(),metrics=["accuracy"])

In [24]:
def SimulateData(countNumber:int,categoryNumber:int)->tuple:
  # Simulating prediction probabilities for each sample and category
  predictions = np.random.randn(countNumber,categoryNumber)
  predictions /= predictions.sum(axis=1,keepdims=True) # Normalize to get probabilities
  trueLabels = np.random.randint(0,categoryNumber,size=(countNumber,))
  trueOneHot = np.eye(categoryNumber)[trueLabels] # One-hot encode labels
  return predictions,trueOneHot

In [25]:
def ComputeLoss(predictions:np.ndarray,trueLabels:np.ndarray)->np.ndarray:
  # Ensuring numerical stability by adding a small number to predictions
  predictions = np.clip(predictions,1e-7,1-1e-7)
  loss = -np.sum(trueLabels*np.log(predictions))/len(trueLabels)
  return loss

In [26]:
xTrain,yTrain = SimulateData(numberofSamples,K)

In [27]:
print(f"Shape of data: {xTrain.shape}")
print(f"Label shape: {yTrain.shape}")

Shape of data: (1000, 5)
Label shape: (1000, 5)


In [28]:
model.fit(xTrain,yTrain,epochs=10,batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7d5889977e20>

In [29]:
xTest,_ = SimulateData(numberofSamples,K)

In [30]:
print(f"Test data shape: {xTest.shape}")

Test data shape: (1000, 5)


In [31]:
def PredictionIndependent(model,sample:np.ndarray)->np.ndarray:
  predictions = model.predict(sample)
  return predictions

In [32]:
predictions = PredictionIndependent(model,xTest)
print(f"Predictions Shape: {predictions.shape}")

Predictions Shape: (1000, 5)


- another example:

In [35]:
sample,groundTruth = make_regression(
    n_samples=1000,
    n_features=5,
    n_targets=2,
    noise=0.1,
    random_state=1
)

In [36]:
print(f"Sample shape: {sample.shape}")
print(f"Labels shape: {groundTruth.shape}")

Sample shape: (1000, 5)
Labels shape: (1000, 2)


In [37]:
xTrain,xTest,yTrain,yTest = train_test_split(sample,groundTruth,test_size=0.2,random_state=1)

In [39]:
regressor = GradientBoostingRegressor(n_estimators=100,random_state=1)
model = MultiOutputRegressor(regressor)

In [41]:
model.fit(xTrain,yTrain)

In [42]:
predictions = model.predict(xTest)

In [43]:
print(f"Prediction shape: {predictions.shape}")

Prediction shape: (200, 2)


In [44]:
print(f"Prediction output example:\n{predictions[:5]}")

Prediction output example:
[[ 160.7917755   -63.72040025]
 [ 178.24929461  -34.3923282 ]
 [-122.37877501 -148.83883248]
 [  59.95969096   95.38790506]
 [-201.8961147  -152.31941062]]


- another example:

In [45]:
def InitializeParameters(featureNumber:int,outputNumber:int)->np.ndarray:
  np.random.seed(42)
  return np.random.randn(featureNumber,outputNumber)

In [46]:
def Prediction(sample:np.ndarray,parameters:np.ndarray)->np.ndarray:
  return np.dot(sample,parameters)

In [47]:
def MSELoss(groundTruth:np.ndarray,predictions:np.ndarray)->np.ndarray:
  return np.mean((groundTruth-predictions)**2)

In [48]:
def GradientDescent(sample:np.ndarray,groundTruth:np.ndarray,parameters:np.ndarray,learningRate:float=0.01,iterations:int=1000)->np.ndarray:
  count = len(groundTruth)
  for i in range(iterations):
    predictions = Prediction(sample,parameters)
    error = predictions-groundTruth
    gradients = np.dot(sample.T,error)/count
    parameters -=learningRate*gradients
  return parameters

In [49]:
sample,groundTruth = make_regression(
    n_samples=1000,
    n_features=5,
    n_targets=2,
    noise=0.1,
    random_state=1
)

In [50]:
xTrain,xTest,yTrain,yTest = train_test_split(sample,groundTruth,test_size=0.2,random_state=1)

In [51]:
xTrainBias = np.c_[np.ones((xTrain.shape[0],1)),xTrain]
xTestBias = np.c_[np.ones((xTest.shape[0],1)),xTest]
print(f"Train Bias shape:\n{xTrainBias.shape}")
print(f"Test Bias shape:\n{xTestBias.shape}")

Train Bias shape:
(800, 6)
Test Bias shape:
(200, 6)


In [52]:
featureNumber = xTrainBias.shape[1]
outputNumber = yTrain.shape[1]
print(f"Feature Count:\n{featureNumber}")
print(f"Output Count:\n{outputNumber}")

Feature Count:
6
Output Count:
2


In [53]:
parameters = InitializeParameters(featureNumber,outputNumber)
print(f"Parameters:\n{parameters}")

Parameters:
[[ 0.49671415 -0.1382643 ]
 [ 0.64768854  1.52302986]
 [-0.23415337 -0.23413696]
 [ 1.57921282  0.76743473]
 [-0.46947439  0.54256004]
 [-0.46341769 -0.46572975]]


In [54]:
optimized = GradientDescent(xTrainBias,yTrain,parameters)
print(f"Optimized Parameters:\n{optimized}")

Optimized Parameters:
[[ 6.41506808e-03 -5.76170471e-03]
 [ 7.27071249e+01  3.13973856e+01]
 [ 9.54029129e+00  7.19777695e+01]
 [ 3.27727804e+01  2.31694098e+01]
 [ 7.48195925e+01  6.07639188e+01]
 [ 7.05764292e+00  8.82642970e+00]]


In [56]:
predictions = Prediction(xTestBias,optimized)
print(f"Predictions Example:\n{predictions[:5]}")

Predictions Example:
[[ 125.41139384  -96.08222377]
 [ 177.77100372  -51.76026902]
 [-123.69643157 -162.66882841]
 [  61.3783659    84.32535235]
 [-233.81605846 -186.66328725]]
