# Data Simulation and Model Comparison

In this notebook, I will simulate data based on the risk and ambiguity task in the modeling qualitative data project.

The goal is to compare the performance of a utility function and the Estimated Value model under different conditions, specifically varying the number of participants and the noise levels.

In this version of the task, there are 84 trials with the following parameters:

Values: 5, 8, 12, 25

Risk: 0.25, 0.5, 0.75

Ambiguity: 0, 0.24, 0.5, 0.74

## Libraries Used in the Experiment

First, we must import the necessary libraries for data manipulation, probabilistic programming, and visualization.

In [1]:
# Data manipulation and analysis
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical operations and array manipulation
import scipy as sp  # For scientific and technical computing
from scipy.special import expit  # For the sigmoid function, the choice function
from scipy import stats  # To draw from a truncated normal disterbution

# Probabilistic programming and Bayesian statistical modeling
import pymc as pm  
import arviz as az  

# Data visualization
import matplotlib.pyplot as plt 
import seaborn as sns  

# Suppressing warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

import logging
logger = logging.getLogger("pymc")
logger.propagate = False
logger.setLevel(logging.ERROR)

## Loading the Choices Dataset

We will load a CSV file that contains all the possible choices for the task without responses.

In [2]:
# Loading the CSV file into a pandas DataFrame
# 'sim.csv' is the file that contains the dataset with all the different possible choices
db = pd.read_csv('sim.csv')

## Simulating Decision-Making Data

Next, we define a function to simulate the decision-making data. 

This function generates risk and ambiguity attitudes, adds noise, and simulates choices based on these parameters.

In [3]:
def plot_uncertainty_att(α_true, β_true):
    fig, axes = plt.subplots(1, 2, figsize=(8, 3))
    
    # Plot the histogram for α_true with the header "Risk attitude"
    sns.histplot(α_true, bins=30, kde=True, ax=axes[0])
    axes[0].set_title("Risk attitude")
    axes[0].set_xlabel("Risk attitude values")
    axes[0].set_ylabel("Frequency")
    
    # Plot the histogram for β_true with the header "Ambiguity attitudes"
    sns.histplot(β_true, bins=30, kde=True, ax=axes[1])
    axes[1].set_title("Ambiguity attitudes")
    axes[1].set_xlabel("Ambiguity attitudes values")
    axes[1].set_ylabel("Frequency")
    
    # Adjust layout
    plt.tight_layout()
    plt.show()

In [4]:
# Function to simulate decision-making data
def sim_data(n_subs=10, noise=0.00001, 
             a_a=4, a_b=7, 
             b_mean=0.65, b_sd=1, lower_bound=-1.4, upper_bound=1.4):
    """
    Simulates decision-making data for a given number of subjects.

    Parameters:
    n_subs (int): Number of subjects to simulate.
    noise (float): Standard deviation of noise to add to risk and ambiguity attitudes.
    a_a (float): Alpha parameter for beta distribution to generate risk attitudes.
    a_b (float): Beta parameter for beta distribution to generate risk attitudes.
    b_mean (float): Mean for truncated normal distribution to generate ambiguity attitudes.
    b_sd (float): Standard deviation for truncated normal distribution to generate ambiguity attitudes.
    lower_bound (float): Lower bound for truncated normal distribution.
    upper_bound (float): Upper bound for truncated normal distribution.

    Returns:
    simdata: pd.DataFrame, Simulated dataset.
    sub_idx: np.ndarray, Array of subject indices.
    """

    # Generate risk attitudes using a beta distribution and scale it to 0-2
    α_part = np.random.beta(a_a, a_b, n_subs)
    α_true = α_part * 2
    
    # Generate ambiguity attitudes using a truncated normal distribution
    a, b = (lower_bound - b_mean) / b_sd, (upper_bound - b_mean) / b_sd
    β_true = stats.truncnorm.rvs(a, b, loc=b_mean, scale=b_sd, size=n_subs)

    # Create arrays of all the choices for each simulated participant
    value = np.tile(np.array(db.value), n_subs)
    risk = np.tile(np.array(db.risk), n_subs)
    ambiguity = np.tile(np.array(db.ambiguity), n_subs)
 
    # Define constant reference values
    refValue = 5  # constant reference value
    refProbability = 1  # constant reference probability
    refAmbiguity = 0  # constant reference ambiguity

    # Create arrays of reference values, replicated for each trial
    refProbabilities = np.tile(refProbability, len(value))
    refValue = np.tile(refValue, len(value))
    refAmbiguities = np.tile(refAmbiguity, len(value))

    # Repeat risk and ambiguity attitudes for each simulated participant
    riskTol = np.repeat(α_true, len(risk) / n_subs)
    ambTol = np.repeat(β_true, len(ambiguity) / n_subs)

    # Add noise to risk and ambiguity attitudes
    noise_dist_a = np.random.normal(loc=0, scale=noise, size=len(riskTol))
    noise_dist_b = np.random.normal(loc=0, scale=noise, size=len(riskTol))
    riskTol += noise_dist_a
    ambTol += noise_dist_b

    # Adjust values to stay within specified bounds
    riskTol = np.clip(riskTol, 0.1, 1.6)
    ambTol = np.clip(ambTol, -1.4, 1.4)

    # Calculate utility for reference and lottery
    uRef = refValue ** riskTol
    uLotto = (value ** riskTol) * (risk - ambTol * (ambiguity / 2))
    p = sp.special.expit(uLotto - uRef)  # Apply logistic function to calculate choice probabilities

    # Simulate choices based on probabilities
    choice = np.random.binomial(1, p, len(p))

    # Create subject indices for each trial
    sub_idx = np.repeat(np.arange(n_subs), 84)
    ID = sub_idx + 1
    
    # Generate a DataFrame with the simulated data
    simdata = pd.DataFrame({'sub': ID,
                            'choice': choice,
                            'value': value, 
                            'risk': risk, 
                            'ambiguity': ambiguity,
                            'riskTol': riskTol,
                            'ambTol': ambTol})

    # Rank the value levels and create binary columns for each level
    simdata['level'] = simdata['value'].rank(method='dense').astype(int)

    simdata['l1'] = simdata.level > 0
    simdata['l2'] = simdata.level > 1
    simdata['l3'] = simdata.level > 2
    simdata['l4'] = simdata.level > 3

    simdata['l1'] = simdata['l1'].astype(int)
    simdata['l2'] = simdata['l2'].astype(int)
    simdata['l3'] = simdata['l3'].astype(int)
    simdata['l4'] = simdata['l4'].astype(int)
    
    return simdata, sub_idx

# Load the file with all different possible choices
db = pd.read_csv('sim.csv')

In [5]:
def Utility(df, n_subs, idx):
    """
    Estimate the utility function of the subjects using a model that accounts for both the value of an outcome 
    and the probability of its occurrence.

    Parameters:
    - df: DataFrame containing data on choice, value levels, risk, and ambiguity for each trial.
    - n_subs: Number of subjects in the dataset.
    - idx: Subject index for each trial (used for modeling individual variations).

    Returns:
    - trace: Samples from the posterior distribution.
    """
    
    # Define the probabilistic model for utility function
    with pm.Model() as Utility:
        
        # Hyperpriors define group-level distributions for subject-specific parameters.
        alpha_a = pm.TruncatedNormal('alpha_a', 4, 1, lower = 0)  # Shape parameter for risk attitude
        alpha_b = pm.TruncatedNormal('alpha_b', 7, 3, lower = 0)  # Rate parameter for risk attitude
        bMu     = pm.Normal('bMu',   .65, 1)  # Group-level mean for ambiguity modulation

        # Individual subject priors.
        alpha = pm.Beta('alpha', alpha_a, alpha_b, shape = n_subs) # Subject-specific utility curvature
        α     = pm.Deterministic('α', alpha * 2) # Scale the value of alpha
        β     = pm.TruncatedNormal('β', bMu, 1, lower = -1.5, upper = 1.5, shape = n_subs) # Ambiguity modulation
        γ     = pm.LogNormal('γ', 0, .25, shape = n_subs) # Inverse temperature parameter

        # Calculate the expected value of the outcome using a power function.
        value = df['value'].values ** α[idx]  # Subjective value based on curvature parameter
        prob  = df['risk'].values  - (β[idx] * (df['ambiguity'].values/2))  # Probability of outcome considering ambiguity

        # Calculate the subjective value (SV) of the lottery for each trial
        svLotto = value * prob
        svRef   = 5 ** α[idx]  # Reference value

        # Convert SV into a probability of choosing the lottery using the inverse logit function.
        p  = (svLotto - svRef) / γ[idx]
        mu = pm.invlogit(p)

        # Define the likelihood of observations using a Binomial distribution, as the choice is binary.
        choice = pm.Binomial('choice', 1, mu, observed=df['choice'])

        trace = pm.sample(idata_kwargs={'log_likelihood':True})
           
    return(trace)

In [6]:
def estimate_values_ordinal(df, n_subs, idx):
    """
    Estimate the value of different reward levels using ordinal constraints and a common hyperprior for each level. 
    The model ensures that the levels are positive (ordinal constraints).

    Parameters:
    - df: DataFrame with trial-specific details, such as choices, value levels, risk, and ambiguity levels.
    - n_sub: Total number of subjects in the dataset.
    - idx: A list indicating the subject ID for each observation/trial.

    Returns:
    - trace: Samples from the posterior distribution of the model.
    """
    
    with pm.Model() as estimate:

        # Hyperparameters for group-level distributions
        bMu  = pm.Normal('bMu', .65, 1)     # Mean for ambiguity effect distribution

        # Hyperparameters for group-level subjective value levels
        l1Mu = pm.TruncatedNormal('l1Mu', 4, 2, lower=0)  # Mean for value of level 1
        l2Mu = pm.TruncatedNormal('l2Mu', 4, 2, lower=0)  # ... level 2
        l3Mu = pm.TruncatedNormal('l3Mu', 4, 2, lower=0)  # ... level 3
        l4Mu = pm.TruncatedNormal('l4Mu', 4, 2, lower=0)  # ... level 4

        
        # Subject-specific priors 
        β = pm.Normal('β', bMu, 1, shape = n_subs)   # Modulation of ambiguity effect
        γ = pm.Lognormal('γ', 0, 0.25, shape = n_subs)   # Inverse temperature, impacting choice stochasticity

        # Priors for subjective values of the different reward levels for each subject.
        level1 = pm.TruncatedNormal('level1', l1Mu, 1, lower = 0, shape = n_subs)
        level2 = pm.TruncatedNormal('level2', l2Mu, 1, lower = 0, shape = n_subs)
        level3 = pm.TruncatedNormal('level3', l3Mu, 1, lower = 0, shape = n_subs)
        level4 = pm.TruncatedNormal('level4', l4Mu, 1, lower = 0, shape = n_subs)

        # Calculate the total expected value for each trial by combining values from different levels
        val = (df['l1'].values * level1[idx] + 
               df['l2'].values * level2[idx] + 
               df['l3'].values * level3[idx] + 
               df['l4'].values * level4[idx]) 

        # Calculate adjusted probability by considering both risk and ambiguity levels modulated by β
        prob = (df['risk'].values) - (β[idx] * (df['ambiguity'].values/2))  

        # Compute the subjective value of the lottery option
        svLotto = val * prob
        svRef   = level1[idx]  # The subjective value of the reference option

        # Transform the SV difference between lottery and reference into a choice probability using the logistic function
        p  = (svLotto - svRef) / γ[idx]
        mu = pm.invlogit(p)

        # Likelihood of the observed choices given the computed probabilities
        choice = pm.Binomial('choice', 1, mu, observed=df['choice'])

        trace = pm.sample(idata_kwargs={'log_likelihood':True})
        
    return trace

## Running Simulations and Model Comparisons

We then run simulations for different numbers of participants and levels of noise. The results of the model comparisons are stored in a data frame.

In [7]:
# Initialize an empty DataFrame to store results
results = pd.DataFrame(columns=['N', 'noise', 'Comparison'])

test = [(30, 0.1),(30, 0.3), (30, 0.5), (60, 0.3), (60, 0.5), (120, 0.5), (300, 0.5)]
# Loop over different values of N and noise
for SIM in test:
    try:
        N = SIM[0]
        noise = SIM[1]
        print(f"Simulating with N={N} and noise={noise}")
        
        sim, sim_idx = sim_data(N, noise=noise)
        utility = Utility(sim, N, sim_idx)
        estimated = estimate_values_ordinal(sim, N, sim_idx)
        
        comp = az.compare({
            'Classic Utility': utility,
            'Estimated values': estimated})

        # Save the results to the DataFrame 
        new_row = pd.DataFrame({'N': [N], 'noise': [noise], 'Comparison': [comp]})
        results = pd.concat([results, new_row], ignore_index=True)
        
    except Exception as e:
        print(f"Simulation failed for N={N} and noise={noise}. Error: {e}")


Simulating with N=30 and noise=0.1


Output()

Output()

Simulating with N=30 and noise=0.3


Output()

Output()

Simulating with N=30 and noise=0.5


Output()

Output()

Simulating with N=60 and noise=0.3


Output()

Output()

Simulating with N=60 and noise=0.5


Output()

Output()

Simulating with N=120 and noise=0.5


Output()

Output()

Simulating with N=300 and noise=0.5


Output()

Output()

## Printing the Results

In [8]:
for i in range(len(results)):
    # Print the values of N and noise for the current row
    print(f"N: {results['N'][i]}, noise: {results['noise'][i]}")
    # Print the comparison outcome for the current row
    print("Comparison outcome:")
    print(results['Comparison'][i])
    print("\n")  # Print a newline for better readability

N: 30, noise: 0.1
Comparison outcome:
                  rank     elpd_loo      p_loo  elpd_diff    weight  \
Classic Utility      0 -1237.291188  59.220923   0.000000  0.928588   
Estimated values     1 -1277.650057  78.840682  40.358868  0.071412   

Classic Utility   26.289285  0.000000     True   log  
Estimated values  25.053221  9.241349    False   log  


N: 30, noise: 0.3
Comparison outcome:
                  rank     elpd_loo      p_loo  elpd_diff    weight  \
Classic Utility      0 -1404.250915  68.661170   0.000000  0.671807   
Estimated values     1 -1410.299069  84.274184   6.048153  0.328193   

Classic Utility   25.758849  0.000000     True   log  
Estimated values  23.564797  9.048145    False   log  


N: 30, noise: 0.5
Comparison outcome:
                  rank     elpd_loo      p_loo  elpd_diff    weight  \
Estimated values     0 -1432.324020  89.450778   0.000000  0.692139   
Classic Utility      1 -1450.721192  76.544238  18.397173  0.307861   

Estimated values  24

## Summary of Results

In the simulations, the performance of the two models—<b>Classic Utility</b> and <b>Estimated Value</b>—was compared across different combinations of the number of participants (N) and noise levels.

<ul>
  <li><b>N: 30, noise: 0.1</b>
    <ul>
      <li>Classic Utility model fits better, with a higher <i>elpd_loo</i> and a weight of 0.9286.</li>
    </ul>
  </li>
  <li><b>N: 30, noise: 0.3</b>
    <ul>
      <li>Classic Utility model still fits better, but the difference between the models is smaller. Classic Utility weight: 0.6718.</li>
    </ul>
  </li>
  <li><b>N: 30, noise: 0.5</b>
    <ul>
      <li>Estimated Values model starts to outperform the Classic Utility model with a higher <i>elpd_loo</i> and a weight of 0.6921.</li>
    </ul>
  </li>
  <li><b>N: 60, noise: 0.3</b>
    <ul>
      <li>Classic Utility model fits better, but the difference is reduced compared to lower noise levels. Classic Utility weight: 0.7167.</li>
    </ul>
  </li>
  <li><b>N: 60, noise: 0.5</b>
    <ul>
      <li>Estimated Values model fits better again, indicating its robustness in higher noise. Estimated Values weight: 0.5575.</li>
    </ul>
  </li>
  <li><b>N: 120, noise: 0.5</b>
    <ul>
      <li>Estimated Values model continues to outperform the Classic Utility model. Estimated Values weight: 0.5727.</li>
    </ul>
  </li>
  <li><b>N: 300, noise: 0.5</b>
    <ul>
      <li>Estimated Values model has a significant advantage over the Classic Utility model with a substantial <i>elpd_diff</i> and a weight of 0.6287.</li>
    </ul>
  </li>
</ul>

## Conclusion

As the noise level increases, the <b>Estimated Values</b> model tends to perform better compared to the <b>Classic Utility</b> model. This suggests that the Estimated Values model is more robust to noise regardless of N size.
