## CPAT simulation code

This Jupyter notebook contains the code needed to run the simulations included in Elliott & Buttery (2022) using `RaschPy` to produce CPAT estimates, which may then be compared with estimates from other estimation algorithms (this is not included here, but the code below saves all the response dataframes, which may be passed to other estimation algorithms as appropriate).

Elliott, M. and Buttery, P. J. (2022) Non-iterative Conditional Pairwise Estimation for the Rating Scale Model, *Educational and Psychological Measurement*, *82*(5), 989-1019.


Import packages and set working directory (change this as appropriate).

In [None]:
import RaschPy as rp
import numpy as np
import pandas as pd
import random
import os
import pickle

os.chdir('my_working_directory')

Set high-level experiment parameters: number of simulations, proportion of data removed for reduced and missing data sets, and priority vector methods to be compared, all as per Elliott & Buttery (2022).

In [None]:
no_of_sims = 10000
missing_prop = 0.3
methods = ['ls', 'evm']

Set ranges for generating parameters, all as per Elliott & Buttery (2022).

In [None]:
item_count_range = [4, 10]
max_score_range = [3, 7]
item_diff_range = [0.5, 3.5]
sample_size_log_10_range = [2, 4]
offset_range = [-0.5, 1]
person_sd_range = [1, 3.5]
category_base_range = [0.5, 2]
disorder_prob = 0.5
max_disorder_range = [0.5, 1]

Generate simulations and save response data to file

In [None]:
%%time

sim_parameters_df = pd.DataFrame()
sim_dict = {}

item_diffs_df = pd.DataFrame(index=[f'Item_{item + 1}' for item in range(item_count_range[1])])
thresholds_df = pd.DataFrame(index=[threshold for threshold in range(max_score_range[1] + 1)])

for sim in range(no_of_sims):

    print(f'Simulation {sim + 1}/{no_of_sims}')

    # Generate the generating parameters for the simulation

    no_of_items = np.random.randint(item_count_range[0],
                                    item_count_range[1] + 1)

    sample_size_log_10 = np.random.uniform(sample_size_log_10_range[0],
                                           sample_size_log_10_range[1])
    no_of_persons = int(round(10 ** sample_size_log_10, 0))

    max_score = np.random.randint(max_score_range[0],
                                  max_score_range[1] + 1)

    item_range = np.random.uniform(item_diff_range[0],
                                   item_diff_range[1])

    category_base = np.random.uniform(category_base_range[0],
                                      category_base_range[1])

    person_sd = np.random.uniform(person_sd_range[0],
                                  person_sd_range[1])
    
    disorder_random = np.random.uniform(0, 1)
    if disorder_random < disorder_prob:
        disorder_flag = 1
    else:
        disorder_flag = 0
    max_disorder = np.random.uniform(max_disorder_range[0],
                                     max_disorder_range[1])
    max_disorder *= disorder_flag
    
    offset = np.random.uniform(offset_range[0],
                               offset_range[1])

    # Add generating parameters to sim_generating_parameters_df

    sim_parameters = {'no_of_items': no_of_items,
                      'no_of_persons': no_of_persons,
                      'max_score': max_score,
                      'item_range': item_range,
                      'category_base': category_base,
                      'person_sd': person_sd,
                      'max_disorder': max_disorder,
                      'offset': offset}

    sim_parameters_df[f'Simulation {sim + 1}'] = sim_parameters

    # Generate simulation from parameters and save full data response dataframe to file

    sim_dict[f'Simulation {sim + 1}'] = rp.RSM_Sim(no_of_items=no_of_items,
                                                   no_of_persons=no_of_persons,
                                                   max_score=max_score,
                                                   item_range=item_range,
                                                   category_base=category_base,
                                                   person_sd=person_sd,
                                                   max_disorder=max_disorder,
                                                   offset=offset)

    sim_dict[f'Simulation {sim + 1}'].scores.to_csv(f'responses_full_{sim + 1}.csv')

    # Generate a reduced response dataframe (whole person lines removed), save to file
    # and add to simulation object as a new attribute, together with reduced set of
    # generating abilities

    reduced_data = sim_dict[f'Simulation {sim + 1}'].scores.sample(frac=1-missing_prop)
    reduced_data.to_csv(f'responses_reduced_{sim + 1}.csv')
    sim_dict[f'Simulation {sim + 1}'].scores_reduced = reduced_data
    
    abils_full = sim_dict[f'Simulation {sim + 1}'].abilities
    abils_reduced = abils_full.loc[reduced_data.index]
    sim_dict[f'Simulation {sim + 1}'].abilities_reduced = abils_reduced
    

    # Generate a missing data response dataframe (individual responses removed MCAR)), save to file
    # and add to simulation object as a new attribute

    random_array = np.random.uniform(0, 1, (no_of_persons, no_of_items))
    random_df = pd.DataFrame(random_array)
    random_df.columns = sim_dict[f'Simulation {sim + 1}'].scores.columns
    random_df.index = sim_dict[f'Simulation {sim + 1}'].scores.index

    missing_data = sim_dict[f'Simulation {sim + 1}'].scores.where(random_df > missing_prop)

    missing_data.to_csv(f'responses_missing_{sim + 1}.csv')
    sim_dict[f'Simulation {sim + 1}'].scores_missing = missing_data
    
# Save sim_generating_parameters_df and create pickle file of dictionary all simulations;
# the pickle file may be opened later to retrieve the simulations with the three response
# dataframes and generating parameters stored as attributes (see cell at end).

sim_parameters_df.to_csv('simulation_generating_parameters.csv')

with open('simulation_dictionary.pkl', 'wb') as file:
    pickle.dump(sim_dict, file)

Define functions for RMSE, SD ratio and RMS parameter estimation residual metrics

In [None]:
def rmse(x, y):

    mse = ((x - y) ** 2).mean()
    
    return np.sqrt(mse)

def sd_ratio(x, y):

    return y.std() / x.std()

Set the additive smoothing constant for parameter estimation. Here, the experiment is only run with a single additive smoothing constant (the default `RaschPy` value of `constant=0.1`); to compare performance using different additive smoothing constants, change the value in this cell and re-run the cell below (remember to move the output files first so they aren't over-written!)

In [None]:
constant = 0.1

Generate parameter estimates and save comparison metrics to file

In [None]:
%%time

results_dict_full = {}
rsm_dict_full = {}
results_dict_reduced = {}
rsm_dict_reduced = {}
results_dict_missing = {}
rsm_dict_missing = {}

for method in methods:

    results_dict_full[method] = {}
    rsm_dict_full[method] = pd.DataFrame()
    results_dict_reduced[method] = {}
    rsm_dict_reduced[method] = pd.DataFrame()
    results_dict_missing[method] = {}
    rsm_dict_missing[method] = pd.DataFrame()
    
    for sim in range(no_of_sims):
    
        print(f'Simulation {sim + 1}/{no_of_sims}')
    
        # Generate estimates from the responses

        # Full data set

        rsm_sim = sim_dict[f'Simulation {sim + 1}']
    
        data_full = rsm_sim.scores
        max_score = rsm_sim.max_score
    
        rsm_full = rp.RSM(data_full, max_score=max_score)
    
        rsm_full.calibrate(method=method, constant=constant)

        diffs_rmse = rmse(rsm_sim.diffs, rsm_full.diffs)
        diffs_sd_ratio = sd_ratio(rsm_sim.diffs, rsm_full.diffs)
                                      
        thresholds_rmse = rmse(rsm_sim.thresholds, rsm_full.thresholds)
        thresholds_sd_ratio = sd_ratio(rsm_sim.thresholds, rsm_full.thresholds)

        exp_score_array_estimated_full = np.array([rsm_full.exp_score(rsm_sim.abilities.loc[person],
                                                                      rsm_full.diffs[item],
                                                                      rsm_full.thresholds)
                                                   for item in rsm_sim.items for person in rsm_sim.persons])
        exp_score_array_generating_full = np.array([rsm_full.exp_score(rsm_sim.abilities.loc[person],
                                                                       rsm_sim.diffs[item],
                                                                       rsm_sim.thresholds)
                                                    for item in rsm_sim.items for person in rsm_sim.persons])
        
        rms_param_est_residuals = rmse(exp_score_array_estimated_full,
                                       exp_score_array_generating_full)

        rsm_sim_results_full = {'RMSE diffs': diffs_rmse,
                                'SD ratio diffs': diffs_sd_ratio,
                                'RMSE thresholds': thresholds_rmse,
                                'SD ratio thresholds': thresholds_sd_ratio,
                                'RMS parameter estimation residuals': rms_param_est_residuals}

        results_dict_full[method][f'Simulation {sim + 1}'] = rsm_sim_results_full
        rsm_dict_full[method][f'Simulation {sim + 1}'] = rsm_full

        # Reduced data set

        data_reduced = pd.read_csv(f'responses_reduced_{sim + 1}.csv', index_col=0)
    
        rsm_reduced = rp.RSM(data_reduced, max_score=max_score)
    
        rsm_reduced.calibrate(method=method)

        diffs_rmse = rmse(rsm_sim.diffs, rsm_reduced.diffs)
        diffs_sd_ratio = sd_ratio(rsm_sim.diffs, rsm_reduced.diffs)
                                      
        thresholds_rmse = rmse(rsm_sim.thresholds, rsm_reduced.thresholds)
        thresholds_sd_ratio = sd_ratio(rsm_sim.thresholds, rsm_reduced.thresholds)

        exp_score_array_estimated_reduced = np.array([rsm_reduced.exp_score(rsm_sim.abilities_reduced.loc[person],
                                                                            rsm_reduced.diffs[item],
                                                                            rsm_reduced.thresholds)
                                                      for item in rsm_sim.items for person in data_reduced.index])
        exp_score_array_generating_reduced = np.array([rsm_reduced.exp_score(rsm_sim.abilities_reduced.loc[person],
                                                                             rsm_sim.diffs[item],
                                                                             rsm_sim.thresholds)
                                                       for item in rsm_sim.items for person in data_reduced.index])
        
        rms_param_est_residuals = rmse(exp_score_array_estimated_reduced,
                                       exp_score_array_generating_reduced)

        rsm_sim_results_reduced = {'RMSE diffs': diffs_rmse,
                                   'SD ratio diffs': diffs_sd_ratio,
                                   'RMSE thresholds': thresholds_rmse,
                                   'SD ratio thresholds': thresholds_sd_ratio,
                                   'RMS parameter estimation residuals': rms_param_est_residuals}

        results_dict_reduced[method][f'Simulation {sim + 1}'] = rsm_sim_results_reduced
        rsm_dict_reduced[method][f'Simulation {sim + 1}'] = rsm_reduced

        # Missing data data set

        data_missing = pd.read_csv(f'responses_missing_{sim + 1}.csv', index_col=0)
    
        rsm_missing = rp.RSM(data_missing, max_score=max_score)
    
        rsm_missing.calibrate(method=method)

        diffs_rmse = rmse(rsm_sim.diffs, rsm_missing.diffs)
        diffs_sd_ratio = sd_ratio(rsm_sim.diffs, rsm_missing.diffs)
                                      
        thresholds_rmse = rmse(rsm_sim.thresholds, rsm_missing.thresholds)
        thresholds_sd_ratio = sd_ratio(rsm_sim.thresholds, rsm_missing.thresholds)

        exp_score_array_estimated_missing = np.array([rsm_missing.exp_score(rsm_sim.abilities.loc[person],
                                                                            rsm_missing.diffs[item],
                                                                            rsm_missing.thresholds)
                                                      for item in rsm_sim.items for person in rsm_sim.persons
                                                      if data_missing.loc[person,item] == data_missing.loc[person,item]])
        exp_score_array_generating_missing = np.array([rsm_missing.exp_score(rsm_sim.abilities.loc[person],
                                                                             rsm_sim.diffs[item],
                                                                             rsm_sim.thresholds)
                                                       for item in rsm_sim.items for person in rsm_sim.persons
                                                       if data_missing.loc[person,item] == data_missing.loc[person,item]])
        
        rms_param_est_residuals = rmse(exp_score_array_estimated_missing,
                                       exp_score_array_generating_missing)

        rsm_sim_results_missing = {'RMSE diffs': diffs_rmse,
                                   'SD ratio diffs': diffs_sd_ratio,
                                   'RMSE thresholds': thresholds_rmse,
                                   'SD ratio thresholds': thresholds_sd_ratio,
                                   'RMS parameter estimation residuals': rms_param_est_residuals}

        results_dict_missing[method][f'Simulation {sim + 1}'] = rsm_sim_results_missing
        rsm_dict_missing[method][f'Simulation {sim + 1}'] = rsm_missing

# Save results to file

for method in methods:
    pd.DataFrame(results_dict_full[method]).to_csv(f'results_full_{method}.csv')
    pd.DataFrame(results_dict_reduced[method]).to_csv(f'results_reduced_{method}.csv')
    pd.DataFrame(results_dict_missing[method]).to_csv(f'results_missing_{method}.csv')

In [None]:
round(pd.DataFrame(results_dict_full['evm']), 3).T

### Retrieving simulations from file

This cell is not part of the simulation, but contains what is needed to retrieve the simulation objects later, each of which contains all the generating parameters and three response dataframes. To open one of the simulations, `RaschPy` is needed, which is why there is a line here to import it (redundant if it has previously been imported).

In [None]:
import RaschPy as rp
with open('simulation_dictionary.pkl', 'rb') as file:
    retrieved_sim_dict = pickle.load(file)

View the full response dataframe for Simulation 1 in the retrieved dictionary.

In [None]:
retrieved_sim_dict['Simulation 1'].scores

View the full generating item difficulties for Simulation 1 in the retrieved dictionary. For the generating thresholds, replace `diffs` with `thresholds`.

In [None]:
retrieved_sim_dict['Simulation 1'].diffs