## *RaschPy* simulation functionality

This notebook works through examples of how to generate simulated data sets with `RaschPy` for experimental use where knowledge of the underlying 'ground truth' of the generating parameters is useful, for example when comparing the efficacy of different estimation algorithms, such as in Elliott & Buttery (2022a) or exploring the effect of fitting different Rasch models to the same data set, such as in Elliott & Buttery (2022b). There are separate classes for each model: `SLM_Sim` for the simple logistic model (or dichotomous Rasch model) (Rasch, 1960), `PCM_Sim` for the partial credit model (Masters, 1982), `RSM_Sim` for the rating scale model (Andrich, 1978), `MFRM_Sim_Global` for the many-facet Rasch model (Linacre, 1994), `MFRM_Sim_Items` for the vector-by-item extended MFRM (Elliott & Buttery, 2022b), `MFRM_Sim_Thresholds` for the vector-by-threshold extended MFRM (Elliott & Buttery, 2022b) and `MFRM_Sim_Matrix` for the matrix extended MFRM (Elliott & Buttery, 2022b). All data is generated to fit the chosen model.

**References**

&nbsp;&nbsp;&nbsp;&nbsp; Andrich, D. (1978). A rating formulation for ordered response categories. *Psychometrika*, *43*(4), 561–573.

&nbsp;&nbsp;&nbsp;&nbsp; Elliott, M., & Buttery, P. J. (2022a) Non-iterative Conditional Pairwise Estimation for the Rating Scale Model, *Educational and Psychological Measurement*, *82*(5), 989-1019.

&nbsp;&nbsp;&nbsp;&nbsp; Elliott, M. and Buttery, P. J. (2022b) Extended Rater Representations in the Many-Facet Rasch Model, *Journal of Applied Measurement*, *22*(1), 133-160.

&nbsp;&nbsp;&nbsp;&nbsp; Linacre, J. M. (1994). *Many-Facet Rasch Measurement*. MESA Press.

&nbsp;&nbsp;&nbsp;&nbsp; Masters, G. N. (1982). A Rasch model for partial credit scoring. *Psychometrika*, *47*(2), 149–174.

&nbsp;&nbsp;&nbsp;&nbsp; Rasch, G. (1960). *Probabilistic models for some intelligence and attainment tests*. Danmarks Pædagogiske
Institut.

Import the packages and set the working directory (here called `my_working_directory`) - you will save your output files here.

In [1]:
import RaschPy as rp
import numpy as np
import pandas as pd
import os
import pickle

os.chdir('my_working_directory')

### `PCM_Sim`

Create an object `pcm_sim_1` of the class `PCM_Sim` with randomised central item difficulties/thresholds and person abilities. `PCM_Sim` will do this automatically when you pass `item_range`, `category_base`, `max_disorder`, `person_sd` and `offset` arguments to the simulation: item difficulties will be sampled from a uniform distribution and person abilities will be sampled from a normal distribution. We pass `item_range=4` to have items covering a range of 4 logits, and `person_sd=2` and `offset=1` to have a sample of persons with a mean ability 1 logit higher than the items, with a standard deviation of 2 logits. We also pass the additional arguments `category_base=1.5` and `max_disorder=1`; this sets the base category width to 1.5 logits, with a degree of random uniform variatoin around  controlled by `max_disorder`. With `max_disorder=1`, the minimum category width is 1 logit (and the maximum, symmetrically, will be 2 logits); a smaller value permits more variation in category widths, and a negative value for `max_disorder` allows the presence of disordered thresholds (hence the name of the argument). From this, a set of central item locations are generated from `item_range`, and sets of centred Rasch-ANdrich thresholds, each summing to zero, are generated from  `category_base` and `max_disorder`. One other additional argument that must be passed to `PCM_Sim` is `max_score_vector`, which is a list containing the maximum possible score for each item (this can vary from item to item). There are 5,000 persons and 12 items, with no missing data for this simulation.

In [2]:
pcm_sim_1 = rp.PCM_Sim(no_of_items=12,
                       no_of_persons=5000,
                       max_score_vector=[3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5],
                       item_range=4,
                       category_base=1.5,
                       max_disorder=1,
                       person_sd=2,
                       offset=0.5)

Save the generated response dataframe, which is stored as an attribute `pcm_sim_1.scores`, to file, and view the first 5 lines.

In [3]:
pcm_sim_1.scores.to_csv('pcm_sim_1_scores.csv')
pcm_sim_1.scores.head(5)

Unnamed: 0,Item_1,Item_2,Item_3,Item_4,Item_5,Item_6,Item_7,Item_8,Item_9,Item_10,Item_11,Item_12
Person_1,1.0,2.0,2.0,1.0,1.0,3.0,5.0,2.0,1.0,2.0,3.0,4.0
Person_2,3.0,2.0,2.0,1.0,2.0,2.0,4.0,3.0,2.0,3.0,4.0,4.0
Person_3,3.0,1.0,2.0,2.0,1.0,1.0,5.0,3.0,4.0,3.0,3.0,5.0
Person_4,2.0,2.0,1.0,2.0,0.0,1.0,3.0,4.0,1.0,2.0,3.0,4.0
Person_5,2.0,3.0,3.0,2.0,0.0,2.0,5.0,5.0,2.0,2.0,4.0,5.0


Save the generating item, threshold and person parameters to file, and view the first 5 lines of the item difficulties and the person abilities, plus the dictionary of centred Rasch-Andrich thresholds (which includes the 'dummy' threshold 0, always set to 0).

In [4]:
pcm_sim_1.diffs.to_csv('pcm_sim_1_diffs.csv', header=None)
pcm_sim_1.diffs.head(5)

Item_1    0.360925
Item_2    0.290721
Item_3    0.240278
Item_4   -0.172308
Item_5    1.138493
dtype: float64

In [5]:
with open('pcm_sim_1_thresholds_centred.pickle', 'wb') as handle:
    pickle.dump(pcm_sim_1.thresholds_centred, handle, protocol=pickle.HIGHEST_PROTOCOL)
pcm_sim_1.thresholds_centred

{'Item_1': array([ 0.        , -1.40152395, -0.06087941,  1.46240336]),
 'Item_2': array([ 0.        , -1.22421373,  0.05521353,  1.1690002 ]),
 'Item_3': array([ 0.        , -1.6790758 ,  0.22558274,  1.45349306]),
 'Item_4': array([ 0.        , -1.8823932 ,  0.06683598,  1.81555722]),
 'Item_5': array([ 0.        , -1.10397365,  0.04350351,  1.06047014]),
 'Item_6': array([ 0.        , -1.59347338,  0.20876874,  1.38470464]),
 'Item_7': array([ 0.        , -2.28573067, -1.25590057, -0.19711798,  1.35478782,
         2.38396141]),
 'Item_8': array([ 0.        , -2.84881759, -1.83598418,  0.04111116,  1.80447455,
         2.83921606]),
 'Item_9': array([ 0.        , -3.14658612, -1.68155009, -0.03876519,  1.56995742,
         3.29694397]),
 'Item_10': array([ 0.        , -3.02457669, -1.76101312,  0.0295202 ,  1.79467891,
         2.9613907 ]),
 'Item_11': array([ 0.        , -2.20315082, -1.1104699 ,  0.00370286,  1.07064889,
         2.23926897]),
 'Item_12': array([ 0.        , -3.6

In [6]:
pcm_sim_1.abilities.to_csv('pcm_sim_1_abilities.csv', header=None)
pcm_sim_1.abilities.head(5)

Person_1    0.757072
Person_2    1.408041
Person_3    1.281264
Person_4    0.299648
Person_5    1.177038
dtype: float64

Save the dictionary of uncentred thresholds to file and view.

In [7]:
with open('pcm_sim_1_thresholds_uncentred.pickle', 'wb') as handle:
    pickle.dump(pcm_sim_1.thresholds_uncentred, handle, protocol=pickle.HIGHEST_PROTOCOL)
pcm_sim_1.thresholds_uncentred

{'Item_1': array([-1.04059888,  0.30004566,  1.82332843]),
 'Item_2': array([-0.93349242,  0.34593484,  1.45972151]),
 'Item_3': array([-1.43879787,  0.46586067,  1.69377099]),
 'Item_4': array([-2.0547015 , -0.10547232,  1.64324892]),
 'Item_5': array([0.03451911, 1.18199627, 2.19896291]),
 'Item_6': array([-2.0154115 , -0.21316937,  0.96276653]),
 'Item_7': array([-3.82237359, -2.79254348, -1.73376089, -0.18185509,  0.84731849]),
 'Item_8': array([-3.76382832, -2.75099492, -0.87389957,  0.88946382,  1.92420532]),
 'Item_9': array([-1.46886111e+00, -3.82507572e-03,  1.63895982e+00,  3.24768243e+00,
         4.97466898e+00]),
 'Item_10': array([-1.31132377, -0.04776021,  1.74277312,  3.50793183,  4.67464362]),
 'Item_11': array([-1.71743041, -0.6247495 ,  0.48942327,  1.5563693 ,  2.72498937]),
 'Item_12': array([-5.91121741, -3.942125  , -2.37844116, -0.38961202,  1.18766018])}

View `max_score_vector`.

In [8]:
pcm_sim_1.max_score_vector

[3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5]

Create an object `pcm_1` of the class `PCM` from the response dataframe for analysis and save the object `pcm_sim_1` to file with `pickle`. Note that when creating the `PCM` object, we have passed a `max_score_vector` argument as well as the response dataframe (this is not essential - `RaschPy` will infer the maximum score from the response data if no vector is paased - but if no persons achieve the maximum score for an item, inferring the score will then not include the full score range).

In [9]:
pcm_1 = rp.PCM(pcm_sim_1.scores, max_score_vector=pcm_sim_1.max_score_vector)

with open('pcm_sim_1.pickle', 'wb') as handle:
    pickle.dump(pcm_sim_1, handle, protocol=pickle.HIGHEST_PROTOCOL)

You may wish to create a simulation based on specified, known item difficulties and/or person abilities. This may be done by passing lists to the `manual_diffs`,  `manual_thresholds` and/or `manual_abilities` arguments (in which case, there is no need to pass the relevant `item_range`, `category_base`, `category_base`, `person_sd` or `offset` arguments). You may also customise the names of the items and/or persons by passing lists of the correct length to the manual_person_names and/or manual_item_names arguments.

The manual_diffs and manual_abilities arguments may also be used to generate random item difficulties and/or person abilities according to distributions other than the default uniform (for items) and normal (for persons). This is what is done in the example `pcm_sim_2` below: A set of specified, fixed item difficulties (4 items of difficulty between -1.5 logit and +1.5 logits with maximum scores of 3, 5, 5 and 3) and sets of Rasch-Andrich thresholds (each summing to zero, with a 'dummy threshold 0 of value 0) are passed together with a random uniform distribution of person abilities between -2 and 2 logits. For this simulation, we also set a proportion of 20% missing data (missing completely at random) by passing the argument `missing=0.2`.

In [10]:
pcm_sim_2 = rp.PCM_Sim(no_of_items=4,
                       no_of_persons=1000,
                       max_score_vector=[3, 5, 5, 3],
                       missing=0.2,
                       manual_diffs= [-1.5, -0.5, 0.5, 1.5],
                       manual_thresholds=[[0, -1, 0, 1], [0, -2, -1, 0, 1, 2], [0, -2, -1, 0, 1, 2], [0, -1, 0, 1]],
                       manual_abilities = np.random.uniform(-2, 2, 1000))

Save the generated response dataframe, which is stored as an attribute `pcm_sim_2.scores`, to file, and view the first 5 lines.

In [11]:
pcm_sim_2.scores

Unnamed: 0,Item_1,Item_2,Item_3,Item_4
Person_1,3.0,5.0,3.0,2.0
Person_2,3.0,4.0,,0.0
Person_3,1.0,0.0,1.0,
Person_4,1.0,3.0,2.0,0.0
Person_5,2.0,1.0,0.0,0.0
...,...,...,...,...
Person_996,3.0,,,2.0
Person_997,2.0,3.0,1.0,1.0
Person_998,3.0,1.0,,0.0
Person_999,,4.0,3.0,2.0


Save the generating item, threshold and person parameters to file, and view the item difficulties and the person abilities, plus the dictionary of centred Rasch-Andrich thresholds (which includes the 'dummy' threshold 0, always set to 0).

In [12]:
pcm_sim_2.diffs.to_csv('pcm_sim_2_diffs.csv', header=None)
pcm_sim_2.diffs

Item_1   -1.5
Item_2   -0.5
Item_3    0.5
Item_4    1.5
dtype: float64

In [13]:
with open('pcm_sim_2_thresholds_centred.pickle', 'wb') as handle:
    pickle.dump(pcm_sim_2.thresholds_centred, handle, protocol=pickle.HIGHEST_PROTOCOL)
pcm_sim_2.thresholds_centred

{'Item_1': array([ 0, -1,  0,  1]),
 'Item_2': array([ 0, -2, -1,  0,  1,  2]),
 'Item_3': array([ 0, -2, -1,  0,  1,  2]),
 'Item_4': array([ 0, -1,  0,  1])}

In [14]:
pcm_sim_1.abilities.to_csv('pcm_sim_1_abilities.csv', header=None)
pcm_sim_1.abilities.head(5)

Person_1    0.757072
Person_2    1.408041
Person_3    1.281264
Person_4    0.299648
Person_5    1.177038
dtype: float64

Save the dictionary of uncentred thresholds to file and view.

In [15]:
with open('pcm_sim_2_thresholds_uncentred.pickle', 'wb') as handle:
    pickle.dump(pcm_sim_2.thresholds_uncentred, handle, protocol=pickle.HIGHEST_PROTOCOL)
pcm_sim_2.thresholds_uncentred

{'Item_1': array([-2.5, -1.5, -0.5]),
 'Item_2': array([-2.5, -1.5, -0.5,  0.5,  1.5]),
 'Item_3': array([-1.5, -0.5,  0.5,  1.5,  2.5]),
 'Item_4': array([0.5, 1.5, 2.5])}

View `max_score_vector`.

In [16]:
pcm_sim_2.max_score_vector

[3, 5, 5, 3]

Create an object, `pcm_2`, of the class `PCM` from the response dataframe for analysis and save the object `pcm_sim_2` to file with pickle.

In [17]:
pcm_2 = rp.PCM(pcm_sim_2.scores, max_score_vector=pcm_sim_2.max_score_vector)

with open('pcm_sim_2.pickle', 'wb') as handle:
    pickle.dump(pcm_sim_2, handle, protocol=pickle.HIGHEST_PROTOCOL)