## *RaschPy* simulation functionality

This notebook works through examples of how to generate simulated data sets with `RaschPy` for experimental use where knowledge of the underlying 'ground truth' of the generating parameters is useful, for example when comparing the efficacy of different estimation algorithms, such as in Elliott & Buttery (2022a) or exploring the effect of fitting different Rasch models to the same data set, such as in Elliott & Buttery (2022b). There are separate classes for each model: `SLM_Sim` for the simple logistic model (or dichotomous Rasch model) (Rasch, 1960), `PCM_Sim` for the partial credit model (Masters, 1982), `RSM_Sim` for the rating scale model (Andrich, 1978), `MFRM_Sim_Global` for the many-facet Rasch model (Linacre, 1994), `MFRM_Sim_Items` for the vector-by-item extended MFRM (Elliott & Buttery, 2022b), `MFRM_Sim_Thresholds` for the vector-by-threshold extended MFRM (Elliott & Buttery, 2022b) and `MFRM_Sim_Matrix` for the matrix extended MFRM (Elliott & Buttery, 2022b). All data is generated to fit the chosen model.

**References**

&nbsp;&nbsp;&nbsp;&nbsp; Andrich, D. (1978). A rating formulation for ordered response categories. *Psychometrika*, *43*(4), 561–573.

&nbsp;&nbsp;&nbsp;&nbsp; Elliott, M., & Buttery, P. J. (2022a) Non-iterative Conditional Pairwise Estimation for the Rating Scale Model, *Educational and Psychological Measurement*, *82*(5), 989-1019.

&nbsp;&nbsp;&nbsp;&nbsp; Elliott, M. and Buttery, P. J. (2022b) Extended Rater Representations in the Many-Facet Rasch Model, *Journal of Applied Measurement*, *22*(1), 133-160.

&nbsp;&nbsp;&nbsp;&nbsp; Linacre, J. M. (1994). *Many-Facet Rasch Measurement*. MESA Press.

&nbsp;&nbsp;&nbsp;&nbsp; Masters, G. N. (1982). A Rasch model for partial credit scoring. *Psychometrika*, *47*(2), 149–174.

&nbsp;&nbsp;&nbsp;&nbsp; Rasch, G. (1960). *Probabilistic models for some intelligence and attainment tests*. Danmarks Pædagogiske
Institut.

Import the packages and set the working directory (here called `my_working_directory`) - you will save your output files here.

In [1]:
import RaschPy as rp
import numpy as np
import pandas as pd
import os
import pickle

# my_working_directory
os.chdir('C:/Users/elliom/Downloads/sims')

### `RSM_Sim`

Create an object `rsm_sim_1` of the class `RSM_Sim` with randomised item difficulties, shared threshold set and person abilities. `RSM_Sim` will do this automatically when you pass `item_range`, `category_base`, `max_disorder`, `person_sd` and `offset` arguments to the simulation: item difficulties will be sampled from a uniform distribution and person abilities will be sampled from a normal distribution. We pass `item_range=4` to have items covering a range of 4 logits, and `person_sd=2` and `offset=1` to have a sample of persons with a mean ability 1 logit higher than the items, with a standard deviation of 2 logits. We also pass the additional arguments `category_base=1.5` and `max_disorder=1`; this sets the base category width to 1.5 logits, with a degree of random uniform variatoin around  controlled by `max_disorder`. With `max_disorder=1`, the minimum category width is 1 logit (and the maximum, symmetrically, will be 2 logits); a smaller value permits more variation in category widths, and a negative value for `max_disorder` allows the presence of disordered thresholds (hence the name of the argument). From this, a set of central item locations are generated from `item_range`, and sets of centred Rasch-ANdrich thresholds, each summing to zero, are generated from  `category_base` and `max_disorder`. One other additional argument that must be passed to `RSM_Sim` is `max_score`, which is a  the maximum possible score for each item. There are 5,000 persons and 12 items, with no missing data for this simulation.

In [4]:
rsm_sim_1 = rp.RSM_Sim(no_of_items=12,
                       no_of_persons=5000,
                       max_score=5,
                       item_range=4,
                       category_base=1.5,
                       max_disorder=1,
                       person_sd=2,
                       offset=0.5)

Save the generated response dataframe, which is stored as an attribute `rsm_sim_1.scores`, to file, and view the first 5 lines.

In [5]:
rsm_sim_1.scores.to_csv('rsm_sim_1_scores.csv')
rsm_sim_1.scores.head(5)

Unnamed: 0,Item_1,Item_2,Item_3,Item_4,Item_5,Item_6,Item_7,Item_8,Item_9,Item_10,Item_11,Item_12
Person_1,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0
Person_2,0.0,0.0,2.0,1.0,0.0,3.0,2.0,2.0,1.0,1.0,1.0,0.0
Person_3,1.0,1.0,1.0,1.0,0.0,2.0,1.0,1.0,1.0,0.0,1.0,1.0
Person_4,1.0,2.0,3.0,2.0,0.0,4.0,5.0,4.0,1.0,3.0,1.0,2.0
Person_5,3.0,3.0,3.0,4.0,2.0,5.0,5.0,4.0,3.0,3.0,2.0,3.0


Save the generating item, threshold and person parameters to file, and view the first 5 lines of the item difficulties and the person abilities, plus the dictionary of centred Rasch-Andrich thresholds (which includes the 'dummy' threshold 0, always set to 0).

In [6]:
rsm_sim_1.diffs.to_csv('rsm_sim_1_diffs.csv', header=None)
rsm_sim_1.diffs.head(5)

Item_1    0.842467
Item_2    0.798172
Item_3   -0.465888
Item_4   -0.257954
Item_5    1.853648
dtype: float64

In [9]:
pd.Series(rsm_sim_1.thresholds).to_csv('rsm_sim_1_thresholds.csv', header=None)
rsm_sim_1.thresholds

array([ 0.        , -2.64305761, -1.38372706,  0.03232781,  1.41243099,
        2.58202586])

In [11]:
rsm_sim_1.abilities.to_csv('rsm_sim_1_abilities.csv', header=None)
rsm_sim_1.abilities.head(5)

Person_1    5.191937
Person_2   -2.346023
Person_3   -2.663436
Person_4   -0.008643
Person_5    1.433023
dtype: float64

View `max_score`.

In [12]:
rsm_sim_1.max_score

5

Create an object `rsm_1` of the class `RSM` from the response dataframe for analysis and save the object `rsm_sim_1` to file with `pickle`. Note that when creating the `RSM` object, we have passed a `max_score` argument as well as the response dataframe (this is not essential - `RaschPy` will infer the maximum score from the response data if no value is paased - but if no persons achieve the maximum score, inferring the score will then not include the full score range).

In [14]:
rsm_1 = rp.RSM(rsm_sim_1.scores, max_score=rsm_sim_1.max_score)

with open('rsm_sim_1.pickle', 'wb') as handle:
    pickle.dump(rsm_sim_1, handle, protocol=pickle.HIGHEST_PROTOCOL)

You may wish to create a simulation based on specified, known item difficulties and/or person abilities. This may be done by passing lists to the `manual_diffs`,  `manual_thresholds` and/or `manual_abilities` arguments (in which case, there is no need to pass the relevant `item_range`, `category_base`, `category_base`, `person_sd` or `offset` arguments). You may also customise the names of the items and/or persons by passing lists of the correct length to the manual_person_names and/or manual_item_names arguments.

The manual_diffs and manual_abilities arguments may also be used to generate random item difficulties and/or person abilities according to distributions other than the default uniform (for items) and normal (for persons). This is what is done in the example `rsm_sim_2` below: A set of specified, fixed item difficulties (6 items of difficulty between -2.5 logit and +2.5 logits and a maximum score of 5) and set of Rasch-Andrich thresholds (each summing to zero, with a 'dummy threshold 0 of value 0) are passed together with a random uniform distribution of person abilities between -2 and 2 logits. For this simulation, we also set a proportion of 20% missing data (missing completely at random) by passing the argument `missing=0.2`.

In [15]:
rsm_sim_2 = rp.RSM_Sim(no_of_items=6,
                       no_of_persons=1000,
                       max_score=5,
                       missing=0.2,
                       manual_diffs=[-2.5, -1.5, -0.5, 0.5, 1.5, 2.5],
                       manual_thresholds=[0, -2, -1, 0, 1, 2],
                       manual_abilities = np.random.uniform(-2, 2, 1000))

Save the generated response dataframe, which is stored as an attribute `rsm_sim_2.scores`, to file, and view the first 5 lines.

In [16]:
rsm_sim_2.scores

Unnamed: 0,Item_1,Item_2,Item_3,Item_4,Item_5,Item_6
Person_1,5.0,5.0,4.0,3.0,2.0,1.0
Person_2,5.0,,,,2.0,3.0
Person_3,5.0,5.0,3.0,4.0,4.0,1.0
Person_4,2.0,3.0,1.0,2.0,0.0,0.0
Person_5,,5.0,,4.0,3.0,2.0
...,...,...,...,...,...,...
Person_996,5.0,5.0,5.0,3.0,3.0,
Person_997,5.0,,1.0,1.0,0.0,1.0
Person_998,4.0,4.0,1.0,0.0,0.0,
Person_999,5.0,,3.0,,1.0,0.0


Save the generating item, threshold and person parameters to file, and view the item difficulties and Rasch-Andrich thresholds (which includes the 'dummy' threshold 0, always set to 0).

In [17]:
rsm_sim_2.diffs.to_csv('rsm_sim_2_diffs.csv', header=None)
rsm_sim_2.diffs

Item_1   -2.5
Item_2   -1.5
Item_3   -0.5
Item_4    0.5
Item_5    1.5
Item_6    2.5
dtype: float64

In [18]:
pd.Series(rsm_sim_2.thresholds).to_csv('rsm_sim_2_thresholds.csv', header=None)
rsm_sim_2.thresholds

array([ 0, -2, -1,  0,  1,  2])

In [19]:
rsm_sim_2.abilities.to_csv('rsm_sim_2_abilities.csv', header=None)
rsm_sim_2.abilities.head(5)

Person_1    1.300112
Person_2    1.505473
Person_3    1.633697
Person_4   -1.762833
Person_5    1.594664
dtype: float64

View `max_score`.

In [20]:
rsm_sim_2.max_score

5

Create an object, `rsm_2`, of the class `RSM` from the response dataframe for analysis and save the object `rsm_sim_2` to file with pickle.

In [21]:
rsm_2 = rp.RSM(rsm_sim_2.scores, max_score=rsm_sim_2.max_score)

with open('rsm_sim_2.pickle', 'wb') as handle:
    pickle.dump(rsm_sim_2, handle, protocol=pickle.HIGHEST_PROTOCOL)