## *RaschPy* simulation functionality

This notebook works through examples of how to generate simulated data sets with `RaschPy` for experimental use where knowledge of the underlying 'ground truth' of the generating parameters is useful, for example when comparing the efficacy of different estimation algorithms, such as in Elliott & Buttery (2022a) or exploring the effect of fitting different Rasch models to the same data set, such as in Elliott & Buttery (2022b). There are separate classes for each model: `SLM_Sim` for the simple logistic model (or dichotomous Rasch model) (Rasch, 1960), `PCM_Sim` for the partial credit model (Masters, 1982), `RSM_Sim` for the rating scale model (Andrich, 1978), `MFRM_Sim_Global` for the many-facet Rasch model (Linacre, 1994), `MFRM_Sim_Items` for the vector-by-item extended MFRM (Elliott & Buttery, 2022b), `MFRM_Sim_Thresholds` for the vector-by-threshold extended MFRM (Elliott & Buttery, 2022b) and `MFRM_Sim_Matrix` for the matrix extended MFRM (Elliott & Buttery, 2022b). All data is generated to fit the chosen model.

**References**

&nbsp;&nbsp;&nbsp;&nbsp; Andrich, D. (1978). A rating formulation for ordered response categories. *Psychometrika*, *43*(4), 561–573.

&nbsp;&nbsp;&nbsp;&nbsp; Elliott, M., & Buttery, P. J. (2022a) Non-iterative Conditional Pairwise Estimation for the Rating Scale Model, *Educational and Psychological Measurement*, *82*(5), 989-1019.

&nbsp;&nbsp;&nbsp;&nbsp; Elliott, M. and Buttery, P. J. (2022b) Extended Rater Representations in the Many-Facet Rasch Model, *Journal of Applied Measurement*, *22*(1), 133-160.

&nbsp;&nbsp;&nbsp;&nbsp; Linacre, J. M. (1994). *Many-Facet Rasch Measurement*. MESA Press.

&nbsp;&nbsp;&nbsp;&nbsp; Masters, G. N. (1982). A Rasch model for partial credit scoring. *Psychometrika*, *47*(2), 149–174.

&nbsp;&nbsp;&nbsp;&nbsp; Rasch, G. (1960). *Probabilistic models for some intelligence and attainment tests*. Danmarks Pædagogiske
Institut.

Import the packages and set the working directory (here called `my_working_directory`) - you will save your output files here.

In [1]:
import RaschPy as rp
import numpy as np
import pandas as pd
import os
import pickle

os.chdir('my_working_directory')

### `MFRM_Sim_Items`

Create an object `mfrm_sim_1` of the class `MFRM_Sim_Items` with randomised item difficulties, shared threshold set and person abilities. `MFRM_Sim_Items` will do this automatically when you pass `item_range`, `rater_range`, `category_base`, `max_disorder`, `person_sd` and `offset` arguments to the simulation: item difficulties and rater severities will be sampled from a uniform distribution; person abilities will be sampled from a normal distribution. We pass `item_range=4` to have items covering a range of 4 logits, `rater_range=3` to have raters covering a range of 3 logits, and `person_sd=2` and `offset=1` to have a sample of persons with a mean ability 1 logit higher than the items, with a standard deviation of 2 logits. We also pass the additional arguments `category_base=1.5` and `max_disorder=1`; this sets the base category width to 1.5 logits, with a degree of random uniform variatoin around  controlled by `max_disorder`. With `max_disorder=1`, the minimum category width is 1 logit (and the maximum, symmetrically, will be 2 logits); a smaller value permits more variation in category widths, and a negative value for `max_disorder` allows the presence of disordered thresholds (hence the name of the argument). From this, a set of central item locations are generated from `item_range`, and sets of centred Rasch-ANdrich thresholds, each summing to zero, are generated from  `category_base` and `max_disorder`. One other additional argument that must be passed to `MFRM_Sim_Items` is `max_score`, which is a  the maximum possible score for each item. There are 500 persons, 8 items and 10 raters, with no missing data for this simulation.

In [2]:
mfrm_sim_1 = rp.MFRM_Sim_Items(no_of_items=8,
                               no_of_persons=500,
                               no_of_raters=10,
                               max_score=5,
                               item_range=4,
                               rater_range=3,
                               category_base=1.5,
                               max_disorder=1,
                               person_sd=2,
                               offset=0.5)

Save the generated response dataframe, which is stored as an attribute `mfrm_sim_1.scores`, to file, and view the first 5 lines.

In [3]:
mfrm_sim_1.scores.to_csv('mfrm_sim_1_scores.csv')
mfrm_sim_1.scores.head(5)

Unnamed: 0,Unnamed: 1,Item_1,Item_2,Item_3,Item_4,Item_5,Item_6,Item_7,Item_8
Rater_1,Person_1,3.0,2.0,2.0,3.0,3.0,4.0,5.0,4.0
Rater_1,Person_2,2.0,0.0,2.0,2.0,3.0,3.0,4.0,3.0
Rater_1,Person_3,2.0,2.0,2.0,3.0,4.0,5.0,2.0,4.0
Rater_1,Person_4,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
Rater_1,Person_5,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


Save the generating item, threshold, rater and person parameters to file, and view the first 5 lines of the item difficulties, rater severities and person abilities, plus the Rasch-Andrich thresholds (which includes the 'dummy' threshold 0, always set to 0).

In [4]:
mfrm_sim_1.diffs.to_csv('mfrm_sim_1_diffs.csv', header=None)
mfrm_sim_1.diffs.head(5)

Item_1    0.013105
Item_2    2.808579
Item_3    1.169241
Item_4    0.119679
Item_5   -1.129521
dtype: float64

In [5]:
pd.Series(mfrm_sim_1.thresholds).to_csv('mfrm_sim_1_thresholds.csv', header=None)
mfrm_sim_1.thresholds

array([ 0.        , -3.4803554 , -1.5947356 ,  0.39064311,  1.42713817,
        3.25730971])

In [9]:
pd.DataFrame(mfrm_sim_1.severities).T.to_csv('mfrm_sim_1_severities.csv')
pd.DataFrame(mfrm_sim_1.severities).T.head(5)

Unnamed: 0,Item_1,Item_2,Item_3,Item_4,Item_5,Item_6,Item_7,Item_8
Rater_1,-0.166803,-0.685445,-0.192489,-0.156795,0.04968,0.092203,-0.18494,0.069342
Rater_2,-1.36217,-0.054932,-0.489174,1.032735,0.401053,0.092069,-0.777921,-0.416458
Rater_3,0.283171,-1.402123,0.509001,-0.633801,-0.178995,0.372468,-0.091473,0.86795
Rater_4,-0.628057,0.515532,-0.479066,0.226833,-0.012193,-0.320237,0.078931,0.521392
Rater_5,0.160702,-0.883917,1.44404,0.074762,-0.148835,-0.103271,0.229064,1.282377


In [10]:
mfrm_sim_1.abilities.to_csv('mfrm_sim_1_abilities.csv', header=None)
mfrm_sim_1.abilities.head(5)

Person_1    1.477238
Person_2   -0.465914
Person_3    0.550679
Person_4    4.804664
Person_5   -3.572967
dtype: float64

View `max_score`.

In [11]:
mfrm_sim_1.max_score

5

Create an object `mfrm_1` of the class `MFRM` from the response dataframe for analysis and save the object `mfrm_sim_1` to file with `pickle`. Note that when creating the `MFRM` object, we have passed a `max_score` argument as well as the response dataframe (this is not essential - `RaschPy` will infer the maximum score from the response data if no value is paased - but if no persons achieve the maximum score, inferring the score will then not include the full score range).

In [12]:
mfrm_1 = rp.MFRM(mfrm_sim_1.scores, max_score=mfrm_sim_1.max_score)

with open('mfrm_sim_1.pickle', 'wb') as handle:
    pickle.dump(mfrm_sim_1, handle, protocol=pickle.HIGHEST_PROTOCOL)

You may wish to create a simulation based on specified, known item difficulties and/or person abilities. This may be done by passing lists to the `manual_diffs`,  `manual_severities`, `manual_thresholds` and/or `manual_abilities` arguments (in which case, there is no need to pass the relevant `item_range`, `category_base`, `category_base`, `person_sd` or `offset` arguments). You may also customise the names of the items and/or persons by passing lists of the correct length to the manual_person_names and/or manual_item_names arguments.

The manual_diffs and manual_abilities arguments may also be used to generate random item difficulties and/or person abilities according to distributions other than the default uniform (for items) and normal (for persons). This is what is done in the example `mfrm_sim_2` below: A set of specified, fixed item difficulties (6 items of difficulty between -2.5 logit and +2.5 logits and a maximum score of 5) and set of Rasch-Andrich thresholds (each summing to zero, with a 'dummy threshold 0 of value 0) are passed together with 5 raters with specified severity profiles and a random uniform distribution of person abilities (between -2 and +2 logits). For this simulation, we also set a proportion of 10% missing data (missing completely at random) by passing the argument `missing=0.1`.

In [18]:
mfrm_sim_2 = rp.MFRM_Sim_Items(no_of_items=6,
                               no_of_persons=500,
                               no_of_raters=5,
                               max_score=5,
                               missing=0.1,
                               manual_abilities = np.random.uniform(-2, 2, 500),
                               manual_diffs=[-2.5, -1.5, -0.5, 0.5, 1.5, 2.5],
                               manual_thresholds=[0, -2, -1, 0, 1, 2],
                               manual_severities = {'Rater_1': {'Item_1': 0, 'Item_2': 0, 'Item_3': 0, 'Item_4': 0, 'Item_5': 0, 'Item_6': 0},
                                                    'Rater_2': {'Item_1': 0, 'Item_2': 0, 'Item_3': 0, 'Item_4': 0, 'Item_5': 0, 'Item_6': 0},
                                                    'Rater_3': {'Item_1': -1, 'Item_2': -2, 'Item_3': 0, 'Item_4': 2, 'Item_5': -2, 'Item_6': -1},
                                                    'Rater_4': {'Item_1': 2, 'Item_2': 1, 'Item_3': -1, 'Item_4': 2, 'Item_5': 1, 'Item_6': -1},
                                                    'Rater_5': {'Item_1': 1, 'Item_2': 1, 'Item_3': 2, 'Item_4': 1, 'Item_5': 1, 'Item_6': 1}})

Save the generated response dataframe, which is stored as an attribute `mfrm_sim_2.scores`, to file, and view the first 5 lines.

In [19]:
mfrm_sim_2.scores

Unnamed: 0,Unnamed: 1,Item_1,Item_2,Item_3,Item_4,Item_5,Item_6
Rater_1,Person_1,5.0,4.0,3.0,1.0,,1.0
Rater_1,Person_2,3.0,4.0,,0.0,0.0,
Rater_1,Person_3,5.0,5.0,5.0,4.0,4.0,2.0
Rater_1,Person_4,4.0,4.0,3.0,3.0,2.0,1.0
Rater_1,Person_5,5.0,4.0,3.0,3.0,2.0,0.0
...,...,...,...,...,...,...,...
Rater_5,Person_496,3.0,4.0,2.0,3.0,0.0,0.0
Rater_5,Person_497,,4.0,1.0,2.0,0.0,0.0
Rater_5,Person_498,,0.0,0.0,0.0,,0.0
Rater_5,Person_499,3.0,1.0,0.0,0.0,0.0,0.0


Save the generating item, threshold and person parameters to file, and view the item difficulties and Rasch-Andrich thresholds (which includes the 'dummy' threshold 0, always set to 0).

In [20]:
mfrm_sim_2.diffs.to_csv('mfrm_sim_2_diffs.csv', header=None)
mfrm_sim_2.diffs

Item_1   -2.5
Item_2   -1.5
Item_3   -0.5
Item_4    0.5
Item_5    1.5
Item_6    2.5
dtype: float64

In [21]:
pd.Series(mfrm_sim_2.thresholds).to_csv('mfrm_sim_2_thresholds.csv', header=None)
mfrm_sim_2.thresholds

array([ 0, -2, -1,  0,  1,  2])

In [23]:
pd.DataFrame(mfrm_sim_2.severities).T.to_csv('mfrm_sim_2_severities.csv')
pd.DataFrame(mfrm_sim_2.severities).T.head(5)

Unnamed: 0,Item_1,Item_2,Item_3,Item_4,Item_5,Item_6
Rater_1,0,0,0,0,0,0
Rater_2,0,0,0,0,0,0
Rater_3,-1,-2,0,2,-2,-1
Rater_4,2,1,-1,2,1,-1
Rater_5,1,1,2,1,1,1


In [24]:
mfrm_sim_2.abilities.to_csv('mfrm_sim_2_abilities.csv', header=None)
mfrm_sim_2.abilities.head(5)

Person_1   -0.459733
Person_2   -1.262183
Person_3    1.141660
Person_4    0.403744
Person_5    0.271680
dtype: float64

View `max_score`.

In [25]:
mfrm_sim_2.max_score

5

Create an object, `mfrm_2`, of the class `MFRM` from the response dataframe for analysis and save the object `mfrm_sim_2` to file with pickle.

In [26]:
rsm_2 = rp.MFRM(mfrm_sim_2.scores, max_score=mfrm_sim_2.max_score)

with open('mfrm_sim_2.pickle', 'wb') as handle:
    pickle.dump(mfrm_sim_2, handle, protocol=pickle.HIGHEST_PROTOCOL)