# 1. Implementing a RaschModel

## 1.1 JAG Model Description
An important question anybody has to ask when trying to implement something is what exactly is that something 
that they are trying to implement? In my case, it was: "what exactly is a Rasch model?" When I started this project 
my initial goal was to implement a Cornell paper, *JAG: Joint Assessment Grading*. In this paper, they defined the Rasch model
as 
    
$$ P(X_{ij} = 1 | S_i, Q_j) = \frac{1}{1+e^{-X_{ij} * (S_i - Q_j)}} \tag{1}$$ where $X_{ij}$ is an encoded answer, either 1 or 0 on from some
sort of questionnaire. In the case of the paper, it was in the context of an academic test. Thus, $S_i$ was The $i^{th}$ student's 
ability, $Q_J$ the $j^{th}$ question's difficulty, and $X_{ij} = 1 $ represented a correct answer while $X_{ij} = 0 $ represented
an incorrect answer. It's also worth noting that the estimation of these parameters is done though Maximizing a likelihood 
function. 

## 1.2 Why Explore other descriptions for the Rasch model? 
When I was implementing the Rasch model, I needed to find something to test my implementation against. Initally, following the JAG 
paper, I implemented the Rasch model by maximizing the log likelihood of the model. 

---
### 1.2.1: Side note on Rasch Model Optimization
In reality I was minimizing the negative log likelihood with tensorflow's 
automatic differentiation of the following loss function: 

$$loss = -\sum_{i,j} X_{ij}\log(\sigma(X_{ij}(s_i-q_j)) + (1-X_{ij})\log(1-\sigma(X_{ij}(s_i-q_j)))$$  

This loss function is common in logistic regression but has no academic citation since the majority of papers exploring the Rasch
model either do not define a loss function for logistic regression in its entirety (A great example of 
this is a paper from SAS on using logistic regression to optimize the Rasch model which 
defines seperate loss functions for the two parameters but does not define how to apply these 
functions in a way to optimize both parameters), or don't use 
logistic regression at all. In the latter case, Marginal Maximum Likelihood or Bayesian estimation techniques are used.
The lack of clarity in the Rasch model's optimization was one of the motivations for 
exploring further model descriptions. 

---

Eventually, I came across a tutorial on pyschometric parameter estimation from Penn State's social science research institute. 
This tutorial can be found here: https://quantdev.ssri.psu.edu/tutorials/introduction-irt-modeling. What's notable about this 
tutorial is the following lines:

```
"The 1PL (also called the Rasch model) IRT model describes test items in terms of only one parameter, item difficulty, q. Item difficulty is simply how hard an item is (how high does the latent trait ability level need to be in order to have a 50% chance of getting the item right?). q is estimated for each item of the test."
```

The Rasch model this tutorial puts forth is defined as 
$$ P(X_{ij} = 1 | S_i, Q_j) =  \frac{e^{S_i - Q_j}}{1+e^{S_i - Q_j}} \tag{2}$$

Now a inconsistency has been introduced. Thus, investigation into more descriptions needs to occur in an attempt to find which description
can be trusted. 


Note: Equation (2)'s syntax is changed based on my interpretation of the Penn State parameters in terms of the original JAG implementation, 
in particular, $b = Q$ and $\theta = S$. 


## 1.3 Chris Hulme-Lowe Dissertation Model Description
While investigating this issue, I came across a dissertation that described IRT parameter estimation and optimization methods in great 
detail. This paper presented the Rasch model as different from the 1PL (1 parameter logistic model), 
"The Rasch model is less restrictive than the 1PL, but more restrictive than the 2PL," (). This is in contrast to 
the Penn State tutorial which presented the Rasch model as equivalent to the 1PL, "The 1PL (also called the Rasch model) IRT model describes..." (). Yet, the paper presents the Rasch
model as nearly identical to the Penn State description:

$$ P(X_{ij} = 1 | S_i, Q_j) =  \frac{\exp(Da_j(S_i - Q_j))}{1+\exp(Da_j(S_i - Q_j))} \tag{3} $$

Where, as explained in the paper, D can be thought of as 1. In fact, in most Rasch model software implementations D is set to 1, 
"However, the logistic model has become so ubiquitous in IRT that many modern software packages leave D = 1, 
which produces parameters in the logistic metric," (). As well, $a_j$ is the $j^{th}$ discrimination parameter (a question's ability to determine student ability)
and for the Rasch model $a_j = \bar{a}$ and typically $ \bar{a} = 1 $ 

(However, $\bar{a}$'s value is not discussed by Hulme-Lowe, this information was gathered from
the R implementation of the Rasch model, "Although the common formulation of the Rasch model assumes that the discrimination parameter is fixed to 1..." which can be found here: https://www.rdocumentation.org/packages/ltm/versions/1.1-1/topics/rasch ).

Thus when simplified the Rasch model as described by Hulme-Lowe is

$$ P(X_{ij} = 1 | S_i, Q_j) =  \frac{\exp(S_i - Q_j)}{1+\exp(S_i - Q_j)} \tag{4} $$

And thus, (4) = (3) = (2), in theory. 