Skip to content

Commit

Permalink
Merge pull request #4 from junchenfeng/master
Browse files Browse the repository at this point in the history
Fix convergence bug. Add DAO module. Add CI.
  • Loading branch information
junchenfeng committed Aug 8, 2017
2 parents 5cb9cdb + 3405833 commit c69faa1
Show file tree
Hide file tree
Showing 25 changed files with 675 additions and 1,127 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ other
build
dist
*.egg-info

#git ls-files -ci --exclude-standard -z | xargs -0 git rm --cached


Expand Down Expand Up @@ -88,3 +87,4 @@ docs/_build/

# compiled by cpython, see it at setup.py
pyirt/utl/clib.c
sandbox/
15 changes: 9 additions & 6 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,29 +1,32 @@
# Copied from https://gist.github.com/dan-blanchard/7045057 Quicker Travis builds that rely on numpy and scipy using Miniconda

language: python
python:
- 2.7
notifications:
email: false

# Setup anaconda
before_install:
- wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh
- chmod +x miniconda.sh
- ./miniconda.sh -b
- export PATH=/home/travis/miniconda/bin:$PATH
- export PATH=/home/travis/miniconda2/bin:$PATH
- conda config --add channels dan_blanchard
- conda config --add channels desilinguist
- conda update --yes conda
# The next couple lines fix a crash with multiprocessing on Travis and are not specific to using Miniconda
- sudo rm -rf /dev/shm
- sudo ln -s /run/shm /dev/shm
# Install packages
install:
- conda install --yes python=$TRAVIS_PYTHON_VERSION atlas numpy scipy matplotlib nose dateutil pandas statsmodels
- conda install --yes python=$TRAVIS_PYTHON_VERSION atlas numpy scipy python-coveralls
# Coverage packages are on my binstar channel
- conda install --yes -c dan_blanchard python-coveralls nose-cov cython
- conda install --yes -c dan_blanchard cython
- pip install nose-cov
- python setup.py install

# Run test
script:
- nosetests
- nosetests

# Calculate coverage
after_success:
Expand Down
113 changes: 46 additions & 67 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,39 @@
pyirt
=====
[![Build Status](https://img.shields.io/travis/17zuoye/pyirt/master.svg?style=flat)](https://travis-ci.org/17zuoye/pyirt)
[![Coverage Status](https://coveralls.io/repos/17zuoye/pyirt/badge.svg)](https://coveralls.io/r/17zuoye/pyirt)
[![Health](https://landscape.io/github/17zuoye/pyirt/master/landscape.svg?style=flat)](https://landscape.io/github/17zuoye/pyirt/master)
[![Build Status](https://img.shields.io/travis/junchenfeng/pyirt/master.svg?style=flat)](https://travis-ci.org/junchenfeng/pyirt)
[![Coverage Status](https://coveralls.io/repos/github/junchenfeng/pyirt/badge.svg?branch=master)](https://coveralls.io/github/junchenfeng/pyirt?branch=master)
[![Code Health](https://landscape.io/github/junchenfeng/pyirt/master/landscape.svg?style=flat)](https://landscape.io/github/junchenfeng/pyirt/master)
[![Download](https://img.shields.io/pypi/dm/pyirt.svg?style=flat)](https://pypi.python.org/pypi/pyirt)
[![License](https://img.shields.io/pypi/l/pyirt.svg?style=flat)](https://pypi.python.org/pypi/pyirt)



A python library of IRT algorithm designed to cope with sparse data structure.

- Current version is in early development stage. Use at your own peril.
- built and test under py2.7. py3.3 compatibility is tested in my own
environment.


# Demo
```python
from pyirt import irt

src_fp = open(file_path,'r')

# alternatively, pass in list of tuples in the format of [(user_id, item_id, ans_boolean)]
# ans_boolean is 0/1.


# (1)Run by default
item_param, user_param = irt(src_fp)

# (2)Supply bounds
item_param, user-param = irt(src_fp, theta_bnds = [-5,5], alpha_bnds=[0.1,3], beta_bnds = [-3,3])

# (3)Supply guess parameter
guessParamDict = {1:{'c':0.0}, 2:{'c':0.25}}
item_param, user_param = irt(src_fp, in_guess_param = guessParamDict)
```


I.Model Specification
Expand All @@ -21,7 +44,7 @@ The current version supports MMLE algorithm and unidimension two parameter
IRT model. There is a backdoor method to specify the guess parameter but there
is not active estimation.

The prior distribution of theta is uniform rather than beta.
The prior distribution of theta is **uniform**.

There is no regularization in alpha and beta estimation. Therefore, the default
algorithm uses boundary on the parameter to prevent over-fitting and deal with
Expand All @@ -31,18 +54,23 @@ extreme cases where almost all responses to the item is right.
The package offers two methods to estimate theta, given item parameters: Bayesian and MLE. <br>
The estimation procedure is quite primitive. For examples, see the test case.

II.What's New
II.Sparse Data Structure
==========

IRT model is developed for offline test that has few missing data. However,
when try to calibrate item parameters for online testing bank, such assumption
breaks down and the algorithm runs into sparse data problem, as well as severe
missing data problem.
In non-test learning dataset, missing data is the common. Not all students
finish all the items. When the number of students and items are large, the data
can be extremely sparse.

The package deals with the sparse structure in two ways:
- Efficient memory storage. Use collapsed list to index data. The memory usage
is about 3 times of the text data file. If the workstation has 6G free
memory, it can handle 2G data file. Most other IRT package will definitely
break.

## Missing Data
- No joint estimation. Under IRT's conditional independence assumption,
estimate each item's parameter is consistent but inefficient. To avoid
reverting a giant jacobian matrix, the item parameters are estimated seprately.

As for now, missing data are assumed to be ignorable.

III.Default Config
===========
Expand Down Expand Up @@ -70,46 +98,13 @@ The file is expected to be comma delimited.

The three columns are uid, eid, result-flag.

Currently the model only works well with 0/1 flag but will NOT raise error for
Currently the model only works well with 0/1 flag but will **NOT** raise error for
other types.



V.Example
=========
```python
from pyirt import *

src_fp = open(file_path,'r')

# alternatively, pass in list of tuples in the format of [(uid, eid, atag),...]


# (1)Run by default
item_param, user_param = irt(src_fp)

# (2)Supply bnds
item_param, user-param = irt(src_fp, theta_bnds = [-5,5], beta_bnds = [-3,3])

# (3)Supply guess parameter
guessParamDict = {1:{'c':0.0}, 2:{'c':0.25}}

item_param, user_param = irt(src_fp, in_guess_param = guessParamDict)
```


VI.Performance
V.Note
=======

## Cython Optimization
The crucial function is log likelihood evaluation, which is implemented in
Cython. At 1 million records scale, it halves the run time.

## Why no parallel
Multi-processing in Python does not accpet class method.

In addition, none of the calculation is particular computation heavy. The
communication cost over-weighs the parallel gain.

## Minimization solver
The scipy minimize is as good as cvxopt.cp and matlab fmincon on item parameter
Expand All @@ -120,28 +115,12 @@ However, the convergence is pretty slow. It requires about 10k obeverations per
item to recover the parameter to the 0.01 precision.


VII.ToDos
===========

## Models
(1) The solver cannot handle polytomous answers.

(2) The solver cannot handle multi-dimensional data.

(3) The solver cannot handle group constraints.


## BIG DATA
bdm is a work around when the data are too much for memory. However,berkeley db
is quite hard to install on operating system. Therefore, although in utl module
there are code snips for dbm trick. It is not standard shipping.



VIII.Acknowledgement
VII.Acknowledgement
==============
The algorithm is described in details by Bradey Hanson(2000), see in the
literature section. I am grateful to Mr.Hanson's work.

The python implementation is benefited greatly from the comments and suggestions from Chaoqun Fu and Dawei Chen.
[Chaoqun Fu](https://github.com/fuchaoqun)'s comment leads to the (much better) API design.

[Dawei Chen](https://github.com/mvj3) and [Lei Wang](https://github.com/wlbksy) contributed to the code.

8 changes: 2 additions & 6 deletions pyirt/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,2 @@
__all__ = ["irt", "model", "solver", "utl"]

from ._pyirt import irt, model

import solver
import utl
__all__ = ["_pyirt", "solver", "util"]
from ._pyirt import irt
30 changes: 19 additions & 11 deletions pyirt/_pyirt.py
Original file line number Diff line number Diff line change
@@ -1,28 +1,36 @@
# -*-coding:utf-8-*-

from .solver import model
from .dao import localDAO


def irt(src, theta_bnds=[-4, 4],
def irt(data_src,
theta_bnds=[-4, 4], num_theta=11,
alpha_bnds=[0.25, 2], beta_bnds=[-2, 2], in_guess_param='default',
model_spec='2PL',
mode='memory', is_mount=False, user_name=None):
max_iter=10, tol=1e-3, nargout=2):


# load data
dao_instance = localDAO(data_src)

if model_spec == '2PL':
mod = model.IRT_MMLE_2PL()
mod = model.IRT_MMLE_2PL(dao_instance)
else:
raise Exception('Unknown model specification.')

# load
mod.load_data(src, is_mount, user_name)
mod.load_param(theta_bnds, alpha_bnds, beta_bnds)
mod.load_guess_param(in_guess_param)
# specify the irt parameters
mod.set_options(theta_bnds, num_theta, alpha_bnds, beta_bnds,max_iter, tol)
mod.set_guess_param(in_guess_param)

# solve
mod.solve_EM()

# post
item_param_dict = mod.get_item_param()
user_param_dict = mod.get_user_param()

return item_param_dict, user_param_dict
if nargout ==1:
return item_param_dict
elif nargout ==2:
user_param_dict = mod.get_user_param()
return item_param_dict, user_param_dict
else:
raise Exception('Invalid number of argument')
36 changes: 36 additions & 0 deletions pyirt/algo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# -*- coding:utf-8 -*-
from .util import clib, tools
import numpy as np

def update_theta_distribution(data, num_theta, theta_prior_val, theta_density, item_param_dict):
'''
data = [(item_idx int, ans_tag binary)]
'''

'''
Basic Math.
P_t(theta, data |q_param) = p(data|q_param, theta)*p_[t-1](theta)
p_t(data|q_param) = sum(p_t(theta,data|q_param)) over theta
p_t(theta|data, q_param) = P_t(theta, data|q_param)/p_t(data|q_param)
'''
likelihood_vec = np.zeros(num_theta)

for k in range(num_theta):
theta = theta_prior_val[k]
ell = 0.0
for log in data:
item_idx = log[0]
ans_tag = log[1]
alpha = item_param_dict[item_idx]['alpha']
beta = item_param_dict[item_idx]['beta']
c = item_param_dict[item_idx]['c']
ell += clib.log_likelihood_2PL(0.0+ans_tag, 1.0 - ans_tag, theta, alpha, beta, c)
likelihood_vec[k] = ell

# posterior
joint_llk_vec = likelihood_vec + np.log(theta_density)
marginal = tools.logsum(joint_llk_vec)
posterior = np.exp(joint_llk_vec - marginal)

return posterior

0 comments on commit c69faa1

Please sign in to comment.