Merge pull request #4 from junchenfeng/master

Fix convergence bug. Add DAO module. Add CI.
17zuoye · Aug 8, 2017 · c69faa1 · c69faa1
2 parents 5cb9cdb + 3405833
commit c69faa1
Show file tree

Hide file tree

Showing 25 changed files with 675 additions and 1,127 deletions.
diff --git a/.gitignore b/.gitignore
@@ -22,7 +22,6 @@ other
 build
 dist
 *.egg-info
-
 #git ls-files -ci --exclude-standard -z | xargs -0 git rm --cached
 
 
@@ -88,3 +87,4 @@ docs/_build/
 
 # compiled by cpython, see it at setup.py
 pyirt/utl/clib.c
+sandbox/
diff --git a/.travis.yml b/.travis.yml
@@ -1,29 +1,32 @@
-# Copied from https://gist.github.com/dan-blanchard/7045057 Quicker Travis builds that rely on numpy and scipy using Miniconda
-
 language: python
 python:
   - 2.7
+notifications:
+  email: false
 
 # Setup anaconda
 before_install:
   - wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh
   - chmod +x miniconda.sh
   - ./miniconda.sh -b
-  - export PATH=/home/travis/miniconda/bin:$PATH
+  - export PATH=/home/travis/miniconda2/bin:$PATH
+  - conda config --add channels dan_blanchard
+  - conda config --add channels desilinguist 
   - conda update --yes conda
   # The next couple lines fix a crash with multiprocessing on Travis and are not specific to using Miniconda
   - sudo rm -rf /dev/shm
   - sudo ln -s /run/shm /dev/shm
 # Install packages
 install:
-  - conda install --yes python=$TRAVIS_PYTHON_VERSION atlas numpy scipy matplotlib nose dateutil pandas statsmodels
+  - conda install --yes python=$TRAVIS_PYTHON_VERSION atlas numpy scipy python-coveralls
   # Coverage packages are on my binstar channel
-  - conda install --yes -c dan_blanchard python-coveralls nose-cov cython
+  - conda install --yes -c dan_blanchard  cython
+  - pip install nose-cov
   - python setup.py install
 
 # Run test
 script:
-  - nosetests
+  - nosetests 
 
 # Calculate coverage
 after_success:

diff --git a/README.md b/README.md
@@ -1,16 +1,39 @@
 pyirt
 =====
-[![Build Status](https://img.shields.io/travis/17zuoye/pyirt/master.svg?style=flat)](https://travis-ci.org/17zuoye/pyirt)
-[![Coverage Status](https://coveralls.io/repos/17zuoye/pyirt/badge.svg)](https://coveralls.io/r/17zuoye/pyirt)
-[![Health](https://landscape.io/github/17zuoye/pyirt/master/landscape.svg?style=flat)](https://landscape.io/github/17zuoye/pyirt/master)
+[![Build Status](https://img.shields.io/travis/junchenfeng/pyirt/master.svg?style=flat)](https://travis-ci.org/junchenfeng/pyirt)
+[![Coverage Status](https://coveralls.io/repos/github/junchenfeng/pyirt/badge.svg?branch=master)](https://coveralls.io/github/junchenfeng/pyirt?branch=master)
+[![Code Health](https://landscape.io/github/junchenfeng/pyirt/master/landscape.svg?style=flat)](https://landscape.io/github/junchenfeng/pyirt/master)
 [![Download](https://img.shields.io/pypi/dm/pyirt.svg?style=flat)](https://pypi.python.org/pypi/pyirt)
 [![License](https://img.shields.io/pypi/l/pyirt.svg?style=flat)](https://pypi.python.org/pypi/pyirt)
 
 
-
 A python library of IRT algorithm designed to cope with sparse data structure.
 
 - Current version is in early development stage. Use at your own peril.
+- built and test under py2.7. py3.3 compatibility is tested in my own
+  environment. 
+
+
+# Demo
+```python
+from pyirt import irt
+
+src_fp = open(file_path,'r')
+
+# alternatively, pass in list of tuples in the format of [(user_id, item_id, ans_boolean)]
+# ans_boolean is 0/1.
+
+
+# (1)Run by default
+item_param, user_param = irt(src_fp)
+
+# (2)Supply bounds
+item_param, user-param = irt(src_fp, theta_bnds = [-5,5], alpha_bnds=[0.1,3], beta_bnds = [-3,3])
+
+# (3)Supply guess parameter
+guessParamDict = {1:{'c':0.0}, 2:{'c':0.25}}
+item_param, user_param = irt(src_fp, in_guess_param = guessParamDict)
+```
 
 
 I.Model Specification
@@ -21,7 +44,7 @@ The current version supports MMLE algorithm and unidimension two parameter
 IRT model. There is a backdoor method to specify the guess parameter but there
 is not active estimation.
 
-The prior distribution of theta is uniform rather than beta.
+The prior distribution of theta is **uniform**.
 
 There is no regularization in alpha and beta estimation. Therefore, the default
 algorithm uses boundary on the parameter to prevent over-fitting and deal with
@@ -31,18 +54,23 @@ extreme cases where almost all responses to the item is right.
 The package offers two methods to estimate theta, given item parameters: Bayesian and MLE. <br>
 The estimation procedure is quite primitive. For examples, see the test case.  
 
-II.What's New
+II.Sparse Data Structure
 ==========
 
-IRT model is developed for offline test that has few missing data. However,
-when try to calibrate item parameters for online testing bank, such assumption
-breaks down and the algorithm runs into sparse data problem, as well as severe
-missing data problem.
+In non-test learning dataset, missing data is the common. Not all students
+finish all the items. When the number of students and items are large, the data
+can be extremely sparse.
 
+The package deals with the sparse structure in two ways:
+- Efficient memory storage. Use collapsed list to index data. The memory usage
+  is about 3 times of the text data file. If the workstation has 6G free
+memory, it can handle 2G data file. Most other IRT package will definitely
+break.
 
-## Missing Data
+- No joint estimation. Under IRT's conditional independence assumption,
+  estimate each item's parameter is consistent but inefficient. To avoid
+reverting a giant jacobian matrix, the item parameters are estimated seprately. 
 
-As for now, missing data are assumed to be ignorable.
 
 III.Default Config
 ===========
@@ -70,46 +98,13 @@ The file is expected to be comma delimited.
 
 The three columns are uid, eid, result-flag.
 
-Currently the model only works well with 0/1 flag but will NOT raise error for
+Currently the model only works well with 0/1 flag but will **NOT** raise error for
 other types.
 
 
-
-V.Example
-=========
-```python
-from pyirt import *
-
-src_fp = open(file_path,'r')
-
-# alternatively, pass in list of tuples in the format of [(uid, eid, atag),...]
-
-
-# (1)Run by default
-item_param, user_param = irt(src_fp)
-
-# (2)Supply bnds
-item_param, user-param = irt(src_fp, theta_bnds = [-5,5], beta_bnds = [-3,3])
-
-# (3)Supply guess parameter
-guessParamDict = {1:{'c':0.0}, 2:{'c':0.25}}
-
-item_param, user_param = irt(src_fp, in_guess_param = guessParamDict)
-```
-
-
-VI.Performance
+V.Note
 =======
 
-## Cython Optimization
-The crucial function is log likelihood evaluation, which is implemented in
-Cython. At 1 million records scale, it halves the run time.
-
-## Why no parallel
-Multi-processing in Python does not accpet class method.
-
-In addition, none of the calculation is particular computation heavy. The
-communication cost over-weighs the parallel gain.
 
 ## Minimization solver
 The scipy minimize is as good as cvxopt.cp and matlab fmincon on item parameter
@@ -120,28 +115,12 @@ However, the convergence is pretty slow. It requires about 10k obeverations per
 item to recover the parameter to the 0.01 precision.
 
 
-VII.ToDos
-===========
-
-## Models
-(1) The solver cannot handle polytomous answers.
-
-(2) The solver cannot handle multi-dimensional data.
-
-(3) The solver cannot handle group constraints.
-
-
-## BIG DATA
-bdm is a work around when the data are too much for memory. However,berkeley db
-is quite hard to install on operating system. Therefore, although in utl module
-there are code snips for dbm trick. It is not standard shipping.
-
-
-
-VIII.Acknowledgement
+VII.Acknowledgement
 ==============
 The algorithm is described in details by Bradey Hanson(2000), see in the
 literature section. I am grateful to Mr.Hanson's work.
 
-The python implementation is benefited greatly from the comments and suggestions from Chaoqun Fu and Dawei Chen.
+[Chaoqun Fu](https://github.com/fuchaoqun)'s comment leads to the (much better) API design. 
+
+[Dawei Chen](https://github.com/mvj3) and [Lei Wang](https://github.com/wlbksy) contributed to the code.
 
diff --git a/pyirt/__init__.py b/pyirt/__init__.py
@@ -1,6 +1,2 @@
-__all__ = ["irt", "model", "solver", "utl"]
-
-from ._pyirt import irt, model
-
-import solver
-import utl
+__all__ = ["_pyirt", "solver", "util"]
+from ._pyirt import irt
diff --git a/pyirt/_pyirt.py b/pyirt/_pyirt.py
@@ -1,28 +1,36 @@
 # -*-coding:utf-8-*-
-
 from .solver import model
+from .dao import localDAO
 
-
-def irt(src, theta_bnds=[-4, 4],
+def irt(data_src,
+        theta_bnds=[-4, 4], num_theta=11,
         alpha_bnds=[0.25, 2], beta_bnds=[-2, 2], in_guess_param='default',
         model_spec='2PL',
-        mode='memory', is_mount=False, user_name=None):
+        max_iter=10, tol=1e-3, nargout=2):
 
+
+    # load data
+    dao_instance = localDAO(data_src)
+
     if model_spec == '2PL':
-        mod = model.IRT_MMLE_2PL()
+        mod = model.IRT_MMLE_2PL(dao_instance)
     else:
         raise Exception('Unknown model specification.')
 
-    # load
-    mod.load_data(src, is_mount, user_name)
-    mod.load_param(theta_bnds, alpha_bnds, beta_bnds)
-    mod.load_guess_param(in_guess_param)
+    # specify the irt parameters
+    mod.set_options(theta_bnds, num_theta, alpha_bnds, beta_bnds,max_iter, tol)
+    mod.set_guess_param(in_guess_param)
 
     # solve
     mod.solve_EM()
 
     # post
     item_param_dict = mod.get_item_param()
-    user_param_dict = mod.get_user_param()
 
-    return item_param_dict, user_param_dict
+    if nargout ==1:
+        return item_param_dict
+    elif nargout ==2:
+        user_param_dict = mod.get_user_param()
+        return item_param_dict, user_param_dict
+    else:
+        raise Exception('Invalid number of argument')
diff --git a/pyirt/algo.py b/pyirt/algo.py
@@ -0,0 +1,36 @@
+# -*- coding:utf-8 -*-
+from .util import clib, tools
+import numpy as np
+
+def update_theta_distribution(data, num_theta, theta_prior_val, theta_density, item_param_dict):
+    '''
+    data = [(item_idx int, ans_tag binary)]
+    '''
+
+    '''
+    Basic Math. 
+        P_t(theta, data |q_param) = p(data|q_param, theta)*p_[t-1](theta)
+        p_t(data|q_param) = sum(p_t(theta,data|q_param)) over theta
+        p_t(theta|data, q_param) = P_t(theta, data|q_param)/p_t(data|q_param)
+    '''
+    likelihood_vec = np.zeros(num_theta)
+
+    for k in range(num_theta):
+        theta     = theta_prior_val[k]
+        ell       = 0.0
+        for log in data:
+            item_idx   = log[0]
+            ans_tag  = log[1]
+            alpha = item_param_dict[item_idx]['alpha']
+            beta  = item_param_dict[item_idx]['beta']
+            c     = item_param_dict[item_idx]['c']
+            ell   += clib.log_likelihood_2PL(0.0+ans_tag, 1.0 - ans_tag, theta, alpha, beta, c)
+        likelihood_vec[k] = ell
+
+    # posterior 
+    joint_llk_vec = likelihood_vec + np.log(theta_density)
+    marginal = tools.logsum(joint_llk_vec)
+    posterior = np.exp(joint_llk_vec - marginal)
+
+    return posterior
+