Merge bd574eb into 86f3907

EducationalTestingService · May 7, 2020 · d51e7f6 · d51e7f6
2 parents 86f3907 + bd574eb
commit d51e7f6
Show file tree

Hide file tree

Showing 1,107 changed files with 171,341 additions and 56,075 deletions.
diff --git a/doc/contributing.rst b/doc/contributing.rst
@@ -42,7 +42,7 @@ There are two kinds of existing tests in RSMTool:
 
 1. The first type of tests are **unit tests**, i.e., very specific tests for which you have a single example (usually embedded in the test itself) and you compare the generated output with known or expected output. These tests should have a very narrow and well defined scope. To see examples of such unit tests, see the test functions in the file `tests/test_utils.py`. 
 
-2. The second type of tests are **functional tests** which are generally written from the users' perspective to test that RSMTool is doing things that users would expect it to. In RSMTool, most (if not all) functional tests are written in the form of "experiment tests", i.e., we first define an experimental configuration using an RSMTool (or RSMEval/RSMPredict, RSMCompare, RSMSummarize) configuration file, then we run the experiment, and then compare the generated output files to expected output files to make sure that RSMTool components are operating as expected. To see examples of such tests, you can look at any of the ``tests/test_experiment_*.py`` files. 
+2. The second type of tests are **functional tests** which are generally written from the users' perspective to test that RSMTool is doing things that users would expect it to. In RSMTool, most (if not all) functional tests are written in the form of "experiment tests", i.e., we first define an experimental configuration using an ``rsmtool`` (or ``rsmeval``/``rsmpredict``/``rsmcompare``/``rsmsummarize``) configuration file, then we run the experiment, and then compare the generated output files to expected output files to make sure that RSMTool components are operating as expected. To see examples of such tests, you can look at any of the ``tests/test_experiment_*.py`` files. 
 
 .. note:: 
 
@@ -57,9 +57,9 @@ To write a new experiment test for RSMTool (or any of the other tools):
 
     (a) Create a new directory under ``tests/data/experiments`` using a descriptive name. 
 
-    (b) Create a JSON configuration file under that directory with the various fields appropriately set for what you want to test. Feel free to use multiple words separated by hyphens to come up with a name that describes the testing condition. The name of the configuration file should be the same as the value of the ``experiment_id`` field in your JSON file. By convention, that's usually the same as the name of the directory you created but with underscores instead of hyphens. 
+    (b) Create a JSON configuration file under that directory with the various fields appropriately set for what you want to test. Feel free to use multiple words separated by hyphens to come up with a name that describes the testing condition. The name of the configuration file should be the same as the value of the ``experiment_id`` field in your JSON file. By convention, that's usually the same as the name of the directory you created but with underscores instead of hyphens. If you are creating a new test for ``rsmcompare`` or ``rsmsummarize``, copy over one or more of the existing ``rsmtool`` or ``rsmeval`` test experiments as input(s) and keep the same name. This will ensure that these inputs will be regularly updated and remain consistent with the current outputs generated by these tools. If you must create a test for a scenario not covered by a current tool, create a new ``rsmtool``/``rsmeval`` test first following the instructions on this page. 
 
-    (c) Next, you need to add the test to the list of parameterized tests in the appropriate test file based on the tool for which you are adding the test, e.g., RSMEval tests should be added to ``tests/test_experiment_rsmeval.py``, RSMPredict tests to ``tests/test_experiment_rsmpredict.py``, and so on. RSMTool tests can be added to any of the four files. The arguments for the `param()` call can be found in the :ref:`Table 1 <param_table>` below.
+    (c) Next, you need to add the test to the list of parameterized tests in the appropriate test file based on the tool for which you are adding the test, e.g., ``rsmeval`` tests should be added to ``tests/test_experiment_rsmeval.py``, ``rsmpredict`` tests to ``tests/test_experiment_rsmpredict.py``, and so on. Tests for ``rsmtool`` can be added to any of the four files. The arguments for the `param()` call can be found in the :ref:`Table 1 <param_table>` below.
 
     (d) In some rare cases, you might want to use a non-parameterized experiment test if you are doing something very different. These should be few and far between. Examples of these can also be seen in various ``tests/test_experiment_*.py`` files.
 
@@ -70,7 +70,7 @@ To write a new experiment test for RSMTool (or any of the other tools):
         :widths: auto
 
         +----------------------------------------------------------------------------+
-        | Writing test(s) for RSMTool                                                |
+        | Writing test(s) for ``rsmtool``                                            |
         |                                                                            |
         | * First positional argument is the name of the test directory you created. |
         |                                                                            |
@@ -88,12 +88,12 @@ To write a new experiment test for RSMTool (or any of the other tools):
         | * Set ``file_format="tsv"`` (or ``"xlsx"``) if you specified the same      |
         |   field in the configuration file.                                         |
         +----------------------------------------------------------------------------+
-        | Writing test(s) for RSMEval                                                |
+        | Writing test(s) for ``rsmeval``                                            |
         |                                                                            |
         | * Same arguments as RSMTool except the ``skll`` keyword argument is not    |
         |   applicable.                                                              |
         +----------------------------------------------------------------------------+
-        | Writing test(s) for RSMPredict                                             |
+        | Writing test(s) for ``rsmpredict``                                         |
         |                                                                            |
         | * The only positional argument is the name of the test directory you       |
         |   created.                                                                 |
@@ -104,14 +104,14 @@ To write a new experiment test for RSMTool (or any of the other tools):
         | * Set ``file_format="tsv"`` (or ``"xlsx"``) if you specified the same      |
         |   field in the configuration file.                                         |
         +----------------------------------------------------------------------------+
-        | Writing test(s) for RSMCompare                                             |
+        | Writing test(s) for ``rsmcompare``                                         |
         |                                                                            |
         | * First positional argument is the name of the test directory you created. |
         |                                                                            |
         | * Second positional argument is the comparison ID from the JSON            |
         |   configuration file.                                                      |
         +----------------------------------------------------------------------------+
-        | Writing test(s) for RSMSummarize                                           |
+        | Writing test(s) for ``rsmsummarize``                                       |
         |                                                                            |
         | * The only positional argument is the name of the test directory you       |
         |   created.                                                                 |
@@ -129,9 +129,42 @@ To do this, you should now run the following:
     
     python tests/update_files.py --tests tests --outputs test_outputs
 
-This will copy over the generated outputs for the newly added tests and show you a report of the files that it added. If run correctly, the report should *only* refer to model files (``*.model``/``*.ols``) and the files affected by the functionality you implemented. If you run ``nosetests`` again, your newly added tests should now pass. 
+This will copy over the generated outputs for the newly added tests and show you a report of the files that it added. It will also update the input files form tests for ``rsmcompare`` and ``rsmsummarize``. If run correctly, the report should *only* refer the files affected by the functionality you implemented. If you run ``nosetests`` again, your newly added tests should now pass. 
+
+At this point, you should inspect all of the new test files added by the above command to make sure that the outputs are as expected. You can find these files under ``tests/data/experiments/<test>/output`` where ``<test>`` refers to the test(s) that you added. 
+
+However, if your changes resulted in updates to the inputs to ``rsmsummarize`` or ``rsmcompare`` tests, you will first need to re-run the tests for these two tools and then re-run the ``update_files.py`` to update the outputs. 
+
+Once you are satisified that the outputs are as expected, you can commit them.
+
+The two examples below might help make this process easier to understand:
+
+.. topic:: Example 1: You made a code change to better handle an edge case that only affects one test. 
+
+    1. Run ``nosetests --nologcapture tests/*.py``. The affected test failed.  
+
+    2. Run ``python tests/update_files.py --tests tests --outputs test_outputs`` to update test outputs. You will see the total number of deleted, updated and missing files. There should be no deleted files and no missing files. Only the files for your new test should be updated. There are no warnings in the output. 
+
+    3. If this is the case, you are now ready to commit your change and the updated test outputs.
+
+.. topic:: Example 2: You made a code change that changes the output of many tests. For example, you renamed one of the evaluation metrics. 
+
+     1. Run ``nosetests --nologcapture tests/*.py``. Many tests will now fail since the output produced by the tool(s) has changed. 
+
+     2. Run ``python tests/update_files.py --tests tests --outputs test_outputs`` to update test outputs. The files affected by your change are shown as added/deleted. You also see the following warning: 
+
+        .. code-block:: 
+     
+            WARNING: X input files for rsmcompare/rsmsummarize tests have been updated. You need to re-run these tests and update test outputs
+
+     3. This means that the changes you made to the code changed the outputs for one or more ``rsmtool``/``rsmeval`` tests that served as inputs to one or more ``rsmcompare``/``rsmsummarize`` tests. Therefore, it is likely that the current test outputs no longer match the expected output and the tests for those two tools must be be re-run.
+
+     4. Run ``nosetests --nologcapture tests/*rsmsummarize*.py`` and ``nosetests --nologcapture tests/*rsmcompare*.py``. If you see any failures, make sure they are related to the changes you made since those are expected.
+
+     3. Next, re=run ``python tests/update_files.py --tests tests --outputs test_outputs`` which should only update the outputs for the ``rsmcompare``/``rsmsummarize`` tests. 
+
+     4. If this is the case, you are now ready to commit your changes. 
 
-At this point, you should inspect all of the new test files added by the above command using to make sure that the outputs are as expected. You can find these files under ``tests/data/experiments/<test>/output`` where ``<test>`` refers to the test(s) that you added. Once you are satisified that the outputs are as expected, you can commit all the them.
 
 Advanced tips and tricks
 ------------------------

diff --git a/rsmtool/notebooks/comparison/evaluation.ipynb b/rsmtool/notebooks/comparison/evaluation.ipynb
@@ -105,7 +105,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.7.6"
   }
  },
  "nbformat": 4,

diff --git a/rsmtool/notebooks/comparison/header.ipynb b/rsmtool/notebooks/comparison/header.ipynb
@@ -434,7 +434,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.8"
+   "version": "3.7.6"
   }
  },
  "nbformat": 4,

diff --git a/rsmtool/notebooks/comparison/true_score_evaluation.ipynb b/rsmtool/notebooks/comparison/true_score_evaluation.ipynb
@@ -14,8 +14,7 @@
    "outputs": [],
    "source": [
     "if not out_dfs['true_score_evaluations'].empty:\n",
-    "    variance_columns = ['N','N_single','N_double','h1_var_single','h1_var_double', 'h2_var_double','true_var']\n",
-    "    prmse_columns = ['N','N_single', 'N_double','sys_var_single','sys_var_double','mse_true','prmse_true']\n",
+    "\n",
     "    markdown_strs = []\n",
     "    markdown_strs.append(\"The tables in this section show how well system scores can \"\n",
     "                        \"predict *true* scores. According to Test theory, a *true* score \"\n",
@@ -25,33 +24,22 @@
     "                        \"human scores when multiple human ratings are available for a subset of \"\n",
     "                        \"responses. In this notebook these are estimated using human scores for \"\n",
     "                        \"responses in the evaluation set.\")\n",
-    "    markdown_strs.append(\"#### Variance of human scores\")\n",
-    "    markdown_strs.append(\"The table below shows variance of both sets of human scores \"\n",
-    "                        \"for the whole evaluation set and for the subset of responses \"\n",
-    "                        \"that were double-scored. Large differences in variance between \"\n",
-    "                        \"the two human scores require further investigation. The last column \"\n",
-    "                        \"shows estimated true score variance. \")\n",
-    "    display(Markdown('\\n'.join(markdown_strs)))\n",
-    "    pd.options.display.width=10\n",
-    "    df_human_variance =  out_dfs['true_score_evaluations'][variance_columns].copy()\n",
-    "                # replace nans with \"-\"\n",
-    "    df_human_variance.replace({np.nan: '-'}, inplace=True)\n",
-    "    display(HTML('<span style=\"font-size:95%\">'+ df_human_variance.to_html(classes=['sortable'], \n",
-    "                                                               escape=False,\n",
-    "                                                               float_format=float_format_func) + '</span>'))\n",
     "    \n",
-    "    markdown_strs = [\"#### Proportional reduction in mean squared error (PRMSE)\"]\n",
-    "    markdown_strs.append(\"The table shows the variance of system scores for single-scored \"\n",
-    "                        \"and double-scored responses, and mean squared error (MSE) and \"\n",
-    "                        \"proportional reduction in mean squared error (PRMSE) for \"\n",
-    "                        \"predicting a true score with system score.\")\n",
+    "    \n",
+    "    markdown_strs.append(\"The table shows variance of human rater errors, \"\n",
+    "                         \"true score variance, mean squared error (MSE) and \"\n",
+    "                         \"proportional reduction in mean squared error (PRMSE) for \"\n",
+    "                         \"predicting a true score with system score.\")\n",
     "    display(Markdown('\\n'.join(markdown_strs)))\n",
     "    pd.options.display.width=10\n",
+    "    prmse_columns = ['version', 'N','N raters', 'N single', 'N multiple', \n",
+    "                     'Variance of errors', 'True score var',\n",
+    "                     'MSE true', 'PRMSE true']\n",
     "    df_prmse = out_dfs['true_score_evaluations'][prmse_columns].copy()\n",
     "    df_prmse.replace({np.nan: '-'}, inplace=True)\n",
     "    display(HTML('<span style=\"font-size:95%\">'+ df_prmse.to_html(classes=['sortable'], \n",
-    "                                                               escape=False,\n",
-    "                                                               float_format=float_format_func) + '</span>'))\n",
+    "                                                                  escape=False, index=False,\n",
+    "                                                                  float_format=float_format_func) + '</span>'))\n",
     "else:\n",
     "    display(Markdown(no_info_str))"
    ]
@@ -73,7 +61,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.7.6"
   }
  },
  "nbformat": 4,

diff --git a/rsmtool/notebooks/summary/true_score_evaluation.ipynb b/rsmtool/notebooks/summary/true_score_evaluation.ipynb
@@ -20,8 +20,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "variance_columns = ['N','N_single','N_double','h1_var_single','h1_var_double', 'h2_var_double','true_var']\n",
-    "prmse_columns = ['N','N_single', 'N_double', 'system score type', 'sys_var_single','sys_var_double','mse_true','prmse_true']\n",
+    "prmse_columns = ['N','N raters', 'N single', 'N multiple', \n",
+    "                 'Variance of errors', 'True score var',\n",
+    "                 'MSE true', 'PRMSE true']\n",
     "\n",
     "def read_true_score_evals(model_list, file_format_summarize):\n",
     "    true_score_evals = []\n",
@@ -58,26 +59,11 @@
    "outputs": [],
    "source": [
     "if not df_true_score_eval.empty:\n",
-    "    markdown_strs = [\"#### Variance of human scores\"]\n",
-    "    markdown_strs.append(\"The table below shows variance of both sets of human scores \"\n",
-    "                        \"for the whole evaluation set and for the subset of responses \"\n",
-    "                        \"that were double-scored. Large differences in variance between \"\n",
-    "                        \"the two human scores require further investigation. The last column \"\n",
-    "                        \"shows estimated true score variance. \")\n",
-    "    display(Markdown('\\n'.join(markdown_strs)))\n",
-    "    pd.options.display.width=10\n",
-    "    df_human_variance =  df_true_score_eval[variance_columns].copy()\n",
-    "    # replace nans with \"-\"\n",
-    "    df_human_variance.replace({np.nan: '-'}, inplace=True)\n",
-    "    display(HTML('<span style=\"font-size:95%\">'+ df_human_variance.to_html(classes=['sortable'], \n",
-    "                                                               escape=False,\n",
-    "                                                               float_format=float_format_func) + '</span>'))\n",
-    "    \n",
     "    markdown_strs = [\"#### Proportional reduction in mean squared error (PRMSE)\"]\n",
-    "    markdown_strs.append(\"The table shows the variance of system scores for single-scored \"\n",
-    "                        \"and double-scored responses, and mean squared error (MSE) and \"\n",
-    "                        \"proportional reduction in mean squared error (PRMSE) for \"\n",
-    "                        \"predicting a true score with system score.\")\n",
+    "    markdown_strs.append(\"The table shows variance of human rater errors, \"\n",
+    "                         \"true score variance, mean squared error (MSE) and \"\n",
+    "                         \"proportional reduction in mean squared error (PRMSE) for \"\n",
+    "                         \"predicting a true score with system score.\")\n",
     "    display(Markdown('\\n'.join(markdown_strs)))\n",
     "    pd.options.display.width=10\n",
     "    df_prmse = df_true_score_eval[prmse_columns].copy()\n",
@@ -107,7 +93,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.7.6"
   }
  },
  "nbformat": 4,