Merge pull request #70 from EducationalTestingService/feature/RSMR-30…

…1-allow-new-input-file-formats Allow new input file formats
EducationalTestingService · Oct 24, 2016 · 0ad67d5 · 0ad67d5
2 parents d0aff68 + 403a621
commit 0ad67d5
Show file tree

Hide file tree

Showing 143 changed files with 11,223 additions and 54 deletions.
diff --git a/conda-recipe/unix/rsmtool/meta.yaml b/conda-recipe/unix/rsmtool/meta.yaml
@@ -34,8 +34,11 @@ requirements:
     - seaborn 0.7.0
     - skll 1.2.1
     - statsmodels 0.6.1
-    - zeromq 4.1.3
+    - zeromq
     - setuptools
+    - openpyxl
+    - xlrd
+    - xlwt
     - nomkl
 
   run:
@@ -53,7 +56,10 @@ requirements:
     - seaborn 0.7.0
     - skll 1.2.1
     - statsmodels 0.6.1
-    - zeromq 4.1.3
+    - zeromq
+    - openpyxl
+    - xlrd
+    - xlwt
     - nomkl
 
 test:

diff --git a/conda-recipe/windows/rsmtool/meta.yaml b/conda-recipe/windows/rsmtool/meta.yaml
@@ -34,8 +34,11 @@ requirements:
     - seaborn 0.7.0
     - skll 1.2.1
     - statsmodels 0.6.1
-    - zeromq 4.1.3
+    - zeromq
     - setuptools
+    - openpyxl
+    - xlrd
+    - xlwt
 
   run:
     - python
@@ -52,7 +55,10 @@ requirements:
     - seaborn 0.7.0
     - skll 1.2.1
     - statsmodels 0.6.1
-    - zeromq 4.1.3
+    - zeromq
+    - openpyxl
+    - xlrd
+    - xlwt
 
 test:
   # Python imports

diff --git a/conda_requirements.txt b/conda_requirements.txt
@@ -12,7 +12,10 @@ scipy=0.17.0
 seaborn=0.7.0
 skll=1.2.1
 statsmodels=0.6.1
-zeromq=4.1.3
+zeromq
+openpyxl
+xlrd
+xlwt
 sphinx
 sphinx_rtd_theme
 coverage

diff --git a/conda_requirements_windows.txt b/conda_requirements_windows.txt
@@ -12,8 +12,11 @@ scipy=0.17.0
 seaborn=0.7.0
 skll=1.2.1
 statsmodels=0.6.1
-zeromq=4.1.3
+zeromq
 sphinx
 sphinx_rtd_theme
 setuptools
 coverage
+openpyxl
+xlrd
+xlwt
diff --git a/doc/column_selection_rsmtool.rst b/doc/column_selection_rsmtool.rst
@@ -3,7 +3,7 @@
 Selecting Feature Columns
 -------------------------
 
-By default, ``rsmtool`` will use all columns included in the training and evaluation ``.csv`` files as features. The only exception are any columns explicitly identified in the configuration file as containing non-feature information (e.g., :ref:`id_column <id_column_rsmtool>`, :ref:`train_label_column <train_label_column_rsmtool>`, :ref:`test_label_column <test_label_column_rsmtool>`, etc.)
+By default, ``rsmtool`` will use all columns included in the training and evaluation data files as features. The only exception are any columns explicitly identified in the configuration file as containing non-feature information (e.g., :ref:`id_column <id_column_rsmtool>`, :ref:`train_label_column <train_label_column_rsmtool>`, :ref:`test_label_column <test_label_column_rsmtool>`, etc.)
 
 However, there are certain scenarios in which it is useful to choose specific columns in the data to be used as features. For example, let's say that you have a large number of very different features and you want to use a different subset of features to score different types of questions on a test. In this case, the ability to easily choose the desired features for any ``rsmtool`` experiment becomes quite important. The alternative of manually pre-processing the data to remove the features you don't need is quite cumbersome.
 
@@ -52,7 +52,7 @@ There are three required fields.
 
 feature
 """""""
-The exact name of the column in the training and evaluation ``.csv`` files, including capitalization. Column names cannot contain hyphens. The following strings are reserved and cannot not be used as feature column names: ``spkitemid``, ``spkitemlab``, ``itemType``, ``r1``, ``r2``, ``score``, ``sc``, ``sc1``, and ``adj``. In addition, any column names provided as values for  ``id_column``, ``train_label_column``, ``test_label_column``, ``length_column``, ``candidate_column``, and ``subgroups`` may also not be used as feature column names.
+The exact name of the column in the training and evaluation data files, including capitalization. Column names cannot contain hyphens. The following strings are reserved and cannot not be used as feature column names: ``spkitemid``, ``spkitemlab``, ``itemType``, ``r1``, ``r2``, ``score``, ``sc``, ``sc1``, and ``adj``. In addition, any column names provided as values for  ``id_column``, ``train_label_column``, ``test_label_column``, ``length_column``, ``candidate_column``, and ``subgroups`` may also not be used as feature column names.
 
 .. _json_transformation:
 
@@ -67,7 +67,7 @@ A transformation that should be applied to the column values before using it in
     * ``addOneInv``: 1/(x+1)
     * ``addOneLn``: ln(x+1)
 
-Note that ``rsmtool`` will raise an exception if the values in the data do not allow the supplied transformation (for example, if ``inv`` is applied to a column which has 0 values). If you really want to use the tranformation, you must pre-process your training and evaluation ``.csv`` files to remove the problematic cases.
+Note that ``rsmtool`` will raise an exception if the values in the data do not allow the supplied transformation (for example, if ``inv`` is applied to a column which has 0 values). If you really want to use the tranformation, you must pre-process your training and evaluation data files to remove the problematic cases.
 
 sign
 """"
@@ -85,11 +85,11 @@ To ensure that this is working as expected, you can check the sign of correlatio
 
 Subset-based column selection
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-For more advanced users, ``rsmtool`` offers the ability to assign columns to named subsets in a ``.csv`` file and then select a set of columns by simply specifying the name of that pre-defined subset.
+For more advanced users, ``rsmtool`` offers the ability to assign columns to named subsets in a data file in one of the :ref:`supported formats <input_file_format>` and then select a set of columns by simply specifying the name of that pre-defined subset.
 
 If you want to run multiple ``rsmtool`` experiments, each choosing from a large number of features, generating a separate :ref:`json file <json_column_selection>` for each experiment listing columns to use can quickly become tedious.
 
-Instead you can define feature subsets by providing a subset definition ``.csv`` file which lists *all* feature names under a column named ``Feature``. Each subset is an additional column with a value of either ``0`` (denoting that the feature does *not* belong to the subset named by that column) or ``1`` (denoting that the feature does belong to the subset named by that column).
+Instead you can define feature subsets by providing a subset definition file in one of the :ref:`supported formats <input_file_format>` which lists *all* feature names under a column named ``Feature``. Each subset is an additional column with a value of either ``0`` (denoting that the feature does *not* belong to the subset named by that column) or ``1`` (denoting that the feature does belong to the subset named by that column).
 
 Here's an example of a subset definition file, say ``subset.csv``.
 
@@ -102,7 +102,7 @@ Here's an example of a subset definition file, say ``subset.csv``.
 
 In this example, ``feature2`` and ``feature3`` belong to a subset called "A" and ``feature1`` and ``feature1`` and ``feature2`` belong to a subset called "B".
 
-This ``.csv`` file can be provided to ``rsmtool`` using the :ref:`feature_subset_file <feature_subset_file>` field in the configuration file. Then, to select a particular pre-defined subset of features, you simply set the :ref:`feature_subset  <feature_subset>` field in the configuration file to the name of the subset that you wish to use.
+This feature subset file can be provided to ``rsmtool`` using the :ref:`feature_subset_file <feature_subset_file>` field in the configuration file. Then, to select a particular pre-defined subset of features, you simply set the :ref:`feature_subset  <feature_subset>` field in the configuration file to the name of the subset that you wish to use.
 
 Then, in order to use feature subset "A" (``feature2`` and ``feature3``) in an experiment, we need to set the following two fields in our experiment configuration file:
 
@@ -119,13 +119,13 @@ Then, in order to use feature subset "A" (``feature2`` and ``feature3``) in an e
 
 Transformations
 """""""""""""""
-Unlike in :ref:`fine-grained selection <json_column_selection>`, the subset ``.csv`` file does not list any transformations to be applied to the feature columns. However, you can automatically select transformation for each feature *in the selected subset* by applying all possible transforms and identifying the one which gives the highest correlation with the human score. To use this functionality set the :ref:`select_transformations <select_transformations_rsmtool>` field in the configuration file to ``true``.
+Unlike in :ref:`fine-grained selection <json_column_selection>`, the feature subset file does not list any transformations to be applied to the feature columns. However, you can automatically select transformation for each feature *in the selected subset* by applying all possible transforms and identifying the one which gives the highest correlation with the human score. To use this functionality set the :ref:`select_transformations <select_transformations_rsmtool>` field in the configuration file to ``true``.
 
 .. _subset_sign:
 
 Signs
 """""
-Some guidelines for building scoring models require all coefficients in the model to be positive and all features to have a positive correlation with human score. ``rsmtool`` can automatically flip the sign for any pre-defined feature subset. To use this functionality, the feature subset ``.csv`` file should provide the expected correlation sign between each feature and human score under a column called ``sign_<SUBSET>`` where ``<SUBSET>`` is the name of the feature subset. Then, to tell ``rsmtool`` to flip the the sign for this subset, you need to set the :ref:`sign <sign>` field in the configuration file to ``<SUBSET>``.
+Some guidelines for building scoring models require all coefficients in the model to be positive and all features to have a positive correlation with human score. ``rsmtool`` can automatically flip the sign for any pre-defined feature subset. To use this functionality, the feature subset file should provide the expected correlation sign between each feature and human score under a column called ``sign_<SUBSET>`` where ``<SUBSET>`` is the name of the feature subset. Then, to tell ``rsmtool`` to flip the the sign for this subset, you need to set the :ref:`sign <sign>` field in the configuration file to ``<SUBSET>``.
 
 To understand this, let's re-examine our earlier example of a subset definition file ``subset.csv``, but with an additional column.
 

diff --git a/doc/config_rsmeval.rst b/doc/config_rsmeval.rst
@@ -13,7 +13,7 @@ An identifier for the experiment that will be used to name the report and all :r
 
 predictions_file
 ~~~~~~~~~~~~~~~~
-The path to the ``.csv`` file with predictions to evaluate. Each row should correspond to a single response and contain the predicted and observed scores for this response. In addition, there should be a column with a unique identifier (ID) for each response. The path can be absolute or relative to the location of the configuration file.
+The path to the file with predictions to evaluate. The file should be in one of the :ref:`supported formats <input_file_format>`. Each row should correspond to a single response and contain the predicted and observed scores for this response. In addition, there should be a column with a unique identifier (ID) for each response. The path can be absolute or relative to the location of the configuration file.
 
 system_score_column
 ~~~~~~~~~~~~~~~~~~~
@@ -71,7 +71,7 @@ scale_with *(Optional)*
 ~~~~~~~~~~~~~~~~~~~~~~~
 In many scoring applications, system scores are :ref:`re-scaled <score_postprocessing>` so that their mean and standard deviation match those of the human scores for the training data.
 
-If you want ``rsmeval`` to re-scale the supplied predictions, you need to provide -- as the value for this field -- the path to a second ``.csv`` file containing the human scores and predictions of the same system on its training data. This file *must* have two columns: the human scores under the ``sc1`` column and the predicted score under the ``prediction``.
+If you want ``rsmeval`` to re-scale the supplied predictions, you need to provide -- as the value for this field -- the path to a second file in one of the :ref:`supported formats <input_file_format>` containing the human scores and predictions of the same system on its training data. This file *must* have two columns: the human scores under the ``sc1`` column and the predicted score under the ``prediction``.
 
 This field can also be set to ``"asis"`` if the scores are already scaled. In this case, no additional scaling will be performed by ``rsmeval`` but the report will refer to the scores as "scaled".
 

diff --git a/doc/config_rsmpredict.rst b/doc/config_rsmpredict.rst
@@ -16,7 +16,7 @@ The ``experiment_id`` used to create the ``rsmtool`` model files being used for
 
 input_feature_file
 ~~~~~~~~~~~~~~~~~~
-The path to the ``.csv`` file with the raw feature values that will be used for generating predictions. Each row should correspond to a single response and contain feature values for this response. In addition, there should be a column with a unique identifier (ID) for each response. The path can be absolute or relative to the location of config file. Note that the feature names *must* be the same as used in the original ``rsmtool`` experiment.
+The path to the file with the raw feature values that will be used for generating predictions. The file should be in one of the :ref:`supported formats <input_file_format>` Each row should correspond to a single response and contain feature values for this response. In addition, there should be a column with a unique identifier (ID) for each response. The path can be absolute or relative to the location of config file. Note that the feature names *must* be the same as used in the original ``rsmtool`` experiment.
 
 
 .. note::

diff --git a/doc/config_rsmtool.rst b/doc/config_rsmtool.rst
@@ -17,11 +17,11 @@ The machine learner you want to use to build the scoring model. Possible values
 
 train_file
 """"""""""
-The path to the training data feature file in ``.csv`` format. Each row should correspond to a single response and contain numeric feature values extracted for this response. In addition, there should be a column with a unique identifier (ID) for each response and a column with the human score for each response. The path can be absolute or relative to the location of config file.
+The path to the training data feature file in one of the :ref:`supported formats <input_file_format>`. Each row should correspond to a single response and contain numeric feature values extracted for this response. In addition, there should be a column with a unique identifier (ID) for each response and a column with the human score for each response. The path can be absolute or relative to the location of config file.
 
 test_file
 """""""""
-The path to the evaluation data feature file in ``.csv`` format. Each row should correspond to a single response and contain numeric feature values extracted for this response. In addition, there should be a column with a unique identifier (ID) for each response and a column with the human score for each response. The path can be absolute or relative to the location of config file.
+The path to the evaluation data feature file in one of the :ref:`supported formats <input_file_format>`. Each row should correspond to a single response and contain numeric feature values extracted for this response. In addition, there should be a column with a unique identifier (ID) for each response and a column with the human score for each response. The path can be absolute or relative to the location of config file.
 
 description *(Optional)*
 """"""""""""""""""""""""
@@ -51,7 +51,7 @@ sign *(Optional)*
 """""""""""""""""
 See below.
 
-By default, ``rsmtool`` will use all of the columns present in the training and evaluation CSV files as features except for any columns explicitly identified in the configuration file (see below). These four fields are useful if you want to use only a specific set of columns as features. See :ref:`selecting feature columns <column_selection_rsmtool>` for more details.
+By default, ``rsmtool`` will use all of the columns present in the training and evaluation files as features except for any columns explicitly identified in the configuration file (see below). These four fields are useful if you want to use only a specific set of columns as features. See :ref:`selecting feature columns <column_selection_rsmtool>` for more details.
 
 .. _id_column_rsmtool:
 

diff --git a/doc/intermediate_files_rsmeval.rst b/doc/intermediate_files_rsmeval.rst
@@ -6,7 +6,7 @@ Although the primary output of ``rsmeval`` is an HTML report, we also want the u
 
 .. note::
 
-    The names of all files begin with the ``experiment_id`` provided by the user in the experiment configuration file. In addition, the names for certain columns are set to default values in these files irrespective of what they were named in the original ``.csv`` files. This is because RSMEval standardizes these column names internally for convenience. These values are:
+    The names of all files begin with the ``experiment_id`` provided by the user in the experiment configuration file. In addition, the names for certain columns are set to default values in these files irrespective of what they were named in the original data files. This is because RSMEval standardizes these column names internally for convenience. These values are:
 
     - ``spkitemid`` for the column containing response IDs.
     - ``sc1`` for the column containing the human scores used as observed scores

diff --git a/doc/intermediate_files_rsmtool.rst b/doc/intermediate_files_rsmtool.rst
@@ -7,7 +7,7 @@ Although the primary output of RSMTool is an HTML report, we also want the user
 
 .. note::
 
-    The names of all files begin with the ``experiment_id`` provided by the user in the experiment configuration file. In addition, the names for certain columns are set to default values in these files irrespective of what they were named in the original ``.csv`` files. This is because RSMTool standardizes these column names internally for convenience. These values are:
+    The names of all files begin with the ``experiment_id`` provided by the user in the experiment configuration file. In addition, the names for certain columns are set to default values in these files irrespective of what they were named in the original data files. This is because RSMTool standardizes these column names internally for convenience. These values are:
 
     - ``spkitemid`` for the column containing response IDs.
     - ``sc1`` for the column containing the human scores used as training labels.
@@ -38,7 +38,7 @@ These files contain all of the rows in the training and evaluation sets that wer
 
 .. note::
 
-    If the training/evaluation ``.csv`` files contained columns with internal names such as ``sc1`` or ``length`` but these columns were not actually used by ``rsmtool``, these columns will also be included into these files but their names will be changed to ``##name##`` (e.g. ``##sc1##``).
+    If the training/evaluation files contained columns with internal names such as ``sc1`` or ``length`` but these columns were not actually used by ``rsmtool``, these columns will also be included into these files but their names will be changed to ``##name##`` (e.g. ``##sc1##``).
 
 Excluded responses
 ^^^^^^^^^^^^^^^^^^
@@ -62,7 +62,7 @@ These files contain all of the the columns from the original features files that
 
 .. note::
 
-    If the training/evaluation ``.csv`` files contained columns with internal names such as ``sc1`` or ``length`` but these columns were not actually used by ``rsmtool``, these columns will also be included into these files but their names will be changed to ``##name##`` (e.g. ``##sc1##``).
+    If the training/evaluation files contained columns with internal names such as ``sc1`` or ``length`` but these columns were not actually used by ``rsmtool``, these columns will also be included into these files but their names will be changed to ``##name##`` (e.g. ``##sc1##``).
 
 Response length
 ^^^^^^^^^^^^^^^

diff --git a/doc/pipeline.png b/doc/pipeline.png