Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
*_location ➡️ _directory & cv_folds_file
  • Loading branch information
dan-blanchard committed Nov 21, 2014
1 parent de92452 commit 442a920
Show file tree
Hide file tree
Showing 22 changed files with 162 additions and 146 deletions.
54 changes: 26 additions & 28 deletions doc/run_experiment.rst
Expand Up @@ -36,10 +36,8 @@ with the following added restrictions:

* Only simple numeric, string, and nomimal values are supported.
* Nominal values are converted to strings.
* There should be an attribute with the name specified by :ref:`id_col <id_col>` in
the :ref:`Input` section of the configuration file you create for your
experiment. This defaults to "id". If there is no such column, IDs will be
generated automatically.
* If the data has instance IDs, there should be an attribute with the name
specified by :ref:`id_col <id_col>` in the :ref:`Input` section of the configuration file you create for your experiment. This defaults to ``id``. If there is no such attribute, IDs will be generated automatically.
* If the data is labelled, there must be an attribute with the name specified
by :ref:`label_col <label_col>` in the :ref:`Input` section of the
configuartion file you create for your experiment. This defaults to ``y``.
Expand All @@ -56,10 +54,8 @@ A simple comma or tab-delimited format with the following restrictions:
specified by :ref:`label_col <label_col>` in the :ref:`Input` section of the
configuartion file you create for your experiment. This defaults to
``y``.
* There should be a column with the name specified by :ref:`id_col <id_col>` in
the :ref:`Input` section of the configuration file you create for your
experiment. This defaults to "id". If there is no such column, IDs will be
generated automatically.
* If the data has instance IDs, there should be a column with the name
specified by :ref:`id_col <id_col>` in the :ref:`Input` section of the configuration file you create for your experiment. This defaults to ``id``. If there is no such column, IDs will be generated automatically.
* All other columns contain feature values, and every feature value
must be specified (making this a poor choice for sparse data).

Expand Down Expand Up @@ -144,7 +140,7 @@ possible settings for each section is provided below, but to summarize:
cross-validation currently uses
`StratifiedKFold <http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html>`__.
You also can optionally use predetermined folds with the
:ref:`cv_folds_location <cv_folds_location>` setting.
:ref:`cv_folds_file <cv_folds_file>` setting.

.. _evaluate:

Expand All @@ -162,6 +158,8 @@ possible settings for each section is provided below, but to summarize:
* If you want to just **train a model**, specify a training location, and set
:ref:`task` to ``train``.

.. _learners_required:

* A :ref:`list of classifiers/regressors <learners>` to try on your feature
files is required.

Expand Down Expand Up @@ -199,7 +197,7 @@ Input

The Input section has only one required field, :ref:`learners`, but also must
contain either :ref:`train_file <train_file>` or
:ref:`train_location <train_location>`.
:ref:`train_directory <train_directory>`.

.. _learners:

Expand Down Expand Up @@ -252,44 +250,44 @@ Regressors:
.. _train_file:

train_file *(Optional)*
"""""""""""""""""""""""""""
"""""""""""""""""""""""

Path to a file containing the features to train on. Cannot be used in
combination with :ref:`featuresets <featuresets>`,
:ref:`train_location <train_location>`, or :ref:`test_location <test_location>`.
:ref:`train_directory <train_directory>`, or :ref:`test_directory <test_directory>`.

.. note::

If :ref:`train_file <train_file>` is not specified,
:ref:`train_location <train_location>` must be.
:ref:`train_directory <train_directory>` must be.

.. _train_location:
.. _train_directory:

train_location *(Optional)*
"""""""""""""""""""""""""""
train_directory *(Optional)*
""""""""""""""""""""""""""""

Path to directory containing training data files. There must be a file for each
featureset. Cannot be used in combination with :ref:`train_file <train_file>`
or :ref:`test_file <test_file>`.

.. note::

If :ref:`train_location <train_location>` is not specified,
If :ref:`train_directory <train_directory>` is not specified,
:ref:`train_file <train_file>` must be.

.. _test_file:

test_file *(Optional)*
"""""""""""""""""""""""""""
""""""""""""""""""""""

Path to a file containing the features to test on. Cannot be used in
combination with :ref:`featuresets <featuresets>`,
:ref:`train_location <train_location>`, or :ref:`test_location <test_location>`
:ref:`train_directory <train_directory>`, or :ref:`test_directory <test_directory>`

.. _test_location:
.. _test_directory:

test_location *(Optional)*
""""""""""""""""""""""""""
test_directory *(Optional)*
"""""""""""""""""""""""""""

Path to directory containing test data files. There must be a file
for each featureset. Cannot be used in combination with
Expand All @@ -307,8 +305,8 @@ if this is not the case. Cannot be used in combination with

.. note::

If specifying :ref:`train_location <train_location>` or
:ref:`test_location <test_location>`, :ref:`featuresets <featuresets>`
If specifying :ref:`train_directory <train_directory>` or
:ref:`test_directory <test_directory>`, :ref:`featuresets <featuresets>`
is required.

.. _suffix:
Expand All @@ -332,8 +330,8 @@ would like to combine.
id_col *(Optional)*
"""""""""""""""""""
If you're using :ref:`ARFF <arff>`, :ref:`CSV <csv>`, or :ref:`TSV <csv>`
files, the IDs for each instance are assumed to be in a column with this
name. If no column with this name is found, the IDs are generated
files, the IDs for each instance are assumed to be in a column with this
name. If no column with this name is found, the IDs are generated
automatically. Defaults to ``id``.

.. _label_col:
Expand Down Expand Up @@ -385,9 +383,9 @@ example, if you wanted to collapse the labels ``beagle`` and ``dachsund`` into a
Any labels not included in the dictionary will be left untouched.

.. _cv_folds_location:
.. _cv_folds_file:

cv_folds_location *(Optional)*
cv_folds_file *(Optional)*
""""""""""""""""""""""""""""""

Path to a csv file (with a header that is ignored) specifying folds for cross-
Expand Down
27 changes: 18 additions & 9 deletions doc/tutorial.rst
Expand Up @@ -4,10 +4,20 @@
Tutorial
========

For this tutorial, we're going to make use of the examples provided in the
Workflow
--------

In general, there are three steps to using SKLL:

1. Get some data in a :ref:`SKLL-compatible format <file_formats>`.
2. Create a small :ref:`configuration file <create_config>` describing the
machine learning experiment you would like to run.
3. Run that configuration file with :ref:`run_experiment <run_experiment>`.

For this tutorial, we're going to use some of the examples provided in the
`examples <https://github.com/EducationalTestingService/skll/blob/master/examples/>`__
directory in your copy of SKLL. Of course, the provided examples are already
perfect and ready to use. If this weren't the case, you would need to...
directory included with your copy of SKLL. Of course, the provided examples
are already perfect and ready to use. If this weren't the case, you would need to...

Get your data into the correct format
-------------------------------------
Expand Down Expand Up @@ -48,10 +58,9 @@ experiment, we can train and test several models, either simultaneously or
sequentially, depending on the availability of a grid engine. This will be
described in more detail later on, when we are ready to run our experiment.

You can consult
:ref:`the full list of learners currently available in SKLL <learners>` to get
an idea for the things you can do. As part of this tutorial, we will use the
following learners:
You can consult :ref:`the full list of learners currently available in SKLL <learners>`
to get an idea for the things you can do. As part of this tutorial, we will
use the following learners:

* Random Forest (``RandomForestClassifier``), C-Support Vector Classification
(``SVC``), Linear Support Vector Classification (``LinearSVC``), Logistic
Expand All @@ -77,8 +86,8 @@ are only going to train a model and evaluate its performance, because in the
:ref:`General` section, :ref:`task` is set to "evaluate". We will explore the
other options for :ref:`task` later.

In the :ref:`Input` section, you may want to adjust :ref:`train_location` and
:ref:`test_location` to point to the directories containing the Iris training
In the :ref:`Input` section, you may want to adjust :ref:`train_directory` and
:ref:`test_directory` to point to the directories containing the Iris training
and testing data (most likely ``skll/examples/iris/train`` and
``skll/examples/iris/test`` respectively, relative to your installation of
SKLL). :ref:`featuresets <featuresets>` indicates the name of both the
Expand Down
2 changes: 1 addition & 1 deletion examples/boston/cross_val.cfg
Expand Up @@ -4,7 +4,7 @@ task = cross_validate

[Input]
# this could also be an absolute path instead (and must be if you're not running things in local mode)
train_location = boston/train
train_directory = boston/train
featuresets = [["example_boston_features"]]
# there is only set of features to try with one feature file in it here.
featureset_names = ["example_boston"]
Expand Down
4 changes: 2 additions & 2 deletions examples/boston/evaluate.cfg
Expand Up @@ -4,8 +4,8 @@ task = evaluate

[Input]
# this could also be an absolute path instead (and must be if you're not running things in local mode)
train_location = boston/train
test_location = boston/test
train_directory = boston/train
test_directory = boston/test
featuresets = [["example_boston_features"]]
# there is only set of features to try with one feature file in it here.
featureset_names = ["example_boston"]
Expand Down
2 changes: 1 addition & 1 deletion examples/iris/cross_val.cfg
Expand Up @@ -5,7 +5,7 @@ task = cross_validate
[Input]
# this could also be an absolute path instead (and must be if you're not
# running things in local mode)
train_location = train
train_directory = train
featuresets = [["example_iris_features"]]
# there is only set of features to try with one feature file in it here.
featureset_names = ["example_iris"]
Expand Down
4 changes: 2 additions & 2 deletions examples/iris/evaluate.cfg
Expand Up @@ -5,8 +5,8 @@ task = evaluate
[Input]
# this could also be an absolute path instead (and must be if you're not
# running things in local mode)
train_location = train
test_location = test
train_directory = train
test_directory = test
featuresets = [["example_iris_features"]]
# there is only set of features to try with one feature file in it here.
featureset_names = ["example_iris"]
Expand Down
2 changes: 1 addition & 1 deletion examples/titanic/cross_validate.cfg
Expand Up @@ -4,7 +4,7 @@ task = cross_validate

[Input]
# this could also be an absolute path instead (and must be if you're not running things in local mode)
train_location = train+dev
train_directory = train+dev
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"]
label_col = Survived
Expand Down
4 changes: 2 additions & 2 deletions examples/titanic/evaluate.cfg
Expand Up @@ -4,8 +4,8 @@ task = evaluate

[Input]
# this could also be an absolute path instead (and must be if you're not running things in local mode)
train_location = train
test_location = dev
train_directory = train
test_directory = dev
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"]
label_col = Survived
Expand Down
4 changes: 2 additions & 2 deletions examples/titanic/evaluate_tuned.cfg
Expand Up @@ -4,8 +4,8 @@ task = evaluate

[Input]
# this could also be an absolute path instead (and must be if you're not running things in local mode)
train_location = train
test_location = dev
train_directory = train
test_directory = dev
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"]
label_col = Survived
Expand Down
4 changes: 2 additions & 2 deletions examples/titanic/predict_train+dev.cfg
Expand Up @@ -4,8 +4,8 @@ task = predict

[Input]
# this could also be an absolute path instead (and must be if you're not running things in local mode)
train_location = train+dev
test_location = test
train_directory = train+dev
test_directory = test
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]]
# We know which learner is the best from previous experiments (using evaluate.cfg or cross_validate.cfg)
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"]
Expand Down
4 changes: 2 additions & 2 deletions examples/titanic/predict_train+dev_tuned.cfg
Expand Up @@ -4,8 +4,8 @@ task = predict

[Input]
# this could also be an absolute path instead (and must be if you're not running things in local mode)
train_location = train+dev
test_location = test
train_directory = train+dev
test_directory = test
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]]
# We know which learner is the best from previous experiments (using evaluate.cfg or cross_validate.cfg)
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"]
Expand Down
4 changes: 2 additions & 2 deletions examples/titanic/predict_train_only.cfg
Expand Up @@ -4,8 +4,8 @@ task = predict

[Input]
# this could also be an absolute path instead (and must be if you're not running things in local mode)
train_location = train
test_location = test
train_directory = train
test_directory = test
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]]
# We know which learner is the best from previous experiments (using evaluate.cfg or cross_validate.cfg)
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"]
Expand Down
4 changes: 2 additions & 2 deletions examples/titanic/predict_train_only_tuned.cfg
Expand Up @@ -4,8 +4,8 @@ task = predict

[Input]
# this could also be an absolute path instead (and must be if you're not running things in local mode)
train_location = train
test_location = test
train_directory = train
test_directory = test
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]]
# We know which learner is the best from previous experiments (using evaluate.cfg or cross_validate.cfg)
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"]
Expand Down
2 changes: 1 addition & 1 deletion examples/titanic/train.cfg
Expand Up @@ -4,7 +4,7 @@ task = train

[Input]
# this could also be an absolute path instead (and must be if you're not running things in local mode)
train_location = train
train_directory = train
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]]
# We know which learner is the best from previous experiments (using evaluate.cfg or cross_validate.cfg)
learners = ["RandomForestClassifier"]
Expand Down
8 changes: 4 additions & 4 deletions skll/__init__.py
Expand Up @@ -13,17 +13,17 @@

from sklearn.metrics import f1_score, make_scorer, SCORERS

from .data import Reader, Writer
from .data import FeatureSet, Reader, Writer
from .experiments import run_configuration
from .learner import Learner
from .metrics import (kappa, kendall_tau, spearman, pearson,
f1_score_least_frequent)
from .version import __version__, VERSION


__all__ = ['Learner', 'Reader', 'kappa', 'kendall_tau', 'spearman',
'pearson', 'f1_score_least_frequent', 'run_configuration',
'Writer']
__all__ = ['FeatureSet', 'Learner', 'Reader', 'kappa', 'kendall_tau',
'spearman', 'pearson', 'f1_score_least_frequent',
'run_configuration', 'Writer']

# Add our scorers to the sklearn dictionary here so that they will always be
# available if you import anything from skll
Expand Down
2 changes: 1 addition & 1 deletion skll/data/featureset.py
Expand Up @@ -25,7 +25,7 @@ class FeatureSet(object):
Encapsulation of all of the features, values, and metadata about a given
set of data.
This replaces ExamplesTuple in older versions.
This replaces ``ExamplesTuple`` from older versions.
:param name: The name of this feature set.
:type name: str
Expand Down

0 comments on commit 442a920

Please sign in to comment.