SKLL can be used in 2 ways -

---
1. *Command Line*
    - Get data in [SKLL compatible format](https://skll.readthedocs.io/en/latest/run_experiment.html#file-formats).
    - Create a [python configuration file](https://skll.readthedocs.io/en/latest/run_experiment.html#create-config).
    - Run the experiment using [run_experiment](https://skll.readthedocs.io/en/latest/run_experiment.html) command.
    - Examine results using the several [utility](https://skll.readthedocs.io/en/latest/utilities.html) commands provided.
---    
2. *Python API*

# Command Line

In [1]:
!pwd

/home/avijit/PycharmProjects/skll/examples


In [2]:
!ls

boston			     make_iris_example_data.py	   Tutorial.ipynb
iris			     make_titanic_example_data.py
make_boston_example_data.py  titanic


### Dataset Manipulation

We shall be using the IRIS dataset for this simple tutorial. It is a simple 3-Class Classification using a single set of 4 features.

The utility python script *make_iris_example_data.py* downloads the IRIS dataset from scikit-learn and pre-processes it to make train, test sub-directories within the *iris* directory. 

Each of the generated sub-directories (*iris/train* and *iris/test*) contains a feature file in SKLL compatible *jsonlines* format.

In [3]:
!python3 make_iris_example_data.py

Retrieving iris data from servers...done
Writing training and testing files...done


In [4]:
import os

def list_files(startpath):
    for root, dirs, files in os.walk(startpath):
        level = root.replace(startpath, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print('{}{}/'.format(indent, os.path.basename(root)))
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print('{}{}'.format(subindent, f))

list_files('iris')

iris/
    evaluate.cfg
    cross_val.cfg
    test/
        example_iris_features.jsonlines
    train/
        example_iris_features.jsonlines
    .ipynb_checkpoints/


In [5]:
!head -5 iris/train/example_iris_features.jsonlines

{"id": "EXAMPLE_96", "y": "versicolor", "x": {"f0": 5.7, "f1": 2.9, "f2": 4.2, "f3": 1.3}}
{"id": "EXAMPLE_105", "y": "virginica", "x": {"f0": 7.6, "f1": 3.0, "f2": 6.6, "f3": 2.1}}
{"id": "EXAMPLE_66", "y": "versicolor", "x": {"f0": 5.6, "f1": 3.0, "f2": 4.5, "f3": 1.5}}
{"id": "EXAMPLE_0", "y": "setosa", "x": {"f0": 5.1, "f1": 3.5, "f2": 1.4, "f3": 0.2}}
{"id": "EXAMPLE_122", "y": "virginica", "x": {"f0": 7.7, "f1": 2.8, "f2": 6.7, "f3": 2.0}}


The *[skll_convert](https://skll.readthedocs.io/en/latest/utilities.html#skll-convert)* command can be used to convert between [SKLL feature file formats](https://skll.readthedocs.io/en/latest/run_experiment.html#feature-file-formats). 

In [6]:
!skll_convert iris/train/example_iris_features.jsonlines iris/train/example_iris_features.csv 
print()
!ls iris/train
print()
!head -5 iris/train/example_iris_features.csv

Traceback (most recent call last):
  File "/usr/local/bin/skll_convert", line 11, in <module>
    load_entry_point('skll==1.5.3', 'console_scripts', 'skll_convert')()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2693, in load_entry_point
    return ep.load()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2324, in load
    return self.resolve()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2330, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/usr/local/lib/python2.7/dist-packages/skll-1.5.3-py2.7.egg/skll/__init__.py", line 15, in <module>
    from .data import FeatureSet, Reader, Writer
  File "/usr/local/lib/python2.7/dist-packages/skll-1.5.3-py2.7.egg/skll/data/__init__.py", line 14, in <module>


### Configuration File

At the core of SKLL experiments is the configuration file which is executed with the *run_experiment* command. 
SKLL configuration files are standard Python configuration files (similar in format to Windows INI files).

The 4 expected sections in a configuration file are :
1. [General](https://skll.readthedocs.io/en/latest/run_experiment.html#general)
    - Defines *experiment_name* and *task* (both compulsory fields)
    - 4 tasks are supported :
        1. cross_validate
        2. evaluate
        3. predict
        4. learning curve
2. [Input](https://skll.readthedocs.io/en/latest/run_experiment.html#input)
    - Defines the *learners* list (compulsory)
    - Additionally, one of *train_directory* or *train_file* field must be defined.
    - All other fields are optional.
3. [Tuning](https://skll.readthedocs.io/en/latest/run_experiment.html#tuning)
    - Contains fields related to tuning the models such as *objectives*, *grid_search* etc.
    - All the fields in this section are optional.
4. [Output](https://skll.readthedocs.io/en/latest/run_experiment.html#output)
    - Contains fields related to output post model training such as *probability*, *metrics*, *results* etc.
    - All the fields in this section are optional.
    
    
An example config file for the IRIS dataset is shown here.

In [7]:
with open('iris/cross_val.cfg', 'r') as config_file:
    print(config_file.read())

[General]
experiment_name = Iris_CV
task = cross_validate

[Input]
# this could also be an absolute path instead (and must be if you're not
# running things in local mode)
train_directory = train
featuresets = [["example_iris_features"]]
# there is only set of features to try with one feature file in it here.
featureset_names = ["example_iris"]
learners = ["RandomForestClassifier", "SVC", "LogisticRegression", "MultinomialNB"]
suffix = .jsonlines

[Tuning]
grid_search = true
objectives = ['f1_score_micro']

[Output]
# again, these can be absolute paths
results = output
log = output
predictions = output



### Running Experiment

After defining the configuration file, we can use the [run_experiment CONFIGURATION_FILE](https://skll.readthedocs.io/en/latest/run_experiment.html#using-run-experiment) command. Although most of the parameters are defined in the config file, some are passed as arguments to *run_experiment* (--ablation, --local etc.).

In [8]:
!run_experiment --local --verbose iris/cross_val.cfg

Traceback (most recent call last):
  File "/usr/local/bin/run_experiment", line 11, in <module>
    load_entry_point('skll==1.5.3', 'console_scripts', 'run_experiment')()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2693, in load_entry_point
    return ep.load()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2324, in load
    return self.resolve()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2330, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/usr/local/lib/python2.7/dist-packages/skll-1.5.3-py2.7.egg/skll/__init__.py", line 15, in <module>
    from .data import FeatureSet, Reader, Writer
  File "/usr/local/lib/python2.7/dist-packages/skll-1.5.3-py2.7.egg/skll/data/__init__.py", line

### Analysing Output

In [9]:
list_files('iris')

iris/
    evaluate.cfg
    cross_val.cfg
    test/
        example_iris_features.jsonlines
    train/
        example_iris_features.jsonlines
    .ipynb_checkpoints/


In [10]:
!cat iris/output/Iris_Evaluate_example_iris_LogisticRegression.results

cat: iris/output/Iris_Evaluate_example_iris_LogisticRegression.results: No such file or directory


In [11]:
!head -5 iris/output/Iris_Evaluate_example_iris_LogisticRegression_predictions.tsv

head: cannot open 'iris/output/Iris_Evaluate_example_iris_LogisticRegression_predictions.tsv' for reading: No such file or directory


In [12]:
!cat iris/output/Iris_Evaluate_example_iris_LogisticRegression.log

cat: iris/output/Iris_Evaluate_example_iris_LogisticRegression.log: No such file or directory


In [13]:
import pandas as pd

summary_df = pd.read_csv('iris/output/Iris_Evaluate_summary.tsv', sep='\t')
print(summary_df.columns)

FileNotFoundError: File b'iris/output/Iris_Evaluate_summary.tsv' does not exist

In [14]:
print(summary_df[['learner_name', 'accuracy', 'score', 'experiment_name', 'featureset_name']])

NameError: name 'summary_df' is not defined