SKLL can be used in 2 ways -

---
1. *Command Line*
    - Get data in [SKLL compatible format](https://skll.readthedocs.io/en/latest/run_experiment.html#file-formats).
    - Create a [python configuration file](https://skll.readthedocs.io/en/latest/run_experiment.html#create-config).
    - Run the experiment using [run_experiment](https://skll.readthedocs.io/en/latest/run_experiment.html) command.
    - Examine results using the several [utility](https://skll.readthedocs.io/en/latest/utilities.html) commands provided.
---    
2. *Python API*

# Command Line

In [1]:
!pwd

/home/jovyan/examples


In [2]:
!ls

boston			     make_iris_example_data.py	   Tutorial.ipynb
iris			     make_titanic_example_data.py
make_boston_example_data.py  titanic


### Dataset Manipulation

We shall be using the IRIS dataset for this simple tutorial. It is a simple 3-Class Classification using a single set of 4 features.

The utility python script *make_iris_example_data.py* downloads the IRIS dataset from scikit-learn and pre-processes it to make train, test sub-directories within the *iris* directory. 

Each of the generated sub-directories (*iris/train* and *iris/test*) contains a feature file in SKLL compatible *jsonlines* format.

In [3]:
!python3 make_iris_example_data.py

Retrieving iris data from servers...done
Writing training and testing files...done


In [4]:
import os

def list_files(startpath):
    for root, dirs, files in os.walk(startpath):
        level = root.replace(startpath, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print('{}{}/'.format(indent, os.path.basename(root)))
        subindent = ' ' * 4 * (level + 1)
        for f in sorted(files):
            print('{}{}'.format(subindent, f))

list_files('iris')

iris/
    cross_val.cfg
    evaluate.cfg
    train/
        example_iris_features.jsonlines
    .ipynb_checkpoints/
    test/
        example_iris_features.jsonlines


In [5]:
!head -5 iris/train/example_iris_features.jsonlines

{"id": "EXAMPLE_96", "y": "versicolor", "x": {"f0": 5.7, "f1": 2.9, "f2": 4.2, "f3": 1.3}}
{"id": "EXAMPLE_105", "y": "virginica", "x": {"f0": 7.6, "f1": 3.0, "f2": 6.6, "f3": 2.1}}
{"id": "EXAMPLE_66", "y": "versicolor", "x": {"f0": 5.6, "f1": 3.0, "f2": 4.5, "f3": 1.5}}
{"id": "EXAMPLE_0", "y": "setosa", "x": {"f0": 5.1, "f1": 3.5, "f2": 1.4, "f3": 0.2}}
{"id": "EXAMPLE_122", "y": "virginica", "x": {"f0": 7.7, "f1": 2.8, "f2": 6.7, "f3": 2.0}}


The *[skll_convert](https://skll.readthedocs.io/en/latest/utilities.html#skll-convert)* command can be used to convert between [SKLL feature file formats](https://skll.readthedocs.io/en/latest/run_experiment.html#feature-file-formats). 

In [6]:
!skll_convert iris/train/example_iris_features.jsonlines iris/train/example_iris_features.csv 
print()
!ls iris/train
print()
!head -5 iris/train/example_iris_features.csv

Loading iris/train/example_iris_features.jsonlines...           done
Writing iris/train/example_iris_features.csv...done           

example_iris_features.csv  example_iris_features.jsonlines

f0,f1,f2,f3,id,y
5.7,2.9,4.2,1.3,EXAMPLE_96,versicolor
7.6,3.0,6.6,2.1,EXAMPLE_105,virginica
5.6,3.0,4.5,1.5,EXAMPLE_66,versicolor
5.1,3.5,1.4,0.2,EXAMPLE_0,setosa


### Configuration File

At the core of SKLL experiments is the configuration file which is executed with the *run_experiment* command. 
SKLL configuration files are standard Python configuration files (similar in format to Windows INI files).

The 4 expected sections in a configuration file are :
1. [General](https://skll.readthedocs.io/en/latest/run_experiment.html#general)
    - Defines *experiment_name* and *task* (both compulsory fields)
    - 4 tasks are supported :
        1. cross_validate
        2. evaluate
        3. predict
        4. learning curve
2. [Input](https://skll.readthedocs.io/en/latest/run_experiment.html#input)
    - Defines the *learners* list (compulsory)
    - Additionally, one of *train_directory* or *train_file* field must be defined.
    - All other fields are optional.
3. [Tuning](https://skll.readthedocs.io/en/latest/run_experiment.html#tuning)
    - Contains fields related to tuning the models such as *objectives*, *grid_search* etc.
    - All the fields in this section are optional.
4. [Output](https://skll.readthedocs.io/en/latest/run_experiment.html#output)
    - Contains fields related to output post model training such as *probability*, *metrics*, *results* etc.
    - All the fields in this section are optional.
    
    
An example config file for the IRIS dataset is shown here.

In [7]:
with open('iris/cross_val.cfg', 'r') as config_file:
    print(config_file.read())

[General]
experiment_name = Iris_CV
task = cross_validate

[Input]
# this could also be an absolute path instead (and must be if you're not
# running things in local mode)
train_directory = train
featuresets = [["example_iris_features"]]
# there is only set of features to try with one feature file in it here.
featureset_names = ["example_iris"]
learners = ["RandomForestClassifier", "SVC", "LogisticRegression", "MultinomialNB"]
suffix = .jsonlines

[Tuning]
grid_search = true
objectives = ['f1_score_micro']

[Output]
# again, these can be absolute paths
results = output
log = output
predictions = output



### run_experiment

After defining the configuration file, we can use the [run_experiment CONFIGURATION_FILE](https://skll.readthedocs.io/en/latest/run_experiment.html#using-run-experiment) command. Although most of the parameters are defined in the config file, some are passed as arguments to *run_experiment* (--ablation, --local etc.).

Here we try out the cross validation configuration shown earlier.

In [8]:
!run_experiment --local --verbose iris/cross_val.cfg

2019-10-07 15:59:16,430 - Iris_CV_example_iris_RandomForestClassifier - INFO - Task: cross_validate
2019-10-07 15:59:16,431 - Iris_CV_example_iris_RandomForestClassifier - INFO - Cross-validating (10 folds) on train, feature set ['example_iris_features'] ...
2019-10-07 15:59:16,431 - Iris_CV_example_iris_RandomForestClassifier - DEBUG - Path: /home/jovyan/examples/iris/train/example_iris_features.jsonlines
Loading /home/jovyan/examples/iris/train/example_iris_features.jsonlines...           done
2019-10-07 15:59:16,435 - Iris_CV_example_iris_RandomForestClassifier - INFO - Cross-validating
2019-10-07 16:01:21,081 - Iris_CV_example_iris_SVC - INFO - Task: cross_validate
2019-10-07 16:01:21,081 - Iris_CV_example_iris_SVC - INFO - Cross-validating (10 folds) on train, feature set ['example_iris_features'] ...
2019-10-07 16:01:21,082 - Iris_CV_example_iris_SVC - DEBUG - Path: /home/jovyan/examples/iris/train/example_iris_features.jsonlines
Loading /home/jovyan/examples/iris/train/example_i

### Analysing Output

In [9]:
list_files('iris')

iris/
    cross_val.cfg
    evaluate.cfg
    train/
        example_iris_features.csv
        example_iris_features.jsonlines
    .ipynb_checkpoints/
    output/
        Iris_CV.log
        Iris_CV_example_iris_LogisticRegression.log
        Iris_CV_example_iris_LogisticRegression.results
        Iris_CV_example_iris_LogisticRegression.results.json
        Iris_CV_example_iris_LogisticRegression_predictions.tsv
        Iris_CV_example_iris_MultinomialNB.log
        Iris_CV_example_iris_MultinomialNB.results
        Iris_CV_example_iris_MultinomialNB.results.json
        Iris_CV_example_iris_MultinomialNB_predictions.tsv
        Iris_CV_example_iris_RandomForestClassifier.log
        Iris_CV_example_iris_RandomForestClassifier.results
        Iris_CV_example_iris_RandomForestClassifier.results.json
        Iris_CV_example_iris_RandomForestClassifier_predictions.tsv
        Iris_CV_example_iris_SVC.log
        Iris_CV_example_iris_SVC.results
        Iris_CV_example_iris_SVC.results.json

In [10]:
!cat iris/output/Iris_CV_example_iris_LogisticRegression.results

Experiment Name: Iris_CV
SKLL Version: 1.5.3
Training Set: train
Training Set Size: 100
Test Set: cv
Test Set Size: n/a
Shuffle: False
Feature Set: ["example_iris_features"]
Learner: LogisticRegression
Task: cross_validate
Number of Folds: 10
Stratified Folds: True
Feature Scaling: none
Grid Search: True
Grid Search Folds: 3
Grid Objective Function: f1_score_micro
Scikit-learn Version: 0.20.1
Start Timestamp: 07 Oct 2019 16:01:26.539858
End Timestamp: 07 Oct 2019 16:01:27.338374
Total Time: 0:00:00.798516


Fold: 1
Model Parameters: {"C": 100.0, "class_weight": null, "dual": false, "fit_intercept": true, "intercept_scaling": 1, "max_iter": 1000, "multi_class": "auto", "n_jobs": null, "penalty": "l2", "random_state": 123456789, "solver": "liblinear", "tol": 0.0001, "verbose": 0, "warm_start": false}
Grid Objective Score (Train) = 0.9547892720306513
+------------+----------+--------------+-------------+-------------+----------+-------------+
|            |   set

In [11]:
!head -5 iris/output/Iris_CV_example_iris_LogisticRegression_predictions.tsv

id	prediction
EXAMPLE_125	virginica
EXAMPLE_130	virginica
EXAMPLE_2	setosa
EXAMPLE_123	virginica


In [12]:
!cat iris/output/Iris_CV_example_iris_LogisticRegression.log

2019-10-07 16:01:26,540 - INFO - Task: cross_validate
2019-10-07 16:01:26,540 - INFO - Cross-validating (10 folds) on train, feature set ['example_iris_features'] ...
2019-10-07 16:01:26,540 - DEBUG - Path: /home/jovyan/examples/iris/train/example_iris_features.jsonlines
2019-10-07 16:01:26,543 - INFO - Cross-validating


In [13]:
import pandas as pd

summary_df = pd.read_csv('iris/output/Iris_CV_summary.tsv', sep='\t')
print(summary_df.columns)

Index(['accuracy', 'additional_scores', 'cv_folds', 'end_timestamp',
       'experiment_name', 'feature_scaling', 'featureset', 'featureset_name',
       'fold', 'folds_file', 'grid_objective', 'grid_score', 'grid_search',
       'grid_search_cv_results', 'grid_search_folds', 'learner_name',
       'min_feature_count', 'model_params', 'pearson', 'save_cv_folds',
       'save_cv_models', 'scikit_learn_version', 'score', 'shuffle',
       'start_timestamp', 'stratified_folds', 'task', 'test_set_name',
       'test_set_size', 'total_time', 'train_set_name', 'train_set_size',
       'use_folds_file_for_grid_search', 'using_folds_file', 'version'],
      dtype='object')


In [14]:
print(summary_df[['learner_name', 'accuracy', 'score', 'fold', 'featureset_name']])

              learner_name  accuracy     score     fold featureset_name
0   RandomForestClassifier  0.916667  0.916667        1    example_iris
1   RandomForestClassifier  0.909091  0.909091        2    example_iris
2   RandomForestClassifier  1.000000  1.000000        3    example_iris
3   RandomForestClassifier  0.818182  0.818182        4    example_iris
4   RandomForestClassifier  0.800000  0.800000        5    example_iris
5   RandomForestClassifier  1.000000  1.000000        6    example_iris
6   RandomForestClassifier  0.888889  0.888889        7    example_iris
7   RandomForestClassifier  1.000000  1.000000        8    example_iris
8   RandomForestClassifier  0.777778  0.777778        9    example_iris
9   RandomForestClassifier  1.000000  1.000000       10    example_iris
10  RandomForestClassifier  0.911061  0.911061  average    example_iris
11                     SVC  1.000000  1.000000        1    example_iris
12                     SVC  0.909091  0.909091        2    examp

Cleaning up the created folders

In [15]:
!rm -rf iris/output/

### Saving Models

Modifying the cross_validation configuration to *train* task and saving models.

In [16]:
import configparser

template_config = configparser.ConfigParser()
template_config.read('iris/cross_val.cfg')
print(template_config.sections())
template_config.set('General', 'experiment_name', 'Iris_Train')
template_config.set('General', 'task', 'train')
template_config.set('Output', 'models', 'output')
template_config.remove_option('Output', 'predictions')

with open('iris/train.cfg', 'w') as configfile:
    template_config.write(configfile)

['General', 'Input', 'Tuning', 'Output']


In [17]:
!run_experiment --local --verbose iris/train.cfg

2019-10-07 16:04:05,394 - Iris_Train_example_iris_RandomForestClassifier - INFO - Task: train
2019-10-07 16:04:05,394 - Iris_Train_example_iris_RandomForestClassifier - INFO - Training on train, feature set ['example_iris_features'] ...
2019-10-07 16:04:05,394 - Iris_Train_example_iris_RandomForestClassifier - DEBUG - Path: /home/jovyan/examples/iris/train/example_iris_features.jsonlines
Loading /home/jovyan/examples/iris/train/example_iris_features.jsonlines...           done
2019-10-07 16:04:05,400 - Iris_Train_example_iris_RandomForestClassifier - INFO - Featurizing and training new RandomForestClassifier model
2019-10-07 16:04:20,984 - Iris_Train_example_iris_RandomForestClassifier - INFO - Best f1_score_micro grid search score: 0.93
2019-10-07 16:04:20,984 - Iris_Train_example_iris_RandomForestClassifier - INFO - Hyperparameters: bootstrap: True, class_weight: None, criterion: gini, max_depth: 5, max_features: auto, max_leaf_nodes: None, min_impurity_decrease: 0.0, min_impurity_sp

In [19]:
list_files('iris')

iris/
    cross_val.cfg
    evaluate.cfg
    train.cfg
    train/
        example_iris_features.csv
        example_iris_features.jsonlines
    .ipynb_checkpoints/
    output/
        Iris_Train.log
        Iris_Train_example_iris_LogisticRegression.log
        Iris_Train_example_iris_LogisticRegression.model
        Iris_Train_example_iris_LogisticRegression.results.json
        Iris_Train_example_iris_MultinomialNB.log
        Iris_Train_example_iris_MultinomialNB.model
        Iris_Train_example_iris_MultinomialNB.results.json
        Iris_Train_example_iris_RandomForestClassifier.log
        Iris_Train_example_iris_RandomForestClassifier.model
        Iris_Train_example_iris_RandomForestClassifier.results.json
        Iris_Train_example_iris_SVC.log
        Iris_Train_example_iris_SVC.model
        Iris_Train_example_iris_SVC.results.json
    test/
        example_iris_features.jsonlines


SKLL provides several [utilities scripts](https://skll.readthedocs.io/en/latest/utilities.html) to be run from the command line. We have already seen the *skll_convert* command. Let us check a few others.

In [20]:
!print_model_weights iris/output/Iris_Train_example_iris_LogisticRegression.model

== intercept values ==
 0.523847976639	setosa
 8.596424777145	versicolor
-10.780098918423	virginica

Number of nonzero features: 12
 10.256015994211	virginica	f3
-5.686157558959	virginica	f1
 5.112029479105	virginica	f2
-4.748723308900	setosa	f2
 3.033345904488	setosa	f1
-2.998959524533	versicolor	f1
-2.676118603072	versicolor	f3
-2.483651635462	virginica	f0
-2.207259783848	setosa	f3
 1.325443092977	versicolor	f2
 0.876178989967	setosa	f0
-0.391612756145	versicolor	f0


In [21]:
!generate_predictions iris/output/Iris_Train_example_iris_LogisticRegression.model iris/test/example_iris_features.jsonlines

Loading iris/test/example_iris_features.jsonlines...Loading iris/test/example_iris_features.jsonlines...             50Loading iris/test/example_iris_features.jsonlines...           0.0%Loading iris/test/example_iris_features.jsonlines...           100%Loading iris/test/example_iris_features.jsonlines...           done
id	prediction
EXAMPLE_73	versicolor
EXAMPLE_18	setosa
EXAMPLE_118	virginica
EXAMPLE_78	versicolor
EXAMPLE_76	versicolor
EXAMPLE_31	setosa
EXAMPLE_64	versicolor
EXAMPLE_141	virginica
EXAMPLE_68	versicolor
EXAMPLE_82	versicolor
EXAMPLE_110	virginica
EXAMPLE_12	setosa
EXAMPLE_36	setosa
EXAMPLE_9	setosa
EXAMPLE_19	setosa
EXAMPLE_56	versicolor
EXAMPLE_104	virginica
EXAMPLE_69	versicolor
EXAMPLE_55	versicolor
EXAMPLE_132	virginica
EXAMPLE_29	setosa
EXAMPLE_127	virginica
EXAMPLE_26	setosa
EXAMPLE_128	virginica
EXAMPLE_131	virginica
EXAMPLE_145	virginica
EXAMPLE_108	virginica
EXAMPLE_143	virginica
EXAMPLE_45	setosa
EXAMPLE_30	setosa
EXAMPLE_22

Cleaning up.

In [1]:
!rm -rf iris/output iris/train.cfg