# An Introduction to SKLL

SKLL (Scikit-Learn Laboratory) provides a number of command-line utilities and an API that make it simpler to run scikit-learn machine learning experiments with pre-generated features.
SKLL can be used in 2 ways:


1. **Via the Command Line**
2. **Via the API**

## 1. Using SKLL via the Command Line

In this example, we will use the following four steps to understand how to use SKLL via the command line.

- Get the data in a [SKLL compatible format](https://skll.readthedocs.io/en/latest/run_experiment.html#file-formats).
- Create a SKLL [experiment configuration file](https://skll.readthedocs.io/en/latest/run_experiment.html#create-config) for cross-validation.
- Run the machine learning experiment using the [*run_experiment*](https://skll.readthedocs.io/en/latest/run_experiment.html) command.
- Examine experiment results using the various SKLL [utility scripts](https://skll.readthedocs.io/en/latest/utilities.html).

### 1a. Data Pre-processing

We will use the [IRIS dataset](https://archive.ics.uci.edu/ml/datasets/Iris/) for this part. This dataset is for a simple 3-class classification task with 4 simple features. The SKLL utility script `make_iris_example_data.py` downloads the IRIS dataset using scikit-learn and pre-processes it to create the training and test splits that are contained within the `train` and `test`sub-directories under the `iris` directory. Each of the created sub-directories (`iris/train` and `iris/test`) contains the respective features in a [newline-delimited JSON](https://jsonlines.org) file format. SKLL supports this format natively and uses the `.jsonlines` extension for files in this format.

Let's first take a look at our current directory.

In [1]:
!ls

Tutorial.ipynb               [31mmake_california_example_data.py[m[m
__init__.py                  [31mmake_iris_example_data.py[m[m
[34mcalifornia[m[m                   [31mmake_titanic_example_data.py[m[m
[34miris[m[m                         [34mtitanic[m[m


Let's run the script to download and pre-process the IRIS data in a format SKLL can use. 

In [11]:
!python3 make_iris_example_data.py

Retrieving iris data from servers...done
Writing training and testing files...done


Let's look at the contents of the `iris` directory after that command.

In [4]:
!ls iris/**

iris/cross_val.cfg iris/evaluate.cfg

iris/test:
example_iris_features.jsonlines

iris/train:
example_iris_features.csv       example_iris_features.jsonlines


We see that two `.jsonlines` files have been created -- one each under `iris/train` and `iris/test` respectively. Let's see what these files look like.

In [4]:
!head -5 iris/train/example_iris_features.jsonlines

{"id": "EXAMPLE_96", "y": "versicolor", "x": {"f0": 5.7, "f1": 2.9, "f2": 4.2, "f3": 1.3}}
{"id": "EXAMPLE_105", "y": "virginica", "x": {"f0": 7.6, "f1": 3.0, "f2": 6.6, "f3": 2.1}}
{"id": "EXAMPLE_66", "y": "versicolor", "x": {"f0": 5.6, "f1": 3.0, "f2": 4.5, "f3": 1.5}}
{"id": "EXAMPLE_0", "y": "setosa", "x": {"f0": 5.1, "f1": 3.5, "f2": 1.4, "f3": 0.2}}
{"id": "EXAMPLE_122", "y": "virginica", "x": {"f0": 7.7, "f1": 2.8, "f2": 6.7, "f3": 2.0}}


As [documented here](https://skll.readthedocs.io/en/latest/run_experiment.html#jsonlines-ndj-recommended), each line of such a file contains a JSON object with the following keys: `y` (the class label), `x` (a dictionary of feature values), and `id` (an optional instance ID). The `.jsonlines` format is the preferred file format for SKLL as it can represent sparse featuresets much more easily.

SKLL accepts a variety of [file formats](https://skll.readthedocs.io/en/latest/run_experiment.html#feature-file-formats) and also provides the [*skll_convert*](https://skll.readthedocs.io/en/latest/utilities.html#skll-convert) utility to easily convert between them. For example, we could easily convert the `.jsonlines` file into `.csv` file like so:

In [5]:
!skll_convert iris/train/example_iris_features.jsonlines iris/train/example_iris_features.csv 
print()
!ls iris/train
print()
!head -5 iris/train/example_iris_features.csv

Loading iris/train/example_iris_features.jsonlines...           done
Writing iris/train/example_iris_features.csv...done           

example_iris_features.csv       example_iris_features.jsonlines

f0,f1,f2,f3,id,y
5.7,2.9,4.2,1.3,EXAMPLE_96,versicolor
7.6,3.0,6.6,2.1,EXAMPLE_105,virginica
5.6,3.0,4.5,1.5,EXAMPLE_66,versicolor
5.1,3.5,1.4,0.2,EXAMPLE_0,setosa


### 1b. Experiment configuration files

The most common method of using SKLL is via its experiment configuration file which can then be run using the *run_experiment* command-line script. SKLL configuration files are basic configuration files that are similar in format to Windows INI files. 

There are 4 sections in a configuration file.


1. [General](https://skll.readthedocs.io/en/latest/run_experiment.html#general)
    - Defines the `experiment_name` and the `task` (both required)
    - SKLL supports 5 types of machine learning [tasks](https://skll.readthedocs.io/en/latest/run_experiment.html#task).
        1. `cross_validate`
        2. `train`
        3. `evaluate`
        4. `predict`
        5. `learning_curve`
2. [Input](https://skll.readthedocs.io/en/latest/run_experiment.html#input)
    - Defines the list of machine learners we want to run for this experiment via the `learners` option (required).
    - Additionally, either the `train_directory` or `train_file` option must also be defined. 
    - Other fields may be required or optional depending on the task.
3. [Tuning](https://skll.readthedocs.io/en/latest/run_experiment.html#tuning)
    - Defines options related to tuning the model hyperparameters e.g., `grid_search`, `param_grids`, `objectives`, et cetera.
    - Other fields may be required or optional depending on the task. 
4. [Output](https://skll.readthedocs.io/en/latest/run_experiment.html#output)
    - Defines options related to the output produced by the machine learning task e.g.`probability`, `metrics`, `results`, et cetera.
    - Other fields may be required or optional depending on the task. 
    
Let's look at the SKLL configuration file for our IRIS example.

In [6]:
!cat iris/cross_val.cfg

[General]
experiment_name = Iris_CV
task = cross_validate

[Input]
# this could also be an absolute path instead (and must be if you're not
# running things in local mode)
train_directory = train
featuresets = [["example_iris_features"]]
# there is only set of features to try with one feature file in it here.
featureset_names = ["example_iris"]
learners = ["RandomForestClassifier", "SVC", "LogisticRegression", "MultinomialNB"]
suffix = .jsonlines

[Tuning]
grid_search = true
objectives = ['f1_score_micro']

[Output]
# again, these can be absolute paths
save_cv_models = true
models = output
results = output
logs = output
predictions = output


As you can see, SKLL configuration files also support Python-style comments. In this particular example, we are going to do cross-validation on the IRIS training set with 4 different learners. The hyperparameters for each learner are tuned by searching a [default grid of values](https://skll.readthedocs.io/en/latest/run_experiment.html#param-grids-optional) defined in SKLL (the default grid can be overridden in the configuration file via the `param_grids` option) and picking the values that yield the largest value for the micro-averaged F1 score metric. The grid search happens inside each outer cross-validation loop using a 3-fold cross-validation on the fold training data. The configuration file also says that the results, the predictions, the logs, and the trained models for each fold will be saved in the `output` directory under the current directory.

### 1c. Running the experiment

After creating the configuration file, we can now run the experiment using the `run_experiment` command. This scripts only takes a single required argument –– the path to the configuration file. However, it also [supports other optional arguments](https://skll.readthedocs.io/en/latest/run_experiment.html#using-run-experiment) for other use cases e.g. running an experiment on a DRMAA-compatible cluster rather than locally.

Let's run our IRIS cross-validation experiment.

In [7]:
!run_experiment iris/cross_val.cfg

2020-03-10 14:32:57,114 - Iris_CV_example_iris_RandomForestClassifier - INFO - Task: cross_validate
2020-03-10 14:32:57,115 - Iris_CV_example_iris_RandomForestClassifier - INFO - Cross-validating (10 folds) on train, feature set ['example_iris_features'] ...
Loading /Users/nmadnani/work/skll/examples/iris/train/example_iris_features.jsonlines...           done
2020-03-10 14:32:57,118 - Iris_CV_example_iris_RandomForestClassifier - INFO - Cross-validating
2020-03-10 14:33:47,710 - Iris_CV_example_iris_SVC - INFO - Task: cross_validate
2020-03-10 14:33:47,711 - Iris_CV_example_iris_SVC - INFO - Cross-validating (10 folds) on train, feature set ['example_iris_features'] ...
Loading /Users/nmadnani/work/skll/examples/iris/train/example_iris_features.jsonlines...           done
2020-03-10 14:33:47,716 - Iris_CV_example_iris_SVC - INFO - Cross-validating
2020-03-10 14:33:49,560 - Iris_CV_example_iris_LogisticRegression - INFO - Task: cross_validate
2020-03-10 14:33:49,561 - Iris_CV_example_i

The above log shows that SKLL reads in the training feature file for every learner. This is one of the many efficiency compromises that SKLL makes for the more convenient option of batched machine learning experiments using simple configuration files. However, this can be quite slow for experiments with very large feature files and multiple learners. For those more advanced cases, we recommend using the SKLL Python API directly (see next section below).

### 1d. Examining experiment output

Let's examine the `iris` directory again.

In [8]:
!ls iris/**

iris/cross_val.cfg iris/evaluate.cfg

iris/output:
Iris_CV.log
Iris_CV_example_iris_LogisticRegression.log
Iris_CV_example_iris_LogisticRegression.results
Iris_CV_example_iris_LogisticRegression.results.json
Iris_CV_example_iris_LogisticRegression_fold1.model
Iris_CV_example_iris_LogisticRegression_fold10.model
Iris_CV_example_iris_LogisticRegression_fold2.model
Iris_CV_example_iris_LogisticRegression_fold3.model
Iris_CV_example_iris_LogisticRegression_fold4.model
Iris_CV_example_iris_LogisticRegression_fold5.model
Iris_CV_example_iris_LogisticRegression_fold6.model
Iris_CV_example_iris_LogisticRegression_fold7.model
Iris_CV_example_iris_LogisticRegression_fold8.model
Iris_CV_example_iris_LogisticRegression_fold9.model
Iris_CV_example_iris_LogisticRegression_predictions.tsv
Iris_CV_example_iris_MultinomialNB.log
Iris_CV_example_iris_MultinomialNB.results
Iris_CV_example_iris_MultinomialNB.results.json
Iris_CV_example_iris_MultinomialNB_fold1.model
Iris_CV_example_iris_MultinomialNB_fol

We see that the subdirectory `iris/output` has been created and that it contains several new files:
- a plain text `.results` file that contains human-readable output of the results for each learner in our list
- a `.results.json` file that contains the same results but in JSON format for programmatic use
- a `.log` file containing log messages for that learner
- the cross-validation predictions in `.tsv` format from each learner
- 10 `.model` files for each learner - one per cross-validation fold
- a `summary.tsv` file that contains the results for the entire experiment across all learners

Let's first examine one of the `.results` files.

In [9]:
!cat iris/output/Iris_CV_example_iris_LogisticRegression.results

Experiment Name: Iris_CV
SKLL Version: 2.0
Training Set: train
Training Set Size: 100
Test Set: cv
Test Set Size: n/a
Shuffle: False
Feature Set: ["example_iris_features"]
Learner: LogisticRegression
Task: cross_validate
Number of Folds: 10
Stratified Folds: True
Feature Scaling: none
Grid Search: True
Grid Search Folds: 3
Grid Objective Function: f1_score_micro
Scikit-learn Version: 0.22.2.post1
Start Timestamp: 10 Mar 2020 14:33:49.560561
End Timestamp: 10 Mar 2020 14:33:50.005559
Total Time: 0:00:00.444998


Fold: 1
Model Parameters: {"C": 100.0, "class_weight": null, "dual": false, "fit_intercept": true, "intercept_scaling": 1, "l1_ratio": null, "max_iter": 1000, "multi_class": "auto", "n_jobs": null, "penalty": "l2", "random_state": 123456789, "solver": "liblinear", "tol": 0.0001, "verbose": 0, "warm_start": false}
Grid Objective Score (Train) = 0.9666666666666667
+------------+----------+--------------+-------------+-------------+----------+-------------+
|            |   setosa 

We see that it contains useful experiment metadata at the top and then it contains the results for each of the folds along with the final results that are the averages across the folds. Next let's look at the predictions file.

In [17]:
!head -5 iris/output/Iris_CV_example_iris_LogisticRegression_predictions.tsv

id	prediction
EXAMPLE_125	virginica
EXAMPLE_130	virginica
EXAMPLE_2	setosa
EXAMPLE_123	virginica


This file simply contains the most likely label from the learner for each of the instances in the test split of each of the 10 cross-validation folds. Next let's look at the log file.

In [18]:
!cat iris/output/Iris_CV_example_iris_LogisticRegression.log

2020-03-10 14:33:49,560 - INFO - Task: cross_validate
2020-03-10 14:33:49,561 - INFO - Cross-validating (10 folds) on train, feature set ['example_iris_features'] ...
2020-03-10 14:33:49,565 - INFO - Cross-validating


The log file contains a record of all useful informational messages and warnings that might have been generated from both SKLL (and `scikit-learn`) during the experiment. For more detailed log messages, use the `--verbose` option for `run_experiment`. Next, let's examine the summary file.

In [19]:
import pandas as pd
summary_df = pd.read_csv('iris/output/Iris_CV_summary.tsv', sep='\t')
print(summary_df.columns)

Index(['accuracy', 'additional_scores', 'cv_folds', 'end_timestamp',
       'experiment_name', 'feature_scaling', 'featureset', 'featureset_name',
       'fold', 'folds_file', 'grid_objective', 'grid_score', 'grid_search',
       'grid_search_cv_results', 'grid_search_folds', 'learner_name',
       'min_feature_count', 'model_params', 'pearson', 'save_cv_folds',
       'save_cv_models', 'scikit_learn_version', 'score', 'shuffle',
       'start_timestamp', 'stratified_folds', 'task', 'test_set_name',
       'test_set_size', 'total_time', 'train_set_name', 'train_set_size',
       'use_folds_file_for_grid_search', 'using_folds_file', 'version'],
      dtype='object')


This file has many useful columns but let's just look at a small subset for now.

In [20]:
summary_df[['learner_name', 'accuracy', 'score', 'fold', 'featureset_name']].head(22)

Unnamed: 0,learner_name,accuracy,score,fold,featureset_name
0,RandomForestClassifier,1.0,1.0,1,example_iris
1,RandomForestClassifier,0.8,0.8,2,example_iris
2,RandomForestClassifier,1.0,1.0,3,example_iris
3,RandomForestClassifier,1.0,1.0,4,example_iris
4,RandomForestClassifier,0.7,0.7,5,example_iris
5,RandomForestClassifier,1.0,1.0,6,example_iris
6,RandomForestClassifier,0.9,0.9,7,example_iris
7,RandomForestClassifier,1.0,1.0,8,example_iris
8,RandomForestClassifier,0.8,0.8,9,example_iris
9,RandomForestClassifier,1.0,1.0,10,example_iris


The summary file shows the micro-averaged F1 scores for each learner for each cross-validation fold and the averaged final results too.

As mentioned earlier, SKLL provides several [utility scripts](https://skll.readthedocs.io/en/latest/utilities.html) that can be run from the command line. We examined the `skll_convert` utility earlier. Next, let's examine the `print_model_weights` utility which can print out the model parameters (weights) for linear models on the command line which can be quite useful for debugging and interpretability.

In [21]:
!print_model_weights iris/output/Iris_CV_example_iris_LogisticRegression_fold1.model

== intercept values ==
 0.526788241270	setosa
 7.904183432646	versicolor
-10.106265873277	virginica

Number of nonzero features: 12
 9.740295928176	virginica	f3
-5.344607925255	virginica	f1
 5.101862309508	virginica	f2
-4.663236251043	setosa	f2
-3.243195675159	versicolor	f1
 2.960911138733	setosa	f1
-2.619269387406	virginica	f0
-2.269167648888	setosa	f3
-2.046442333447	versicolor	f3
 0.933654025132	versicolor	f2
 0.869165923294	setosa	f0
-0.010096757953	versicolor	f0


Next, we have the `generate_predictions` script which can be used to generate predictions on new test data from an already existing model on disk.

In [22]:
!generate_predictions iris/output/Iris_CV_example_iris_LogisticRegression_fold1.model iris/test/example_iris_features.jsonlines

Loading iris/test/example_iris_features.jsonlines...           done
id	prediction
EXAMPLE_73	versicolor
EXAMPLE_18	setosa
EXAMPLE_118	virginica
EXAMPLE_78	versicolor
EXAMPLE_76	versicolor
EXAMPLE_31	setosa
EXAMPLE_64	versicolor
EXAMPLE_141	virginica
EXAMPLE_68	versicolor
EXAMPLE_82	versicolor
EXAMPLE_110	virginica
EXAMPLE_12	setosa
EXAMPLE_36	setosa
EXAMPLE_9	setosa
EXAMPLE_19	setosa
EXAMPLE_56	versicolor
EXAMPLE_104	virginica
EXAMPLE_69	versicolor
EXAMPLE_55	versicolor
EXAMPLE_132	virginica
EXAMPLE_29	setosa
EXAMPLE_127	virginica
EXAMPLE_26	setosa
EXAMPLE_128	virginica
EXAMPLE_131	virginica
EXAMPLE_145	virginica
EXAMPLE_108	virginica
EXAMPLE_143	virginica
EXAMPLE_45	setosa
EXAMPLE_30	setosa
EXAMPLE_22	setosa
EXAMPLE_15	setosa
EXAMPLE_65	versicolor
EXAMPLE_11	setosa
EXAMPLE_42	setosa
EXAMPLE_146	virginica
EXAMPLE_51	versicolor
EXAMPLE_27	setosa
EXAMPLE_4	setosa
EXAMPLE_32	setosa
EXAMPLE_142	virginica
EXAMPLE_85	versicolor
EXAMPLE_86	versicolor
EXAMPLE_16	setosa
EXAMPLE_10	setosa
EXAMPL

# 2. Using SKLL via the API

While the command line tools are intended to be the primary method of using SKLL for normal users who want to run batched machine-learning experiments, the SKLL Python API can also be useful for building applications on top of SKLL and for other advanced use cases. 

The [*learner*](https://skll.readthedocs.io/en/latest/api/learner.html) and [*data*](https://skll.readthedocs.io/en/latest/api/data.html) modules can be used to broadly replicate the functionalities of command line tools. For example, the `skll.data.Reader` class can be used to programmatically read any SKLL-compatible feature file into a SKLL [*FeatureSet*](https://skll.readthedocs.io/en/latest/api/data.html#module-skll.data.featureset) object.

In [23]:
from skll.data import Reader

train_examples_reader = Reader.for_path('iris/train/example_iris_features.jsonlines')
test_examples_reader = Reader.for_path('iris/test/example_iris_features.jsonlines')

train_examples = train_examples_reader.read()
test_examples = test_examples_reader.read()

test_examples

{'name': 'iris/test/example_iris_features.jsonlines', 'ids': array(['EXAMPLE_73', 'EXAMPLE_18', 'EXAMPLE_118', 'EXAMPLE_78',
       'EXAMPLE_76', 'EXAMPLE_31', 'EXAMPLE_64', 'EXAMPLE_141',
       'EXAMPLE_68', 'EXAMPLE_82', 'EXAMPLE_110', 'EXAMPLE_12',
       'EXAMPLE_36', 'EXAMPLE_9', 'EXAMPLE_19', 'EXAMPLE_56',
       'EXAMPLE_104', 'EXAMPLE_69', 'EXAMPLE_55', 'EXAMPLE_132',
       'EXAMPLE_29', 'EXAMPLE_127', 'EXAMPLE_26', 'EXAMPLE_128',
       'EXAMPLE_131', 'EXAMPLE_145', 'EXAMPLE_108', 'EXAMPLE_143',
       'EXAMPLE_45', 'EXAMPLE_30', 'EXAMPLE_22', 'EXAMPLE_15',
       'EXAMPLE_65', 'EXAMPLE_11', 'EXAMPLE_42', 'EXAMPLE_146',
       'EXAMPLE_51', 'EXAMPLE_27', 'EXAMPLE_4', 'EXAMPLE_32',
       'EXAMPLE_142', 'EXAMPLE_85', 'EXAMPLE_86', 'EXAMPLE_16',
       'EXAMPLE_10', 'EXAMPLE_81', 'EXAMPLE_133', 'EXAMPLE_137',
       'EXAMPLE_75', 'EXAMPLE_109'], dtype='<U11'), 'labels': array(['versicolor', 'setosa', 'virginica', 'versicolor', 'versicolor',
       'setosa', 'versicolor', 'virg

The `Learner` class can be used to read in saved models from disk into memory and also to create new learners from scratch. First let's load in one of the models from our previous IRIS experiment and then use it to generate predictions for our test set.

In [24]:
from skll.learner import Learner

learner = Learner.from_file('iris/output/Iris_CV_example_iris_LogisticRegression_fold1.model')
learner.predict(test_examples, class_labels=True)

array(['versicolor', 'setosa', 'virginica', 'versicolor', 'versicolor',
       'setosa', 'versicolor', 'virginica', 'versicolor', 'versicolor',
       'virginica', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor',
       'virginica', 'versicolor', 'versicolor', 'virginica', 'setosa',
       'virginica', 'setosa', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'setosa', 'setosa', 'setosa', 'setosa',
       'versicolor', 'setosa', 'setosa', 'virginica', 'versicolor',
       'setosa', 'setosa', 'setosa', 'virginica', 'versicolor',
       'versicolor', 'setosa', 'setosa', 'versicolor', 'versicolor',
       'virginica', 'versicolor', 'virginica'], dtype='<U10')

Next, let's create a new learner, train and tune it on the training set via grid search using accuracy as our objective.

In [25]:
# create a new Linear SVM learner with custom hyper-parameters that will be passed to scikit-learn
new_learner = Learner('LinearSVC', model_kwargs = {'max_iter': 1000})
best_objective_value, grid_search_results = learner.train(train_examples, grid_objective='accuracy')
best_objective_value

Training data will be shuffled to randomize grid search folds.  Shuffling may yield different results compared to scikit-learn.


0.9494949494949495

The `Learner.train()` method also returns `grid_search_results` which is a dictionary containing useful intermediate results from the grid-search process e.g. the value of the objective for each of the parameter values in the grid and for each of the 3 cross-validation folds.

In [26]:
pd.DataFrame(grid_search_results)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.002472,0.000161,0.000644,5.7e-05,0.01,{'C': 0.01},0.647059,0.666667,0.636364,0.65003,0.012548,5
1,0.001733,7.3e-05,0.000389,5e-06,0.1,{'C': 0.1},0.794118,0.757576,0.727273,0.759655,0.027329,4
2,0.001783,0.000135,0.000417,6.5e-05,1.0,{'C': 1.0},1.0,0.818182,0.939394,0.919192,0.075589,3
3,0.001857,0.000186,0.000389,1.2e-05,10.0,{'C': 10.0},1.0,0.818182,1.0,0.939394,0.08571,2
4,0.002135,0.000184,0.000457,5.6e-05,100.0,{'C': 100.0},1.0,0.848485,1.0,0.949495,0.071425,1


Next, let's evaluate this new trained learner on the test set using both accuracy as well as balanced accuracy.

In [27]:
(conf_matrix,
 accuracy,
 prf_dict,
 model_params,
 obj_score,
 metric_scores) = learner.evaluate(test_examples, grid_objective='accuracy', output_metrics=['balanced_accuracy'])

print('Test Accuracy : {}'.format(obj_score))
print('Test Balanced Accuracy : {}'.format(metric_scores['balanced_accuracy']))
print('Test confusion matrix: {}'.format(conf_matrix))

Test Accuracy : 0.98
Test Balanced Accuracy : 0.9791666666666666
Test confusion matrix: [[19, 0, 0], [0, 15, 0], [0, 1, 15]]


This tutorial only covered some of the most basic functionality for SKLL. SKLL can do a whole lot more. Please read the [comprehensive documentation](https://skll.readthedocs.io) for more details. We also welcome any contributions you might want to make to SKLL. Please read the [contribution guidelines](https://skll.readthedocs.io/en/latest/contributing.html).