# Tutorial on building configuration file for Keras classification models

This tutorial is aimed to help users of the open-sourced library **DeepPavlov** to understand the structure of configuration files for **classification** models implemented in DeepPavlov on **Keras** (with tensorflow backend).

Let's take a look at "Detecting Insults in Social Commentary" 

See here (https://www.kaggle.com/c/detecting-insults-in-social-commentary)


## Firstly, please, download dataset, embedding file and pre-trained model for considered task via command in terminal:

```
python -m deeppavlov.deep download configs/sentiment/insults_kaggle.json
```

### Import necessary functions from DeepPavlov

In [46]:
import pandas as pd

from deeppavlov.core.common.file import read_json, save_json

In [47]:
def print_json(data):
    print(json.dumps(data, indent=2))

### Read one of the configs for classification task.

In [48]:
config = read_json("../../configs/sentiment/insults_kaggle.json")

In [49]:
print_json(config)

{
  "dataset_reader": {
    "name": "basic_classification_reader",
    "x": "Comment",
    "y": "Class",
    "data_path": "insults_data"
  },
  "dataset_iterator": {
    "name": "basic_classification_iterator",
    "seed": 42
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "pipe": [
      {
        "id": "classes_vocab",
        "name": "default_vocab",
        "fit_on": [
          "y"
        ],
        "level": "token",
        "save_path": "vocabs/insults_kaggle_classes.dict",
        "load_path": "vocabs/insults_kaggle_classes.dict"
      },
      {
        "in": [
          "x"
        ],
        "out": [
          "x_prep"
        ],
        "name": "dirty_comments_preprocessor"
      },
      {
        "id": "my_embedder",
        "name": "fasttext",
        "save_path": "embeddings/wordpunct_tok_reddit_comments_2017_11_300.bin",
        "load_path": "embeddings/wordpunct_tok_reddit_comments_2017_11_300.bin",
        "dim": 300
      },
     

# dataset_reader

DatasetReader parameters are determined by `config["dataset_reader"]` dictionary.

In [50]:
print_json(config["dataset_reader"])

{
  "name": "basic_classification_reader",
  "x": "Comment",
  "y": "Class",
  "data_path": "insults_data"
}


Parameter `name` is a registered name of one of the DatasetReaders from DeepPavlov. 

One can either prepare data in particular format and use ready DatasetReader OR code and registere his/her own DatasetReader for dataset of interest.

For example, the considered dataset was converted to the following format and now can be used with `basic_classification_reader`:

In [51]:
pd.read_csv("../../../download/insults_data/train.csv").head()

Unnamed: 0,Comment,Class
0,"""You fuck your dad.""",Insult
1,"""i really don't understand your point.\xa0 It ...",Not Insult
2,"""A\\xc2\\xa0majority of Canadians can and has ...",Not Insult
3,"""listen if you dont wanna get married to a man...",Not Insult
4,"""C\xe1c b\u1ea1n xu\u1ed1ng \u0111\u01b0\u1edd...",Not Insult


**Parameters for `basic_classification_reader`:**

* The goal of the DatasetReader is to read data from the given `data_path` folder.

* One can provide `train`, `valid` and/or `test` items to determine filenames for train, validation and test sets (by default `train.csv`, `valid.csv` and `test.csv`).

* Datsets should be provided in `.csv` format. Additional parameters `sep`, `header`, `names` for `pandas.read_csv` can also be specified.

* Items `x` and `y` determine column names (by default `text` and `labels`).

* If dataset contains multi-labeled data, particular sample labels should be given in one column `y` separated by `class_sep` (by default `,`).

Most parameters for the considered dataset are used by default but let us specify them below to just make clear how this part of config can look like.

In [52]:
config["dataset_reader"]["train"] = "train.csv"
config["dataset_reader"]["valid"] = "valid.csv"
config["dataset_reader"]["test"] = "test.csv"
config["dataset_reader"]["sep"] = ","
config["dataset_reader"]["class_sep"] = ","

In [53]:
print_json(config["dataset_reader"])

{
  "name": "basic_classification_reader",
  "x": "Comment",
  "y": "Class",
  "data_path": "insults_data",
  "train": "train.csv",
  "valid": "valid.csv",
  "test": "test.csv",
  "sep": ",",
  "class_sep": ","
}


# dataset_iterator

DatasetIterator is aimed to get data dictionary from DatasetReader and iterate over it batch-wise.

In [54]:
print_json(config["dataset_iterator"])

{
  "name": "basic_classification_iterator",
  "seed": 42
}


Parameter `name` is a registered name of one of the DatasetIterators from DeepPavlov. 

**Parameters for `basic_classification_iterator`:**

* Boolean parameter `shuffle` (by default True) and integer parameter `seed`  determine whether to shuffle train data and which seed to use.

* `basic_classification_iterator` allows to merge and/or split given data (for example, when validation set is not defined, or too big and smaller validation set can be separated). 
* `fields_to_merge` is a list of fields to be merged in one field `merged_field` (each field name is out of `train`, `valid`, `test`)
* `field_to_split` is a name of field to be splitted in fields named by a list `split_fields` with `split_proportions` (list with values from 0 to 1).

Most parameters for the considered dataset are used by default but let us specify them below to just make clear how this part of config can look like.

An example below shows parameters if one wants to merge samples from train and validation files and then to separate one tenth to be validation set.

In [55]:
config["dataset_iterator"]["shuffle"] = True
config["dataset_iterator"]["seed"] = 42
config["dataset_iterator"]["fields_to_merge"] = ["train", "valid"]
config["dataset_iterator"]["merged_field"] = "train"
config["dataset_iterator"]["field_to_split"] = "train"
config["dataset_iterator"]["split_proportions"] = [0.9, 0.1]

In [56]:
print_json(config["dataset_iterator"])

{
  "name": "basic_classification_iterator",
  "seed": 42,
  "shuffle": true,
  "fields_to_merge": [
    "train",
    "valid"
  ],
  "merged_field": "train",
  "field_to_split": "train",
  "split_proportions": [
    0.9,
    0.1
  ]
}


# chainer

Chainer is the biggest part of the config that determines structure and parameters for model pipeline.

In [57]:
print_json(config["chainer"])

{
  "in": [
    "x"
  ],
  "in_y": [
    "y"
  ],
  "pipe": [
    {
      "id": "classes_vocab",
      "name": "default_vocab",
      "fit_on": [
        "y"
      ],
      "level": "token",
      "save_path": "vocabs/insults_kaggle_classes.dict",
      "load_path": "vocabs/insults_kaggle_classes.dict"
    },
    {
      "in": [
        "x"
      ],
      "out": [
        "x_prep"
      ],
      "name": "dirty_comments_preprocessor"
    },
    {
      "id": "my_embedder",
      "name": "fasttext",
      "save_path": "embeddings/wordpunct_tok_reddit_comments_2017_11_300.bin",
      "load_path": "embeddings/wordpunct_tok_reddit_comments_2017_11_300.bin",
      "dim": 300
    },
    {
      "id": "my_tokenizer",
      "name": "nltk_tokenizer",
      "tokenizer": "wordpunct_tokenize"
    },
    {
      "in": [
        "x_prep"
      ],
      "in_y": [
        "y"
      ],
      "out": [
        "y_labels",
        "y_probas_dict"
      ],
      "main": true,
      "name": "intent_model",
 

`chainer` has four main parameters `in`, `in_y`, `out` and `pipe`:

* `in`, `in_y` and `out` denote names and structure of data transferred in pipeline. DatasetIterator `basic_dataset_iterator` provides data sample as tuple of two elements `(x, y)`: text and its labels.

* `pipe` is a list of pipeline elements: vocabularies, preprocessors, embedders, tokenizers, model itself.

* Every element in pipe should have specified `name` that is a registered name in DeepPavlov.

* For further usage parameter `id` can be specified. For example, tokenizer should be given to `KerasModel` during initialization of model. Therefore, one should place a tokenizer element before model, specify `"id": "my_tokenizer"` and then refer to it `"tokenizer": "#my_tokenizer"` in model parameters.

* If element of pipe processes data, `in` and `out` determine the order of data flow.

* Other parameters for each element of pipe are individual.

### Vocab

Considered classification model implies only one vocabulary that is used to extract all presented in the train set classes.

In [59]:
print_json(config["chainer"]["pipe"][0])

{
  "id": "classes_vocab",
  "name": "default_vocab",
  "fit_on": [
    "y"
  ],
  "level": "token",
  "save_path": "vocabs/insults_kaggle_classes.dict",
  "load_path": "vocabs/insults_kaggle_classes.dict"
}


* `id` is a user-denoted name for further references in config
* `default_vocab` is registered name of vocabulary
* vocab is fitted on `y` (which denotes labels in the considered task) on token `level`
* `save_path` and `load_path` denote where to load pre-trained vocabulary of labels from or where to save trained vocabulary of labels

### Preprocessors

One can use preprocessors presented in DeepPavlov or create his/her own preprocessor.

In [60]:
print_json(config["chainer"]["pipe"][1])

{
  "in": [
    "x"
  ],
  "out": [
    "x_prep"
  ],
  "name": "dirty_comments_preprocessor"
}


Preprocessor is a part of pipe that processes given data that means one has to define input and output names and structure. 
* `name` is a registered name of preprocessor from DeepPavlov
* Considered preprocessor takes texts from `in` field of pipe, processes them to `out`. In this case preprocessor acts on `x` that is exact input of chainer (first element of tuple gengerated by DatasetIterator).

### Embedders and tokenizers

Embedders and tokenizers are not exact elements of pipeline because they are not process data in pipeline. 

But embedder and tokenizer should be initialized before transferring them to the main model. 
Therefore, embedder and tokenizer should be placed somewhere before the main model in pipeline config.

In [61]:
print_json(config["chainer"]["pipe"][2])

{
  "id": "my_embedder",
  "name": "fasttext",
  "save_path": "embeddings/wordpunct_tok_reddit_comments_2017_11_300.bin",
  "load_path": "embeddings/wordpunct_tok_reddit_comments_2017_11_300.bin",
  "dim": 300
}


In [62]:
print_json(config["chainer"]["pipe"][3])

{
  "id": "my_tokenizer",
  "name": "nltk_tokenizer",
  "tokenizer": "wordpunct_tokenize"
}


* `id` is a user-denoted name for further references in config.
* `name` is a registered name of embedder/tokenizer in DeepPavlov.
* `save_path` and `load_path` denote  where to load pre-trained embedder/tokenizer from or where to save trained embedder/tokenizer.
* Other additional parameters are accepted. For example, `dim` defines dimensionality of embedding model trained via fastText, and `tokenizer` defines tokenization mode for NLTK.

### Model

The main part of the classification pipeline is keras neural model itself.

One can use either implemented in `KerasIntentModel` neural networks or implement his/her own network as a method of `KerasIntentModel` class.

**Description of parameters for currently available neural networks is given at the end of this tutorial.**

In [63]:
print_json(config["chainer"]["pipe"][4])

{
  "in": [
    "x_prep"
  ],
  "in_y": [
    "y"
  ],
  "out": [
    "y_labels",
    "y_probas_dict"
  ],
  "main": true,
  "name": "intent_model",
  "save_path": "sentiment/insults_kaggle_v0",
  "load_path": "sentiment/insults_kaggle_v0",
  "classes": "#classes_vocab.keys()",
  "kernel_sizes_cnn": [
    1,
    2,
    3
  ],
  "filters_cnn": 256,
  "confident_threshold": 0.5,
  "optimizer": "Adam",
  "lear_rate": 0.01,
  "lear_rate_decay": 0.1,
  "loss": "binary_crossentropy",
  "last_layer_activation": "softmax",
  "text_size": 100,
  "coef_reg_cnn": 0.001,
  "coef_reg_den": 0.01,
  "dropout_rate": 0.5,
  "dense_size": 100,
  "model_name": "cnn_model",
  "embedder": "#my_embedder",
  "tokenizer": "#my_tokenizer"
}


* `in`, `in_y` and `out` denote names and structure of data transferred in pipeline. DatasetIterator `basic_dataset_iterator` provides data sample as tuple of two elements (`x`, `y`): text and its labels. Then preprocessor processes `x` to `x_prep`, and exactly this `x_prep` is an input for the main model along with `y` labels. For each sample the main model provides tuple of two elements (`y_labels`, `y_probas_dict`) where `y_labels` is an array of predicted classes (which sample belongs with), `y_probas_dict` is a dictionary like {"class_i": probability_i}.
* `name` is a registered name of model in DeepPavlov.
* `save_path` and `load_path` denote where to load pre-trained model from or where to save trained model.
* `classes` contains names of all the presented in the train dataset classes. In the considered case it is presented as a reference to method `keys()` applied to the vocabulary of labels (`id` is used to refer).
* `model_name` is a method name of `KerasIntentModel` class. **Currently available methods** are `cnn_model`, `dcnn_model`, `cnn_model_max_and_aver_pool`, `bilstm_model`, `bilstm_bilstm_model`,  `bilstm_cnn_model`, `cnn_bilstm_model`, `bilstm_self_add_attention_model`, `bilstm_self_mult_attention_model`, `bigru_model`.
* `kernel_sizes_cnn`, `filters_cnn`, `dense_size`, `last_layer_activation`, `coef_reg_cnn`, `coef_reg_den`, `dropout_rate` are specific parameters for `cnn_model` method of `KerasIntentModel`.
* `confident_threshold` is a boundary value of probability for converting probabilities to labels. The value is from 0 to 1. If all probabilities are lower than `confident_threshold`, label with the highest probability is assigned.
* `optimizer` is a function from `keras.optimizers`.
* `lear_rate`, `lear_rate_decay` is a learning rate and learning rate decay.
* `loss` is a function from `keras.losses`.
* `text_size` determines maximal length of text in tokens (words), longer texts are cutted, shorter ones are padded by zeros (pre-padding).
* `embedder` and `tokenizer` are given by references to pipeline elements via theis `id`.

### Train parameters

Another essential part of config files is a dictionary with train parameters.

In [65]:
print_json(config["train"])

{
  "epochs": 1000,
  "batch_size": 64,
  "metrics": [
    "classification_accuracy",
    "classification_f1",
    "classification_roc_auc"
  ],
  "validation_patience": 5,
  "val_every_n_epochs": 5,
  "log_every_n_epochs": 5,
  "show_examples": false,
  "validate_best": true,
  "test_best": true
}


* `epochs` is a number of considered epochs.
* `batch_size` is used for training and evaluation.
* `metrics` is a list of names of registered metrics. For the examined task `classification_accuracy`, `classification_f1`, `classification_roc_auc` can be used because a special output (tuple of two elements) is considered.
* `metric_optimization` determines whether to minimize or maximize the main metric ("minimize", "maximize"), by default `maximize`.
* `validation_patience` is aparameter of early stopping: for how many epochs the training can continue without improvement of metric value on the validation set.
* `val_every_n_epochs` is a frequency of validation during training (validate every n epochs).
* `val_every_n_batches` is a frequency of validation during training (validate every n batches).

### Metadata

Additional information about model

In [66]:
print_json(config["metadata"])

{
  "labels": {
    "telegram_utils": "IntentModel",
    "server_utils": "KerasIntentModel"
  },
  "download": [
    "http://lnsigo.mipt.ru/export/deeppavlov_data/vocabs.tar.gz",
    "http://lnsigo.mipt.ru/export/deeppavlov_data/sentiment.tar.gz",
    "http://lnsigo.mipt.ru/export/datasets/insults_data.tar.gz",
    {
      "url": "http://lnsigo.mipt.ru/export/embeddings/reddit_fastText/wordpunct_tok_reddit_comments_2017_11_300.bin",
      "subdir": "embeddings"
    }
  ]
}


* `labels` determine labels or tags to make reference to this model.
* `download`contains links for downloading all the components required for the considered model.


# Model parameters

### cnn_model

Shallow and wide convolutional NN.

* `kernel_sizes_cnn` - list of kernel sizes of convolutions
* `filters_cnn` - number of filters for convolutions
* `coef_reg_cnn` - L2-regularization coefficient for convolutions
* `dropout_rate` - dropout rate to be used after convolutions and between dense layers
* `dense_size` - number of units for dense layer
* `coef_reg_dense` - L2-regularization coefficient for dense layers
* `last_layer_activation` - activation type for the last classification layer (by default `sigmoid`)

### dcnn_model

Deep convolutional NN.

* `kernel_sizes_cnn` - list of kernel sizes of convolutions
* `filters_cnn` - list of numbers of filters for convolutions
* `coef_reg_cnn` - L2-regularization coefficient for convolutions
* `dropout_rate` - dropout rate to be used after convolutions and between dense layers
* `dense_size` - number of units for dense layer
* `coef_reg_dense` - L2-regularization coefficient for dense layers
* `last_layer_activation` - activation type for the last classification layer (by default `sigmoid`)

### cnn_model_max_and_aver_pool

Shallow and wide convolutional NN with concatentation of max and average pooling after convolutions.

* `kernel_sizes_cnn` - list of kernel sizes of convolutions
* `filters_cnn` - number of filters for convolutions
* `coef_reg_cnn` - L2-regularization coefficient for convolutions
* `dropout_rate` - dropout rate to be used after convolutions and between dense layers
* `dense_size` - number of units for dense layer
* `coef_reg_dense` - L2-regularization coefficient for dense layers
* `last_layer_activation` - activation type for the last classification layer (by default `sigmoid`)

### bilstm_model

Bidirectional Long short-term memory NN.

* `units_lstm` - number of units for LSTM
* `coef_reg_lstm` - L2-regularization coefficient for LSTM
* `rec_dropout_rate` - droupout rate for LSTM
* `dropout_rate` - dropout rate to be used after BiLSTM and between dense layers
* `dense_size` - number of units for dense layer
* `coef_reg_dense` - L2-regularization coefficient for dense layers
* `last_layer_activation` - activation type for the last classification layer (by default `sigmoid`)

### bilstm_bilstm_model

Two-layers bidirectional Long short-term memory NN.

* `units_lstm_1` - number of units for the first LSTM layer
* `units_lstm_2` - number of units for the second LSTM layer
* `coef_reg_lstm` - L2-regularization coefficient for LSTM
* `rec_dropout_rate` - droupout rate for LSTM
* `dropout_rate` - dropout rate to be used between all BiLSTM and dense layers
* `dense_size` - number of units for dense layer
* `coef_reg_dense` - L2-regularization coefficient for dense layers
* `last_layer_activation` - activation type for the last classification layer (by default `sigmoid`)

### bilstm_cnn_model

Bidirectional Long short-term memory NN followed by shallow and wide Convolutional NN.

* `units_lstm` - number of units for the first LSTM layer
* `coef_reg_lstm` - L2-regularization coefficient for LSTM
* `rec_dropout_rate` - droupout rate for LSTM
* `kernel_sizes_cnn` - list of kernel sizes of convolutions
* `filters_cnn` - number of filters for convolutions
* `coef_reg_cnn` - L2-regularization coefficient for convolutions
* `dropout_rate` - dropout rate to be used between BiLSTM and CNN, after CNN and between dense layers
* `dense_size` - number of units for dense layer
* `coef_reg_dense` - L2-regularization coefficient for dense layers
* `last_layer_activation` - activation type for the last classification layer (by default `sigmoid`)

### cnn_bilstm_model

Shallow-and-wide Convolutional NN followed by Bidirectional Long short-term memory NN.

* `kernel_sizes_cnn` - list of kernel sizes of convolutions
* `filters_cnn` - number of filters for convolutions
* `coef_reg_cnn` - L2-regularization coefficient for convolutions
* `units_lstm` - number of units for the first LSTM layer
* `coef_reg_lstm` - L2-regularization coefficient for LSTM
* `rec_dropout_rate` - droupout rate for LSTM
* `dropout_rate` - dropout rate to be used between BiLSTM and CNN, after BiLSTM and between dense layers
* `dense_size` - number of units for dense layer
* `coef_reg_dense` - L2-regularization coefficient for dense layers
* `last_layer_activation` - activation type for the last classification layer (by default `sigmoid`)

### bilstm_self_add_attention_model

Bidirectional Long short-term memory NN with additive self-attention.

* `units_lstm` - number of units for the first LSTM layer
* `coef_reg_lstm` - L2-regularization coefficient for LSTM
* `rec_dropout_rate` - droupout rate for LSTM
* `self_att_hid` - number of hidden units for additive self-attention layer
* `self_att_out` - number of output units for additive self-attention layer
* `dropout_rate` - dropout rate to be used after self-attention layer and between dense layers
* `dense_size` - number of units for dense layer
* `coef_reg_dense` - L2-regularization coefficient for dense layers
* `last_layer_activation` - activation type for the last classification layer (by default `sigmoid`)

### bilstm_self_mult_attention_model

Bidirectional Long short-term memory NN with multiplicative self-attention.

* `units_lstm` - number of units for the first LSTM layer
* `coef_reg_lstm` - L2-regularization coefficient for LSTM
* `rec_dropout_rate` - droupout rate for LSTM
* `self_att_hid` - number of hidden units for multiplicative self-attention layer
* `self_att_out` - number of output units for multiplicative self-attention layer
* `dropout_rate` - dropout rate to be used after self-attention layer and between dense layers
* `dense_size` - number of units for dense layer
* `coef_reg_dense` - L2-regularization coefficient for dense layers
* `last_layer_activation` - activation type for the last classification layer (by default `sigmoid`)

### bigru_model

Bidirectional Gated Recurrent Units NN.

* `units_lstm` - number of units for the first GRU layer
* `coef_reg_lstm` - L2-regularization coefficient for GRU
* `rec_dropout_rate` - droupout rate for GRU
* `dropout_rate` - dropout rate to be used after BiGRU| and between dense layers
* `dense_size` - number of units for dense layer
* `coef_reg_dense` - L2-regularization coefficient for dense layers
* `last_layer_activation` - activation type for the last classification layer (by default `sigmoid`)