Skip to content

Commit

Permalink
Merge pull request #1092 from RasaHQ/nlu_embed_update
Browse files Browse the repository at this point in the history
Update tensorflow pipeline, add basic OOV handling
  • Loading branch information
Ghostvv committed Jun 1, 2018
2 parents 9d881c4 + e475cc5 commit e0dd9fa
Show file tree
Hide file tree
Showing 9 changed files with 444 additions and 201 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@ Added
-----
- doc link to a community contribution for Rasa NLU in Chinese
- support for component ``count_vectors_featurizer`` use ``tokens`` feature provide by tokenizer
- predict empty string instead of None for intent name
- update default parameters for tensorflow embedding classifier
- do not predict anything if feature vector contains only zeros in tensorflow embedding classifier
- change persistence keywords in tensorflow embedding classifier (make previously trained models impossible to load)
- intent_featurizer_count_vectors adds features to text_features instead of overwriting them
- add basic OOV support to intent_featurizer_count_vectors (make previously trained models impossible to load)

Changed
-------
Expand Down
60 changes: 51 additions & 9 deletions docs/pipeline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -270,12 +270,35 @@ intent_featurizer_count_vectors
Creates bag-of-words representation of intent features using
`sklearn's CountVectorizer <http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_. All tokens which consist only of digits (e.g. 123 and 99 but not a123d) will be assigned to the same feature.

.. note:: If the words in the model language cannot be split by the white-space, a language-specific tokenizer is required in the pipeline before this component (e.g. using ``tokenizer_jieba`` for Chinese language)
.. note:: If the words in the model language cannot be split by the white-space, a language-specific tokenizer is required in the pipeline before this component (e.g. using ``tokenizer_jieba`` for Chinese language).

:Configuration:
See `sklearn's CountVectorizer docs <http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
for detailed description of the configuration parameters

Handling Out-Of-Vacabulary (OOV) words:

Since the training is performed on limited vocabulary data, it cannot be guarantied that during prediction
an algorithm will not encounter an unknown word (a word that were not seen during training).
In order to teach an algorithm how to treat unknown words, some words in training data can be substituted by generic word ``OOV_token``.
In this case during prediction all unknown words will be treated as this generic word ``OOV_token``.

For example, one might create separate intent ``outofscope`` in the training data containing messages of different number of ``OOV_token``s and
maybe some additional general words. Then an algorithm will likely classify a message with unknown words as this intent ``outofscope``.

.. note::
This featurizer creates bag-of-words representation by **counting** words, so a number of ``OOV_token``s might be important.
- ``OOV_token`` set a keyword for unseen words; if training data contains ``OOV_token`` as words in some messages,
during prediction the words that were not seen during training will be substituted with provided ``OOV_token``;
if ``OOV_token=None`` (default behaviour) words that were not seen during training will be ignored during prediction time;
- ``OOV_words`` set a list of words to be treated as ``OOV_token`` during training; if a list of words that should be treated
as Out-Of-Vacabulary is known, it can be set to ``OOV_words`` instead of manually changing it in trainig data or using custom preprocessor.

.. note::
Providing ``OOV_words`` is optional, training data can contain ``OOV_token`` input manually or by custom additional preprocessor.
Unseen words will be substituted with ``OOV_token`` **only** if this token is present in the training data or ``OOV_words`` list is provided.

.. code-block:: yaml
pipeline:
Expand All @@ -297,10 +320,16 @@ intent_featurizer_count_vectors
# integer - absolute counts
"max_df": 1.0 # float in range [0.0, 1.0] or int
# set ngram range
"min_ngram": 1
"max_ngram": 1
"min_ngram": 1 # int
"max_ngram": 1 # int
# limit vocabulary size
"max_features": None
"max_features": None # int or None
# if convert all characters to lowercase
"lowercase": true # bool
# handling Out-Of-Vacabulary (OOV) words
# will be converted to lowercase if lowercase is true
"OOV_token": None # string or None
"OOV_words": [] # list of strings
intent_classifier_keyword
~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -423,13 +452,17 @@ intent_classifier_tensorflow_embedding
It is recommended to use ``intent_featurizer_count_vectors`` that can be optionally preceded
by ``nlp_spacy`` and ``tokenizer_spacy``.

.. note:: If during prediction time a message contains **only** words unseen during training,
and no Out-Of-Vacabulary preprocessor was used,
empty intent ``""`` is predicted with confidence ``0.0``.

:Configuration:
If you want to split intents into multiple labels, e.g. for predicting multiple intents or for
modeling hierarchical intent structure, use these flags:

- tokenization of intent labels:
- ``intent_tokenization_flag`` if ``true`` the algorithm will split the intent labels into tokens and use bag-of-words representations for them;
- ``intent_split_symbol`` sets the delimiter string to split the intent labels. Default ``_``
- ``intent_tokenization_flag`` if ``true`` the algorithm will split the intent labels into tokens and use bag-of-words representations for them, default ``false``;
- ``intent_split_symbol`` sets the delimiter string to split the intent labels, default ``_``.


The algorithm also has hyperparameters to control:
Expand All @@ -453,6 +486,10 @@ intent_classifier_tensorflow_embedding

.. note:: For ``cosine`` similarity ``mu_pos`` and ``mu_neg`` should be between ``-1`` and ``1``.

.. note:: There is an option to use linearly increasing batch size. The idea comes from `<https://arxiv.org/abs/1711.00489>`_.
In order to do it pass a list to ``batch_size``, e.g. ``"batch_size": [64, 256]`` (default behaviour).
If constant ``batch_size`` is required, pass an ``int``, e.g. ``"batch_size": 64``.

In the config, you can specify these parameters:

.. code-block:: yaml
Expand All @@ -464,14 +501,14 @@ intent_classifier_tensorflow_embedding
"hidden_layer_size_a": [256, 128]
"num_hidden_layers_b": 0
"hidden_layer_size_b": []
"batch_size": 32
"batch_size": [64, 256]
"epochs": 300
# embedding parameters
"embed_dim": 10
"embed_dim": 20
"mu_pos": 0.8 # should be 0.0 < ... < 1.0 for 'cosine'
"mu_neg": -0.4 # should be -1.0 < ... < 1.0 for 'cosine'
"similarity_type": "cosine" # string 'cosine' or 'inner'
"num_neg": 10
"num_neg": 20
"use_max_sim_neg": true # flag which loss function to use
# regularization
"C2": 0.002
Expand All @@ -480,11 +517,15 @@ intent_classifier_tensorflow_embedding
# flag if to tokenize intents
"intent_tokenization_flag": false
"intent_split_symbol": "_"
# visualization of accuracy
"evaluate_every_num_epochs": 10 # small values may hurt performance
"evaluate_on_num_examples": 1000 # large values may hurt performance
.. note:: Parameter ``mu_neg`` is set to a negative value to mimic the original
starspace algorithm in the case ``mu_neg = mu_pos`` and ``use_max_sim_neg = False``.
See `starspace paper <https://arxiv.org/abs/1709.03856>`_ for details.


intent_entity_featurizer_regex
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -496,6 +537,7 @@ intent_entity_featurizer_regex
extractor to simplify classification (assuming the classifier has learned during the training phase, that this set
feature indicates a certain intent). Regex features for entity extraction are currently only supported by the
``ner_crf`` component!
.. note:: There needs to be a tokenizer previous to this featurizer in the pipeline!

tokenizer_whitespace
~~~~~~~~~~~~~~~~~~~~
Expand Down

0 comments on commit e0dd9fa

Please sign in to comment.