-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added description of config params for tf embed classifier #1012
Merged
Merged
Changes from 4 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
f3e61d0
added description of config params for tf embed classifier
Ghostvv aa1921a
change the phrases
Ghostvv 36eefd5
change the phrases
Ghostvv 3a5a6f8
duplicate info on intent tokenization and make more prevalent
amn41 8b69ab6
fix indent
amn41 17723d8
Merge branch 'master' into update_docs_embed
Ghostvv File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -108,11 +108,17 @@ to use it as a template: | |
|
||
pipeline: "tensorflow_embedding" | ||
|
||
The tensorflow pipeline supports any language, that can be tokenized. The | ||
The tensorflow pipeline supports any language that can be tokenized. The | ||
current tokenizer implementation relies on words being separated by spaces, | ||
so any languages that adheres to that can be trained with this pipeline. | ||
|
||
To use the components and configure them separately: | ||
If you want to split intents into multiple labels, e.g. for predicting multiple intents or for modeling hierarchical intent structure, use these flags: | ||
|
||
- ``intent_tokenization_flag`` if ``true`` the algorithm will split the intent labels into tokens and use bag-of-words representations for them; | ||
- ``intent_split_symbol`` sets the delimiter string to split the intent labels. Default ``_`` | ||
|
||
|
||
Here's an example configuration: | ||
|
||
.. code-block:: yaml | ||
|
||
|
@@ -121,6 +127,10 @@ To use the components and configure them separately: | |
pipeline: | ||
- name: "intent_featurizer_count_vectors" | ||
- name: "intent_classifier_tensorflow_embedding" | ||
intent_tokenization_flag: true | ||
intent_split_symbol: "_" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
|
||
|
||
|
||
Custom pipelines | ||
~~~~~~~~~~~~~~~~ | ||
|
@@ -412,19 +422,36 @@ intent_classifier_tensorflow_embedding | |
by ``nlp_spacy`` and ``tokenizer_spacy``. | ||
|
||
:Configuration: | ||
There are several hyperparameters such as the neural network's number of hidden layers, embedding dimension, | ||
droprate, regularization, etc. | ||
In the config, you can specify these parameters. | ||
|
||
.. note:: There is a parameter that controls similarity ``similarity_type``. | ||
It should be either ``cosine`` or ``inner``. For ``cosine`` similarity ``mu_pos`` and ``mu_neg`` | ||
should be between ``-1`` and ``1``. Parameter ``mu_pos`` controls how similar the algorithm | ||
should try to make embedding vectors for correct intent labels, | ||
while ``mu_neg`` controls maximum negative similarity for incorrect intents. | ||
It is set to a negative value to mimic the original | ||
starspace algorithm in the case ``mu_neg = mu_pos`` and ``use_max_sim_neg = False``. | ||
See `starspace paper <https://arxiv.org/abs/1709.03856>`_ for details. | ||
If ``use_max_sim_neg = True`` the algorithm only minimizes maximum similarity over incorrect intents. | ||
If you want to split intents into multiple labels, e.g. for predicting multiple intents or for | ||
modeling hierarchical intent structure, use these flags: | ||
|
||
- tokenization of intent labels: | ||
- ``intent_tokenization_flag`` if ``true`` the algorithm will split the intent labels into tokens and use bag-of-words representations for them; | ||
- ``intent_split_symbol`` sets the delimiter string to split the intent labels. Default ``_`` | ||
|
||
|
||
The algorithm also has hyperparameters to control: | ||
- neural network's architecture: | ||
- ``num_hidden_layers_a`` and ``hidden_layer_size_a`` set the number of hidden layers and their sizes before embedding layer for user inputs; | ||
- ``num_hidden_layers_b`` and ``hidden_layer_size_b`` set the number of hidden layers and their sizes before embedding layer for intent labels; | ||
- training: | ||
- ``batch_size`` sets the number of training examples in one forward/backward pass, the higher the batch size, the more memory space you'll need; | ||
- ``epochs`` sets the number of times the algorithm will see training data, where ``one epoch`` = one forward pass and one backward pass of all the training examples; | ||
- embedding: | ||
- ``embed_dim`` sets the dimension of embedding space; | ||
- ``mu_pos`` controls how similar the algorithm should try to make embedding vectors for correct intent labels; | ||
- ``mu_neg`` controls maximum negative similarity for incorrect intents; | ||
- ``similarity_type`` sets the type of the similarity, it should be either ``cosine`` or ``inner``; | ||
- ``num_neg`` sets the number of incorrect intent labels, the algorithm will minimize their similarity to the user input during training; | ||
- ``use_max_sim_neg`` if ``true`` the algorithm only minimizes maximum similarity over incorrect intent labels; | ||
- regularization: | ||
- ``C2`` sets the scale of L2 regularization | ||
- ``C_emb`` sets the scale of how important is to minimize the maximum similarity between embeddings of different intent labels; | ||
- ``droprate`` sets the dropout rate, it should be between ``0`` and ``1``, e.g. ``droprate=0.1`` would drop out ``10%`` of input units; | ||
|
||
.. note:: For ``cosine`` similarity ``mu_pos`` and ``mu_neg`` should be between ``-1`` and ``1``. | ||
|
||
In the config, you can specify these parameters: | ||
|
||
.. code-block:: yaml | ||
|
||
|
@@ -452,6 +479,10 @@ intent_classifier_tensorflow_embedding | |
"intent_tokenization_flag": false | ||
"intent_split_symbol": "_" | ||
|
||
.. note:: Parameter ``mu_neg`` is set to a negative value to mimic the original | ||
starspace algorithm in the case ``mu_neg = mu_pos`` and ``use_max_sim_neg = False``. | ||
See `starspace paper <https://arxiv.org/abs/1709.03856>`_ for details. | ||
|
||
intent_entity_featurizer_regex | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs more indentation (should start at the same level as
name
)