Skip to content

Commit

Permalink
Merge 898e30a into 59e120d
Browse files Browse the repository at this point in the history
  • Loading branch information
DominikRos committed Feb 12, 2019
2 parents 59e120d + 898e30a commit a2d4b42
Show file tree
Hide file tree
Showing 21 changed files with 169 additions and 163 deletions.
30 changes: 15 additions & 15 deletions docs/choosing_pipeline.rst
@@ -1,4 +1,6 @@
:desc: Setting up a Rasa NLU pipeline
:desc: Set up a pipeline of pre-trained word vectors form GloVe or fastText
or fit them specifically on your dataset using the tensorflow pipeline
for open source NLU.

.. _choosing_pipeline:

Expand All @@ -8,7 +10,7 @@ Choosing a Rasa NLU Pipeline
The Short Answer
----------------

If you have less than 1000 total training examples, and there is a spaCy model for your
If you have less than 1000 total training examples, and there is a spaCy model for your
language, use the ``spacy_sklearn`` pipeline:

.. literalinclude:: ../sample_configs/config_spacy.yml
Expand Down Expand Up @@ -38,39 +40,39 @@ doesn't use any pre-trained word vectors, but instead fits these specifically fo
The advantage of the ``spacy_sklearn`` pipeline is that if you have a training example like:
"I want to buy apples", and Rasa is asked to predict the intent for "get pears", your model
already knows that the words "apples" and "pears" are very similar. This is especially useful
if you don't have very much training data.
if you don't have very much training data.

The advantage of the ``tensorflow_embedding`` pipeline is that your word vectors will be customised
The advantage of the ``tensorflow_embedding`` pipeline is that your word vectors will be customised
for your domain. For example, in general English, the word "balance" is closely related to "symmetry",
but very different to the word "cash". In a banking domain, "balance" and "cash" are closely related
and you'd like your model to capture that. This pipeline doesn't use a language-specific model,
so it will work with any language that you can tokenize (on whitespace or using a custom tokenizer).

You can read more about this topic `here <https://medium.com/rasa-blog/supervised-word-vectors-from-scratch-in-rasa-nlu-6daf794efcd8>`_ .
You can read more about this topic `here <https://medium.com/rasa-blog/supervised-word-vectors-from-scratch-in-rasa-nlu-6daf794efcd8>`_ .


There are also the ``mitie`` and ``mitie_sklearn`` pipelines, which use MITIE as a source of word vectors.
There are also the ``mitie`` and ``mitie_sklearn`` pipelines, which use MITIE as a source of word vectors.
We do not recommend that you use these; they are likely to be deprecated in a future release.

.. note::

Intent classification is independent of entity extraction. So sometimes
NLU will get the intent right but entities wrong, or the other way around.
You need to provide enough data for both intents and entities.
NLU will get the intent right but entities wrong, or the other way around.
You need to provide enough data for both intents and entities.


Multiple Intents
----------------

If you want to split intents into multiple labels,
If you want to split intents into multiple labels,
e.g. for predicting multiple intents or for modeling hierarchical intent structure,
you can only do this with the tensorflow pipeline.
To do this, use these flags:

- ``intent_tokenization_flag`` if ``true`` the algorithm will split the intent labels into tokens and use a bag-of-words representations for them;
- ``intent_split_symbol`` sets the delimiter string to split the intent labels. Default ``_``

`Here <https://blog.rasa.com/how-to-handle-multiple-intents-per-input-using-rasa-nlu-tensorflow-pipeline/>`_ is a tutorial on how to use multiple intents in Rasa Core and NLU.
`Here <https://blog.rasa.com/how-to-handle-multiple-intents-per-input-using-rasa-nlu-tensorflow-pipeline/>`_ is a tutorial on how to use multiple intents in Rasa Core and NLU.

Here's an example configuration:

Expand All @@ -93,7 +95,7 @@ In Rasa NLU, incoming messages are processed by a sequence of components.
These components are executed one after another
in a so-called processing pipeline. There are components for entity extraction, for intent classification,
pre-processing, and others. If you want to add your own component, for example to run a spell-check or to
do sentiment analysis, check out :ref:`section_customcomponents`.
do sentiment analysis, check out :ref:`section_customcomponents`.

Each component processes the input and creates an output. The ouput can be used by any component that comes after
this component in the pipeline. There are components which only produce information that is used by other components
Expand Down Expand Up @@ -180,7 +182,7 @@ exactly. Instead it will return the trained synonym.
Pre-configured Pipelines
------------------------

A template is just a shortcut for
A template is just a shortcut for
a full list of components. For example, these two configurations are equivalent:

.. literalinclude:: ../sample_configs/config_spacy.yml
Expand Down Expand Up @@ -255,7 +257,7 @@ default is to use a simple whitespace tokenizer:
- name: "intent_classifier_tensorflow_embedding"
If you have a custom tokenizer for your language, you can replace the whitespace
tokenizer with something more accurate.
tokenizer with something more accurate.

.. _section_mitie_pipeline:

Expand Down Expand Up @@ -320,5 +322,3 @@ If you want to use custom components in your pipeline, see :ref:`section_customc


.. include:: feedback.inc


20 changes: 10 additions & 10 deletions docs/components.rst
@@ -1,10 +1,12 @@
:desc: Understanding a Rasa NLU Pipeline
:desc: Configure the custom components of your ML model to optimise the
processes performed on the user input of your contextual assistant.

.. _section_pipeline:

Component Configuration
=======================

This is a reference of the configuration options for every built-in component in
This is a reference of the configuration options for every built-in component in
Rasa NLU. If you want to build a custom component, check out :ref:`section_customcomponents`.

.. contents::
Expand Down Expand Up @@ -117,13 +119,13 @@ intent_featurizer_count_vectors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

:Short: Creates bag-of-words representation of intent features
:Outputs:
nothing, used as an input to intent classifiers that
need bag-of-words representation of intent features
:Outputs:
nothing, used as an input to intent classifiers that
need bag-of-words representation of intent features
(e.g. ``intent_classifier_tensorflow_embedding``)
:Description:
Creates bag-of-words representation of intent features using
`sklearn's CountVectorizer <http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_.
`sklearn's CountVectorizer <http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_.
All tokens which consist only of digits (e.g. 123 and 99 but not a123d) will be assigned to the same feature.

.. note::
Expand Down Expand Up @@ -426,7 +428,7 @@ tokenizer_whitespace
:Description:
Creates a token for every whitespace separated character sequence. Can be used to define tokens for the MITIE entity
extractor.

tokenizer_jieba
~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -678,7 +680,5 @@ ner_duckling_http
# needed to calculate dates from relative expressions like "tomorrow"
timezone: "Europe/Berlin"
.. include:: feedback.inc

.. include:: feedback.inc
18 changes: 9 additions & 9 deletions docs/config.rst
@@ -1,4 +1,6 @@
:desc: Customizing Your Rasa NLU Configuration
:desc: Read more on configuring open source library Rasa NLU to access machine
learning based prediction of intents and entities as a server.

.. _section_configuration:

Server Configuration
Expand All @@ -16,7 +18,7 @@ Server Configuration
In older versions of Rasa NLU, the server and models were configured with a single file.
Now, the server only takes command line arguments (see :ref:`server_parameters`).
The configuration file only refers to the model that you want to train,
i.e. the pipeline and components.
i.e. the pipeline and components.


Running the server
Expand Down Expand Up @@ -49,21 +51,21 @@ from the same process & avoid duplicating the memory load.

As stated previously, Rasa NLU naturally handles serving multiple apps.
By default the server will load all projects found
under the ``path`` directory passed at run time.
under the ``path`` directory passed at run time.

Rasa NLU naturally handles serving multiple apps, by default the server will load all projects found
under the directory specified with ``--path`` option. unless you have provide ``--pre_load`` option
to load a specific project.
under the directory specified with ``--path`` option. unless you have provide ``--pre_load`` option
to load a specific project.

.. code-block:: console
$ # This will load all projects under projects/ directory
$ python -m rasa_nlu.server -c config.yaml --path projects/
$ python -m rasa_nlu.server -c config.yaml --path projects/
.. code-block:: console
$ # This will load only hotels project under projects/ directory
$ python -m rasa_nlu.server -c config.yaml --pre_load hotels --path projects/
$ python -m rasa_nlu.server -c config.yaml --pre_load hotels --path projects/
The file structure under ``path directory`` is as follows:
Expand Down Expand Up @@ -135,5 +137,3 @@ CORS
By default CORS (cross-origin resource sharing) calls are not allowed. If you want to call your Rasa NLU server from another domain (for example from a training web UI) then you can whitelist that domain by adding it to the config value ``cors_origin``.

.. include:: feedback.inc


8 changes: 4 additions & 4 deletions docs/customcomponents.rst
@@ -1,5 +1,7 @@
:desc: How to build custom Rasa NLU components
.. _section_customcomponents:
:desc: Create custom components to create additional features like sentiment
analysis to integrate with open source bot framework Rasa Stack.

.._section_customcomponents:

Custom Components
=================
Expand Down Expand Up @@ -52,5 +54,3 @@ Component


.. include:: feedback.inc


26 changes: 13 additions & 13 deletions docs/dataformat.rst
@@ -1,4 +1,6 @@
:desc: The Rasa NLU Training Data Format
:desc: Read more about how to format training data with Rasa NLU for open
source natural language processing.

.. _section_dataformat:

Training Data Format
Expand All @@ -9,13 +11,13 @@ Data Format
~~~~~~~~~~~

You can provide training data as markdown or as json, as a single file or as a directory containing multiple files.
Note that markdown is usually easier to work with.
Note that markdown is usually easier to work with.


Markdown Format
---------------

Markdown is the easiest Rasa NLU format for humans to read and write.
Markdown is the easiest Rasa NLU format for humans to read and write.
Examples are listed using the unordered
list syntax, e.g. minus ``-``, asterisk ``*``, or plus ``+``.
Examples are grouped by intent, and entities are annotated as markdown links.
Expand Down Expand Up @@ -47,10 +49,10 @@ Examples are grouped by intent, and entities are annotated as markdown links.
path/to/currencies.txt
The training data for Rasa NLU is structured into different parts:
examples, synonyms, regex features, and lookup tables.
examples, synonyms, regex features, and lookup tables.

Synonyms will map extracted entities to the same name, for example mapping "my savings account" to simply "savings".
However, this only happens *after* the entities have been extracted, so you need to provide examples with the synonyms present so that Rasa can learn to pick them up.
However, this only happens *after* the entities have been extracted, so you need to provide examples with the synonyms present so that Rasa can learn to pick them up.

Lookup tables may be specified either directly as lists or as txt files containing newline-separated words or phrases. Upon loading the training data, these files are used to generate case-insensitive regex patterns that are added to the regex features. For example, in this case a list of currency names is supplied so that it is easier to pick out this entity.

Expand All @@ -73,7 +75,7 @@ The most important one is ``common_examples``.
}
The ``common_examples`` are used to train your model. You should put all of your training
examples in the ``common_examples`` array.
examples in the ``common_examples`` array.
Regex features are a tool to help the classifier detect entities or intents and improve the performance.


Expand All @@ -86,7 +88,7 @@ and after training a model. Luckily, there's a
for creating training data in rasa's format.
- created by `@azazdeaz <https://github.com/azazdeaz>`_ -
and it's also extremely helpful for inspecting and modifying existing data.
`Rasa Platform <https://rasa.com/products/rasa-platform>`_ (Rasa's commercial product) also has
`Rasa Platform <https://rasa.com/products/rasa-platform>`_ (Rasa's commercial product) also has
a full-featured UI for annotating data.


Expand All @@ -101,8 +103,8 @@ data in the GUI before training.
Generating More Entity Examples
-------------------------------

It is sometimes helpful to generate a bunch of entity examples, for
example if you have a database of restaurant names. There are a couple
It is sometimes helpful to generate a bunch of entity examples, for
example if you have a database of restaurant names. There are a couple
of great tools built by the community to help with that.

You can use `Chatito <https://rodrigopivi.github.io/Chatito/>`__ , a tool for generating training datasets in rasa's format using a simple DSL or `Tracy <https://yuukanoo.github.io/tracy>`__, a simple GUI to create training datasets for rasa.
Expand Down Expand Up @@ -297,7 +299,7 @@ you could have a folder called ``nlu_data``:
nlu_data/
├── restaurants.md
├── smalltalk.md
├── smalltalk.md
To train a model with this data, pass the path to the directory to the train script:

Expand All @@ -316,6 +318,4 @@ To train a model with this data, pass the path to the directory to the train scr
and json


.. include:: feedback.inc


.. include:: feedback.inc
3 changes: 2 additions & 1 deletion docs/docker.rst
@@ -1,4 +1,5 @@
:desc: Using Rasa NLU with Docker
:desc: Setup Rasa NLU with Docker in your own infrastructure for local
intent recognition and entity recognition.

.. _section_docker:

Expand Down
7 changes: 5 additions & 2 deletions docs/endpoint_configuration.rst
@@ -1,4 +1,7 @@
:desc: Adding endpoints using an endpoint configuration file
:desc: Add new endpoints to the configuration file of Rasa NLU to connect
your API's to integrate with open source NLU.

.. _section_endpoint_configuration:

Endpoint Configuration
======================
Expand Down Expand Up @@ -33,4 +36,4 @@ To use models from a model server, add this to your endpoint configuration:
model:
url: <path to your model>
token: <authentication token> # [optional]
token_name: <name of the token # [optional] (default: token)
token_name: <name of the token # [optional] (default: token)
28 changes: 14 additions & 14 deletions docs/entities.rst
@@ -1,4 +1,6 @@
:desc: Entity Extraction with Rasa NLU
:desc: Use open source named entity recognition like spacy and duckling
for building contextual AI Assistants.

.. _section_entities:

Entity Extraction
Expand All @@ -19,8 +21,8 @@ Custom Entities
^^^^^^^^^^^^^^^

Almost every chatbot and voice app will have some custom entities.
In a restaurant bot, ``chinese`` is a cuisine, but in a language-learning app it would mean something very different.
The ``ner_crf`` component can learn custom entities in any language.
In a restaurant bot, ``chinese`` is a cuisine, but in a language-learning app it would mean something very different.
The ``ner_crf`` component can learn custom entities in any language.


Extracting Places, Dates, People, Organisations
Expand All @@ -38,7 +40,7 @@ Dates, Amounts of Money, Durations, Distances, Ordinals

The `duckling <https://duckling.wit.ai/>`_ library does a great job
of turning expressions like "next Thursday at 8pm" into actual datetime
objects that you can use, e.g.
objects that you can use, e.g.

.. code-block:: python
Expand All @@ -47,8 +49,8 @@ objects that you can use, e.g.
The list of supported langauges is `here <https://github.com/facebook/duckling/tree/master/Duckling/Dimensions>`_.
Duckling can also handle durations like "two hours",
amounts of money, distances, and ordinals.
Duckling can also handle durations like "two hours",
amounts of money, distances, and ordinals.
Fortunately, there is a duckling docker container ready to use,
that you just need to spin up and connect to Rasa NLU.
(see :ref:`ner_duckling_http`)
Expand All @@ -59,11 +61,11 @@ Regular Expressions (regex)

You can use regular expressions to help the CRF model learn to recognize entities.
In the :ref:`section_dataformat` you can provide a list of regular expressions, each of which provides
the ``ner_crf`` with an extra binary feature, which says if the regex was found (1) or not (0).
the ``ner_crf`` with an extra binary feature, which says if the regex was found (1) or not (0).

For example, the names of German streets often end in ``strasse``. By adding this as a regex,
we are telling the model to pay attention to words ending this way, and will quickly learn to
associate that with a location entity.
associate that with a location entity.

If you just want to match regular expressions exactly, you can do this in your code,
as a postprocessing step after receiving the response form Rasa NLU.
Expand Down Expand Up @@ -103,13 +105,13 @@ Some extractors, like ``duckling``, may include additional information. For exam

.. code-block:: json
{
"additional_info":{
{
"additional_info":{
"grain":"day",
"type":"value",
"value":"2018-06-21T00:00:00.000-07:00",
"values":[
{
"values":[
{
"grain":"day",
"type":"value",
"value":"2018-06-21T00:00:00.000-07:00"
Expand All @@ -134,5 +136,3 @@ Some extractors, like ``duckling``, may include additional information. For exam


.. include:: feedback.inc


0 comments on commit a2d4b42

Please sign in to comment.