Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phrase matcher #822

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
7570ce0
added entity_phrases support to markdown and json
twerkmeister Feb 5, 2018
6cb02c1
simplified extractors appending entities
twerkmeister Feb 5, 2018
faa6e5e
added entity_phrase extractor component
twerkmeister Feb 5, 2018
e370711
added test for phrase matcher
twerkmeister Feb 6, 2018
dc24813
removed non-ascii character from doc string
twerkmeister Feb 6, 2018
db3e146
Merge branch 'master' of github.com:RasaHQ/rasa_nlu into phrase_matcher
twerkmeister Feb 20, 2018
4b69dea
updated changelog
twerkmeister Feb 20, 2018
08c956f
case sensitivity
twerkmeister Feb 20, 2018
757fecf
docs
twerkmeister Feb 20, 2018
240b636
added missing sphinx req
twerkmeister Feb 20, 2018
e6baa7f
req fixed
twerkmeister Feb 20, 2018
5be2757
removed req again that was already there...
twerkmeister Feb 20, 2018
c8d0ee9
fixed default config, kwargs handling
twerkmeister Feb 25, 2018
35a7acd
added mode for tokenized processing
twerkmeister May 14, 2018
541b6c9
Merge branch 'master' of github.com:RasaHQ/rasa_nlu into phrase_matcher
twerkmeister May 14, 2018
34b7a4f
static method problem
twerkmeister May 14, 2018
51804ea
fixed default config
twerkmeister May 14, 2018
e4bed74
adjusting to new components class style
twerkmeister May 14, 2018
bc6c52e
pep8
twerkmeister May 14, 2018
d200ce6
persistence with default file name
twerkmeister May 15, 2018
7dbb1fc
added ner_phrase_matcher for test pipeline
twerkmeister May 15, 2018
2b3943e
Merge branch 'master' of github.com:RasaHQ/rasa_nlu into phrase_matcher
twerkmeister May 15, 2018
87d8bb8
giving training tests more time
twerkmeister May 20, 2018
69fc46d
removed travis_wait again
twerkmeister May 20, 2018
8741a6c
removed all components pipeline, removed or syntax for training data
twerkmeister May 23, 2018
e2e79a0
fixed line breaks
twerkmeister May 23, 2018
0d1dcf9
Merge branch 'master' of github.com:RasaHQ/rasa_nlu into phrase_matcher
twerkmeister May 23, 2018
46dd060
keeping entity phrases when splitting training sets in eval
twerkmeister Aug 3, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.rst
Expand Up @@ -11,8 +11,8 @@ This project adheres to `Semantic Versioning`_ starting with version 0.7.0.

Added
-----
- A new entity extraction component using predefined entity lists: ``ner_phrase_matcher``
- doc link to a community contribution for Rasa NLU in Chinese

Changed
-------

Expand Down
1 change: 1 addition & 0 deletions alt_requirements/requirements_full.txt
Expand Up @@ -10,6 +10,7 @@
# MITIE Requirements
-r requirements_mitie.txt

pygtrie==2.2
duckling==1.8.0
Jpype1==0.6.2
jieba==0.39
6 changes: 6 additions & 0 deletions data/examples/rasa/demo-rasa.json
@@ -1,5 +1,11 @@
{
"rasa_nlu_data": {
"entity_phrases": [
{
"entity": "food",
"phrases": ["Mapo Tofu", "Tacos", "Chana Masala"]
}
],
"regex_features": [
{
"name": "zipcode",
Expand Down
5 changes: 5 additions & 0 deletions data/examples/rasa/demo-rasa.md
Expand Up @@ -56,6 +56,11 @@
- vegg
- veggie

## entity_phrase:food
* Mapo Tofu
* Tacos
* Chana Masala

## regex:zipcode
- [0-9]{5}

Expand Down
6 changes: 6 additions & 0 deletions data/test/multiple_files_json/demo-rasa-affirm.json
@@ -1,5 +1,11 @@
{
"rasa_nlu_data": {
"entity_phrases": [
{
"entity": "food",
"phrases": ["Mapo Tofu"]
}
],
"common_examples": [
{
"text": "yes",
Expand Down
@@ -1,5 +1,11 @@
{
"rasa_nlu_data": {
"entity_phrases": [
{
"entity": "food",
"phrases": ["Chana Masala", "Tacos"]
}
],
"regex_features": [
{
"name": "greet",
Expand Down
3 changes: 3 additions & 0 deletions data/test/multiple_files_markdown/demo-rasa-affirm.md
Expand Up @@ -10,3 +10,6 @@
- correct
- great choice
- sounds really good

## entity_phrase:food
- Mapo Tofu
Expand Up @@ -25,3 +25,7 @@

## regex:zipcode
- [0-9]{5}

## entity_phrase:food
- Chana Masala
- Tacos
36 changes: 34 additions & 2 deletions docs/dataformat.rst
Expand Up @@ -17,7 +17,8 @@ The most important one is ``common_examples``.
"rasa_nlu_data": {
"common_examples": [],
"regex_features" : [],
"entity_synonyms": []
"entity_synonyms": [],
"entity_phrases": [],
}
}

Expand Down Expand Up @@ -157,6 +158,33 @@ for these extractors. Currently, all intent classifiers make use of available re
recognize entities and related intents. Hence, you still need to provide intent & entity examples as part of your
training data!

Entity Phrases
--------------
Entity phrases are predefined lists of entities that the system should directly extract.
These lists are used by the ``ner_phrase_matcher`` component to search the text for the defined entities.

If you have a closed set of non-ambiguous entities, entity phrases can give you high-precision
and high-recall extraction. Keep in mind, however, that the ``ner_phrase_matcher`` does not take the
context into account. As a result, it might create false positives by extracting entities in the wrong
context, e.g. "white house" in both "a white house" and "the white house". Also, the phrase matcher
will not recognize any entities that are not explicitly defined in the entity phrases section.

.. code-block:: json

{
"rasa_nlu_data": {
"entity_phrases": [
{
"entity": "food",
"phrases": ["Mapo Tofu", "Tacos", "Chana Masala"]
}
]
}
}

In this example, `food` is the entity type and the phrases are what the
``ner_phrase_matcher`` component searches for in the message text.

Markdown Format
---------------

Expand All @@ -177,10 +205,14 @@ list syntax, e.g. minus ``-``, asterisk ``*``, or plus ``+``:
## synonym:savings <!-- synonyms, method 2 -->
- pink pig


## regex:zipcode
- [0-9]{5}

## entity_phrase:food
- Mapo Tofu
- Tacos
- Chana Masala

Organization
------------

Expand Down
17 changes: 9 additions & 8 deletions docs/entities.rst
Expand Up @@ -6,14 +6,15 @@ There are a number of different entity extraction components,
which can seem intimidating for new users.
Here we'll go through a few use cases and make recommendations of what to use.

================ ========== ======================== ===================================
Component Requires Model notes
================ ========== ======================== ===================================
``ner_mitie`` MITIE structured SVM good for training custom entities
``ner_crf`` crfsuite conditional random field good for training custom entities
``ner_spacy`` spaCy averaged perceptron provides pre-trained entities
``ner_duckling`` duckling context-free grammar provides pre-trained entities
================ ========== ======================== ===================================
====================== ========== ======================== ===================================
Component Requires Model notes
====================== ========== ======================== ===================================
``ner_mitie`` MITIE structured SVM good for training custom entities
``ner_crf`` crfsuite conditional random field good for training custom entities
``ner_spacy`` spaCy averaged perceptron provides pre-trained entities
``ner_duckling`` duckling context-free grammar provides pre-trained entities
``ner_phrase_matcher`` pygtrie direct match for predefined entity lists
====================== ========== ======================== ===================================

The exact required packages can be found in ``dev-requirements.txt``
and they should also be shown when they are missing
Expand Down
36 changes: 34 additions & 2 deletions docs/pipeline.rst
Expand Up @@ -710,7 +710,7 @@ ner_duckling
}

:Description:
Duckling allows to recognize dates, numbers, distances and other structured entities
Duckling recognizes dates, numbers, distances and other structured entities
and normalizes them (for a reference of all available entities
see `the duckling documentation <https://duckling.wit.ai/#getting-started>`_).
Please be aware that duckling tries to extract as many entity types as possible without
Expand All @@ -722,7 +722,7 @@ ner_duckling
based system.

:Configuration:
Configure which dimensions, i.e. entity types, the :ref:`duckling component <section_pipeline_duckling>` to extract.
Configure which dimensions, i.e. entity types, duckling extracts.
A full list of available dimensions can be found in the `duckling documentation <https://duckling.wit.ai/>`_.

.. code-block:: yaml
Expand All @@ -734,6 +734,38 @@ ner_duckling



ner_phrase_matcher
~~~~~~~~~~~~~~~~~~
:Short: entity extraction using predefined lists of entities
:Outputs: appends ``entities``
:Output-Example:

.. code-block:: json

{
"entities": [{"value":"New York City",
"start": 20,
"end": 33,
"entity": "city",
"extractor": "ner_phrase_matcher"}]
}

:Description:
The phrase matcher recognizes entities from text according to the provided entity lists
in the ``entity_phrases`` sections of the training data. By default,
the phrase matcher is case insensitve and only matches along token boundaries.

:Configuration:
.. code-block:: yaml

pipeline:
- name: "ner_phrase_matcher"
# whether or not ignore case when matching phrases and text
ignore_case: True
# whether or not to match only on token boundaries. If False,
phrases can be matched within a word.
use_tokens: True

Creating new Components
-----------------------
You can create a custom Component to perform a specific task which NLU doesn't currently offer (e.g. sentiment analysis).
Expand Down
9 changes: 6 additions & 3 deletions rasa_nlu/evaluate.py
Expand Up @@ -155,7 +155,8 @@ def drop_intents_below_freq(td, cutoff=5):
for ex in td.intent_examples
if td.examples_per_intent[ex.get("intent")] >= cutoff]

return TrainingData(keep_examples, td.entity_synonyms, td.regex_features)
return TrainingData(keep_examples, td.entity_synonyms,
td.regex_features, td.entity_phrases)


def evaluate_intents(targets, predictions): # pragma: no cover
Expand Down Expand Up @@ -537,10 +538,12 @@ def generate_folds(n, td):
test = [x[i] for i in test_index]
yield (TrainingData(training_examples=train,
entity_synonyms=td.entity_synonyms,
regex_features=td.regex_features),
regex_features=td.regex_features,
entity_phrases=td.entity_phrases),
TrainingData(training_examples=test,
entity_synonyms=td.entity_synonyms,
regex_features=td.regex_features))
regex_features=td.regex_features,
entity_phrases=td.entity_phrases))


def combine_intent_result(results, interpreter, data):
Expand Down
5 changes: 5 additions & 0 deletions rasa_nlu/extractors/__init__.py
Expand Up @@ -28,6 +28,11 @@ def add_processor_name(self, entity):

return entity

def append_entities(self, message, entities):
entities = self.add_extractor_name(entities)
message.set("entities", message.get("entities", []) + entities,
add_to_output=True)

@staticmethod
def find_entity(ent, text, tokens):
offsets = [token.offset for token in tokens]
Expand Down
5 changes: 2 additions & 3 deletions rasa_nlu/extractors/crf_entity_extractor.py
Expand Up @@ -123,9 +123,8 @@ def _create_dataset(self, examples):
def process(self, message, **kwargs):
# type: (Message, **Any) -> None

extracted = self.add_extractor_name(self.extract_entities(message))
message.set("entities", message.get("entities", []) + extracted,
add_to_output=True)
extracted = self.extract_entities(message)
self.append_entities(message, extracted)

@staticmethod
def _convert_example(example):
Expand Down
5 changes: 1 addition & 4 deletions rasa_nlu/extractors/duckling_extractor.py
Expand Up @@ -173,10 +173,7 @@ def process(self, message, **kwargs):

extracted = convert_duckling_format_to_rasa(relevant_matches)

extracted = self.add_extractor_name(extracted)

message.set("entities", message.get("entities", []) + extracted,
add_to_output=True)
self.append_entities(message, extracted)

@classmethod
def load(cls,
Expand Down
5 changes: 1 addition & 4 deletions rasa_nlu/extractors/duckling_http_extractor.py
Expand Up @@ -114,10 +114,7 @@ def process(self, message, **kwargs):
"file nor is `RASA_DUCKLING_HTTP_URL` "
"set as an environment variable.")

extracted = self.add_extractor_name(extracted)
message.set("entities",
message.get("entities", []) + extracted,
add_to_output=True)
self.append_entities(message, extracted)

@classmethod
def load(cls,
Expand Down
4 changes: 1 addition & 3 deletions rasa_nlu/extractors/mitie_entity_extractor.py
Expand Up @@ -134,9 +134,7 @@ def process(self, message, **kwargs):

ents = self.extract_entities(message.text, message.get("tokens"),
mitie_feature_extractor)
extracted = self.add_extractor_name(ents)
message.set("entities", message.get("entities", []) + extracted,
add_to_output=True)
self.append_entities(message, ents)

@classmethod
def load(cls,
Expand Down