<a href="https://colab.research.google.com/github/ELehmann91/FS1/blob/master/Copy_of_Minimal_chat_with_SpaCy_and_WordNet_handout.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building basic chatbots with rules, syntax and semantic nets

It is increasingly often, that companies would like to automate internal or customer facing tasks via a chat interface. Though there are mature frameworks (like [RASA](https://rasa.com/)) or services (like [Microsoft Bot Framework](https://dev.botframework.com/) or [Chatfuel](https://chatfuel.com/)), we will attempt to set up a basic analysis pipeline based on SpaCy and WordNet, that can give us some coverage in a basic banking scenario.   

We will use SpaCy for our basic analysis (including syntax), as well as a simple addon, that connects it to WordNet called unsurprisingly [SpaCy-WordNet](https://spacy.io/universe/project/spacy-wordnet).

Let's take the following texts as a problem:

In [0]:
test_texts = [
    "I would like to deposit 5000 euros.",
    "I would like to put in 5000 euros.",
    "I would like to pay in 5000 euros.",
    "I would like to pay up 5000EUR.",
    "Can I pay in 5000 euros, please?",
    
    
    "I would like to deposit money.",
    

    "I am about to take out 5000 euros.",
    "I am about to get out 5000 euros.",
    "I am about to withdraw 5000 euros.",
    "I want to withdraw 5000 USD.",
    "Can I withdraw $5000.",

    
    "Can I check my account, please?",
    "May I see my balance, please?",
    "Could I query my account, please?",
    "I would like to see my account balance."
]

## Let's try some syntactic analysis!

The first goal is to see, if we can filter out, based on some common POS / Dependency structure the "main message", the things that people would like to say with the sentences above.

The expected output based on our analysis would be something like:

```
[deposit, 5000, euros]
[put, in, 5000, euros]
[pay, in, 5000, euros]
[pay, up, 5000EUR]
[pay, in, 5000, euros]
[deposit, money]
[take, out, 5000, euros]
----- No success in parsing. Original: I am about to get out 5000 euros.
[withdraw, 5000, euros]
[withdraw, 5000, USD]
[withdraw, $, 5000]
[check, my, account]
[see, my, balance]
[query, my, account]
[see, my, account, balance]
```

### Preliminaries: install SpaCy and initialize model

In [0]:
%%capture
!pip install spacy
!python -m spacy download en_core_web_sm

In [0]:
import spacy

nlp =  spacy.load("en_core_web_sm")

In [0]:
# We create one document out of the array of sentences for convenience.
long_text = " ".join(test_texts)

doc = nlp(long_text ) 

### Try some syntactic matching on the texts!

Let us use the syntactic analysis of SpaCy to get to the "core" of the sentences!

Let us assume, that we are interested in **verbs** and their **minimal subtrees**!

Please

1. look for the verbs in the sentences, 
2. get their subtrees,
3. delete every token from the "left" of the verb
4. from the "right" subtree, filter interjections and punctuations,
5. keep the shortest such subtree from the sentence and print it out!

For the visualization of the sentence tree use [DisplaCy](https://spacy.io/usage/visualizers).

In [5]:
from spacy import displacy

doc = nlp("Can I pay in 5000 euros, please?")
displacy.render(doc, style="dep",jupyter=True)
#displacy.render(doc, style="ent",jupyter=True)

In [6]:
doc = nlp("I am about to withdraw 5000 euros , please")
displacy.render(doc, style="dep",jupyter=True)

In [7]:
doc[7].pos_

'PUNCT'

In [8]:
for sentence in doc.sents:
  try: 
    #verb = [token for token in doc if token.pos_ == 'VERB'][-1]
    for i in range(0,len(sentence)):
      if sentence[i].pos_ == 'VERB':
        sentence_subtree = []
        #sentence_subtree.append(sentence[i])
        for tok in sentence[i:]:
          if tok.pos_ not in  ['PUNCT','INTJ']:
            sentence_subtree.append(tok)
  except IndexError:
    next

  if sentence_subtree:
      print(sentence_subtree)
  else:
      print("----- No success in parsing. Original:",sentence)


[withdraw, 5000, euros]


deposit, 5000, euros]
[put, in, 5000, euros]
[pay, in, 5000, euros]
[pay, up, 5000EUR]
[pay, in, 5000, euros]
[deposit, money]
[take, out, 5000, euros]
----- No success in parsing. Original: I am about to get out 5000 euros.
[withdraw, 5000, euros]
[withdraw, 5000, USD]
[withdraw, $, 5000]
[check, my, account]
[see, my, balance]
[query, my, account]
[see, my, account, balance]

As we can see, even in this simple case, some noise remains, that is: with our method we can not achieve success by sentence 8. Please observe, and let's discuss, why!

In [9]:
from spacy import displacy

doc=nlp(test_texts[7])

displacy.render(doc, style="dep")


'<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" id="a3e7d3463ce243188f121fcb0532dde0-0" class="displacy" width="1450" height="399.5" direction="ltr" style="max-width: none; height: 399.5px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr">\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="309.5">\n    <tspan class="displacy-word" fill="currentColor" x="50">I</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="50">PRON</tspan>\n</text>\n\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="309.5">\n    <tspan class="displacy-word" fill="currentColor" x="225">am</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="225">VERB</tspan>\n</text>\n\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="309.5">\n    <tspan class="displacy-word" fill="currentColor" x="400">about</tspan>\n    <tspan class="displacy-tag" 

It is worth noting, that some addon libraries, like [Textacy](https://spacy.io/universe/project/textacy) have built in functions that can come in handy in these topics.

Like:

`textacy.spacier.utils.get_main_verbs_of_sent(sent)`
Return the main (non-auxiliary) verbs in a sentence.

`textacy.spacier.utils.get_subjects_of_verb(verb)`
Return all subjects of a verb according to the dependency parse.

`textacy.spacier.utils.get_objects_of_verb(verb)`
Return all objects of a verb according to the dependency parse, including open clausal complements.

`textacy.spacier.utils.get_span_for_compound_noun(noun)`
Return document indexes spanning all (adjacent) tokens in a compound noun.

`textacy.spacier.utils.get_span_for_verb_auxiliaries(verb)`
Return document indexes spanning all (adjacent) tokens around a verb that are auxiliary verbs or negations.

None the less, if we want to carry out some definite actions for these sentences, we have to try another route.

## Second try: detecting "intents" and "entities" with the help of WordNet

In processing chat utterances, the two common tasks are to:

1. Detect the overall intent of the given utterance
2. Extract some key parameters needed for action.

The first is called **"intent detection"** the second **"entity extraction"**.

More on this can be found in the Theory section on chatbots, discussed later.

Though the standard practice for the first step is to build up a sentence classifier, and the second is done usually with some token level classifier / matching, now we will utilize the same rule based matching mechanism of SpaCy that we did before, albeit with a twist.

One of the main problems, as we saw before is the **variety of utterances**, that is, people tend to formulate the same intent in myriad ways. We will intend to mitigate this by **increasing coverage with WorNet synonyms**.

For this we need a connection between our analysis pipeline and WordNet. Luckily, we have it as an extension.


### Install extension and register it to the pipeline

In [10]:
!pip install spacy-wordnet
!python -m nltk.downloader wordnet

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
from spacy_wordnet.wordnet_annotator import WordnetAnnotator 
#import nltk
#nltk.download('wordnet')
nlp.add_pipe(WordnetAnnotator(nlp.lang), after='tagger') #Register to the pipeline

### Setting up a custom detector for intents

As said, we will hijack the entity detector capability of SpaCy to classify intents.

For this, we need to define custom rules with `EntityRuler`, and some patterns that match our intents.

We have all in all 3 intents in mind:

`INTENTS = ["TAKEOUT_INTENT","PAYIN_INTENT","BALANCE_INTENT"]`

First define patterns **one for each**, register it, try to run the pipeline, and see the result.

After it, you will have to **get back to this cell and iteratively refine the pattern** based on the results of WordNet enrichment below.

First make it run through, then refine!
All in all 7 patterns are enough in total to detect the three intents in all their forms seen here with the help of WordNet synsets.

#### Set up EntityRuler

In [0]:
# Construction from class
from spacy.pipeline import EntityRuler
ruler = EntityRuler(nlp, overwrite_ents=True)

INTENTS = ["TAKEOUT_INTENT","PAYIN_INTENT","BALANCE_INTENT"]

ruler = EntityRuler(nlp)

patterns = [{"label": "PAYIN_INTENT", "pattern": "deposit"},
            {"label": "TAKEOUT_INTENT", "pattern": "withdraw"},
            {"label": "TAKEOUT_INTENT", "pattern":  "out"},
            {"label": "PAYIN_INTENT", "pattern":  "in"},
            {"label": "PAYIN_INTENT", "pattern":  "up"},
            {"label": "BALANCE_INTENT", "pattern": "check"},
            {"label": "BALANCE_INTENT", "pattern":  "my"}
            ]

# Add the patterns to the ruler
ruler.add_patterns(patterns)

# Add the ruler to the pipeline
nlp.add_pipe(ruler)

#### Define a `detect_intent` function

The function takes in as an input an analysed sentence (`Doc`), a list on intents (eg. `INTENTS`), ad gives back the found intent or `None`.

In [0]:
def detect_intent(analysed_sentence, intents):
    # In this case, we do not do proper intent detection,
    # which would be a whole sentence classification task, based on it's semantics,
    # but we do an intelligent entity matching based on our rules,
    # where we treat intents as special entities.

    
    found= [(ent.text, ent.label_) for ent in analysed_sentence.ents if ent.label_ in intents]
    
    
    return found

In [14]:
doc = nlp("I am about to withdraw 5000 euros , please")
detect_intent(doc,INTENTS)

[('withdraw', 'TAKEOUT_INTENT')]

### Setitng up a function for detecting "real" entities

In SpaCy's world, monetary units and numbers are considered to be entities by default, thus the built in Named Entity Recognizer (`ner` in the pipeline) detects and tags those.

In our case we are only interested in the monetary entities. **Please bear in mind that MULTIPLE categories can mean money, so some times normal numbers, sometimes formal money, etc. Use multiple numeric categories for detection!**

More on this [here](https://spacy.io/usage/linguistic-features/#named-entities) and [here](https://spacy.io/api/annotation#named-entities).

In [0]:
MONEY = ["MONEY", "CARDINAL", "QUANTITY"]

def detect_money(analysed_sentence, money):
    
    #found_money = [(ent.text, ent.label_) for ent in analysed_sentence.ents if ent.label_ in MONEY]
    only_numbers = [(ent.text) for ent in analysed_sentence.ents if ent.label_ in MONEY]
    if only_numbers:
      only_numbers = only_numbers[0]
    
        
    # Please return only the numbers from the money!!!
    
    return only_numbers 

In [31]:
doc = nlp("I am about to withdraw 5000 euros , please")
detect_money(doc,INTENTS)

'5000'

### Enriching intent detection with WordNet

As we well saw, if we don't want to manually set up the patterns that match all test cases - which is unsustainable for a much bigger corpus than this - we need some semantic help.

Let's define a super crude `enrich_sentence` function, that generates sentence variants from the input. It takes in an analysed sentence (`doc`), a set of domains (in our case eg. `ECONOMY_DOMAINS`), and **for each token in the sentence searches for the sysnonyms inside our domains, then replaces the token with it's synonym, and appends the new sentence to a list.**

**Finally we expect to get back a set of sentence variants as texts in a list.**

In [0]:
sentence = nlp('I want to withdraw 5,000 euros')
for j, word in enumerate(sentence[7:]):
      print(word,j)

In [0]:
ECONOMY_DOMAINS = ['finance', 'banking']

def append_new_sen(sen,enriched_sentences,analysed_sentence,i,domains):
  for word in analysed_sentence[i:]:
      i += 1
      print(word,i)
      synsets = word._.wordnet.wordnet_synsets_for_domain(domains) 
      if synsets:
        for s in synsets:
          for lem in s.lemma_names():
            sen[i-1] = lem
            enriched_sentences.append(sen)
            print('---',i,lem,sen)
            
   
          for sen in enriched_sentences:
            print(len(enriched_sentences),sen,enriched_sentences)
            append_new_sen(sen,enriched_sentences,analysed_sentence,i,domains)
  return analysed_sentence


def enrich_sentence(analysed_sentence, domains):

    enriched_sentences_=[]

    sen = [tok.text for tok in analysed_sentence]
    sen_n = len(sen)
    synsets = [word._.wordnet.wordnet_synsets_for_domain(domains) for word in sentence]
    synonym_l = [w[0].lemma_names() for w in synsets if w]
    for syns in synonym_l:
      syn_w = [s for s in syns if s in sen][0]
      #print(syn_w,syns)
      for syn in syns:
        for i in range(len(sen)):
          if sen[i] == syn_w:
            sen_ = sen.copy()
            sen_[i] = syn
            enriched_sentences_.extend(sen_)
      sen = [tok for tok in enriched_sentences_]
      #print('--',sen)

    enriched_sentences=[]

    for i in range(0, len(enriched_sentences_), sen_n):
      if enriched_sentences_[i:i + sen_n] not in enriched_sentences:
        enriched_sentences.append(enriched_sentences_[i:i + sen_n])
      #print(enriched_sentences[i:i + 6])
      #print(syn in sen)
    #lem in s.lemma_names()
    enriched_sentences

    if len(enriched_sentences)==0:
      enriched_sentences.append(sen)
    return enriched_sentences

    
    #Please bear in mind that WordNet lemmas can be of multiple words, thus containing a "_" which we don't need.
    


In [19]:
sentence = nlp('I want to withdraw 5,000 euros')
enrich_sentence(sentence,ECONOMY_DOMAINS)


[['I', 'want', 'to', 'withdraw', '5,000', 'euros'],
 ['I', 'need', 'to', 'withdraw', '5,000', 'euros'],
 ['I', 'require', 'to', 'withdraw', '5,000', 'euros'],
 ['I', 'want', 'to', 'draw', '5,000', 'euros'],
 ['I', 'need', 'to', 'draw', '5,000', 'euros'],
 ['I', 'require', 'to', 'draw', '5,000', 'euros'],
 ['I', 'want', 'to', 'take_out', '5,000', 'euros'],
 ['I', 'need', 'to', 'take_out', '5,000', 'euros'],
 ['I', 'require', 'to', 'take_out', '5,000', 'euros'],
 ['I', 'want', 'to', 'draw_off', '5,000', 'euros'],
 ['I', 'need', 'to', 'draw_off', '5,000', 'euros'],
 ['I', 'require', 'to', 'draw_off', '5,000', 'euros']]

#### Full search for intents

Based on the `detect_intent` and `enrich_sentence` functions we set up the full logic that searches for intents.

The function has to accept an analysed sentence (`Doc`), the list of intents and the list of domains as above, and then **try to find the intent in the default sentence. If not found, try to enrich the sentence, then search in the enriched ones. Return an intent if found.**

In [0]:
def search_for_intents(analysed_sentence, intents, domains):
    found_intent = []
    found_intent = [(ent.text, ent.label_) for ent in analysed_sentence.ents if ent.label_ in intents]
    if len(found_intent)<1:
      enrich = enrich_sentence(analysed_sentence, domains)
      for sen in enrich:
        docs = nlp(' '.join(s for s in sen))
        found_intent = [(ent.text, ent.label_) for ent in docs.ents if ent.label_ in intents]
        if len(found_intent)>0:
          break
    
    return found_intent[0][1]

#### Let's try this out!

In [21]:
sentence = nlp("I would like to take_out 5000 euros.")

found_intent = search_for_intents(sentence,INTENTS,ECONOMY_DOMAINS)

print(found_intent)

TAKEOUT_INTENT


## Finally: parse the full query

Refine the original patterns and all the functions until the tests pass at the end of the notebook. Use **the least amount of handmade patterns possible!**

In [0]:
def parse_query(query, intents, domains, money):
    
    analysed_sentence = nlp(query)
    found_intent = search_for_intents(analysed_sentence,intents, domains)
    print(found_intent)
    if found_intent == intents[0] or found_intent == intents[1]:
        amount = detect_money(analysed_sentence,money)
        if amount:
            print("Executing",found_intent,"with",amount)
            return (found_intent, amount)
        else:
            print("No amount was given, please add one!")
            return (found_intent, None)
    elif found_intent == intents[-1]:
        print("Getting you your account balance, one moment...")
        return (found_intent, None)
    else:
        print("Can't parse what you are asking for, sorry!")
        return (None, None)

In [33]:
parse_query("I would like to withdraw 5000.",INTENTS, ECONOMY_DOMAINS, MONEY)

TAKEOUT_INTENT
Executing TAKEOUT_INTENT with 5000


('TAKEOUT_INTENT', '5000')

In [0]:
tests = [
    ("I would like to deposit 5000 euros.",("PAYIN_INTENT","5000")),
    ("I would like to put in 5000 euros.",("PAYIN_INTENT","5000")),
    ("I would like to pay in 5000 euros.",("PAYIN_INTENT","5000")),
    ("I would like to pay up 5000 EUR.",("PAYIN_INTENT","5000")),
    ("Can I pay in 5000 euros, please?",("PAYIN_INTENT","5000")),
    
    
    ("I would like to deposit money.",("PAYIN_INTENT",None)),
    

    ("I am about to take out 5000 euros.",("TAKEOUT_INTENT","5000")),
    ("I am about to get out 5000 euros.",("TAKEOUT_INTENT","5000")),
    ("I am about to withdraw 5000 euros.",("TAKEOUT_INTENT","5000")),
    ("I want to withdraw 5000 USD.",("TAKEOUT_INTENT","5000")),
    ("Can I withdraw $5000.",("TAKEOUT_INTENT","5000")),

    
    ("Can I check my account, please?",("BALANCE_INTENT",None)),
    ("May I see my balance, please?",("BALANCE_INTENT",None)),
    ("Could I query my account, please?",("BALANCE_INTENT",None)),
    ("I would like to see my account balance.",("BALANCE_INTENT",None)),

]

In [38]:
for test in tests:
    try:
        assert parse_query(test[0],INTENTS, ECONOMY_DOMAINS, MONEY) == test[1]
    except:
        print(test[0])
        print("---ERROR: ",parse_query(test[0],INTENTS, ECONOMY_DOMAINS, MONEY),test[1])
        raise

PAYIN_INTENT
Executing PAYIN_INTENT with 5000
PAYIN_INTENT
Executing PAYIN_INTENT with 5000
PAYIN_INTENT
Executing PAYIN_INTENT with 5000
PAYIN_INTENT
Executing PAYIN_INTENT with 5000
PAYIN_INTENT
Executing PAYIN_INTENT with 5000
PAYIN_INTENT
No amount was given, please add one!
TAKEOUT_INTENT
Executing TAKEOUT_INTENT with 5000
TAKEOUT_INTENT
Executing TAKEOUT_INTENT with 5000
TAKEOUT_INTENT
Executing TAKEOUT_INTENT with 5000
TAKEOUT_INTENT
Executing TAKEOUT_INTENT with 5000
TAKEOUT_INTENT
Executing TAKEOUT_INTENT with 5000
BALANCE_INTENT
Getting you your account balance, one moment...
BALANCE_INTENT
Getting you your account balance, one moment...
BALANCE_INTENT
Getting you your account balance, one moment...
BALANCE_INTENT
Getting you your account balance, one moment...


A more elaborate and very nice example on the power of rule based matching and it's combination with machine learning models can be found [here](https://github.com/pmbaumgartner/binder-notebooks/blob/master/rule-based-matching-with-spacy-matcher.ipynb)