<a href="https://colab.research.google.com/github/RajanMehta/nlp-pocs/blob/master/entity_extract_and_map.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Problem:**

For different types of named-entities that share a semantic position, how can we identify their type?

Example:
- how much did i spend in `walmart`?
- how much did i spend in `new york`?
- how much did i spend in `food and drink`?

As shown in the example above, in the same sentence structure we have equal chances of having a `merchant`(walmart), a `location`(new york) or a `transaction_category` (food and drink). Our goal is to differentiate these types and map the extracted named-entity correctly.


**Why it is difficult to use pre-trained NER models:**

- They are case sensitive. Usually they don't work if the query is not properly cased. 
    - It's highly likely that `how much did i spend in new york?` will not extract `new york` as a `location` entity.
    - But if we case it correctly: `How much did i spend in New York?`, it might work just fine. Sadly, this isn't always the case with real-world language data.

- They are especially not reliable for the problem described above where you have similar sentence structures but different possible entity types.

- There's no way the model would have seen all the possible vocabulary for a particular entity type. Meaning, it might miss some person-entities or some organization-entities. 
    - This is why we should have a generic model that extracts the entities first followed by having a separate logic that assigns a type to that entity (by validating it with real-world data (wikidata) or by using contextual embeddings (transformers))...

#### **Step 1**: Train a completely new entity type.

There are many ways to create a custom named-entity recognition model. As google colab comes with spacy, it was super quick to train a very basic model as a proof-of-concept. [reference](https://spacy.io/usage/training#ner)

If you run the cell below, spacy will train a new entity called `ENTITY`. 
Note that the training data has similar sentences with different types of entities.

In [None]:
import spacy

import warnings
warnings.filterwarnings('ignore')
import random 
import datetime as dt

model = spacy.load('en_core_web_sm')

model.remove_pipe('ner')

TRAIN_DATA = [
   ("show new york transactions", {"entities": [(5, 13, "ENTITY")]}),
   ("show new jersey transactions", {"entities": [(5, 15, "ENTITY")]}),
   ("show walmart transactions", {"entities": [(5, 12, "ENTITY")]}),
   ("show grocery transactions", {"entities": [(5, 12, "ENTITY")]}),
   ("show entertainment transactions", {"entities": [(5, 18, "ENTITY")]}),
   ("show dallas transactions",{'entities': [(5, 11, 'ENTITY')]}),
   ("show costco transactions",{'entities': [(5, 11, 'ENTITY')]}),
   ("show pizza hut transactions",{'entities': [(5, 14, 'ENTITY')]}),
   ("show ann arbor transactions",{'entities': [(5, 14, 'ENTITY')]}) 
]

def create_blank_nlp(train_data):
    nlp = spacy.blank("en")
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner, last=True)
    ner = nlp.get_pipe("ner")
    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
    return nlp  

nlp = create_blank_nlp(TRAIN_DATA)
optimizer = nlp.begin_training()  
for i in range(10):
    random.shuffle(TRAIN_DATA)
    losses = {}
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations], sgd=optimizer, losses=losses)
    print(f"Losses at iteration {i} - {dt.datetime.now()}", losses)

Losses at iteration 0 - 2020-08-29 03:11:37.468793 {'ner': 17.837710716063157}
Losses at iteration 1 - 2020-08-29 03:11:37.683789 {'ner': 7.221678547572083}
Losses at iteration 2 - 2020-08-29 03:11:37.900806 {'ner': 3.752802208555257}
Losses at iteration 3 - 2020-08-29 03:11:38.113894 {'ner': 10.005212334148023}
Losses at iteration 4 - 2020-08-29 03:11:38.331543 {'ner': 6.78672894530958}
Losses at iteration 5 - 2020-08-29 03:11:38.545253 {'ner': 0.3249405558149986}
Losses at iteration 6 - 2020-08-29 03:11:38.762234 {'ner': 0.0006758154624136789}
Losses at iteration 7 - 2020-08-29 03:11:38.977697 {'ner': 1.2032174909438047e-05}
Losses at iteration 8 - 2020-08-29 03:11:39.192019 {'ner': 9.265434348959295e-07}
Losses at iteration 9 - 2020-08-29 03:11:39.408050 {'ner': 1.3939076297464356e-07}


In [None]:
doc = nlp("show walmart transactions")
for ent in doc.ents:
  print(ent, ent.label_)

walmart ENTITY


#### **Step 2**: Install [txtai](https://github.com/neuml/txtai), an AI-powered search engine built on Transformers. 

- This will help us map the extracted entity to it's type.
- I strongly recommend to go through this [reference](https://towardsdatascience.com/introducing-txtai-an-ai-powered-search-engine-built-on-transformers-37674be252ec) first.

In [None]:
%%capture
!pip install git+https://github.com/neuml/txtai

In [None]:
%%capture

from txtai.embeddings import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})

In [None]:
sections = [
  "place region location state city country geography",
  "business organization company enterprise inc institution bank office corporation service"
]

# Create an index for the list of sections
embeddings.index([(uid, text, None) for uid, text in enumerate(sections)])

In [None]:
import numpy as np 

embeddings.similarity("dallas", sections)

array([0.4549731, 0.3990316], dtype=float32)

#### **Step 3**: Leveraging Wikidata

This is not a mandatory step if the embeddings are working fine as per your expectations. 

- Although embeddings are super powerful, I noticed that it would sometimes map the entity incorrectly. This also depends on how well you define the `sections` above. For example, `disney` is a `merchant` but the transformer thinks that "place-ness" of `disney` is higher than its "merchant-ness". So, it will assign it the type `place`.
- The way I fix this is, instead of searching for `disney`, I search for `what I mean by disney`. By disney, I mean `['film production company', 'media conglomerate', 'commercial organization']`. So, I search for these key words and BAM! the transformer identifies disney as a merchant now.
- The way I get that list is by using wikidata. By passing the extracted entity to wikidata, one can get what wikidata instance those entities belong to. It will become easier to understand if you see it first...

In [None]:
import requests

def get_item(item):
    url = 'https://www.wikidata.org/w/api.php'
    r = requests.get(url, params = {'format': 'json', 'search': item, 'action': 'wbsearchentities', 'type': 'item', 'language': 'en'})
    data = r.json()
    return data['search'][0] if len(data['search']) else None

def is_instance_of(item):
    url = 'https://query.wikidata.org/sparql'
    item_obj = get_item(item)
    item_id = item_obj["id"] if item_obj else None

    if item_id:
      query = """
          SELECT ?ans ?ansLabel
          WHERE 
          {
            wd:%s  wdt:P31 ?ans.
            SERVICE wikibase:label { bd:serviceParam wikibase:language
            "[AUTO_LANGUAGE],en". }
          }
          """ % (item_id)
    
      r = requests.get(url, params = {'format': 'json', 'query': query})
      r = r.json()
      if len(r['results']['bindings']):
        result = [obj['ansLabel']['value'] for obj in r['results']['bindings']]
      else:
        result = None
    else:
      result = None
    return result

In [None]:
is_instance_of("disney")

['film production company', 'media conglomerate', 'commercial organization']

In [None]:
is_instance_of("bofa")

['credit institution', 'business', 'enterprise']

In [None]:
is_instance_of("new york")

['city',
 'global city',
 'city of the United States',
 'big city',
 'city with millions of inhabitants',
 'port settlement',
 'largest city']

- [Wikidata Tutorial Reference](https://towardsdatascience.com/a-brief-introduction-to-wikidata-bb4e66395eb1)

#### **Step 4**: Putting it all together

In [None]:
import numpy as np 

queries = [
  "show las vegas transactions",
  "show chicago transactions",
  "show pizza hut transactions",
  "show bofa transactions",
  "show grocery transactions",
  "show cinema transactions"
  "show disney transactions"        
]

for query in queries:
    doc = nlp(query) #ignore the sentense if it doen't make sense :)
    for ent in doc.ents:
      ent.label_ == "ENTITY"
      # wikidata instance
      entity_type = is_instance_of(ent.text)
      if entity_type:
        # Get index of best section that best matches query
        scores = embeddings.similarity(" ".join(entity_type), sections)
        if np.any(scores > 0.3):
          uid = np.argmax(scores)
        else:
          #transformer isn't sure
          uid = "something-else"
      else:
        #wikidata couldn't identify
        uid = "something-else"
      print("%-20s %s" % (ent.text, "location" if uid==0 else ("merchant" if uid==1 else "not sure")))

las vegas            location
chicago              location
pizza hut            merchant
bofa                 merchant
grocery              not sure
cinema               not sure
disney               merchant


#### **Future Work**:
- Use a local wikidata dump rather than having an API call to reduce latency. Interesting read: [Kensho](https://blog.kensho.com/announcing-the-kensho-derived-wikimedia-dataset-5d1197d72bcf)
- Prepare a test suite. Try different transformer models / fasttext + BM25 as described in the txtai reference above, to compare and select the best one.
- Add more sections for even more granular categorization (groceries, restaurants, sports, hobbies, clothes, bills, etc.)