# The approach

The focus is to train a spacy model to recognize an entity which has size, item and quantity in it. This way we can extract multiple entities if the user request many items at once. 

Once that is done, we will be using spacy patterns to extract the size, item and quanitity seperately from each entity. 

# 1 . Loading Dataset

In [52]:
with open("new_entity_format.txt") as file:
    # if not line.isspace():
      lines = [line.lstrip().rstrip() for line in file if not line.isspace()]

In [53]:
lines[:5]

["I'd like a small coffee and a medium latte, please.",
 '[(a small coffee), (a medium latte)]',
 'Can I get a large smoothie and two small coffees to go?',
 '[(a large smoothie), (two small coffees)]',
 "I'll have a medium latte and a small smoothie."]

# 2 . Preprocessing dataset
Note : Get the above list to a Spacy friendly training dataset format

In [54]:
entities = []
sentences = []

for i,line in enumerate(lines):
  if i%2 == 0:
    sentences.append(line)
  else:
    entities.append(line)

In [55]:
for i,entity in enumerate(entities):
  entities[i] = entities[i].strip("[]").split(",")
  entities[i] = [item.strip("() ").lower() for item in entities[i]]

In [56]:
def generate_entity_output(sentence, index, entity_name):
    entity_list = []
    for j,entity in enumerate(entities[index]):
      start_index = sentence.index(entity)
      end_index = start_index + len(entity) - 1
      entity_list.append((start_index, end_index, entity_name))

    output = {
        "entities": entity_list
    }
    return (sentence, output)

In [57]:
train_set = []

for i,sentence in enumerate(sentences):
  train_set.append(generate_entity_output(sentence,i,"Entity_Item"))

In [58]:
train_set[0]

("I'd like a small coffee and a medium latte, please.",
 {'entities': [(9, 22, 'Entity_Item'), (28, 41, 'Entity_Item')]})

# 3 . Training Spacy

In [74]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
nlp=spacy.load('en_core_web_sm')

db = DocBin() # create a DocBin object
for text, annot in tqdm(train_set): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents
    db.add(doc)
db.to_disk("./train.spacy") # save the docbin object

100%|██████████| 30/30 [00:00<00:00, 2383.22it/s]


In [75]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [79]:
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[2023-05-31 02:25:41,804] [INFO] Set up nlp object from config
[2023-05-31 02:25:41,829] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-05-31 02:25:41,836] [INFO] Created vocabulary
[2023-05-31 02:25:42,804] [INFO] Added vectors: en_core_web_sm
[2023-05-31 02:25:42,806] [INFO] Finished initializing nlp object
[2023-05-31 02:25:43,324] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     43.83    0.00    0.00    0.00    0.00
 45     200          0.82    468.42  100.00  100.00  100.00    1.00
101     400          0.00      0.00  100.00  100.00  100.00    1.00
167     600          0.00      0.00  100.00  100.00  100.00    

# 4 . Load Trained model


In [84]:
nlp1 = spacy.load(r"./output/model-best") #load the best model
doc = nlp1("I need two large lattes and a medium coffee.") # input sample text
doc.ents

(two large, a medium)