<a href="https://colab.research.google.com/github/ssanner/lge-foodkg/blob/main/baselines/newTasBtest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
pip install transformers


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.0-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 5.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 10.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 33.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 55.9 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstal

In [5]:
from transformers import AutoTokenizer, AutoModel
import numpy as np
import torch
import json


In [26]:
#path = "../../data/data.json" # will be changed to suit data path later
#dataset = json.load(path)

with open("h.json", 'r', encoding='utf-8') as f:
 data = json.load(f)
#path =  # will be changed to suit data path later
dataset = data


1. Define a **NeuralEmbedder** class to abstract away the embedding process for the retriever

In [18]:
class NeuralEmbedder():
  def __init__(self, model_name, tokenizer_name):
    self.tokenizer = AutoTokenizer.from_pretrained(model_name) 
    self.bert_model = AutoModel.from_pretrained(tokenizer_name)
  def embed(self,text):
    return self.bert_model(**self.tokenizer(text,return_tensors="pt"))[0][:,0,:].squeeze(0).numpy()

2. Define **the search engine** class. We embedded the documents once and saved the representations in a numpy matrix so we would not have to compute them repeatedly.

In [19]:
class NeuralSearchEngine():

  def __init__(self, embedder):
    self.embedder = embedder

  def index(self, documents):
    self.documents = documents
    encoded_docs = []
    for d in documents:
      with torch.no_grad():
        d_encoded = self.embedder.embed(d)
      encoded_docs.append(d_encoded.reshape(-1,768))
    self.index = np.concatenate(encoded_docs,axis=0)
  
  def search(self, query):
    with torch.no_grad():
      q_encoded = self.embedder.embed(query).reshape(-1,768)
    scores = q_encoded.dot(self.index.T)[0]

    args = np.argsort(scores)[::-1]

    print("\nThe query:", query,"\nTop three:")

    predicted = ""
    for i in range(3):
      print((i+1),'-','Score:',scores[args[i]],'doc:',self.documents[args[i]])
      if i == 0:
        predicted = self.documents[args[i]]
       
    return predicted

    

**Main code**

In [36]:
def tasb_score(dataset):
  # number of correct predictions
  correct = 0
  # h@1 evaluation metric
  total_hit_at_1 = 0
  # number of queries
  count = 0

  # create an embedder object the tokenizer and model 
  embedder = NeuralEmbedder("sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco","sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco")

  # loop through each query
  for query in dataset:
    count +=1
    print(count)

    docs = []
    for description in query["options"].values():
      docs.append(description)

    # create a search engine object for this query 
    engine = NeuralSearchEngine(embedder)
    # index the options into the search engine
    engine.index(docs)

    # check if model predicted the correct answer
    if engine.search(query["query"]) == list(query["correct_answer"].values())[0]:
      print(True, ": The correct description has the highest score.","\n")
      correct += 1
      total_hit_at_1 += 1
    else:
      print(False, ": The correct description is:", list(query["correct_answer"].values())[0],"\n")
      
  print("Total correct =", correct)
  print("average h@1",total_hit_at_1/count)

tasb_score(dataset)

1

The query: I want to make a warm dish containing oysters 
Top three:
1 - Score: 101.60317 doc: Simple creamy oyster soup
2 - Score: 97.31483 doc: Seasoned salted crackers shaped like oysters
3 - Score: 91.15459 doc: Warm vegetable soup containing tomatoes, peas, corn, carrots, and potatoes
True : The correct description has the highest score. 

2

The query: Can I have a recipe for fish that's roasted? 
Top three:
1 - Score: 94.64951 doc: Salmon roasted with olive oil, chives, and tarragon leaves
2 - Score: 92.52005 doc: Pecan halves roasted with butter and salt
3 - Score: 92.15469 doc: Roasted cauliflower with olive oil and seasonings
True : The correct description has the highest score. 

3

The query: What are recipes for fish, but not baked in the oven? 
Top three:
1 - Score: 105.76575 doc: Fish fillets baked with breaded mixture
2 - Score: 99.0171 doc: Breaded fish fillets baked with parmesan cheese
3 - Score: 96.07117 doc: Baked halibut fish fillets with Worcestershire sauce
F

In [None]:
dict = [
  {
    "options":{
        "1":"one",
        "2":"two",
        "3":"three"
    }
  },
  {
    "options":{
        "4":"four",
        "5":"five",
        "6":"six"
    }
  },
  {
    "options":{
        "7":"seven",
        "8":"eight",
        "9":"nine"
    }
  }
]

for query in dict:
  print(list(query["options"].values())[0])

one
four
seven
