# CAsT Session

### Sources

#### reranking
* https://github.com/castorini/pygaggle/blob/master/docs/experiments-msmarco-passage-subset.md
* https://github.com/castorini/pygaggle

In [1]:
# T5 query expantion
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM


In [2]:

# MonoT5 reranking
#import pygaggle.rerank.transformer.MonoT5
from pygaggle.rerank.base import Query, Text


In [12]:

from pygaggle.rerank.transformer import MonoT5



In [4]:
from elasticsearch import Elasticsearch
from typing import Dict, List, Optional
import json


In [5]:
INDEX_NAME = "cast_base"
es = Elasticsearch()

## T5 testing

In [6]:
tokenizer = AutoTokenizer.from_pretrained("castorini/t5-base-canard")
model = AutoModelForSeq2SeqLM.from_pretrained("castorini/t5-base-canard")

In [7]:
input_ids = tokenizer('Jafar is funny. <sep> Is he funny?', return_tensors='pt').input_ids
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Is Jafar funny?


## Framework

In [50]:
class CAsT():
    def __init__(self, context_queries: int = 0, context_responses: int = 0, reranking: bool = False) -> None:
        self.INDEX_NAME = "cast_base"
        self.es = Elasticsearch()
        self.queries = []
        self.responses = []
        self.context_queries = context_queries
        self.context_responses = context_responses

        self.reranking = reranking
        self.reranker = MonoT5() if reranking else None
        self.tokenizer = AutoTokenizer.from_pretrained(
            "castorini/t5-base-canard")
        self.model = AutoModelForSeq2SeqLM.from_pretrained(
            "castorini/t5-base-canard")

    def clear_context(self, clear_queries: bool = True, clear_responses: bool = True):
        if clear_queries:
            self.queries = []
        if clear_responses:
            self.responses = []

    def query(self, q: str) -> str:
        """
            returns: passage_id NOTE: for now complete hit is returned
        """
        sep = " <sep>"
        qs = []
        if self.context_queries > 0 or self.context_responses > 0:
            for i in range(1, max(self.context_queries, self.context_responses)+1):
                if i <= self.context_queries:
                    if len(self.queries) >= i:
                        qs.insert(0, self.queries[-i])

                if i <= self.context_responses:
                    if len(self.responses) >= i:
                        qs.insert(0, self.responses[-i])
        qs.append(q)

        input_ids = tokenizer(sep.join(qs), return_tensors='pt').input_ids
        outputs = model.generate(input_ids)

        query = tokenizer.decode(outputs[0], skip_special_tokens=True)
        self.queries.append(query)  # * Adding reformated query to context

        hits = es.search(
            index=self.INDEX_NAME, q=query, _source=True, size=100
        ).get("hits", {}).get("hits")

        hits_cleaned = [{
            "passage": hit.get("_source", {}).get("passage"),
            "_id": "MARCO_" + hit.get("_id") if hit.get("_source").get(
                    "origin") == "msmarco" else "CAR_" + hit.get("_id"),
            "_score": hit.get("_score", "FAILED")} for hit in hits]

        if self.reranking:
            print("RERANKING")
            texts = [Text(hit.get("passage"), {
                '_id': hit.get("_id", "FAILED")}, 0) for hit in hits_cleaned]

            reranked = self.reranker.rerank(Query(query), texts)
            hits_cleaned = [{
                "passage": hit.text,
                "_id": hit.metadata["_id"],
                "_score": hit.score}
                for hit in reranked]

        if len(hits) > 0:
            self.responses.append(
                hits_cleaned[0].get("passage"))
            return hits_cleaned[:500]
        else:
            return []

### Framework tests

#### Query expantion

In [14]:
test = CAsT(context_queries=1)

In [10]:
test.query("Tell me about Oslo?")

TypeError: query() got an unexpected keyword argument 'context_queries'

In [None]:
test.query("Where is it?")

Entered 'i <= context_queries:'
Entered 'len(self.queries) >= i:'
Query: Where is Oslo located?


{'_index': 'cast_base',
 '_type': '_doc',
 '_id': '8841012',
 '_score': 4.786257,
 '_source': {'passage': 'Refraction of Sound. Refraction is the bending of waves when they enter a medium where their speed is different. Refraction is not so important a phenomenon with sound as it is with light where it is responsible for image formation by lenses, the eye, cameras, etc.But bending of sound waves does occur and is an interesting phenomena in sound.efraction of Sound. Refraction is the bending of waves when they enter a medium where their speed is different. Refraction is not so important a phenomenon with sound as it is with light where it is responsible for image formation by lenses, the eye, cameras, etc.'}}

#### Reranking

In [51]:
cast = CAsT(context_queries=3, reranking=True)

Some weights of the model checkpoint at castorini/monot5-base-msmarco were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [52]:
ret = cast.query("How do you know when your garage door opener is going bad?")

2021-11-18 12:28:08 [INFO] base: POST http://localhost:9200/cast_base/_search?_source=true&q=How+do+you+know+when+your+garage+door+opener+is+going+bad%3F&size=100 [status:200 request:0.220s]


RERANKING


In [59]:
ret

[{'passage': 'Frame vs. Frameless Shower Doors & Tub Enclosures. Frameless shower doors fit right in to modern bathrooms. Shower doors and tub enclosures play an important role in your bathroom. Not only do these structures keep water contained within the bath and shower where it belongs, but also they impact the overall style and decor of your space.',
  '_id': '8841100',
  '_score': -14.476116180419922},
 {'passage': 'The greatest advantage to using frameless shower doors lies in the variety of design options available. Frameless shower and tub enclosures may come in any size or style, allowing for optimal customization. The doors open in or out based on layout and design needs. With no frames to get in the way, frameless doors provide a more open, airy look, making it easy to show off beautiful tiles and other finishes. The lack of frames also makes for easier cleaning. Thanks to the heavier glass and smooth polished edges, frameless enclosures come at a premium price point compared

## Run Queries

In [186]:
path = "../eval/2020_automatic_evaluation_topics_v1.0.json"
key = "raw_utterance"
queries = json.load(open(path))

In [190]:
def run_queries(query_file: str, key: str, CAsT: object, run_id: str):

    queries = json.load(open(query_file))
    if queries[0].get("turn", {})[0].get(key) is None:
        raise KeyError("Provided key: " + key +
                       "is not a valid key for queryfile")
    total_num = len(queries)
    f = open(run_id + ".txt", "w")

    for i, topic in enumerate(queries):
        print("Topic: {}/{}".format(i+1, total_num))
        CAsT.clear_context()
        topic_id = topic.get("number")
        for turn in topic.get("turn"):
            turn_id = turn.get("number")
            hits = CAsT.query(turn.get(key))
            for j, hit in enumerate(hits):
                f.write(str(topic_id) + "_" + str(turn_id) + "\t" + "Q0" + "\t" + str(hit.get("_id")) +
                        "\t" + str(j) + "\t" + str(hit.get("_score")) + "\t" + str(run_id) + "\n")
    f.close()


In [188]:
test_obj = CAsT(context_queries=3, reranking=True)

In [189]:
run_queries(path, key=key, CAsT=test_obj, run_id="Test02")

Topic: 0/25
Topic: 1/25
Topic: 2/25
Topic: 3/25
Topic: 4/25
Topic: 5/25
Topic: 6/25
Topic: 7/25
Topic: 8/25
Topic: 9/25
Topic: 10/25
Topic: 11/25
Topic: 12/25
Topic: 13/25
Topic: 14/25
Topic: 15/25
Topic: 16/25
Topic: 17/25
Topic: 18/25
Topic: 19/25
Topic: 20/25
Topic: 21/25
Topic: 22/25
Topic: 23/25
Topic: 24/25


In [196]:
q = "How do you know when your garage door opener is going bad?"
query = {
  "query": {
    "match": { "passage": q }
  },
  "highlight": {
    "fields": {
      "passage": {}
    }
  }
}
es.search(index=INDEX_NAME, body=query)

  es.search(index=INDEX_NAME, body=query)


{'took': 1544,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 360, 'relation': 'eq'},
  'max_score': 12.617959,
  'hits': [{'_index': 'cast_base',
    '_type': '_doc',
    '_id': '8841100',
    '_score': 12.617959,
    '_source': {'passage': 'Frame vs. Frameless Shower Doors & Tub Enclosures. Frameless shower doors fit right in to modern bathrooms. Shower doors and tub enclosures play an important role in your bathroom. Not only do these structures keep water contained within the bath and shower where it belongs, but also they impact the overall style and decor of your space.'},
    'highlight': {'passage': ['Frameless Shower <em>Doors</em> & Tub Enclosures. Frameless shower <em>doors</em> fit right in to modern bathrooms.',
      'Shower <em>doors</em> and tub enclosures play an important role in <em>your</em> bathroom.',
      'Not only <em>do</em> these structures keep water contained within the bath and shower

## Reranking