# CAsT Session

### Sources


#### docT5query
* https://github.com/castorini/docTTTTTquery#Predicting-Queries-from-Passages-T5-Inference-with-PyTorch


#### reranking
* https://github.com/castorini/pygaggle/blob/master/docs/experiments-msmarco-passage-subset.md
* https://github.com/castorini/pygaggle

### Imports

In [5]:
# T5 query expantion
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM


In [6]:

# MonoT5 reranking
from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import MonoT5


In [7]:
from elasticsearch import Elasticsearch
from typing import Dict, List, Optional
import json
import logging


In [8]:
INDEX_NAME = "cast_base"
es = Elasticsearch()

## T5 testing

In [64]:
tokenizer = AutoTokenizer.from_pretrained("castorini/t5-base-canard")
model = AutoModelForSeq2SeqLM.from_pretrained("castorini/t5-base-canard")

In [65]:
input_ids = tokenizer('Jafar is funny. <sep> Is he funny?', return_tensors='pt').input_ids
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Is Jafar funny?


## Framework

In [12]:
class CAsT():
    def __init__(self, index_name: str = "cast_base", context_queries: int = 0, context_responses: int = 0, reranking: bool = False) -> None:
        self.INDEX_NAME = index_name
        self.es = Elasticsearch()
        es_logger = logging.getLogger('elasticsearch')
        es_logger.setLevel(logging.WARNING)
        self.queries = []
        self.responses = []
        self.context_queries = context_queries
        self.context_responses = context_responses

        self.reranking = reranking
        self.reranker = MonoT5() if reranking else None
        self.tokenizer = AutoTokenizer.from_pretrained(
            "castorini/t5-base-canard")
        self.model = AutoModelForSeq2SeqLM.from_pretrained(
            "castorini/t5-base-canard")

    def clear_context(self, clear_queries: bool = True, clear_responses: bool = True):
        if clear_queries:
            self.queries = []
        if clear_responses:
            self.responses = []

    def query(self, q: str) -> str:
        """
            returns: passage_id NOTE: for now complete hit is returned
        """
        sep = " <sep>"
        qs = []
        if self.context_queries > 0 or self.context_responses > 0:
            for i in range(1, max(self.context_queries, self.context_responses)+1):
                if i <= self.context_queries:
                    if len(self.queries) >= i:
                        qs.insert(0, self.queries[-i])

                if i <= self.context_responses:
                    if len(self.responses) >= i:
                        qs.insert(0, self.responses[-i])
        qs.append(q)

        
        input_ids = self.tokenizer(sep.join(qs), return_tensors='pt').input_ids
        outputs = self.model.generate(input_ids)

        query = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        self.queries.append(query)  # * Adding reformated query to context

        hits = es.search(
            index=self.INDEX_NAME, q=query, _source=True, size=100
        ).get("hits", {}).get("hits")

        hits_cleaned = [{
            "passage": hit.get("_source", {}).get("passage"),
            "_id": "MARCO_" + hit.get("_id") if hit.get("_source").get(
                    "origin") == "msmarco" else "CAR_" + hit.get("_id"),
            "_score": hit.get("_score", "FAILED")} for hit in hits]

        if self.reranking:
            print("RERANKING")
            texts = [Text(hit.get("passage"), {
                '_id': hit.get("_id", "FAILED")}, 0) for hit in hits_cleaned]

            reranked = self.reranker.rerank(Query(query), texts)
            hits_cleaned = [{
                "passage": hit.text,
                "_id": hit.metadata["_id"],
                "_score": hit.score}
                for hit in reranked]

        if len(hits) > 0:
            self.responses.append(
                hits_cleaned[0].get("passage"))
            return hits_cleaned[:1000]
        else:
            return []

### Framework tests

#### Query expantion

In [13]:
test = CAsT(context_queries=1)

In [14]:
test.query("Tell me about Oslo?")



[{'passage': "Tell a friend about us, add a link to this page, or visit the webmaster's page for free fun content. Link to this page: <a href=http://acronyms.thefreedictionary.com/South+African+Board+for+Personnel+Practice>SABPP</a>. Facebook.",
  '_id': 'MARCO_8841272',
  '_score': 9.102427},
 {'passage': 'Tell Me Something: The Songs of Mose Allison. Singles from How Long Has This Been Going On. How Long Has This Been Going On is the twenty-fourth studio album by Northern Irish singer-songwriter Van Morrison, with Georgie Fame and Friends, released in December 1995 (see 1995 in music) in the UK. It charted at #1 on Top Jazz Albums.',
  '_id': 'MARCO_8841042',
  '_score': 8.981482},
 {'passage': "I've searched and searched but can't find a thread on it, so forgive me if its been discussed. I used to hear about opera singers who can shatter glass with their voice, but can't seem to find much info about if that is real or not.",
  '_id': 'MARCO_8841113',
  '_score': 8.887823},
 {'passag

In [15]:
test.query("Where is it?")

[{'passage': 'Refraction of Sound. Refraction is the bending of waves when they enter a medium where their speed is different. Refraction is not so important a phenomenon with sound as it is with light where it is responsible for image formation by lenses, the eye, cameras, etc.But bending of sound waves does occur and is an interesting phenomena in sound.efraction of Sound. Refraction is the bending of waves when they enter a medium where their speed is different. Refraction is not so important a phenomenon with sound as it is with light where it is responsible for image formation by lenses, the eye, cameras, etc.',
  '_id': 'MARCO_8841012',
  '_score': 4.786257},
 {'passage': "Hawaii Five-O star Alex O'Loughlin is off the market. The 37-year-old actor wed his girlfriend Malia Jones recently, according to People. The couple tied the knot in Hawaii, where they live and where he films his series on which he plays Lt. Commander Steve McGarrett. Scroll down for video. Newlyweds: Alex O'Lo

#### Reranking

In [16]:
cast = CAsT(context_queries=3, reranking=True)

Some weights of the model checkpoint at castorini/monot5-base-msmarco were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [17]:
ret = cast.query("How do you know when your garage door opener is going bad?")

RERANKING




In [18]:
ret[0]

{'passage': 'Frame vs. Frameless Shower Doors & Tub Enclosures. Frameless shower doors fit right in to modern bathrooms. Shower doors and tub enclosures play an important role in your bathroom. Not only do these structures keep water contained within the bath and shower where it belongs, but also they impact the overall style and decor of your space.',
 '_id': 'MARCO_8841100',
 '_score': -14.476116180419922}

## Run Queries

In [None]:
path = "../eval/2020_automatic_evaluation_topics_v1.0.json"
key = "raw_utterance"

In [20]:
def run_queries(query_file: str, key: str, CAsT: object, run_id: str):
    queries = json.load(open(query_file))
    if queries[0].get("turn", {})[0].get(key) is None:
        raise KeyError("Provided key: " + key +
                       "is not a valid key for queryfile")
    total_num = len(queries)
    f = open(run_id + ".trec", "w")

    for i, topic in enumerate(queries):
        print("Topic: {}/{}".format(i+1, total_num))
        CAsT.clear_context()
        topic_id = topic.get("number")
        for turn in topic.get("turn"):
            turn_id = turn.get("number")
            hits = CAsT.query(turn.get(key))
            for j, hit in enumerate(hits):
                f.write(str(topic_id) + "_" + str(turn_id) + "\t" + "Q0" + "\t" + str(hit.get("_id")) +
                        "\t" + str(j) + "\t" + str(hit.get("_score")) + "\t" + str(run_id) + "\n")
    f.close()

In [21]:
# Base
## Reranking FALSE
qc0_qr0_rrF_base = CAsT(context_queries=0, context_responses= 0, reranking=False, index_name="cast_base")
qc3_qr0_rrF_base = CAsT(context_queries=3, context_responses= 0, reranking=False, index_name="cast_base")
qc0_qr3_rrF_base = CAsT(context_queries= 0, context_responses==3, reranking=False, index_name="cast_base")
## Reranking TRUE
qc0_qr0_rrT_base = CAsT(context_queries=0, context_responses= 0, reranking=True, index_name="cast_base")
qc3_qr0_rrT_base = CAsT(context_queries=3, context_responses= 0, reranking=True, index_name="cast_base")
qc0_qr3_rrT_base = CAsT(context_queries= 0, context_responses==3, reranking=True, index_name="cast_base")


# d2q
qc0_qr0_rrF_d2q = CAsT(context_queries==0, reranking=False, index_name="cast_d2q")
qc0_qr0_rrT_d2q = CAsT(context_queries==0, reranking=True, index_name="cast_d2q")
qc3_qr0_rrF_d2q = CAsT(context_queries==3, reranking=False, index_name="cast_d2q")
qc3_qr0_rrT_d2q = CAsT(context_queries==3, reranking=True, index_name="cast_d2q")
qc0_qr3_rrF_d2q = CAsT(context_responses==5, reranking=False, index_name="cast_d2q")
qc0_qr3_rrT_d2q = CAsT(context_responses==5, reranking=True, index_name="cast_d2q")

In [22]:
run_queries(path, key=key, CAsT=test_obj, run_id="Test02")

Topic: 1/25
Topic: 2/25
Topic: 3/25
Topic: 4/25
Topic: 5/25
Topic: 6/25
Topic: 7/25
Topic: 8/25
Topic: 9/25


## Future testing 

In [196]:
q = "How do you know when your garage door opener is going bad?"
query = {
  "query": {
    "match": { "passage": q }
  },
  "highlight": {
    "fields": {
      "passage": {}
    }
  }
}
es.search(index=INDEX_NAME, body=query)

  es.search(index=INDEX_NAME, body=query)


{'took': 1544,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 360, 'relation': 'eq'},
  'max_score': 12.617959,
  'hits': [{'_index': 'cast_base',
    '_type': '_doc',
    '_id': '8841100',
    '_score': 12.617959,
    '_source': {'passage': 'Frame vs. Frameless Shower Doors & Tub Enclosures. Frameless shower doors fit right in to modern bathrooms. Shower doors and tub enclosures play an important role in your bathroom. Not only do these structures keep water contained within the bath and shower where it belongs, but also they impact the overall style and decor of your space.'},
    'highlight': {'passage': ['Frameless Shower <em>Doors</em> & Tub Enclosures. Frameless shower <em>doors</em> fit right in to modern bathrooms.',
      'Shower <em>doors</em> and tub enclosures play an important role in <em>your</em> bathroom.',
      'Not only <em>do</em> these structures keep water contained within the bath and shower

## Reranking