# EPIC-QA Project
## Overview:
### Pipeline for EPIC-QA System
 * Step 1: Passage Expansion
 * Step 2: Indexing & Retrieval
 * Step 3: Passage Re-Ranking
 * Step 4: Reading Comprehension
 
### Phases for EPIC-QA Project
 * Phase 1: (EASY) No labels, entirely based on pre-trained systems on other datasets
 * Phase 2: (MEDIUM) Passage-level labels from coarse document-level labels provided, maybe utilize Active Learning
 * Phase 3: (HARD) Sentence-level labels from passage-level labels, utilize Active Learning
 
### Issues / Unsolved Problems
 * Expert-level vs consumer-level 
 
### Citations
 * List of citations / prior work read

# Pipeline for EPIC-QA System
## Step 1: Passage Expansion
 * Perform passage expansion by predicting queries given a passage and appending those queries to the passage.
 * (Document Expansion by Query Prediction: https://arxiv.org/abs/1904.08375)
 * doc2query/docTTTTTquery: https://github.com/castorini/docTTTTTquery
 * TODO
     1. Phase 1: Utilize docTTTTTquery trained model on MSMARCO to expand EPIC-QA
     2. Phase 2: Fine-tune docTTTTTquery on queries from EPIC-QA to expand EPIC-QA

## Step 2: Indexing & Retrieval
 * Index expanded passages from Step 1 using classical index (Lucene, etc.)
 * (BM25: find citations)
 * Retrieval of top-k passages given query using classical ranking retrieval (BM25, etc.)
 * TODO
     1. Phase 1: Index expanded passages, configure BM25 for efficient top-k retrieval
     2. Phase 2: TODO

## Step 3: Passage Re-Ranking
 * Re-Rank passages from Step 2 using neural passage re-ranker 
 * (Passage Re-ranking with BERT: https://arxiv.org/abs/1901.04085)
 * MSMARCO or TREC-CAR trained re-ranking models: https://github.com/nyu-dl/dl4marco-bert
 * TODO
    1. Phase 1: Utilize MSMARCO or TREC-CAR trained re-ranking models on re-ranking EPIC-QA
    2. Phase 2: Fine-tune re-ranking model on queries from EPIC-QA for re-ranking EPIC-QA

## Step 4: Reading Comprehension
 * Re-Rank sentences within passages and select best contiguous subset. 
 * Utilize MSMARCO or TREC-CAR trained re-ranking models on re-ranking all sentence n-grams and selecting maximum.
 * TODO
    1. Phase 1: Utilize MSMARCO or TREC-CAR trained re-ranking models on re-ranking EPIC-QA sentence n-grams
    2. Phase 2: Fine-tune re-ranking model on queries from EPIC-QA for re-ranking EPIC-QA sentence n-grams
 
## Phases for EPIC-QA Project
### Phase 1: No Labels
 * No labels necessary. Use as baseline with pre-trained systems from other IR tasks and datasets. 
 
### Phase 2: Passage-level Labels
 * Multiple options for refining document-level query labels provided by 4th round of TREC-COVID
     * Manual refinement of document-level query labels
     * Automatic refinement of document-level query labels using passage re-ranking model
     * Fusion: Provide re-ranked passages as provided rank to labeler, have feedback loop akin to Active Learning

### Phase 3: Sentence-level Labels
 * Refine passage-level labels using sentence n-gram re-ranking. Similar options to Phase 2, may just become Phase 2.
 * Active Learning
 
 
# Issues / Unsolved Problems
 * How to differentiate between expert-level and consumer-level systems. 
     * Do we need to differentiate due to different underlying document collection and different queries?
     
# Citations:
 * JULIE Lab & Med Uni Graz @ TREC 2019 Precision Medicine Track
 * IDST at TREC 2019 Deep Learning Track: Deep Cascade Ranking with Generation-based Document Expansion and Pre-trained Language Modeling
 * Passage Re-Ranking with BERT
 * Document Expansion by Query Prediction
 * Multi-Stage Document Ranking with BERT
 * D-NET: A Simple Framework for Improving the Generalization of Machine Reading Comprehension
 * Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
 * docTTTTTquery
 * Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

In [3]:
!ls ~/data/corpora/epic_qa/expert/

epic_qa_cord_2020-06-19_v2    qrels-covid_d4_j3.5-4.txt
expert_questions_prelim.json


In [2]:
!cat ~/data/corpora/epic_qa/expert/expert_questions_prelim.json


[
    {
        "question_id": "EQ001",
        "question": "what is the origin of COVID-19",
        "query": "coronavirus origin",
        "background": "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans"
    },
    {
        "question_id": "EQ002",
        "question": "how does the coronavirus respond to changes in the weather",
        "query": "coronavirus response to weather changes",
        "background": "seeking range of information about the SARS-CoV-2 virus viability in different weather/climate conditions as well as information related to transmission of the virus in different climate conditions"
    },
    {
        "question_id": "EQ003",
        "question": "will SARS-CoV2 infected people develop immunity? Is cross protection possible?",
        "query": "coronavirus immunity",
        "background": "seeking studies of immunity developed due to infection with SARS-

In [8]:
!cat ~/data/corpora/epic_qa/expert/epic_qa_cord_2020-06-19_v2/75rz1r0k.json

{
    "document_id": "75rz1r0k",
    "metadata": {
        "title": "La pandemia por el nuevo coronavirus covid-19./ [The novel coronavirus covid-19 pandemic]",
        "authors": "Eduardo, Cuestas",
        "urls": [],
        "full_text_path": "document_parses/pdf_json/b4de6df22c6eec09f7fb5b72b66df975beeecd77.json"
    },
    "contexts": [
        {
            "section": "INTRODUCTION",
            "context_id": "75rz1r0k-C000",
            "text": "Business Intelligence (BI) tools built on top of database systems provide excellent data analytics capabilities. E.g., trends of revenues of different companies over the past six months, or different departmental store products' sales comparison over the past three months etc. However, user needs a fair bit of knowledge of the underlying database schema and acquaintance with the BI dashboard. This can be particularly challenging for users and organizations that are not tech savvy or do not have time and resources to do so. A 

In [11]:
!ls ~/data/corpora/epic_qa/expert/epic_qa_cord_2020-06-19_v2/ | wc -l

157819


In [13]:
!cat ~/data/corpora/epic_qa/expert/qrels-covid_d4_j3.5-4.txt

1 4  00fmeepz 1
1 4  021q9884 1
1 3.5  047xpt2c 0
1 4  0chuwvg6 2
1 4  0iq9s94n 1
1 4  0m5mc320 0
1 3.5  0v5wo0ty 1
1 4  105q161g 2
1 4  10ecm4wi 1
1 4  1c47w4q5 2
1 3.5  1ldynibm 0
1 4  1pnc889f 0
1 4  1s25r3o0 0
1 4  1xf2sxtv 1
1 4  21fhsooy 0
1 4  22fc1qly 2
1 4  2hb28brw 0
1 3.5  2s7ki6g7 1
1 4  2sjsxf96 0
1 4  2t3evpxf 0
1 4  2y452utz 1
1 4  3fiz0tqy 0
1 4  3jireyep 0
1 4  3k1ks3wg 0
1 4  3okdfxzq 0
1 3.5  3otc2ac1 0
1 4  3pqtmhob 0
1 3.5  3uvuo4sf 2
1 3.5  3we40x62 0
1 3.5  3y4ulpkh 1
1 4  3zmq7nd5 0
1 3.5  43gik8e3 2
1 4  4almssg6 0
1 4  4hvv4sep 0
1 4  4sfgha4z 1
1 4  4ywt0yqn 0
1 4  4ze0mfxp 2
1 3.5  50xzptr1 1
1 3.5  52lcpf0x 0
1 4  59492sjb 2
1 3.5  5h7qyn1g 1
1 4  5hio4lgc 1
1 3.5  5oisrm5s 1
1 4  5opiip58 0
1 4  5uwzo304 0
1 3.5  5yk1j4ms 1
1 4  66xk0qqq 1
1 3.5  6cbnpqjj 0
1 3.5  6hyrcq7y 0
1 4  6k5ac3f2 1
1 4  6rpt47gm 0
1 4  6v7oru2l 0
1 3.5  6zfmjq9p 2
1 3.5  7hw23xae 1
1 3.5  7x3nq9cp 0
1 4  84hxim2n 0
1 4  8arwl

38 4  yze7t35v 1
38 4  z9uu4sj7 2
38 3.5  zb6cv8ik 1
38 4  zbdkmgvt 2
38 3.5  zcvk1paf 0
38 4  zm8hpuer 2
38 4  zsyi98t0 1
38 4  zwvutq57 1
38 4  zwy2wym7 2
38 4  zy9lb7d9 2
39 3.5  040w9ba1 1
39 4  0avmt789 2
39 4  0evt7ggx 2
39 4  0fgquau3 2
39 4  0gss1knb 2
39 3.5  0khg28ex 0
39 3.5  0pleiv0k 2
39 4  0sbaxwuf 1
39 4  0x08lgm2 1
39 3.5  11sxecb3 1
39 3.5  16rgt4ca 1
39 4  1dfzjwx0 2
39 4  1fdmmdll 0
39 4  1ru15s5a 2
39 4  1s6wtj25 2
39 4  1untezgg 2
39 4  1w7g6dkq 2
39 4  247fspnb 1
39 4  2b6l1c0n 2
39 3.5  2cqu1fos 1
39 4  2fr4kpp6 2
39 3.5  2m369sm5 2
39 4  2m9nchys 2
39 4  2oh9k0gs 1
39 3.5  2skjqwis 2
39 3.5  2w5ws14b 2
39 4  343y63e5 1
39 4  363ivs67 2
39 3.5  36hiiw91 1
39 3.5  37i62atc 2
39 4  3cnm7p5y 1
39 4  3fp46sov 2
39 4  3hwkr3a4 1
39 3.5  3n16xkvo 2
39 4  3su8pc3f 2
39 4  41fzp72v 2
39 4  41igg68l 0
39 3.5  41y1dr1n 2
39 4  4361psuq 1
39 3.5  493nholj 1
39 4  49g1s7dh 2
39 4  4ki9j4by 2
39 4  4lm663f1 2
39 4  4mjg7jul

In [14]:
!pwd ~/data/corpora/epic_qa/expert/

/users/max/code/epic_qa


In [1]:
import os
import json
from collections import defaultdict
import torch
import numpy as np

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [79]:
!ls /users/max/data/corpora/epic_qa/

consumer  expert


In [3]:
with open('/users/max/data/corpora/epic_qa/consumer/consumer_questions_prelim.json', 'r') as f:
  c_queries = json.load(f)

In [4]:
c_queries[0]

{'question_id': 'CQ001',
 'question': 'what is the origin of COVID-19',
 'query': 'coronavirus origin',
 'background': 'seeking information about whether the virus was designed in a lab or occured naturally in animals and how it got to humans'}

In [5]:
with open('/users/max/data/corpora/epic_qa/expert/expert_questions_prelim.json', 'r') as f:
  queries = json.load(f)

In [6]:
qrels = defaultdict(list)
with open('/users/max/data/corpora/epic_qa/expert/qrels-covid_d4_j3.5-4.txt', 'r') as f:
  for line in f:
    line = line.strip().split()
    if line:
      query_id, _, doc_id, dq_rank = line
      query_id = int(query_id)
      dq_rank = int(dq_rank)
      qrels[query_id].append((doc_id, dq_rank))

In [7]:
sorted_qrels = {}
for query_id, qrel in qrels.items():
  sorted_qrels[query_id-1] = sorted(qrel, key=lambda x: -x[1])

In [8]:
queries[0]

{'question_id': 'EQ001',
 'question': 'what is the origin of COVID-19',
 'query': 'coronavirus origin',
 'background': "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans"}

In [9]:
sorted_qrels[0][:10]

[('0chuwvg6', 2),
 ('105q161g', 2),
 ('1c47w4q5', 2),
 ('22fc1qly', 2),
 ('3uvuo4sf', 2),
 ('43gik8e3', 2),
 ('4ze0mfxp', 2),
 ('59492sjb', 2),
 ('6zfmjq9p', 2),
 ('8arwlhf0', 2)]

In [10]:
rel_docs = []
for query_id, qrel in sorted_qrels.items():
  rel_docs.extend([x for x in qrel if x[1] > 0])
len(rel_docs)

5824

In [32]:
!cat ~/data/corpora/epic_qa/expert/epic_qa_cord_2020-06-19_v2/0chuwvg6.json

{
    "document_id": "0chuwvg6",
    "metadata": {
        "title": "Evidence of significant natural selection in the evolution of SARS-CoV-2 in bats, not humans",
        "authors": "MacLean, Oscar A.; Lytras, Spyros; Singer, Joshua B.; Weaver, Steven; Pond, Sergei L. Kosakovsky; Robertson, David L.",
        "urls": [
            "https://doi.org/10.1101/2020.05.28.122366"
        ],
        "full_text_path": "document_parses/pdf_json/63fe0a2cfc6d1add0e9e94436083aeca6fb9ede3.json"
    },
    "contexts": [
        {
            "section": "Abstract",
            "context_id": "0chuwvg6-C000",
            "text": "RNA viruses are proficient at switching to novel host species due to their fast mutation rates. Implicit in this assumption is the need to evolve adaptations in the new host species to exploit their cells efficiently. However, SARS-CoV-2 has required no significant adaptation to humans since the pandemic began, with no observed selective sweeps to date. Here we 

In [11]:
collection_path = '/users/max/data/corpora/epic_qa/expert/epic_qa_cord_2020-06-19_v2/'

In [12]:
def extract_passages(doc_name):
  with open(os.path.join(collection_path, doc_name + '.json'), 'r') as f:
    doc = json.load(f)
  passages = []
  for context in doc['contexts']:
    context_text = context['text']
    context_passages = []
    for sentence in context['sentences']:
      context_passages.append(context_text[sentence['start']:sentence['end']])
    additional_passages = []
    if len(context_passages) >= 2:
      bi_grams = list([' '.join(x) for x in zip(context_passages[:1], context_passages[1:])])
      additional_passages.extend(bi_grams)
    if len(context_passages) >= 3:
      tri_grams = list([' '.join(x) for x in zip(context_passages[:2], context_passages[1:1], context_passages[2:])])
      additional_passages.extend(tri_grams)
    context_passages = context_passages + additional_passages
    passages.extend(context_passages)
  return passages

In [14]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# TODO consider other reranking models
rerank_model_name = 'nboost/pt-bert-large-msmarco'
tokenizer = AutoTokenizer.from_pretrained(rerank_model_name)

model = AutoModelForSequenceClassification.from_pretrained(rerank_model_name)
model.to(device)
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1

In [31]:
def rerank(query, passages):
  batch_size = 8
  pairs = [(query, passage) for passage in passages]
  all_scores = []
  for b_idx in range(int(np.ceil(len(pairs)/batch_size))):
    batch = tokenizer.batch_encode_plus(
      # query, passage
      batch_text_or_text_pairs=pairs[b_idx * batch_size:(b_idx+1)*batch_size], 
      add_special_tokens=True,
      padding=True,
      return_tensors='pt'
    )

    scores = model(
      input_ids=batch['input_ids'].to(device),
      token_type_ids=batch['token_type_ids'].to(device),
      attention_mask=batch['attention_mask'].to(device)
    )[0][:, 0].data.cpu().numpy()
    all_scores.extend(scores)
  
  passages_sorted = list(sorted(zip(all_scores, range(len(passages))), key=lambda x: x[0]))
  return passages_sorted


In [38]:
query = c_queries[0]['question']
print('Query:')
print(query)
passages = extract_passages('105q161g')
reranks = rerank(query, passages)
print()
print('Re-Ranked Passages:')
for p_score, p_idx in reranks:
  print()
  print(f'Sentence: {p_idx}: {p_score:.2f}')
  print(f'{passages[p_idx]}')
  print()
torch.cuda.empty_cache()

Query:
what is the origin of COVID-19

Re-Ranked Passages:

Sentence: 11: -3.48
A number of virological, epidemiological and ethnographic arguments suggest that COVID-19 has a zoonotic origin. The pangolin, a species threatened with extinction due to poaching for both culinary purposes and traditional Chinese pharmacopoeia, is now suspected of being the “missing link” in the transmission to humans of a virus that probably originated in a species of bat.


Sentence: 0: -3.31
A number of virological, epidemiological and ethnographic arguments suggest that COVID-19 has a zoonotic origin.


Sentence: 100: -2.88
As far as COVID-19 is concerned, the pangolin, a species threatened with extinction due to poaching for both culinary purposes and traditional Chinese pharmacopoeia, is now suspected of being the "missing link" in the transmission to humans of a virus that probably originated in a species of bat.


Sentence: 102: -2.67
As far as COVID-19 is concerned, the pangolin, a species threate

In [13]:
from transformers import T5Config, T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained('t5-base')
config = T5Config.from_pretrained('t5-base')
model = T5ForConditionalGeneration.from_pretrained(
    '/users/max/data/models/t5/docT5query_base/model.ckpt-1004000', from_tf=True, config=config)
model.to(device)
model.eval()

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseReluDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dr

In [14]:
def expand(passage_text, num_samples=3, max_length=64, top_k=10):

  input_ids = tokenizer.encode(
    f'{passage_text} </s>', 
    return_tensors='pt'
  ).to(device)

  outputs = model.generate(
      input_ids=input_ids,
      max_length=max_length,
      do_sample=True,
      top_k=top_k,
      num_return_sequences=num_samples
  )

  samples = []
  for i in range(num_samples):
    sample_txt = tokenizer.decode(outputs[i], skip_special_tokens=True)
    samples.append(sample_txt)
  return samples

In [15]:
def extract_contexts(doc_name):
  with open(os.path.join(collection_path, doc_name + '.json'), 'r') as f:
    doc = json.load(f)
  contexts = []
  for context in doc['contexts']:
    context_text = context['text']
    contexts.append(context_text)
  return contexts

In [16]:
contexts = extract_contexts('22fc1qly')
# expand_passages = [passages[0], passages[11]]

for passage in contexts:
  print()
  print('Passage:')
  print(passage)
  print('Expanded Queries:')
  expanded_queries = expand(passage)
  for query in expanded_queries:
    print(query)
  print()
torch.cuda.empty_cache()


Passage:
Coronaviruses are the well-known cause of severe respiratory, enteric and systemic infections in a wide range of hosts including man, mammals, fish, and avian. The scientific interest on coronaviruses increased after the emergence of Severe Acute Respiratory Syndrome coronavirus (SARS-CoV) outbreaks in 2002-2003 followed by Middle East Respiratory Syndrome CoV (MERS-CoV). This decade's first CoV, named 2019-nCoV, emerged from Wuhan, China, and declared as 'Public Health Emergency of International Concern' on January 30th, 2020 by the World Health Organization (WHO). As on February 4, 2020, 425 deaths reported in China only and one death outside China (Philippines). In a short span of time, the virus spread has been noted in 24 countries. The zoonotic transmission (animal-to-human) is suspected as the route of disease origin. The genetic analyses predict bats as the most probable source of 2019-nCoV though further investigations needed to confirm the origin of the novel virus.

what is narratives social media
what types of narratives are there
what types of narratives are in social media?


Passage:
Moreover, the events have been considered as the causes of online user activity that can be identified via activity fluctuations over time [3, 25] . Developing appropriate tools for social media narrative analysis can facilitate communicating the main ideas regarding the events in large data.
Expanded Queries:
social media narrative analysis
why use a narrative in social media
how to determine the cause of social media activity


Passage:
As social media activities generate abundant timestamped multimodal data, many studies such as [8] have presented algorithms to discover the topics and develop descriptive summaries over social media events. probabilistic models to discover word patterns that reflect the underlying topics in a set of document collections [1] . The most commonly used approach to topic modeling is Latent Dirichlet Allocation (LDA) [19] . LDA is a g

what is generative model in social networking
which feature of narrative activity is best described by gabbs sampling?
what is the role of categorical time


Passage:
I. For each topic z, draw T multinomials ϕ z from a Dirichlet prior β; II. For each document d, draw a multinomial θ d from a Dirichlet prior α; III. For each word w di in d:
Expanded Queries:
how to draw a multinomial
how to draw multinomials
which is a multinomial?


Passage:
(a) draw a topic z di from multinomial θ d ;
Expanded Queries:
what's the definition of di
definition of z di
what is a topic di


Passage:
where In this model, Gibbs sampling provides an approximate inference instead if exact inference. To calculate the probability of topic assignment to word w di , we first need to calculate the joint probability of the dataset as P(z d i , w d i , t d i |w −di , t −di , z −di , α, β,ψ ) and use chain rule to derive the probability of P(z d i |w, t, z −d i , α, β,ψ ) as below, where −di subscripts refers to all t

what is the average coherency score for a narratives
what is the difference between noc and tot
what is the coherence score of noc


Passage:
The topic attractiveness to social media users can be investigated as a measure of the length of conversation cascades, the number of initiated textual content, and the number of unique users performing an activity relative to the underlying topic. The user activity fluctuations for timestamped data may contain activity bursts that are illustrative of significant events. Similarly, the generation and propagation of textual content within an online platform can illustrate the narrative activity relative to the events over time, where a burst represents a significant narrative activity. Additionally, the recurrence of a topic can be considered as an attractiveness measure for the associated topic.
Expanded Queries:
what is topic attractiveness
what is topic attractiveness
why is the content attractive


Passage:
In this regard, we propose the signi

what is conjugate priors
what is an example of conjugate priors
how to conjugate priors


Passage:
P(w, t, z|α, β,ψ ) = P(w |z, β) p(t |ψ , z) P(z|α)
Expanded Queries:
what is p(t t
what is p(wt)t,z
what is the formula for p(w, t, z)


Passage:
where P and p refer to the probability mass function (PMF) and probability density function (PDF), respectively. The conditional probability P(z di |w, t, z −di , α, β,ψ ) can be found using the chain rule as:
Expanded Queries:
what is conditional probability
what is the p value for probability density
what is the relationship between probability density and mass function


Passage:
The probability of p(t di ∈ b k ) can be measured as follows:
Expanded Queries:
how can i find the probability of p(t di  b k)
what is p(t di b
what is the probability of p(t di)?


Passage:
where I(.) is equal to 1 when t z d i ∈ b k , and 0 otherwise. Remember first they said the video including the pics of the chlorine cylinder was fake. Whitehelmets One America N

In [None]:
# TODO write script to run document expander on every passage in corpus and append to json
# TODO then write lucene/BM25 index on expanded collection
# TODO then configure IR retrieval to run query, get top-k bm25 results, then re-rank n-grams of sentences for sentence retrieval
# TODO then we are done with Phase 1, move on to Phase 2 with labeling.
# TODO extract qrels and then configure for Phase 2 labeling