<a href="https://colab.research.google.com/github/DayalStrub/ecir2021tutorial/blob/main/other/2-simplest-thing-Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TODO

* figure out whether can pass D or R df to BM25, etc to choose subset of docs from index
* understand whether could easily/efficiently get tfidf from index rather than recompute w sk-learn
* understand caching/compiling to not run RM3_pipe twice in simplest thing
* try use Anserini back end?

## Set up

In [1]:
!pip install -q jupyter-autotime

In [2]:
# !pip install -q python-terrier
!pip install -q git+https://github.com/terrier-org/pyterrier.git

  Building wheel for python-terrier (setup.py) ... [?25l[?25hdone


In [3]:
import pandas as pd
from pandas.testing import assert_frame_equal

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import normalize

In [4]:
import pyterrier as pt
pt.init(boot_packages=["com.github.terrierteam:terrier-prf:-SNAPSHOT"])

  from pandas import Panel


PyTerrier 0.5.0 has loaded Terrier 5.4 (built by craigm on 2021-01-16 14:17)


In [5]:
%load_ext autotime

## Data & index

In [6]:
dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')

In [7]:
next(iter(dataset.get_corpus_iter()))

HBox(children=(FloatProgress(value=0.0, description='cord19/trec-covid documents', max=192509.0, style=Progres…

OrderedDict([('title',
              'Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia'),
             ('doi', '10.1186/1471-2334-1-6'),
             ('date', '2001-07-04'),
             ('abstract',
              'OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common

In [8]:
N = 192509 # 10_000

pt_index_path = './index_cord19'

# create the index, using the IterDictIndexer indexer 
indexer = pt.index.IterDictIndexer(pt_index_path, overwrite=True) # , blocks=True

# we give the dataset get_corpus_iter() directly to the indexer
# while specifying the fields to index and the metadata to record
index_ref = indexer.index(
    (datum for i, datum in enumerate(dataset.get_corpus_iter()) if i < N),
    fields=('abstract',), # TODO should this be a list?
    # meta={'docno' : 26, 'text' : 2048, 'abstract' : 2048}
    meta=['docno','abstract'],
    )

index = pt.IndexFactory.of(index_ref)

HBox(children=(FloatProgress(value=0.0, description='cord19/trec-covid documents', max=192509.0, style=Progres…



17:32:35.729 [ForkJoinPool-1-worker-3] WARN  o.t.structures.indexing.Indexer - Indexed 54937 empty documents


In [9]:
print(index.getCollectionStatistics().toString())

Number of documents: 192509
Number of terms: 151235
Number of postings: 11554033
Number of fields: 1
Number of tokens: 17728468
Field names: [abstract]
Positions:   false



In [10]:
queries = dataset.get_topics(variant='title').head(3)
queries

Unnamed: 0,qid,query
0,1,coronavirus origin
1,2,coronavirus response to weather changes
2,3,coronavirus immunity


## Retrieval

### BM25

In [11]:
# NOTE num_results not working, but hackable, see
# https://github.com/terrier-org/pyterrier/issues/140

M = 1002

bm25 = pt.BatchRetrieve(index, wmodel="BM25", num_results=M)

bm25.transform(queries)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,175892,zy8qjaai,0,11.915479,coronavirus origin
1,1,82224,8ccl9aui,1,11.550953,coronavirus origin
2,1,135326,ne5r4d4b,2,11.268729,coronavirus origin
3,1,122806,4fb291hq,3,11.165944,coronavirus origin
4,1,122805,kn2z7lho,4,11.165944,coronavirus origin
...,...,...,...,...,...,...
2995,3,78633,8hrjgcas,995,7.195963,coronavirus immunity
2996,3,161122,bn9ny0k6,996,7.195963,coronavirus immunity
2997,3,91233,am67dg8a,997,7.195963,coronavirus immunity
2998,3,74904,intwv8g4,998,7.194310,coronavirus immunity


In [12]:
# TODO can I pass D to bm25 to subset index and choose docs to rank?

### RM3

In [13]:
rm3 = pt.rewrite.RM3(index)

(bm25 >> rm3).transform(queries)

Unnamed: 0,qid,query_0,query
0,1,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...
1,2,coronavirus response to weather changes,applypipeline:off respons^0.150000006 action^0...
2,3,coronavirus immunity,applypipeline:off coronavirus^0.022062551 prot...


In [14]:
bm25_text = pt.BatchRetrieve(index, wmodel="BM25", metadata=["docno", "abstract"], num_results=M)

rm3_pipe = bm25 >> rm3 >> bm25_text >> pt.apply.rename({'abstract':'text'})

In [15]:
df_out = rm3_pipe.transform(queries)
df_out # .loc[df_out["qid"] == "1", :]

Unnamed: 0,qid,docid,docno,text,rank,score,query_0,query
0,1,135326,ne5r4d4b,Severe acute respiratory syndrome coronavirus ...,0,15.997875,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...
1,1,175892,zy8qjaai,"In humans, infection with the coronavirus, esp...",1,14.555033,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...
2,1,174028,lfo6otkc,"During the past two decades, three zoonotic co...",2,13.581315,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...
3,1,166596,d2knbzhl,Bats have been recognized as the natural reser...,3,13.408224,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...
4,1,89280,j318qn5p,The new decade of the 21st century (2020) star...,4,13.139674,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...
...,...,...,...,...,...,...,...,...
2995,3,93774,6x1704l3,There is an urgent need to better understand t...,995,7.617904,coronavirus immunity,applypipeline:off coronavirus^0.022062551 prot...
2996,3,32145,xeyggg1b,There is an urgent need to better understand t...,996,7.617904,coronavirus immunity,applypipeline:off coronavirus^0.022062551 prot...
2997,3,183966,gts8jtae,Coronavirus disease 2019 (COVID19) is a respir...,997,7.616834,coronavirus immunity,applypipeline:off coronavirus^0.022062551 prot...
2998,3,6428,lbe594up,Viral and microbial constituents contain speci...,998,7.615554,coronavirus immunity,applypipeline:off coronavirus^0.022062551 prot...


In [16]:
# NOTE: RM3 returns queries, so RM3_pipe is NOT a re-ranking, but rather a fresh ranking

In [17]:
df_out.loc[:, ["qid", "docid"]].groupby("qid").count()

Unnamed: 0_level_0,docid
qid,Unnamed: 1_level_1
1,1000
2,1000
3,1000


### Lin's simplest thing

In [18]:
def pr_linear_reranker(
    df,
    *,
    n_top = 10,
    n_bottom = 100, # TODO should actually get 0 score documents? how to do this in pyterrier?
    col_text = 'text'
):
  df_tmp = df.copy()
  df_relevant = df_tmp.sort_values("rank").reset_index(drop=True).loc[0:n_top, ["docid", col_text]]
  df_relevant["label"] = 1

  df_not_relevant = df_tmp.sort_values("rank", ascending=False).reset_index(drop=True).loc[0:n_bottom, ["docid", col_text]]
  df_not_relevant["label"] = 0

  df_train = pd.concat([df_relevant, df_not_relevant])

  text_transformer = Pipeline([
      ('tfidf', TfidfVectorizer()),
  ])

  preprocessor = ColumnTransformer(
      transformers=[
          # ('num', num_transformer, col_num),
          # ('cat', cat_transformer, col_cat),
          ('text', text_transformer, col_text)
      ]
  )

  clf = Pipeline(steps=[
      ('preprocessor', preprocessor),
      ('classifier', LogisticRegression(random_state=23))
  ])

  X_train = df_train.loc[:, ["docid", col_text]]
  y_train = df_train["label"]

  clf.fit(X_train, y_train);

  preds = clf.predict_proba(df_tmp.loc[:, ["docid", col_text]])

  df_tmp["score"] = preds[:, 1] # True class second?
  df_tmp.sort_values("score", ascending=False, inplace=True)
  df_tmp["rank"] = [*range(df_tmp.shape[0])]
  
  return df_tmp

In [19]:
# df_eg = df_out.copy().loc[df_out["qid"] == "2", :].reset_index(drop=True)

# pr_linear_reranker(df_eg)

In [20]:
def normalise_score(df, norm="max"):
  normalized_data = normalize(df["score"].to_numpy().reshape(1, -1), norm=norm)[0]
  df["score"] = normalized_data
  return df

In [21]:
# df_eg["score"].to_numpy().reshape(1, -1)
# normalize(df_eg["score"].to_numpy().reshape(1, -1), norm="max")

# df_new = df_out.groupby("qid").apply(normalise_score)
# df_new.loc[df_new["score"] == 1, :]

In [22]:
def simple_reranker(df):
    df_score = pr_linear_reranker(df)
    return normalise_score(df_score)

In [23]:
class SimpleTransformer(pt.transformer.TransformerBase):
  def transform(self, df, **kwargs):
    return df.groupby("qid").apply(simple_reranker).droplevel(0)

In [24]:
qid = "0"
docno = "r9scxa76"

In [25]:
# pr_pipe = rm3_pipe >> pt.apply.by_query(simple_reranker)
pr_pipe = rm3_pipe >> pt.apply.by_query(pr_linear_reranker) >> pt.apply.by_query(normalise_score)

df_out_st = pr_pipe.transform(queries) # .loc[:0, :])
df_out_st # .loc[(df_out_st["qid"] == qid) & (df_out_st["docno"] == docno), :]

Unnamed: 0,qid,docid,docno,text,score,query_0,query,rank
4,1,89280,j318qn5p,The new decade of the 21st century (2020) star...,1.000000,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...,0
5,1,89279,ab757i3f,The new decade of the 21st century (2020) star...,1.000000,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...,1
9,1,101278,apqizr54,OBJECTIVE: SARS-CoV-2 is responsible for the p...,0.992260,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...,2
10,1,101277,icwvm7jp,OBJECTIVE: SARS-CoV-2 is responsible for the p...,0.992260,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...,3
8,1,39220,deajwhx0,OBJECTIVE SARS-CoV-2 is responsible for the pr...,0.992260,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...,4
...,...,...,...,...,...,...,...,...
2096,3,85256,2ufbrfqj,Feline infectious peritonitis (FIP) is caused ...,0.212774,coronavirus immunity,applypipeline:off coronavirus^0.022062551 prot...,995
2232,3,174717,qaf9esus,The resistance of immunized mice to challenge ...,0.212494,coronavirus immunity,applypipeline:off coronavirus^0.022062551 prot...,996
2832,3,80716,wobsyenq,Tyrosinase-related proteins-1 and -2 (gp75/TRP...,0.212414,coronavirus immunity,applypipeline:off coronavirus^0.022062551 prot...,997
2720,3,21863,wpc5dmcz,This chapter describes immune responses to the...,0.212234,coronavirus immunity,applypipeline:off coronavirus^0.022062551 prot...,998


In [26]:
# len(df_out_st.index.unique())

In [27]:
df_out_st.shape

(3000, 8)

In [28]:
pr_pipe_tr = rm3_pipe >> SimpleTransformer()

df_out_str = pr_pipe_tr.transform(queries) # .loc[:0, :])
df_out_str # .loc[(df_out_str["qid"] == qid) & (df_out_str["docno"] == docno), :]

Unnamed: 0,qid,docid,docno,text,rank,score,query_0,query
4,1,89280,j318qn5p,The new decade of the 21st century (2020) star...,0,1.000000,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...
5,1,89279,ab757i3f,The new decade of the 21st century (2020) star...,1,1.000000,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...
9,1,101278,apqizr54,OBJECTIVE: SARS-CoV-2 is responsible for the p...,2,0.992260,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...
10,1,101277,icwvm7jp,OBJECTIVE: SARS-CoV-2 is responsible for the p...,3,0.992260,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...
8,1,39220,deajwhx0,OBJECTIVE SARS-CoV-2 is responsible for the pr...,4,0.992260,coronavirus origin,applypipeline:off coronavirus^0.035478260 anim...
...,...,...,...,...,...,...,...,...
2096,3,85256,2ufbrfqj,Feline infectious peritonitis (FIP) is caused ...,995,0.212774,coronavirus immunity,applypipeline:off coronavirus^0.022062551 prot...
2232,3,174717,qaf9esus,The resistance of immunized mice to challenge ...,996,0.212494,coronavirus immunity,applypipeline:off coronavirus^0.022062551 prot...
2832,3,80716,wobsyenq,Tyrosinase-related proteins-1 and -2 (gp75/TRP...,997,0.212414,coronavirus immunity,applypipeline:off coronavirus^0.022062551 prot...
2720,3,21863,wpc5dmcz,This chapter describes immune responses to the...,998,0.212234,coronavirus immunity,applypipeline:off coronavirus^0.022062551 prot...


In [29]:
df_out_str.shape

(3000, 8)

In [30]:
assert_frame_equal(df_out_st, df_out_str.loc[:, df_out_st.columns]) # .query("qid=='1'")

In [31]:
# TODO how to run rm3_pipe only once? CHECK caching

alpha = 0.5

simplest_pipe = alpha * rm3_pipe + (1 - alpha) * pr_pipe

## Experiment

In [32]:
# NOTE Experiment is indeed/actually a function
pt.Experiment(
    retr_systems=[bm25, rm3_pipe, pr_pipe, simplest_pipe],
    names=["BM25", "simplest_a_1", "simplest_a_0", "simplest_a_05"],
    topics=queries,
    qrels=dataset.get_qrels(), # TODO figure out cord19 qrels
    eval_metrics=["map"]
              )

Unnamed: 0,name,map
0,BM25,0.109513
1,simplest_a_1,0.145553
2,simplest_a_0,0.135164
3,simplest_a_05,0.147446
