<a href="https://colab.research.google.com/github/DayalStrub/ecir2021tutorial/blob/main/other/indexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Indexing: in memory vs. on file

In [1]:
!pip install -q python-terrier

[K     |████████████████████████████████| 81kB 6.9MB/s 
[K     |████████████████████████████████| 1.1MB 37.4MB/s 
[K     |████████████████████████████████| 71kB 6.4MB/s 
[K     |████████████████████████████████| 163kB 46.7MB/s 
[K     |████████████████████████████████| 51kB 5.3MB/s 
[K     |████████████████████████████████| 81kB 8.4MB/s 
[K     |████████████████████████████████| 1.8MB 40.4MB/s 
[K     |████████████████████████████████| 133kB 47.7MB/s 
[K     |████████████████████████████████| 645kB 41.4MB/s 
[K     |████████████████████████████████| 5.5MB 42.9MB/s 
[?25h  Building wheel for python-terrier (setup.py) ... [?25l[?25hdone
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Building wheel for pytrec-eval (setup.py) ... [?25l[?25hdone
  Building wheel for chest (setup.py) ... [?25l[?25hdone
  Building wheel for warc3-wet-clueweb09 (setup.py) ... [?25l[?25hdone
  Building wheel for cbor (setup.py) ... [?25l[?25hdone


In [2]:
import pyterrier as pt
if not pt.started():
  pt.init()

  from pandas import Panel


terrier-assemblies 5.4  jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.5  jar not found, downloading to /root/.pyterrier...
Done
PyTerrier 0.5.0 has loaded Terrier 5.4 (built by craigm on 2021-01-16 14:17)


In [3]:
import pandas as pd

pd.set_option('display.max_colwidth', 150)

docs_df = pd.DataFrame([
        ["d1", "this is the first document of many documents"],
        ["d2", "this is another document"],
        ["d3", "the topic of this document is unknown"]
    ], columns=["docno", "text"])

docs_df

Unnamed: 0,docno,text
0,d1,this is the first document of many documents
1,d2,this is another document
2,d3,the topic of this document is unknown


In [4]:
pt.index.IndexingType(3)

<IndexingType.MEMORY: 3>

In [6]:
# NOTE would be nice to have index_path=None by default so don't need to provide
indexer = pt.DFIndexer(index_path=None, type=pt.index.IndexingType(3))

index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index_ref.toString() # java method

'MemoryIndex'

In [7]:
index = pt.IndexFactory.of(index_ref)

type(index)

jnius.reflect.org.terrier.structures.Index

In [8]:
indexer_file = pt.DFIndexer(index_path="./index_3docs", overwrite=True)

index_ref_2 = indexer_file.index(docs_df["text"], docs_df["docno"])
index_ref_2.toString()

'./index_3docs/data.properties'

In [9]:
index_2 = pt.IndexFactory.of(index_ref_2)

# same Terrier index as above
type(index_2)

jnius.reflect.org.terrier.structures.Index

In [14]:
# NOTE BatchRetreiver? verb noun etc often confused, and why Batch?
br = pt.BatchRetrieve(index, wmodel="Tf")
br.search("document")
# NOTE docid from terrier not mentioned in docs, and unnecessary/shouldnt be there

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,0,d1,0,2.0,document
1,1,2,d3,1,1.0,document
2,1,1,d2,2,1.0,document


TextScorer

In [18]:
# NOTE taken from docs - and VERY broken!
df = pd.DataFrame(
    [
        ["q1", "chemical reactions", "d1", "professor protor poured the chemicals"],
        ["q1", "chemical reactions", "d2", "chemical brothers turned up the beats"],
    ], columns=["qid", "query", "docno", "text"])
# NOTE df inconsistent w data model
textscorer = pt.batchretrieve.TextScorer(takes="docs", body_attr="text", wmodel="TF_IDF")
# NOTE TextScorer and transform inconsistent w/ eg BatchRetrieve which takes index and transform which takes query
textscorer.transform(df)

Unnamed: 0,qid,docno,rank,score,query
0,q1,d1,0,0.545455,chemical reactions
1,q1,d2,1,0.545455,chemical reactions
