In [4]:
from src.bm25 import *
import src.preprocessing as preprocessing
import src.utils as utils

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Build the searcher

In [5]:
bm25 = build_searcher('entry_data_token_by_section/')


  0%|          | 0/24517 [00:00<?, ?it/s]

### Use the searcher

Here is an example of how to use the searcher. Note that the search is case-insensitive.

In [6]:
# query can be any string, here we read from a example query file
query = open('toc.txt').read() 

# get scores for the query
ranking = bm25.get_scores(query)


The returned object from `get_scores(query)` is a list of tuples in the form of `(section title, score, list of (token, token_score) pairs)`. The list is sorted by the relevance score in descending order. 


In [7]:
ranking[0]

['Reliabilist Epistemology || A Paradigm Shift in Analytic Epistemology',
 104.64468000171928,
 [('reliabilist', 10.682715970501583),
  ('gettier', 7.913633470759927),
  ('safety', 6.460693400026655),
  ('sensitivity', 5.790663622000863),
  ('justified', 5.188410321299965),
  ('justification', 4.705217440719186),
  ('epistemic', 4.476381109480742),
  ('modal', 3.9926725712569433),
  ('theory', 3.9575494581618247),
  ('first', 3.0909508783660558),
  ('alternative', 2.8901483048141245),
  ('belief', 2.7948843512884576),
  ('knowledge', 2.512723295519643),
  ('false', 2.5009147724367233),
  ('relevant', 2.5005336499409045),
  ('condition', 1.8708100531326974),
  ('truth', 1.7237379582424155),
  ('true', 1.6861872088247576),
  ('case', 0.27215427778001605)]]

Optionally, we can exclude all sections under a specific entry by using `exclude_by_entry(ranking_list, entry)`. This is useful when the query text comes from a specific entry. 

For example, to exclude all sections under the entry "The Analysis of Knowledge", we can do:

In [11]:
r_e = exclude_by_entry(ranking,'The Analysis of Knowledge') 

#### Reweighting terms
By default, every token is weighted by the BM25 score. We can also assign weights to important tokens by using `word_importance()`.

Here we make "Epistemic" and "justification" twice as important as other words. And make "know" half as important as other words. Note that this weight will still be multiplied by the BM25 score, which will weight the tokens separately by the inverse document frequency and the term frequency. Defination of importance is also case-insensitive. 

In [14]:

importance = {
    'Epistemic': 2,
    'know': 0.5,
    'justification': 2,
}

ranking = bm25.get_scores(query, importance)


{'epistemic': 2, 'know': 0.5, 'justification': 2}


In this example the top result does not change from the default score. But the scores are now different, note taht the score for "justification" and "Epistemic" are now higher and contribute more to the final score. 

In [15]:
ranking[0]

['Reliabilist Epistemology || A Paradigm Shift in Analytic Epistemology',
 118.53149599263838,
 [('reliabilist', 10.682715970501583),
  ('justification', 9.410434881438372),
  ('epistemic', 8.952762218961483),
  ('gettier', 7.913633470759927),
  ('safety', 6.460693400026655),
  ('sensitivity', 5.790663622000863),
  ('justified', 5.188410321299965),
  ('modal', 3.9926725712569433),
  ('theory', 3.9575494581618247),
  ('first', 3.0909508783660558),
  ('alternative', 2.8901483048141245),
  ('belief', 2.7948843512884576),
  ('knowledge', 2.512723295519643),
  ('false', 2.5009147724367233),
  ('relevant', 2.5005336499409045),
  ('condition', 1.8708100531326974),
  ('truth', 1.7237379582424155),
  ('true', 1.6861872088247576),
  ('case', 0.27215427778001605)]]