# Scoring or Ranking Search Results

Until this point you should be familiar with creating index and writing search queries. In coming cells, we will be using a scoring criteria while searching the indexes below. 


Normally the list of result documents is sorted by score. 
The `whoosh.scoring` module contains implementations of various scoring algorithms. 
The default is [BM25F](https://en.wikipedia.org/wiki/Okapi_BM25). 
You can set the scoring object to use when you create the searcher using the weighting keyword argument: 

```python
from whoosh import scoring

with myindex.searcher(weighting=scoring.TF_IDF()) as s:
    ... 
```
    
    
A weighting model is a `WeightingModel` subclass with a `scorer()` method that produces a “scorer” instance. 
This instance has a method that takes the current matcher and returns a floating point score.



### TF-IFD

So why do we have to score the terms? 
Previously, we have simply used the number of times a token occurs in a document to classify the document. 
Even with the removal of stop words, however, 
this can still overemphasize tokens that might generally occur across many documents (e.g., names or general concepts). 
An alternative technique that often provides robust improvements in classification 
accuracy is to employ the frequency of token occurrence, 
normalized over the frequency with which the token occurs in all documents. 
In this manner, we give higher weight in the classification process to tokens 
that are more strongly tied to a particular label. 

Formally this concept is known as [term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf–idf) (or tf-idf). 
We will use this scoring method and compare our the search results with those from the normal vector space model.

In the below code cell, documents with a better TF-IDF score will appear higher in the search results list. 
Compare the results below with the results of the above cell which used the basic vector space model for scoring documents. 
Read the below documents to understand what TF-IDF is about and how it is applied in whoosh. 
 

-----

Reference: 

- [Scoring and sorting](http://whoosh.readthedocs.io/en/latest/searching.html#scoring-and-sorting)
- [TF-IDF](http://www.tfidf.com/)


In the previous lab, we processed only 3 documents. Before experimenting with the scoring method, let's process all the documents.

### Create Schema

Earlier we stored file id and content for the schema. This time, we aim to index at the line level. 


In [None]:
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                line_num=ID(stored=True),  # this is new; we want to show line number in search result
                content=TEXT(analyzer=StemmingAnalyzer(),stored=True)
               )
print(schema)

### Create Index

In [None]:
import os, os.path
from whoosh import index, scoring

# create an index dir

os.makedirs("book_index", exist_ok=True)  # create a directory for indexing

# Note, this clears the existing index in the directory
ix = index.create_in("book_index", schema)

# Get a writer to form the created index in 
writer = ix.writer()


In [None]:
# necessary functions for processing files

def loadFile(writer, fname):
    '''
    Read file contents, load into database.
    '''
    line_no = 1
    with open(fname, 'r', encoding="utf-8") as infile:
        for line in infile:  # since we want to show line number is search, we need to process
            # the document line by line. 
            line = line.rstrip('\n')
            line_no += 1
            writer.add_document(filename=fname, line_num=str(line_no), content=line)
    print("Indexed: ", fname)


def processFolder(writer,folder):
    '''
    Process a folder for files and subfolders
    '''
    print('Processing folder: ',folder)
    for root, dirs, files in os.walk(folder):
        # add a new line to separate folders in the output
        print("\nroot = ", root)
        # Process Files
        for file in files:
            if file.endswith(".txt"):
                filename = os.path.join(root, file)
                print('Processing File:',filename)
                loadFile(writer,filename)
            else:
                print("Unhandled File")


In [None]:

processFolder(writer,"/dsa/data/all_datasets/book")

writer.commit() # save changes

<span style="background:yellow; font-weight:bold">Comprehension Check:</span>  
- What does the loadFile function do? 
- What does the processFolder function do? 
    - Do you understand how each directory gets processed?
- Why are the two code lines at the bottom flush left? 
- Why is the line 'writer.commit' necessary?

### Query

We now execute a query on this index. For ranking the search results, we first use the default one (i.e., [BM25F](https://en.wikipedia.org/wiki/Okapi_BM25)), and then compare against the TF-IDF scoring system. 

In [None]:
from whoosh.qparser import QueryParser

ix = index.open_dir("book_index")   # open the index created earlier
qp = QueryParser("content", schema=ix.schema)


q = qp.parse("love")

with ix.searcher() as s:  # using default scoring method
    results = s.search(q)  
    for hit in results:
        filename = hit['filename']
        line_num = hit['line_num']
        
        print(f"filename: {filename:40s} line_num: {line_num:>4} score: {hit.score}")

Now change the scoring method to TF-IDF and see if there is any difference in the ranking. 

In [None]:
q = qp.parse("love")

with ix.searcher(weighting=scoring.TF_IDF()) as s:
    results = s.search(q)

    for hit in results:
        filename = hit["filename"]
        line_num = hit["line_num"]
        print(f"filename: {filename:40s} line_num: {line_num:>4} score: {hit.score}")


You can observe that hits from the files **"1samuel.txt"** and **"hosea.txt"** have made it to top 10 while the lines from file **john.txt** which were at position 2 and 6 are changed to positions 4 and 3 because of the ranking based on TDIDF scores. 

<span style="background:yellow; font-weight:bold">Comprehension Check:</span>  
- Do you understand why (or how) the lines from 1samuel.txt and hosea.txt moved into the top 10 while the lines from john.txt changed in the rankings?


----

## Filtering results

You can use the filter keyword argument in `search()` to white list the set of documents permitted in the results. 
The argument can be a `whoosh.query.Query` object, a `whoosh.searching.Results` object, 
or a set-like object containing document numbers. 
The searcher caches filters so if you use the same query filter with a searcher 
multiple times, the additional searches will be faster because the searcher will 
use the cache of results from previous runs of the query. 
You can also specify a mask keyword argument to specify a set of documents that are not permitted in the results. 

Lets first look up documents where `hate` is appearing.

----

In [None]:
from whoosh.qparser import QueryParser
from whoosh import scoring

qp = QueryParser("content", schema=ix.schema)
q = qp.parse("hate")

with ix.searcher(weighting=scoring.TF_IDF()) as s:
    results = s.search(q)
    for hit in results:
        print(hit["filename"])

In the below code cell, we are using the filter argument to only allow `john.txt` in the results and mask the word `hate`. 
So if you observe the results below, indexes in `john.txt` have appeared and none of the indexes have hate in them.

In [None]:
from whoosh.query import *

with ix.searcher(weighting=scoring.TF_IDF()) as s:
    qp = QueryParser("content", ix.schema)
    user_q = qp.parse("love")

    # Only show documents in the "rendering" chapter
    allow_q = Term("filename", "/dsa/data/all_datasets/book/john.txt")
    
    # Don't show any documents where the "content" field contains "hate"
    restrict_q = Term("content","hate")

    results = s.search(user_q, mask=restrict_q, filter=allow_q)      #   
    for hit in results:
        print(hit["filename"], hit["content"], hit.score)

<span style="background:yellow; font-weight:bold">Comprehension Check:</span>  
- How are filtering and masking different?
- How could filtering and masking be useful?
-----



----

Let's put our results into a pandas dataframe.


In [None]:
from whoosh.searching import Hit 
import numpy as np
from IPython.display import display
import pandas as pd

with ix.searcher(weighting=scoring.TF_IDF()) as s:
    qp = QueryParser("content", ix.schema)
    user_q = qp.parse("love")
    
    results = s.search(user_q)
    print("Total no of matches: ",len(results))
    
    rank=[]
    docnum=[]
    score=[]
    filenames=[]
    lines=[]
    line_num=[]
    
    display(results[0])
    
    for i in np.arange(0,10):
        rank.append(results[i].rank)
        docnum.append(results[i].docnum)
        score.append(results[i].score)
        filenames.append(results[i]['filename'])
        line_num.append(results[i]['line_num'])
        lines.append(results[i]['content'])
       
    df = pd.DataFrame({'filename' : filenames, 'line_num' : line_num, 'line' : lines, 'docnum' : docnum, \
                            'score' : score, 'rank' : rank})
    display(df)


----

# Save your notebook, then `File > Close and Halt`