# Building and Loading Text Search in Python Whoosh using TFIDF

For this Practice, 
we will be creating full text search capability using Python as we did in the Lab, using TFIDF scoring. 

This time, our data is in the folder **`/dsa/data/all_datasets/hp`**  - but no, 
this is not Hewlett Packard documentation. 
It is something much more enchanting!

**Please review some of the files in /dsa/data/all_datasets/hp).**



In [None]:
! ls /dsa/data/all_datasets/hp

To read the first 10 lines of a file, we can use `head` command from bash. 

In [None]:
! head -n 10 /dsa/data/all_datasets/hp/CHAPTER\ 1.txt

Throughout the practice, reflection questions are asked. 
Take the time to answer them - consult the documentation for libraries and functions if needed, 
experiment with the code, and ask your classmates.


## 1. Building the Whoosh Schema that support searches at the line level

Import the necessary libraries and build a schema including filename, line_num and content.

In [None]:
# Add your code below 
# -----------------------

#TO DO: import the necessary libraries for this step
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

#TO DO: build the schema
schema = Schema(filename=ID(stored=True),
                line_num=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
               )

#### Reflection

 - Which libraries did you import and why?

 - Explain how you built the schema - did you use ID, TEXT, KEYWORD or STORED? 
 - If so, where and why? ([Documentation available here](http://whoosh.readthedocs.io/en/latest/schema.html))



----

## 2. Loading the Data

* In the first cell, import any libraries you need, create the index in the folder `hp_index` within the practices folder, and get a writer for the index.
* In the second cell, complete the function for loadFile
* In the third cell, process the folder and persist your changes.

In [None]:
# Add your code below 
# -----------------------

#TO DO: import the necessary libraries for this step
#import os, os.path
import  os
from whoosh import index

#TO DO: Create the index

# create index dir
os.makedirs("hp_index", exist_ok=True)

# Note, this clears the existing index in the directory
ix = index.create_in("hp_index", schema)


#TO DO: Get a writer form the created index in 
writer = ix.writer()

In [None]:
# Complete code below 
# -----------------------

def loadFile(writer, fname):
    '''
    Read file contents, load into database.
    '''
    line_no = 1
    with open(fname, 'r', encoding="utf-8") as infile:
        # TODO: create indexes for each line in the input file
        for line in infile:
            line = line.rstrip('\n')
            line_no += 1
            writer.add_document(filename=fname, line_num=str(line_no),content=line)
        #-------------------------------------------------------
        print("Indexed: ", fname)


def processFolder(writer,folder):
    print('Processing folder: ',folder)
    for root, dirs, files in os.walk(folder):
        # add a new line to separate folders in the output
        print("\nroot = ", root)
        # Process Files
        for file in files:
            if file.endswith(".txt"):
                filename = os.path.join(root, file)
                print('root:', root, '; file:', file, '; filename:', filename)
                print('Processing File:',filename)
                loadFile(writer,filename)
            else:
                print("Unhandled File")

In [None]:
# Add your code below 
# -----------------------

# TODO: process the folder and persist your changes 
# Functions defined,  get the party started:
processFolder(writer,"/dsa/data/all_datasets/hp")

writer.commit() # save changes
    
    

#### Reflection

 - Which libraries did you import and why?
 - In loadFile, how did you get the line number for each line?
 - In loadFile, which code line adds an index for the processed line?
 - In processFolder, what does the following line do? Give an example.
```
filename = os.path.join(root, file)
```
 - What code line makes sure the index get persisted? (How is it saved so it can be used?)

----

## 3. Executing Queries
* In the first cell, import any libraries you need, and find the indexes of lines where the string 'Harry' appears. Display the top 10 hits.
* In the second cell, import any additional libraries you need, and find the indexes of lines where the string 'Harry' appears using TF-IDF as the scoring mechanism. Display the top 10 hits.
* In the third cell, import any additional libraries you need, and use a filter to list the indexes in chapter 6 corresponding to the search string 'Harry' using TF_IDF as the scoring mechanism. Display the top 10 hits.

In [None]:
# Add your code below 
# -----------------------

#TO DO: import the necessary libraries for this step
from whoosh.qparser import QueryParser

#TO DO: Find the indexes of lines where the string 'Harry' appears. 
qp = QueryParser("content", schema=ix.schema)
q = qp.parse("Harry")

#TO DO: display the top 10 hits
# NOTE: By default the results contains at most the first 10 matching documents. 
# So using a count is not necessary
with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit['filename'], hit['line_num'], hit.score, hit.rank)


In [None]:
# Add your code below 
# -----------------------

#TO DO: import the necessary libraries for this step
from whoosh.qparser import QueryParser
from whoosh import scoring

#TO DO: Find the indexes of lines where the string 'Harry' appears using TF_IDF as the scoring mechanism. 
qp = QueryParser("content", schema=ix.schema)
q = qp.parse("Harry")

#TO DO: display the top 10 hits
w = scoring.BM25F(B=0.75, content_B=1.0, K1=1.5)
#with ix.searcher(weighting=scoring.TF_IDF()) as s:
with ix.searcher(weighting=w) as s:
    results = s.search(q)
    for hit in results:
        print(hit['filename'], hit['line_num'], hit.score, hit.rank)


In [None]:
# Add your code below 
# -----------------------

#TO DO: import the necessary libraries for this step
from whoosh.qparser import QueryParser
from whoosh import scoring
#from whoosh.query import *
from whoosh.query import Term

#TO DO: Use a filter to list the indexes in chapter 6 corresponding to the search string 'Harry' 
# using TF_IDF as the scoring mechanism. 
with ix.searcher(weighting=scoring.TF_IDF()) as s:
    qp = QueryParser("content", ix.schema)
    user_q = qp.parse("Harry")

    # Only show documents in the "rendering" chapter
    allow_q = Term("filename", "/dsa/data/all_datasets/hp/CHAPTER 6.txt")

    #TO DO: display the top 10 hits
    results = s.search(user_q, filter=allow_q)
    for hit in results:
        print(hit['filename'], hit['line_num'], hit.score, hit.rank)


#### Reflection

 - Which libraries did you import and why?
 - What differences do you see in the results of the first two cells?
 - What do those differences mean?

----

# Save your notebook, then `File > Close and Halt`