# Building Text Search in Python

In this notebook, we will explore a search engine library, Whoosh (funny name!). We can use this library to  create a naive search engine for our application. 

In Module 5, we learned about the first two steps in creating a search engine: data acquisition and data transformation, such as stemming and lemmatization. The next two steps are index creation and query processing. 

<img src="../images/indexing_process.png" height=400 width=600 />

Fig. An overview of document indexing

<img src="../images/query_process.png" height=400 width=600 />

Fig. An overview of query processing


Whoosh search engine library implements the following steps for us: 

1. Data transformation 
2. Document indexing
3. Processing queries
    * Parsing the query
    * Ranking the results

So all we need is to collect data and the rest will be taken care of by this library. 

--- 
<a id='Whoosh_text' ></a>

## Whoosh

Whoosh was started as a quick and dirty search server for the online documentation of the Houdini 3D animation software package. 
They generously allowed the code to be open source, in case it might be useful to anyone else who needs a very flexible or pure-Python search engine (or both!).

  * Whoosh is fast, but uses only pure Python, so it will run anywhere Python runs, without requiring a compiler.
  * By default, Whoosh uses the Okapi BM25F ranking function, but like most things the ranking function can be easily customized.
  * Whoosh creates fairly small indexes compared to many other search libraries.
  * All indexed text in Whoosh must be unicode.
  * Whoosh lets you store arbitrary Python objects with indexed documents.

### What is Whoosh?

Whoosh is a fast, pure Python search engine library.

The primary design impetus of Whoosh is that it is pure Python. 
You should be able to use Whoosh anywhere you can use Python, no compiler or Java required.

Like one of its ancestors, Lucene, Whoosh is not really a search engine, it’s a programmer library for creating a search engine.

Practically no important behavior of Whoosh is hard-coded. 
Indexing of text, the level of information stored for each term in each field, parsing of search queries, 
the types of queries allowed, scoring algorithms, etc. are all customizable, replaceable, and extensible.

--- 
<a id='task' ></a>

## Task at hand

**For this lab, we are going to walk through the process of creating full text search capability within Python for integration into other analytical processes.**

You previously read about the _`book`_ data and you have seen the data used for a corpus in a PostgreSQL full text search.
Now, we are going to walk through the similar process to build the search engine in pure Python.
The process will take very little time and the usability of the full text search is multiplied by degree of heterogeneous data that can be integrated with the full text search.

Throughout these steps, try to recognize the similarities to the PostgreSQL process.

--- 
<a id='build_it' ></a>

## Buiding our Whoosh Schema

In this module we will explore the Bible scripture. It is 4.6 megabytes of text and 31 thousand lines. These files are physically located here: `/dsa/data/all_datasets/book/`. 

**In whoosh, we structure the retrieval system by defining a storage schema, which defines the fields of documents in an index.**

Ref: https://whoosh.readthedocs.io/en/latest/schema.html

> Each document can have multiple fields, such as title, content, url, date, etc.

> Some fields can be indexed, and some fields can be stored with the document so the field value is available in search results. Some fields will be both indexed and stored.


Whoosh provides some useful predefined field types: TEXT, KEYWORD, ID, STORED, NUMERIC, DATETIME, BOOLEAN. We can use the appropriate types for defining the fields of a document. 

By default index is created for filed types TEXT and KEYWORD, but they are not stored. Within the TEXT type, we can specify a text analyzer. There is default analyzer, but we can also pass `StemmingAnalyzer`, `RegexAnalyzer`, and `CompositeAnalyzer` (Ref: https://whoosh.readthedocs.io/en/latest/analysis.html) 


Let's create a schema for these book chapters. Two make it simpler, we will store only two fields: name of the file and it's content. 


In [None]:
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(
    filename=ID(stored=True),  
    # by default ID is indexed, but not stored; You would want to store the value of a url field so you 
    # could provide links to the original in your search results.
    
    # Use ID for fields like url or path (the URL or file path of a document), date, category – 
    # fields where the value must be treated as a whole, and each document only has one value for the field.
    
    content=TEXT(analyzer=StemmingAnalyzer())
    
    # TEXT filed indexes the text and stores term positions to allow phrase searching. 
    # For identifying terms, we can use various analyzers, and StemmingAnalyzer is one of them. 
    
    )

In [None]:
print(schema)

--- 
<a id='load_it' ></a>

## Loading Data and Creating Index

For this lab, we have created a small folder of a few books in the datasets folder. We will process all the documents in the next lab. 

```Bash
=> ls /dsa/data/all_datasets/book_lite
acts.txt  numbers.txt  romans.txt
```
We will create the _whoosh_ index object on the folder and then a writer object, then ingest the files to add them to the reverse index. In _whoosh_, this is basically a table listing every word in the corpus, and for each word, the list of documents in which it appears. 

To load the data, we will execute a python script with follow the basic crawling behavior:

 1. For each file/folder in the specified starting folder:
 1. If it is a folder, recurse into folder and process contents
 1. If it is a file, read contents and load into indexer.
 
 We will create the index in the labs/book_lite_index folder.

In [None]:
import os, os.path
from whoosh import index

os.makedirs("book_lite_index", exist_ok=True) 

# First we need an index for our directory
# Note, this clears the existing index in the directory
ix = index.create_in("book_lite_index", schema)

# In whoosh, you need a writer object to write to an index
# So we get a writer for the created index ix 
writer = ix.writer()


We will two functions: 

* loadFile: this function process a file 
* processFolder: this function recursively crawls and process all the files using the first function

In [None]:
def loadFile(writer, fname):   # process a file 
    '''
    Read file contents, load into database.
    '''
    with open(fname, 'r') as infile:
        content=infile.read()
        writer.add_document(filename=fname, content=content)
        print("Indexed: ", fname)

In [None]:
def processFolder(writer, folder):  # recursively process all the files and folders 
    '''
    Process a folder for files and subfolders
    '''
    print('# Processing folder: ',folder)
    for root, dirs, files in os.walk(folder):  # identify all the subdirs and files in the current dir
        print("root = ", root)
        # Process Files
        for file in files:     # process all the files in the currrent dir
            if file.endswith(".txt"):
                filename = os.path.join(root, file)
                print('=> Processing File:',filename)
                loadFile(writer,filename)             # process a file by calling the above function
            else:
                print("Unhandled File")
                
        # Recurse into subfolders
        for d in dirs:           # go inside these subdirs and process
            print("recursing into ",d)
            processFolder(writer,d)      # recursive call 


In [None]:
# Functions defined,  get the party started:
processFolder(writer,"/dsa/data/all_datasets/book_lite")
writer.commit() # save changes

--- 
<a id='search_me' ></a>

## Executing Queries

Read: 
  http://whoosh.readthedocs.io/en/latest/searching.html

First, we construct a QueryParser object to use in our queries. It takes parameters for the fieldname of the field we will be searching and the whoosh.fields.schema object to use when parsing. We only have two fields - filename and content, and content is what we'll be searching!

```python
"QueryParser("content", schema=ix.xchema) 
```
Calling parse() on the query parser will parse the input string and return a whoosh.query.Query object/tree. 

```python
q = qp.parse("abode")
```
So q is our query in the form of a _whoosh_ query object. We want to search through the index and see what we find. We need to use a searcher object and use the search method on that object to look for our query. Then we can print out all the hits in our results!
```python
with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)
```

In [None]:
from whoosh.qparser import QueryParser


ix = index.open_dir("book_lite_index")   # open the index created earlier
qp = QueryParser("content", schema=ix.schema)

q = qp.parse("abode")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

Let's try some other queries...

In [None]:
q = qp.parse("judged OR power")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

In [None]:
q = qp.parse("wealth")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

In [None]:
q = qp.parse("mightest")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

In [None]:
q = qp.parse("weak AND powerless")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

In [None]:
q = qp.parse("strong AND powerful")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

### You can experiment with some queries of your own in the next cells.

In [None]:
# Write your own query here
# -------------------------





In [None]:
# Write your own query here
# -------------------------





# SAVE YOUR NOTEBOOK, the `File > Close and Halt`