# Exercise: Building and Loading Text Search in Python Whoosh

--- 
<a id='task' ></a>

## Task at hand


For this exercise, we are going to walk through the process of creating full text search capability within Python for integration into other analytical processes.

You previously worked with the _`book`_ data. In this exercise, we will work with some wiki data. 

--- 
<a id='build_it' ></a>

## Buiding our Whoosh Schema

Recall, the `book/` folder is composed of a collection of text files, each its own book chapter.

In whoosh, we structure the retrieval system by defining a storage schema.

From the lab with the text files:
```
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
                )
```

This tells us we are defining records to have a `(filename, content)`

For this exercise, we will be using a few Wikipedia pages for our data source.

### 1) For this exercise, you should look at a few of these web pages:

  * https://en.wikipedia.org/wiki/Nyctimantis
  * https://en.wikipedia.org/wiki/Osteocephalus
  * https://en.wikipedia.org/wiki/Osteopilus
  
Specifically, inspect the HTML source and the 
```HTML
<table class="infobox biota" ... </table>
```



<img src="../images/table_inspect.png" height=400 width=600 />



**Task: You need to extend the above schema definition to collect this frog table data when available.**

* Content will be the all visible text on the html page
* Table information such as kingdom, phylum, class, order, family, subfamily, genus should be searchable 

In [None]:
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
                # Extend the schema definition to capture relevant table data
                
               )

--- 
<a id='load_it' ></a>

## Loading Data

For this exercise, we have created a small folder of a few Wikipedia pages under the `en.wikipedia.org/wiki` folder in the common datasets folder:


In [None]:
! ls /dsa/data/all_datasets/en.wikipedia.org/wiki



You will create the _whoosh_ index files in the `modules/module6/exercises/wiki_index` folder then ingest the files.

To load the data, write a python script that follow the basic crawling behavior

 1. For each file/folder in the specified starting folder:
 1. If it is a folder, recurse into folder and process contents
 1. If it is a file, read contents and load into indexer.
 
## Follow the lab for Python IR with whoosh to complete this exercise.

### 2) Create / Initialize the whoosh index and get the `writer` object.

In [None]:
import os, os.path
from whoosh import index

# Step 2 below this comment"



### 3) Adapt the helper functions

Note the subtle changes.
Please adapt the code below such as provided recursive parsing of the HTML (.html) files, indexing with the Whoosh API.
Trust no code, verify all code segments.


In [None]:
def visible(element):  # return those html elements that are visible as text 
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']: #html tags
        return False
    elif re.match('<!--.*-->', str(element)): # html comments
        return False
    return True

def pullBiota(soup):  
        
    data = {}   
   
    table = soup.find(<write your code>)
        
    for row in table.find_all(<write your code>):
        cells = row.find_all(<write your code>)
        # TODO: process the cells and populate the data variable
            
    return data


def loadFile(writer, fname):
    '''
    Read file contents, load into database.
    '''
    with open(fname, 'r') as infile:
        content=infile.read()   # read html content
        
        soup = BeautifulSoup(html, 'html.parser')
        texts = soup.find_all(text=True)
        
        # Process all the visible text
        visible_texts = filter(visible, texts)
        
        # TODO: Assemble all visible_texts into a content string
        # Hint: Iterate over visible_texts line by line; remove newlines; create a concatenated string
        

        # TODO: Process the "<table class="infobox biota" ... </table> data
        infotable = pullBiota(<write your code>)
        
        # Write to the index
        print("Indexed: ", fname)

def processFolder(writer,folder):
    '''
    Process a folder for files and subfolders
    '''
    print('Processing folder: ',folder)
    for root, dirs, files in os.walk(folder):
        print("root = ", root)
        # Process Files
        for file in files:
            if file.endswith(".html"):
                filename = os.path.join(root, file)
                print('Processing File:',filename)
                loadFile(writer,filename)
            else:
                print("Unhandled File")



### 4) Parse with our defined functions in place.

In [None]:
# Start processing the folder and commit the work
# ---------------------------------------------------




--- 
<a id='search_me' ></a>

### 5) Executing Queries

Read: 
  http://whoosh.readthedocs.io/en/latest/searching.html
  
Previously, we hard-coded query strings into the code cells.

Now, use the `input()` function collect a query string from the user. 
Then execute the search. For this task, focus only on the `content` field. 

In [None]:
from whoosh.qparser import QueryParser

# Write your code below this comment:
# --------------------------------------








### 6) Write two example queries to ensure you can search the index 

That is, make sure you can search on the fields you added to the index from the infobox biota table.

```HTML
<table class="infobox biota" ... </table>
```
For this search, we will ignore `content` field and search over the other fields. We can use `MultifieldParser` to specify the fields of our interest. 


In [None]:
# Write your code below this comment:
# --------------------------------------
from whoosh.qparser import MultifieldParser


# OMIT CONTENT
qp = MultifieldParser(["Kingdom","Phylum","Class","Order","Family","Genus"], 
                      schema=ix.schema, group=qparser.OrGroup)  




In [None]:
# Write your code below this comment:
# --------------------------------------

# OMIT CONTENT
qp = MultifieldParser(["Kingdom","Phylum","Class","Order","Family","Genus"], 
                      schema=ix.schema, group=qparser.OrGroup)





# SAVE YOUR NOTEBOOK WITH ALL EXECUTED CELLS
# Then, `File > Close and Halt`