# Exercise: Building and Loading Text Search in Python Whoosh


## OUTLINE
 1. [Whoosh](#Whoosh_text)
 1. [Task at hand](#task)
 1. [Buiding our Whoosh Schema](#build_it)
 1. [Loading Data](#load_it)
 1. [Executing Queries, Google-lite...very very lite](#search_me) 



--- 
<a id='Whoosh_text' ></a>

## Whoosh

Whoosh was started as a quick and dirty search server for the online documentation of the Houdini 3D animation software package. 
Side Effects Software generously allowed the code to be open source, in case it might be useful to anyone else who needs a very flexible or pure-Python search engine (or both!).

  * Whoosh is fast, but uses only pure Python, so it will run anywhere Python runs, without requiring a compiler.
  * By default, Whoosh uses the Okapi BM25F ranking function, but like most things the ranking function can be easily customized.
  * Whoosh creates fairly small indexes compared to many other search libraries.
  * All indexed text in Whoosh must be unicode.
  * Whoosh lets you store arbitrary Python objects with indexed documents.

### What is Whoosh?

Whoosh is a fast, pure Python search engine library.

The primary design impetus of Whoosh is that it is pure Python. 
You should be able to use Whoosh anywhere you can use Python, no compiler or Java required.

Like one of its ancestors, Lucene, Whoosh is not really a search engine, it’s a programmer library for creating a search engine.

Practically no important behavior of Whoosh is hard-coded. 
Indexing of text, the level of information stored for each term in each field, parsing of search queries, the types of queries allowed, scoring algorithms, etc. are all customizable, replaceable, and extensible.

--- 
<a id='task' ></a>

## Task at hand

For this exercise, we are going to walk through the process of creating full text search capability within Python for integration into other analytical processes.

You previously read about the _`book`_ data and you have seen the data used for a corpus in a PostgreSQL full text search, as well as using Whoosh in Python.

Now, we are going go through the similar process to build a search engine in pure Python for a different corpus.

The process will take very little time and the useability of the full text search is multiplied by degree of heterogeneous data that can be integrated with the full text search.

--- 
<a id='build_it' ></a>

## Buiding our Whoosh Schema

Recall, the `book/` folder is composed of a collection of text files, each its own book chapter.

In whoosh, we structure the retrieval system by defining a storage schema.

From the lab with the text files:
```
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
                )
```

This tells us we are defining records to have a `(filename, content)`

For this exercise, we will be using a few Wikipedia pages for our data source.

### 1) For this exercise, you should look at a few of these web pages:

  * https://en.wikipedia.org/wiki/Nyctimantis
  * https://en.wikipedia.org/wiki/Osteocephalus
  * https://en.wikipedia.org/wiki/Osteopilus
  
Specifically, inspect the HTML source and the 
```HTML
<table class="infobox biota" ... </table>
```

You need to extend the schema definition to collect the table data when available.

In [3]:
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer()),
                # Extend the schema definition to capture relevant table data            
                Kingdom=TEXT(stored=True),
                Phylum=TEXT(stored=True),
                Class=TEXT(stored=True),
                Order=TEXT(stored=True),
                Family=TEXT(stored=True),
                Subfamily=TEXT(stored=True),
                Genus=TEXT(stored=True),
               )

--- 
<a id='load_it' ></a>

## Loading Data

For this exercise, we have created a small folder of a few Wikipedia pages under the `en.wikipedia.org/wiki` folder in the common datasets folder:

```Bash
sebcq5@jupyter-sebcq5:/dsa/data/all_datasets$ ls en.wikipedia.org/wiki
Acris.html           Charadrahyla.html   Hylidae.html      Megastomatohyla.html  Plectrohyla.html  Sphaenorhynchus.html
Anotheca.html        Corythomantis.html  Hylinae.html      Myersiohyla.html      Pseudacris.html   Tepuihyla.html
Aparasphenodon.html  Dendropsophus.html  Hyloscirtus.html  Nyctimantis.html      Pseudis.html      Tlalocohyla.html
Aplastodiscus.html   Duellmanohyla.html  Hypsiboas.html    Osteocephalus.html    Ptychohyla.html   Trachycephalus.html
Argenteohyla.html    Ecnomiohyla.html    Isthmohyla.html   Osteopilus.html       Scarthyla.html    Triprion.html
Bokermannohyla.html  Exerodonta.html     Itapotihyla.html  Phyllodytes.html      Scinax.html       Xenohyla.html
Bromeliohyla.html    Hyla.html           Lysapsus.html     Phytotriades.html     Smilisca.html

```
You will create the _whoosh_ index files in the modules/module5/exercises/wiki_index folder then ingest the files.

To load the data, a python script with follow the basic crawling behavior

 1. For each file/folder in the specified starting folder:
 1. If it is a folder, recurse into folder and process contents
 1. If it is a file, read contents and load into indexer.
 
## Follow the lab for Python IR with whoosh to complete this exercise.

### Step 2) Create / Initialize the whoosh index and get the `writer` object.

In [14]:
import os, os.path
from whoosh import index

# Step 2 below this comment"
ix = index.create_in('wiki_index', schema)

writer = ix.writer()


### 3) Adapt the helper functions

Note the subtle changes.
Please adapt the code below such as provided recursive parsing of the HTML (.html) files, indexing with the Whoosh API.
Trust no code, verify all code segments.

HINT: When parsing the `"<table class="infobox biota" ... </table>` data, consider the difference between `.string` and `.get_text()` and experiment to see if it makes a difference.

In [9]:
from bs4 import BeautifulSoup
from html.parser import HTMLParser
import re

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True


In [6]:
def pullBiota(content):
        
    data = {}
                    
    # TODO: Process the "<table class="infobox biota" ... </table> data
    soup = BeautifulSoup(content, 'html.parser')    

    table = soup.findAll('table', attrs = {'class': re.compile(r'\binfobox\b')})
        
    table_as_bs = BeautifulSoup(str(table), 'html.parser')
        
    for row in table_as_bs.findAll('tr'):
        cells = row.findAll('td')
        if(len(cells)==2):
            #print(cells[0].string,'-',cells[1].string)
            #print(cells[0].string,'-',cells[1].get_text())
            #print(cells[0].string,'-',cells[1].get_text().split('\n', 1))
            #print(cells[0].string,'-',cells[1].get_text().split('\n', 1)[0])
            if (cells[1].get_text()):
                data[cells[0].string.strip(':')]=cells[1].get_text().split('\n', 1)[0]
            
    return data

In [7]:
def loadFile(writer, fname):
    '''
    Read file contents, load into database.
    '''
    with open(fname, 'r', encoding="utf-8") as infile:
        #visible_texts=infile.read()
        content=infile.read()
        
        
        soup = BeautifulSoup(content, 'html.parser')    

        texts = soup.findAll(text=True)
        
        # Process all the visible text
        visible_texts = filter(visible, texts)
        
        # TODO: Assemble all visible_texts into a content string
        page_text = ""
        for line in visible_texts:
            page_text += " " + line.strip ("\n")
#         print("-----------------")
#         print(page_text)
#         print("-----------------")
    
        # TODO: Process the "<table class="infobox biota" ... </table> data
        tableOut = pullBiota(content)
        #print(tableOut)

        # Write to the index
        writer.add_document(filename=fname,
                            content=page_text,
                            Kingdom=tableOut.get('Kingdom'),
                            Phylum=tableOut.get('Phylum'),
                            Class=tableOut.get('Class'),
                            Order=tableOut.get('Order'),
                            Family=tableOut.get('Family'),
                            Subfamily=tableOut.get('Subfamily'),
                            Genus=tableOut.get('Genus')
                           )
        print("Indexed: ",fname)
            

In [10]:
# Test 
# loadFile(writer,'./wiki_index/Plectrohyla.html')
loadFile(writer,'/dsa/data/all_datasets/en.wikipedia.org/wiki/Plectrohyla.html')
writer.commit() # save changes



Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Plectrohyla.html


In [11]:
from whoosh.qparser import QueryParser
qp = QueryParser("content", schema=ix.schema)
q = qp.parse(u"frog")                             

with ix.searcher() as s:
     results = s.search(q)
     for hit in results:
         print(hit)



<Hit {'Class': 'Amphibia', 'Family': 'Hylidae', 'Genus': 'Plectrohyla', 'Kingdom': 'Animalia', 'Order': 'Anura', 'Phylum': 'Chordata', 'Subfamily': 'Hylinae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Plectrohyla.html'}>


In [12]:
def processFolder(writer,folder):
    '''
    Process a folder for files and subfolders
    '''
    print('Processing folder: ',folder)
    for root, dirs, files in os.walk(folder):
        print("root = ", root)
        # Process Files
        for file in files:
            if file.endswith(".html"):
                filename = os.path.join(root, file)
                print('Processing File:',filename)
                loadFile(writer,filename)
            else:
                print("Unhandled File: ",file)
            

### 4) Parse with our defined functions in place.

In [15]:
# Start processing the folder and commit the work
# ---------------------------------------------------
import sys
sys.setrecursionlimit(1000)

# Functions defined,  get the party started:
processFolder(writer, '/dsa/data/all_datasets/en.wikipedia.org/wiki')
writer.commit() # save changes


Processing folder:  /dsa/data/all_datasets/en.wikipedia.org/wiki
root =  /dsa/data/all_datasets/en.wikipedia.org/wiki
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Argenteohyla.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Argenteohyla.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Bokermannohyla.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/w

--- 
<a id='search_me' ></a>

### 5) Executing Queries

Read: 
  http://whoosh.readthedocs.io/en/latest/searching.html
  
Previously, we hard-coded query strings into the code cells.

Now, use the `input()` function collect a query string from the user. 
Then execute the search.

In [20]:
from whoosh.qparser import QueryParser

# Write your code below this comment:
# --------------------------------------
val = input("What are you searching for? ")


What are you searching for? frog


In [22]:
val = 'u"' + str(val) + '"'
qp = QueryParser("content",schema=ix.schema)
q = qp.parse(val)

with ix.searcher() as s:
    results = s.search(q)
    print("Found {}".format(len(results)))
    for hit in results:
        print(hit)

Found 41
<Hit {'Class': 'Amphibia', 'Family': 'Hylidae', 'Genus': 'Dendropsophus', 'Kingdom': 'Animalia', 'Order': 'Anura', 'Phylum': 'Chordata', 'Subfamily': 'Hylinae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Dendropsophus.html'}>
<Hit {'Class': 'Amphibia', 'Family': 'Hylidae', 'Genus': 'Hypsiboas', 'Kingdom': 'Animalia', 'Order': 'Anura', 'Phylum': 'Chordata', 'Subfamily': 'Hylinae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Hypsiboas.html'}>
<Hit {'Class': 'Amphibia', 'Family': 'Hylidae', 'Genus': 'Hyla', 'Kingdom': 'Animalia', 'Order': 'Anura', 'Phylum': 'Chordata', 'Subfamily': 'Hylinae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Hyla.html'}>
<Hit {'Class': 'Amphibia', 'Family': 'Hylidae', 'Kingdom': 'Animalia', 'Order': 'Anura', 'Phylum': 'Chordata', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Hylidae.html'}>
<Hit {'Class': 'Amphibia', 'Family': 'Hylidae', 'Genus': 'Plectrohyla', 'Kingdom': 'Animalia', 'Order': 'Anur

### 6) Write example queries to ensure you can search the index 

That is, make sure you can search on the fields you added to the index from the infobox biota table.

```HTML
<table class="infobox biota" ... </table>
```

In [24]:
from whoosh.qparser import QueryParser

# Write your code below this comment:
# --------------------------------------
# example search : Ptychohyla in any field, including content
phrase = input("search for: ")
phrase

search for: Ptychohyla


'Ptychohyla'

In [25]:
from whoosh import qparser

qp = qparser.MultifieldParser(["content","Kingdom","Phylum","Class","Order","Family","Genus"], 
                              schema=ix.schema,group=qparser.OrGroup)
q = qp.parse(phrase)

with ix.searcher() as s:
    results = s.search(q)
    print("Found {}".format(len(results)))
    for hit in results:
        print(hit)

Found 3
<Hit {'Class': 'Amphibia', 'Family': 'Hylidae', 'Genus': 'Ptychohyla', 'Kingdom': 'Animalia', 'Order': 'Anura', 'Phylum': 'Chordata', 'Subfamily': 'Hylinae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Ptychohyla.html'}>
<Hit {'Class': 'Amphibia', 'Family': 'Hylidae', 'Kingdom': 'Animalia', 'Order': 'Anura', 'Phylum': 'Chordata', 'Subfamily': 'Hylinae', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Hylinae.html'}>
<Hit {'Class': 'Amphibia', 'Family': 'Hylidae', 'Kingdom': 'Animalia', 'Order': 'Anura', 'Phylum': 'Chordata', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Hylidae.html'}>


In [26]:
# Write your code below this comment:
# --------------------------------------
# example search for Anura, an order
phrase = input("search for: ")
phrase

# OMIT CONTENT
qp = qparser.MultifieldParser(["Kingdom","Phylum","Class","Order","Family","Genus"], schema=ix.schema,group=qparser.OrGroup)
q = qp.parse(phrase)

with ix.searcher() as s:
    results = s.search(q)
    print(results)
    for hit in results:
        print(hit)


search for: Anura
<Top 10 Results for Or([Term('Kingdom', 'anura'), Term('Phylum', 'anura'), Term('Class', 'anura'), Term('Order', 'anura'), Term('Family', 'anura'), Term('Genus', 'anura')]) runtime=0.0014877889771014452>
<Hit {'Class': 'Amphibia', 'Family': 'Hylidae', 'Genus': 'Acris', 'Kingdom': 'Animalia', 'Order': 'Anura', 'Phylum': 'Chordata', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html'}>
<Hit {'Class': 'Amphibia', 'Family': 'Hylidae', 'Genus': 'Anotheca', 'Kingdom': 'Animalia', 'Order': 'Anura', 'Phylum': 'Chordata', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html'}>
<Hit {'Class': 'Amphibia', 'Family': 'Hylidae', 'Genus': 'Aparasphenodon', 'Kingdom': 'Animalia', 'Order': 'Anura', 'Phylum': 'Chordata', 'filename': '/dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html'}>
<Hit {'Class': 'Amphibia', 'Family': 'Hylidae', 'Genus': 'Aplastodiscus', 'Kingdom': 'Animalia', 'Order': 'Anura', 'Phylum': 'Chordata', 'filename': '/

# SAVE YOUR NOTEBOOK WITH ALL EXECUTED CELLS
# Then, `File > Close and Halt`