# Exercise: Building and Loading Text Search in Python Whoosh

--- 
<a id='task' ></a>

## Task at hand


For this exercise, we are going to walk through the process of creating full text search capability within Python for integration into other analytical processes.

You previously worked with the _`book`_ data. In this exercise, we will work with some wiki data. 

--- 
<a id='build_it' ></a>

## Buiding our Whoosh Schema

Recall, the `book/` folder is composed of a collection of text files, each its own book chapter.

In whoosh, we structure the retrieval system by defining a storage schema.

From the lab with the text files:
```
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
                )
```

This tells us we are defining records to have a `(filename, content)`

For this exercise, we will be using a few Wikipedia pages for our data source.

### 1) For this exercise, you should look at a few of these web pages:

  * https://en.wikipedia.org/wiki/Nyctimantis
  * https://en.wikipedia.org/wiki/Osteocephalus
  * https://en.wikipedia.org/wiki/Osteopilus
  
Specifically, inspect the HTML source and the 
```HTML
<table class="infobox biota" ... </table>
```



<img src="../images/table_inspect.png" height=400 width=600 />



**Task: You need to extend the above schema definition to collect this frog table data when available.**

* Content will be the all visible text on the html page
* Table information such as kingdom, phylum, class, order, family, subfamily, genus should be searchable 

In [24]:
from whoosh.fields import Schema, TEXT, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(
    filename=ID(stored=True),  
    content=TEXT(analyzer=StemmingAnalyzer()),  
    kingdom=TEXT(stored=True), 
    phylum=TEXT(stored=True),   
    class_=TEXT(stored=True),   
    order=TEXT(stored=True),    
    family=TEXT(stored=True),   
    subfamily=TEXT(stored=True),
    genus=TEXT(stored=True)    
)



--- 
<a id='load_it' ></a>

## Loading Data

For this exercise, we have created a small folder of a few Wikipedia pages under the `en.wikipedia.org/wiki` folder in the common datasets folder:


In [25]:
! ls /dsa/data/all_datasets/en.wikipedia.org/wiki

Acris.html	     Hylidae.html	   Plectrohyla.html
Anotheca.html	     Hylinae.html	   Pseudacris.html
Aparasphenodon.html  Hyloscirtus.html	   Pseudis.html
Aplastodiscus.html   Hypsiboas.html	   Ptychohyla.html
Argenteohyla.html    Isthmohyla.html	   Scarthyla.html
Bokermannohyla.html  Itapotihyla.html	   Scinax.html
Bromeliohyla.html    Lysapsus.html	   Smilisca.html
Charadrahyla.html    Megastomatohyla.html  Sphaenorhynchus.html
Corythomantis.html   Myersiohyla.html	   Tepuihyla.html
Dendropsophus.html   Nyctimantis.html	   Tlalocohyla.html
Duellmanohyla.html   Osteocephalus.html    Trachycephalus.html
Ecnomiohyla.html     Osteopilus.html	   Triprion.html
Exerodonta.html      Phyllodytes.html	   Xenohyla.html
Hyla.html	     Phytotriades.html


In [26]:
cat /dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Cricket frog - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Cricket_frog","wgTitle":"Cricket frog","wgCurRevisionId":730810557,"wgRevisionId":730810557,"wgArticleId":2953907,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with 'species' microformats","All articles with dead external links","Articles with dead external links from July 2016","Articles with permanently dead external links","Commons category with local link same as on Wikidata","All stub articles","Hylinae","Acris","Hylinae stubs"],"wgBreakFrames":false,"wgPageContentL



You will create the _whoosh_ index files in the `modules/module6/exercises/wiki_index` folder then ingest the files.

To load the data, write a python script that follow the basic crawling behavior

 1. For each file/folder in the specified starting folder:
 1. If it is a folder, recurse into folder and process contents
 1. If it is a file, read contents and load into indexer.
 
## Follow the lab for Python IR with whoosh to complete this exercise.

### 2) Create / Initialize the whoosh index and get the `writer` object.

In [27]:
import os
from whoosh.index import create_in
from whoosh.fields import Schema, TEXT, ID
from whoosh.analysis import StemmingAnalyzer
from bs4 import BeautifulSoup
from whoosh import writing

schema = Schema(
    filename=ID(stored=True),
    content=TEXT(analyzer=StemmingAnalyzer()),
    kingdom=TEXT(stored=True),
    phylum=TEXT(stored=True),
    class_=TEXT(stored=True),
    order=TEXT(stored=True),
    family=TEXT(stored=True),
    subfamily=TEXT(stored=True),
    genus=TEXT(stored=True)
)

def parse_html(html):

    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find('table')
    
    kingdom = "Animalia"
    phylum = "Chordata"
    class_ = "Amphibia"
    order = "Anura"
    family = "Hylidae"
    subfamily = "Hylinae"
    genus = "Hyla"

    for row in table.find_all('tr'):
        cells = row.find_all(['td'])
            
        label_found = False
        label = None
        for cell in cells:
            if not cell:
                continue
            if not label_found:
                label = cell.get_text()
                if label:
                    if "kingdom:" in label.lower():
                        label_found = True
                    if "phylum:" in label.lower():
                        label_found = True
                    if "class:" in label.lower():
                        label_found = True
                    if "order:" in label.lower():
                        label_found = True
                    if "family:" in label.lower():
                        label_found = True
                    if "subfamily:" in label.lower():
                        label_found = True
                    if "genus:" in label.lower():
                        label_found = True
                        continue
                
            if label_found:
                if cell.find('a') and cell.find('a').has_attr('title') and cell.find('a')['title']:
                    if "kingdom:" in label.lower():
                        kingdom = cell.find('a')['title']
                    if "phylum:" in label.lower():
                        phylum = cell.find('a')['title']
                    if "class:" in label.lower():
                        class_ = cell.find('a')['title']
                    if "order:" in label.lower():
                        order = cell.find('a')['title']
                    if "family:" in label.lower():
                        family = cell.find('a')['title']
                    if "subfamily:" in label.lower():
                        subfamily = cell.find('a')['title']
                    if "genus:" in label.lower():
                        genus = cell.find('a')['title']
    return(kingdom, phylum, class_, order, family, subfamily, genus)

index_dir = "modules/module6/exercises/wiki_index"
if not os.path.exists(index_dir):
    os.mkdir(index_dir)

ix = create_in(index_dir, schema)

def add_files_to_index(dirname):
    writer = ix.writer()
    try:
        for root, dirs, files in os.walk(dirname):
            for filename in files:
                if filename.endswith('.html'):
                    filepath = os.path.join(root, filename)
                    with open(filepath, 'r', encoding='utf-8') as file:
                        content = file.read()
                        kingdom, phylum, class_, order, family, subfamily, genus = parse_html(content)

                        soup = BeautifulSoup(content, 'html.parser')
                        text_content = soup.get_text()

                        writer.add_document(
                            filename=filename,
                            content=text_content,
                            kingdom=kingdom,
                            phylum=phylum,
                            class_=class_,
                            order=order,
                            family=family,
                            subfamily=subfamily,
                            genus=genus
                        )
        writer.commit()
    except LockError:

        ix.unlock() 
        writer.commit(mergetype=writing.CLEAR)  
        

### 3) Adapt the helper functions

Note the subtle changes.
Please adapt the code below such as provided recursive parsing of the HTML (.html) files, indexing with the Whoosh API.
Trust no code, verify all code segments.


In [28]:
import os
import re
from bs4 import BeautifulSoup
from whoosh import index
from whoosh.fields import Schema, TEXT, ID
from whoosh.analysis import StemmingAnalyzer
from whoosh.index import LockError
from whoosh.index import open_dir, create_in

schema = Schema(
    filename=ID(stored=True),  
    content=TEXT(analyzer=StemmingAnalyzer())  
)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)): 
        return False
    return True

def extract_text(html):
    soup = BeautifulSoup(html, 'html.parser')
    texts = soup.find_all(text=True)
    visible_texts = filter(visible, texts)
    return " ".join(t.strip() for t in visible_texts if t.strip())

def load_file(writer, fname):
    with open(fname, 'r', encoding='utf-8') as infile:
        html = infile.read()

    content = extract_text(html)

    writer.add_document(
        filename=os.path.basename(fname),
        content=content
    )
    print(f"Indexed: {fname}")

def process_folder(writer, folder):
    print(f"Processing folder: {folder}")
    for root, dirs, files in os.walk(folder):
        print(f"root = {root}")
       
        for file in files:
            if file.endswith(".html"):
                filename = os.path.join(root, file)
                print(f"Processing File: {filename}")
                load_file(writer, filename)
            else:
                print(f"Unhandled File: {file}")

def create_index(index_dir, data_folder):
  
    lock_file_path = os.path.join(index_dir, "MAIN_WRITELOCK")
    if os.path.exists(lock_file_path):
        os.remove(lock_file_path)

    if not os.path.exists(index_dir):
        os.makedirs(index_dir)

    if index.exists_in(index_dir):
        ix = open_dir(index_dir)
    else:
        ix = create_in(index_dir, schema)

    writer = None
    try:
        writer = ix.writer()
        process_folder(writer, data_folder)
    except LockError:  
        print("Index is locked. Please try again after the lock is released.")
    finally:
        if writer is not None:
            writer.commit()
            print("Indexing complete.")

data_folder = '/dsa/data/all_datasets/en.wikipedia.org/wiki'  
index_dir = 'modules/module6/exercises/wiki_index'  

create_index(index_dir, data_folder)




 


Processing folder: /dsa/data/all_datasets/en.wikipedia.org/wiki
root = /dsa/data/all_datasets/en.wikipedia.org/wiki
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html
Indexed: /dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html
Indexed: /dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html
Indexed: /dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html
Indexed: /dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Argenteohyla.html
Indexed: /dsa/data/all_datasets/en.wikipedia.org/wiki/Argenteohyla.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Bokermannohyla.html
Indexed: /dsa/data/all_datasets/en.wikipedia.org/wiki/Boke

### 4) Parse with our defined functions in place.

In [40]:
# Start processing the folder and commit the work
# ---------------------------------------------------

import os
import re
from bs4 import BeautifulSoup
from whoosh import index
from whoosh.fields import Schema, TEXT, ID
from whoosh.analysis import StemmingAnalyzer
from whoosh.index import LockError, create_in, open_dir
from whoosh.writing import AsyncWriter

schema = Schema(
    filename=ID(stored=True),
    content=TEXT(analyzer=StemmingAnalyzer()),
    kingdom=TEXT(stored=True),
    phylum=TEXT(stored=True),
    class_=TEXT(stored=True),
    order=TEXT(stored=True),
    family=TEXT(stored=True),
    subfamily=TEXT(stored=True),
    genus=TEXT(stored=True)
)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

def extract_text(html):
    soup = BeautifulSoup(html, 'html.parser')
    texts = soup.find_all(text=True)
    visible_texts = filter(visible, texts)
    return " ".join(t.strip() for t in visible_texts if t.strip())

def load_file(writer, fname):
    with open(fname, 'r', encoding='utf-8') as infile:
        html = infile.read()

    content = extract_text(html)

    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find('table')
            
    kingdom, phylum, class_, order, family, subfamily, genus = parse_html(html)
    print(kingdom)
    
    writer.add_document(
        filename=os.path.basename(fname),
        content=content,
        kingdom=kingdom,
        phylum=phylum,
        class_=class_,
        order=order,
        family=family,
        subfamily=subfamily,
        genus=genus
    )
    print(f"Indexed: {fname}")

def process_folder(writer, folder):
    print(f"Processing folder: {folder}")
    for root, dirs, files in os.walk(folder):
        print(f"root = {root}")

        for file in files:
            if file.endswith(".html"):
                filename = os.path.join(root, file)
                print(f"Processing File: {filename}")
                load_file(writer, filename)
            else:
                print(f"Unhandled File: {file}")

def create_index(index_dir, data_folder):
    
    lock_file_path = os.path.join(index_dir, "MAIN_WRITELOCK")
    if os.path.exists(lock_file_path):
        os.remove(lock_file_path)

    
    if not os.path.exists(index_dir):
        os.makedirs(index_dir)

    if index.exists_in(index_dir):
        ix = open_dir(index_dir)
    else:
        ix = create_in(index_dir, schema)

    writer = None
    try:
        writer = AsyncWriter(ix)
        process_folder(writer, data_folder)
    except LockError:
        print("Index is locked. Please try again after the lock is released.")
    finally:
        if writer is not None:
            writer.commit()
        print("Indexing complete.")

data_folder = '/dsa/data/all_datasets/en.wikipedia.org/wiki'
index_dir = 'modules/module6/exercises/wiki_index'

create_index(index_dir, data_folder)



Processing folder: /dsa/data/all_datasets/en.wikipedia.org/wiki
root = /dsa/data/all_datasets/en.wikipedia.org/wiki
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html
Animal
Indexed: /dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html
Animal
Indexed: /dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html
Animal
Indexed: /dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html
Animal
Indexed: /dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Argenteohyla.html
Animal
Indexed: /dsa/data/all_datasets/en.wikipedia.org/wiki/Argenteohyla.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Bokermannohyla.html
Animal
Indexed: /dsa/da

--- 
<a id='search_me' ></a>

### 5) Executing Queries

Read: 
  http://whoosh.readthedocs.io/en/latest/searching.html
  
Previously, we hard-coded query strings into the code cells.

Now, use the `input()` function collect a query string from the user. 
Then execute the search. For this task, focus only on the `content` field. 

In [36]:
from whoosh.qparser import QueryParser
from whoosh import index

# Write your code below this comment:
# --------------------------------------
ix = index.open_dir("modules/module6/exercises/wiki_index")

user_query = input("Enter your search query: ")

with ix.searcher() as searcher:
   
    query = QueryParser("content", ix.schema).parse(user_query)
    results = searcher.search(query)
  
    print(f"Found {len(results)} results:")
    for result in results:
        print(result['filename']) 




Enter your search query: frog
Found 123 results:
Dendropsophus.html
Dendropsophus.html
Dendropsophus.html
Hypsiboas.html
Hypsiboas.html
Hypsiboas.html
Hyla.html
Hyla.html
Hyla.html
Hylidae.html


### 6) Write two example queries to ensure you can search the index 

That is, make sure you can search on the fields you added to the index from the infobox biota table.

```HTML
<table class="infobox biota" ... </table>
```
For this search, we will ignore `content` field and search over the other fields. We can use `MultifieldParser` to specify the fields of our interest. 


In [42]:
# Write your code below this comment:
# --------------------------------------
from whoosh.qparser import MultifieldParser, OrGroup


# OMIT CONTENT
qp = MultifieldParser(["kingdom","phylum","class","order","family","genus"], 
                      schema=ix.schema, group=OrGroup)  

user_query = input("Enter your search query for biota fields: ")

with ix.searcher() as searcher:
    query = qp.parse(user_query)
    results = searcher.search(query)

    
    print(f"Found {len(results)} results:")
    for result in results:
        print(result['filename'])



Enter your search query for biota fields: Animal
Found 160 results:
Acris.html
Anotheca.html
Aparasphenodon.html
Aplastodiscus.html
Argenteohyla.html
Bokermannohyla.html
Bromeliohyla.html
Charadrahyla.html
Corythomantis.html
Dendropsophus.html


# SAVE YOUR NOTEBOOK WITH ALL EXECUTED CELLS
# Then, `File > Close and Halt`