# Chapter 06

***start

## Indexing

## Searching

## Lucene

The leading searching library for text searching is definitely [Lucene](https://lucene.apache.org/). However, Lucene is a Java library, which is not easy to implement (especially crossplatform as would be the case in this course). 

There is a Python extension for accessing Java Lucene, called [PyLucene](https://lucene.apache.org/pylucene/). Its goal is to allow you to use Lucene's text indexing and searching capabilities from Python. Still, PyLucene is not a Lucene **port** but a Python **wrapper** around Java Lucene. PyLucene embeds a Java VM with Lucene into a Python process. This means that you still need Java Lucene to run PyLucene, and some additional tools (GNU `Make`, a C++ compiler, etc.).



## Whoosh

As text indexing/searching is bound to be really slow in Python (so it make good sense to stick to Java Lucene) there is no true pure-Python alternative to Lucene. However, there are some libraries that allow you to experiment with similar indexing/searching software.

One of these is [Whoosh](https://whoosh.readthedocs.io/en/latest/index.html), which is unfortunately no longer maintained. Still, the latest version, 2.7.4, still works fine for Python 3 and can easily be installed through `pip install Whoosh`.

In the [Whoosh introduction](https://whoosh.readthedocs.io/en/latest/intro.html) we read:

> ### About Whoosh
>- Whoosh is fast, but uses only pure Python, so it will run anywhere Python runs, without requiring a compiler.
>- By default, Whoosh uses the Okapi BM25F ranking function, but like most things the ranking function can be easily customized.
>- Whoosh creates fairly small indexes compared to many other search libraries.
>- All indexed text in Whoosh must be unicode.
>- Whoosh lets you store arbitrary Python objects with indexed documents.

> ### What is Woosh

>Whoosh is a fast, pure Python search engine library.

>The primary design impetus of Whoosh is that it is pure Python. You should be able to use Whoosh anywhere you can use Python, no compiler or Java required.

>Like one of its ancestors, Lucene, Whoosh is not really a search engine, it’s a programmer library for creating a search engine.

>Practically no important behavior of Whoosh is hard-coded. Indexing of text, the level of information stored for each term in each field, parsing of search queries, the types of queries allowed, scoring algorithms, etc. are all customizable, replaceable, and extensible.

Indeed, Whoosh is quite similar to Lucene, including its query language. It lets you connect terms with `AND` or `OR`, eleminate terms with `NOT`, group terms together into clauses with parentheses, do range, prefix, and wilcard queries, and specify different fields to search. By default it joins clauses together with `AND` (so by default, all terms you specify must be in the document for the document to match)

The following code shows you how to create and search a basic Whoosh index. For more information, see the [Whoosh quick start](https://whoosh.readthedocs.io/en/latest/quickstart.html) and documentation on the [query language](https://whoosh.readthedocs.io/en/latest/querylang.html).


In [None]:
"""
Whoosh quick start
Source: https://whoosh.readthedocs.io/en/latest/quickstart.html
"""

import os

from whoosh import highlight
from whoosh.index import open_dir, create_in
from whoosh.fields import Schema, STORED, ID, KEYWORD, TEXT
from whoosh.qparser import QueryParser
from whoosh.query import *

# Create schema
"""
To begin using Whoosh, you need an index object. The first time you create
an index, you must define the index’s schema. The schema lists the fields in
the index. A field is a piece of information for each document in the index,
such as its title or text content. A field can be indexed (meaning it can be
searched) and/or stored (meaning the value that gets indexed is returned with
the results; this is useful for fields such as the title).
"""

schema = Schema(title=TEXT(stored=True), content=TEXT(stored=True),
                path=ID(stored=True))

# Create index
"""
Once you have the schema, you can create an index.
At a low level, this creates a Storage object to contain the index.
A Storage object represents that medium in which the index will be stored.
Usually this will be FileStorage, which stores the index as a set of files
in a directory.
"""

if not os.path.exists("index"):
    os.mkdir("index")
ix = create_in("index", schema)

# Open index
"""
After you’ve created an index, you can open it.
"""
ix = open_dir("index")

# Add documents
"""
OK, so we’ve got an Index object, now we can start adding documents.
The writer() method of the Index object returns an IndexWriter object that
lets you add documents to the index. The IndexWriter’s add_document(**kwargs)
method accepts keyword arguments where the field name is mapped to a value.

The documents we add, a small corpus of British fiction, are part of
the chapter06 repo.
"""

writer = ix.writer()

for document in os.listdir("documents"):
    with open("documents/" + document, 'r') as text:
        writer.add_document(title=document, content=text.read(),
                            path=document)
writer.commit()


# Parse a query string
"""
Woosh's Searcher (cf.infra) takes a Query object. You can construct query
objects directly or use a query parser to parse a query string.
To parse a query string, you can use the default query parser in the qparser
module. The first argument to the QueryParser constructor is the default field
to search. This is usually the "body text" field. The second optional argument
is a schema to use to understand how to parse the fields.
"""
parser = QueryParser("content", ix.schema)
myquery = parser.parse('"My father was well known to the best circles"')

# Search documents
"""
Once you have a Searcher and a query object, you can use the Searcher's
search() method to run the query and get a Results object. You can use the
highlights() method on the whoosh.searching.Hit object to get highlighted
snippets from the document containing the search terms.

Woosh highlighting system works as a pipeline, with four component types.

- Fragmenters chop up the original text into __fragments__, based on the
locations of matched terms in the text.
- Scorers assign a score to each fragment, allowing the system to rank the
best fragments by whatever criterion.
- Order functions control in what order the top-scoring fragments are presented
to the user. For example, you can show the fragments in the order they appear
in the document (FIRST) or show higher-scoring fragments first (SCORE)
- Formatters turn the fragment objects into human-readable output,
such as an HTML string.
"""
with ix.searcher() as searcher:
    results = searcher.search(myquery)
    results.fragmenter = highlight.SentenceFragmenter()
    for item in results:
        print(item.highlights("content"))

## Assignment: chunk hunter

** look for 3, 4 and 5 word chunks (n-grams), sort according to frequency