# Word-based Information Retrieval with Apache Lucene

Created by Jonathan Diebel

This notebook demonstrates how to use Apache Lucene for word-based information retrieval.<br>
Apache Lucene is a powerful, open-source search library written in Java. We will use PyLucene, the Python extension for accessing Java Lucene.

https://downloads.apache.org/lucene/pylucene/

## 1 Installation of PyLucene

PyLucene is a Python extension that enables access to the Java-based Lucene library. Lucene is a powerful search engine library developed by the Apache Software Foundation and used for full-text search and indexing. PyLucene uses JCC (Java C++ compiler) to make Java classes available in Python by generating C++ wrappers that use the Java Native Interface (JNI).

**Prerequisites**:
- Java Development Kit (JDK): PyLucene requires a JDK for compilation and a Java Runtime Environment (JRE) for runtime. From version 9.x PyLucene requires Java 11 or higher.
- JCC: JCC must be installed before installing PyLucene. JCC is included in the PyLucene source code and can be built and installed via setup.py.
- C/C++ compiler: A modern C/C++ compiler is required to compile the generated C++ wrappers.

In [1]:
# Initialize the JVM
import lucene
lucene.initVM()

ModuleNotFoundError: No module named 'lucene'

In [None]:
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field, TextField
from org.apache.lucene.index import DirectoryReader, IndexWriter, IndexWriterConfig
from org.apache.lucene.queryparser.classic import QueryParser
from org.apache.lucene.search import IndexSearcher
from org.apache.lucene.store import RAMDirectory

In [None]:
# Create an in-memory index
directory = RAMDirectory()
analyzer = StandardAnalyzer()
config = IndexWriterConfig(analyzer)
index_writer = IndexWriter(directory, config)

In [None]:
# Function to add documents to the index
def add_doc(writer, title, content):
    doc = Document()
    doc.add(TextField("title", title, Field.Store.YES))
    doc.add(TextField("content", content, Field.Store.YES))
    writer.addDocument(doc)

# Adding sample documents to the index
add_doc(index_writer, "Document 1", "Lucene is a powerful search library written in Java.")
add_doc(index_writer, "Document 2", "Information retrieval is the process of obtaining information from a large repository.")
add_doc(index_writer, "Document 3", "Lucene supports various types of queries including term, phrase, and wildcard queries.")
index_writer.close()

In [None]:
# Function to search the index
def search_index(query_str):
    reader = DirectoryReader.open(directory)
    searcher = IndexSearcher(reader)
    query_parser = QueryParser("content", analyzer)
    query = query_parser.parse(query_str)
    hits = searcher.search(query, 10).scoreDocs
    
    results = []
    for hit in hits:
        doc_id = hit.doc
        doc = searcher.doc(doc_id)
        results.append((doc.get("title"), doc.get("content")))
    
    reader.close()
    return results

# Example search queries
query1 = "Lucene"
query2 = "information retrieval"
query3 = "supports queries"

In [None]:
# Perform searches and display results
print(f"Results for query: '{query1}'")
for title, content in search_index(query1):
    print(f"Title: {title}\nContent: {content}\n")

print(f"Results for query: '{query2}'")
for title, content in search_index(query2):
    print(f"Title: {title}\nContent: {content}\n")

print(f"Results for query: '{query3}'")
for title, content in search_index(query3):
    print(f"Title: {title}\nContent: {content}\n")