# Word-based Information Retrieval with Apache Lucene

Created by Jonathan Diebel

[Apache Lucene](https://lucene.apache.org/) is a powerful, open-source search library written in Java.<br>
This notebook demonstrates how to use Apache Lucene for word-based information retrieval tasks. We will cover the basics of using Lucene, including:
- Creating an index
- Adding documents to the index
- Performing searches on the index
- Updating documents in the index
- Deleting single and all elements from the index

**Note:** This Jupyter Notebook must be run with a Java kernel. Ensure that you have a Java kernel installed and selected in your Jupyter environment to execute the code cells correctly. If you don't have a Java kernel installed, you can add one using the appropriate Jupyter extensions or tools like [IJava](https://github.com/SpencerPark/IJava) which I used.

## 1 Setup and Dependencies

First, ensure you have the necessary Lucene libraries. Download Lucene from [the official website](https://lucene.apache.org/core/downloads.html).

Make sure you include the following jars in your classpath:
- lucene-core
- lucene-analyzers-common
- lucene-queryparser

In [1]:
// Add the Lucene libraries to the classpath
%classpath add jar /lucene-9.10.0/lucene-core-9.10.0.jar
%classpath add jar /lucene-9.10.0/lucene-analysis-common-9.10.0.jar
%classpath add jar /lucene-9.10.0/lucene-queryparser-9.10.0.jar

In [2]:
// Import necessary Lucene classes
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.MMapDirectory;

import java.nio.file.Paths;
import java.io.IOException;

## 2 Creating an Index

First, we need to set up the analyzer and the directory to store our index. We will use `MMapDirectory` for the index storage.

In [3]:
// Set up the standard analyzer and MMapDirectory for index storage
StandardAnalyzer analyzer = new StandardAnalyzer();
Directory index = new MMapDirectory(Paths.get("/usr/local/lucene-index"));

// Configuration for the IndexWriter
IndexWriterConfig config = new IndexWriterConfig(analyzer);

// Initialize the IndexWriter with the configuration
IndexWriter writer = new IndexWriter(index, config);

## 3 Adding Documents

Next, we will create a few sample documents and add them to our index. Each document represents a piece of text that we want to make searchable.

In [4]:
// Method to add a document to the index
public void addDoc(IndexWriter writer, String title, String content) throws IOException {
    Document doc = new Document();
    doc.add(new StringField("title", title, Field.Store.YES));
    doc.add(new TextField("content", content, Field.Store.YES));
    writer.addDocument(doc);
}

// Adding sample documents
addDoc(writer, "Information retrieval", "Information retrieval is the process of obtaining information from a large repository.");
addDoc(writer, "Lucene Introduction", "Lucene is a powerful search engine library written in Java.");
addDoc(writer, "Lucene vs Elasticsearch", "Both Lucene and Elasticsearch are used for full-text search, but Elasticsearch is built on top of Lucene.");
addDoc(writer, "Getting Started with Lucene", "This is a beginner's guide to getting started with Apache Lucene.");
addDoc(writer, "Advanced Lucene Features", "This document explains advanced features of Lucene, including custom scoring and analysis.");
addDoc(writer, "Elasticsearch Scaling", "Elasticsearch provides powerful scaling capabilities on top of Lucene.");
writer.close();

## 4 Searching the Index

Now that we have indexed our documents, we can perform searches on the index. We will use the `IndexSearcher` and `QueryParser` classes to execute search queries.

In [5]:
// Define a method to perform search and display results
public void search(String querystr, IndexSearcher searcher) throws Exception {
    // Re-open the IndexReader to ensure it reflects the latest changes
    DirectoryReader reader = DirectoryReader.open(index);
    searcher = new IndexSearcher(reader);

    // Parse and execute the query
    QueryParser parser = new QueryParser("content", analyzer);
    Query query = parser.parse(querystr);
    ScoreDoc[] hits = searcher.search(query, 10).scoreDocs;

    // Display search results
    System.out.println("Found " + hits.length + " hits.");
    for (int i = 0; i < hits.length; ++i) {
        int docId = hits[i].doc;
        Document d = searcher.doc(docId);
        System.out.println((i + 1) + ". " + d.get("title") + "\t\t" + d.get("content"));
    }

    // Close the IndexReader after searching
    reader.close();
}

// Re-initialize the IndexSearcher
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(index));

### Search for documents containing "Lucene" or "Elasticsearch"

In [6]:
search("Lucene OR Elasticsearch", searcher);

Found 5 hits.
1. Lucene vs Elasticsearch		Both Lucene and Elasticsearch are used for full-text search, but Elasticsearch is built on top of Lucene.
2. Elasticsearch Scaling		Elasticsearch provides powerful scaling capabilities on top of Lucene.
3. Lucene Introduction		Lucene is a powerful search engine library written in Java.
4. Getting Started with Lucene		This is a beginner's guide to getting started with Apache Lucene.
5. Advanced Lucene Features		This document explains advanced features of Lucene, including custom scoring and analysis.


### Search for documents containing "guide" and "beginner"

In [7]:
search("beginner's AND guide", searcher);

Found 1 hits.
1. Getting Started with Lucene		This is a beginner's guide to getting started with Apache Lucene.


### Phrase search for "search engine"

In [8]:
search("\"search engine\"", searcher);

Found 1 hits.
1. Lucene Introduction		Lucene is a powerful search engine library written in Java.


### Search with wildcard

In [9]:
search("in*", searcher);

Found 3 hits.
1. Information retrieval		Information retrieval is the process of obtaining information from a large repository.
2. Lucene Introduction		Lucene is a powerful search engine library written in Java.
3. Advanced Lucene Features		This document explains advanced features of Lucene, including custom scoring and analysis.


### Fuzzy search for terms similar to "retrival"

In [10]:
search("retrival~", searcher);

Found 1 hits.
1. Information retrieval		Information retrieval is the process of obtaining information from a large repository.


###  Boosting a term

Boosting a term in a search query makes it possible to increase the relevance of a specific term or phrase in the search results. The term or phrase is assigned a weight (a boost number) that indicates how relevant it should be compared to other terms.

The boost factor of 2 means that the term "Lucene" is twice as important as other terms in the search query.

In [11]:
search("Lucene^2 Elasticsearch", searcher);

Found 5 hits.
1. Lucene vs Elasticsearch		Both Lucene and Elasticsearch are used for full-text search, but Elasticsearch is built on top of Lucene.
2. Elasticsearch Scaling		Elasticsearch provides powerful scaling capabilities on top of Lucene.


3. Lucene Introduction		Lucene is a powerful search engine library written in Java.
4. Getting Started with Lucene		This is a beginner's guide to getting started with Apache Lucene.
5. Advanced Lucene Features		This document explains advanced features of Lucene, including custom scoring and analysis.


### Search for a term which is not included

In [12]:
search("pizza", searcher);

Found 0 hits.


## 5 Updating Documents in the Index

Sometimes, we need to update existing documents in our index. Lucene allows us to delete the old version of a document and add the updated version.


In [13]:
// Configuration for the IndexWriter
IndexWriterConfig config = new IndexWriterConfig(analyzer);

// Method to update a document in the index
public void updateDoc(IndexWriter writer, String oldTitle, String newTitle, String newContent) throws IOException {
    Document doc = new Document();
    doc.add(new StringField("title", newTitle, Field.Store.YES));
    doc.add(new TextField("content", newContent, Field.Store.YES));
    writer.updateDocument(new Term("title", oldTitle), doc);
}

// Re-initialize the IndexWriter
writer = new IndexWriter(index, config);

// Update a document
updateDoc(writer, "Lucene Introduction", "Lucene Introduction", "Lucene is a search engine library written in Java. It is highly scalable.");
writer.close();

### Verifying the Update

Let's perform a search to verify that the document was updated successfully.

In [14]:
// Perform a search to verify the update
search("scalable", searcher);

Found 1 hits.
1. Lucene Introduction		Lucene is a search engine library written in Java. It is highly scalable.


## 6 Deleting a Document from the Index

We can delete an existing document from the index based on a specific field, such as the title. The following method will find a document by its title and delete it from the index.

In [15]:
// Configuration for the IndexWriter
IndexWriterConfig config = new IndexWriterConfig(analyzer);

// Method to delete a document from the index
public void deleteDoc(IndexWriter writer, String title) throws IOException {
    writer.deleteDocuments(new Term("title", title));
}

// Re-initialize the IndexWriter
writer = new IndexWriter(index, config);

// Delete a document
deleteDoc(writer, "Information retrieval");
writer.close();

### Verifying the Update

Let's perform a search to verify that the document was deleted successfully.

In [16]:
// Perform a search to verify the update
search("Information retrieval", searcher);

Found 0 hits.


## 7 Deleting All Documents from the Index

In some cases, you might need to clear the entire index by deleting all documents. This can be useful during testing or when you want to completely reset your index. The following method demonstrates how to delete all documents from the index.

In [17]:
// Configuration for the IndexWriter
IndexWriterConfig config = new IndexWriterConfig(analyzer);

// Method to delete all documents from the index
public void deleteAllDocs(IndexWriter writer) throws IOException {
    // Delete all documents
    writer.deleteAll();
    // Commit changes to the index
    writer.commit();
    // Close the writer to apply changes
    writer.close();

    System.out.println("Deleted all documents from the index.");
}

// Re-initialize the IndexWriter
writer = new IndexWriter(index, config);

// Delete all documents
deleteAllDocs(writer);

Deleted all documents from the index.


### Verifying the Update

This search query uses the wildcard operator * to match all documents in the index. The pattern means "match all" and therefore returns all documents that are present in the index.

In [18]:
search("*:*", searcher);

Found 0 hits.
