# Word-based Information Retrieval with Apache Lucene

Created by Jonathan Diebel

[Apache Lucene](https://lucene.apache.org/) is a powerful, open-source search library written in Java.<br>

This notebook demonstrates how to use Apache Lucene for word-based information retrieval tasks. We will cover the basics of using Lucene, including:
- Creating an index
- Adding documents to the index
- Updating documents in the index
- Performing searches on the index

## 1 Setup and Dependencies

First, ensure you have the necessary Lucene libraries. Download Lucene from [the official website](https://lucene.apache.org/core/downloads.html).

Make sure you include the following jars in your classpath:
- lucene-core
- lucene-analyzers-common
- lucene-queryparser

In [1]:
// Add the Lucene libraries to the classpath
%classpath add jar /lucene-9.10.0/lucene-core-9.10.0.jar
%classpath add jar /lucene-9.10.0/lucene-analysis-common-9.10.0.jar
%classpath add jar /lucene-9.10.0/lucene-queryparser-9.10.0.jar

In [2]:
// Import necessary Lucene classes
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.MMapDirectory;

import java.nio.file.Paths;
import java.io.IOException;

## 2 Creating an Index

First, we need to set up the analyzer and the directory to store our index. We will use `MMapDirectory` for the index storage.

In [None]:
// Set up the standard analyzer and MMapDirectory for index storage
StandardAnalyzer analyzer = new StandardAnalyzer();
Directory index = new MMapDirectory(Paths.get("/usr/local/lucene-index"));

// Configuration for the IndexWriter
IndexWriterConfig config = new IndexWriterConfig(analyzer);

// Initialize the IndexWriter with the configuration
IndexWriter writer = new IndexWriter(index, config);

## 3 Adding Documents



In [3]:
// Method to add a document to the index
public void addDoc(IndexWriter writer, String title, String content) throws IOException {
    Document doc = new Document();
    doc.add(new StringField("title", title, Field.Store.YES));
    doc.add(new TextField("content", content, Field.Store.YES));
    writer.addDocument(doc);
}

// Adding sample documents
addDoc(writer, "Information retrieva", "Information retrieval is the process of obtaining information from a large repository.");
addDoc(writer, "Lucene Introduction", "Lucene is a powerful search engine library written in Java.");
addDoc(writer, "Lucene vs Elasticsearch", "Both Lucene and Elasticsearch are used for full-text search, but Elasticsearch is built on top of Lucene.");
addDoc(writer, "Getting Started with Lucene", "This is a beginner's guide to getting started with Apache Lucene.");
writer.close();

## Searching the Index

Now that we have indexed our documents, we can perform searches on the index. We will use the `IndexSearcher` and `QueryParser` classes to execute search queries.

In [5]:
// Initialize the IndexSearcher
DirectoryReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);

// Define a method to perform search and display results
public void search(String querystr) throws Exception {
    QueryParser parser = new QueryParser("content", analyzer);
    Query query = parser.parse(querystr);
    ScoreDoc[] hits = searcher.search(query, 10).scoreDocs;

    System.out.println("Found " + hits.length + " hits.");
    for (int i = 0; i < hits.length; ++i) {
        int docId = hits[i].doc;
        Document d = searcher.doc(docId);
        System.out.println((i + 1) + ". " + d.get("title") + "\t" + d.get("content"));
    }
}

// Performing a search
search("Java");

Found 1 hits.
1. Lucene Introduction	Lucene is a search engine library written in Java.
