# Word-based Information Retrieval with Elasticsearch

Created by Jonathan Diebel

This notebook demonstrates how to use Elasticsearch for word-based information retrieval.<br>
Elasticsearch is a powerful search and analytics engine. We'll cover how to set up Elasticsearch, create an index, add documents, update documents, and perform searches. This guide assumes you have Elasticsearch running locally on your machine using the official Docker image.

https://www.elastic.co/de/elasticsearch

## 1 Setup of Elasticsearch

Ensure you have Docker installed and running on your machine. You can start Elasticsearch with the following Docker command:

```bash
docker run -d --name elasticsearch -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" elasticsearch:8.13.4
```

We'll use the elasticsearch Python client to interact with our Elasticsearch instance. Install it via pip if you haven't already:

```bash
pip install elasticsearch
```

In [5]:
import requests

# Set the Elasticsearch URL
elasticsearch_url = 'http://localhost:9200'

# Send a GET request to check the cluster health
response = requests.get(elasticsearch_url)

# Check the response status code
if response.status_code == 200:
    print("Elasticsearch is running and accessible.")
else:
    print("There was an error connecting to Elasticsearch. Status code:", response.status_code)
    print("Please ensure that Elasticsearch is running and accessible.")

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

: 

In [1]:
# Import the Elasticsearch library
from elasticsearch import Elasticsearch

# Connect to Elasticsearch running locally (assuming default port)
es = Elasticsearch()

# Check if Elasticsearch is running
if es.ping():
    print("Connected to Elasticsearch!")
else:
    print("Could not connect to Elasticsearch. Make sure it's running.")

ValueError: Either 'hosts' or 'cloud_id' must be specified

In [1]:
# Initialize the JVM
import lucene
lucene.initVM()

ModuleNotFoundError: No module named 'lucene'

In [None]:
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field, TextField
from org.apache.lucene.index import DirectoryReader, IndexWriter, IndexWriterConfig
from org.apache.lucene.queryparser.classic import QueryParser
from org.apache.lucene.search import IndexSearcher
from org.apache.lucene.store import RAMDirectory

In [None]:
# Create an in-memory index
directory = RAMDirectory()
analyzer = StandardAnalyzer()
config = IndexWriterConfig(analyzer)
index_writer = IndexWriter(directory, config)

In [None]:
# Function to add documents to the index
def add_doc(writer, title, content):
    doc = Document()
    doc.add(TextField("title", title, Field.Store.YES))
    doc.add(TextField("content", content, Field.Store.YES))
    writer.addDocument(doc)

# Adding sample documents to the index
add_doc(index_writer, "Document 1", "Lucene is a powerful search library written in Java.")
add_doc(index_writer, "Document 2", "Information retrieval is the process of obtaining information from a large repository.")
add_doc(index_writer, "Document 3", "Lucene supports various types of queries including term, phrase, and wildcard queries.")
index_writer.close()

In [None]:
# Function to search the index
def search_index(query_str):
    reader = DirectoryReader.open(directory)
    searcher = IndexSearcher(reader)
    query_parser = QueryParser("content", analyzer)
    query = query_parser.parse(query_str)
    hits = searcher.search(query, 10).scoreDocs
    
    results = []
    for hit in hits:
        doc_id = hit.doc
        doc = searcher.doc(doc_id)
        results.append((doc.get("title"), doc.get("content")))
    
    reader.close()
    return results

# Example search queries
query1 = "Lucene"
query2 = "information retrieval"
query3 = "supports queries"

In [None]:
# Perform searches and display results
print(f"Results for query: '{query1}'")
for title, content in search_index(query1):
    print(f"Title: {title}\nContent: {content}\n")

print(f"Results for query: '{query2}'")
for title, content in search_index(query2):
    print(f"Title: {title}\nContent: {content}\n")

print(f"Results for query: '{query3}'")
for title, content in search_index(query3):
    print(f"Title: {title}\nContent: {content}\n")