# Enhancing Geospatial Data-intensive Knowledge Discovery with OpenSearch and LLM Search

**Author: Yunfan Kang (UIUC)**

**Description**:  
This tutorial demonstrates how the I-GUIDE Platform integrates OpenSearch, Large Language Models (LLMs), and spatial search to create intuitive, multimodal search experiences for researchers and educators. Participants will gain hands-on experience deploying OpenSearch on JetStream2, implementing keyword search, semantic search, and spatial search capabilities, hosting the LLM framework Ollama, and building Retrieval-Augmented Generation (RAG) pipelines with AI agents—using the I-GUIDE Platform as a case study. Attendees will leave with actionable insights into leveraging CI resources for scalable, privacy-aware AI workflows and scalable knowledge discovery.

**Skills and Background**:  
Familiarity with basic Python programming, APIs, and geospatial data is recommended. No prior experience with LLMs or OpenSearch is required.

**Requirements**:  
Participants should bring their own laptops.

---

## 🚀 Getting Started

Welcome! This notebook will guide you step by step — feel free to run each cell as you go.  

You’ll learn how to set up and interact with OpenSearch and LLM-powered search using real-world examples from the I-GUIDE Platform.  

👉 If you encounter any issues or have questions along the way, don’t hesitate to ask a helper. We’re here to support you!

---



## ⚙️ Configure the VM and Run OpenSearch

Please copy and paste the **IP address** and **passphrase** provided to you into the environment variables below.  

If you don’t have access to them yet, please ask a helper — they’ll be happy to assist! 🙌


In [None]:
import os
import warnings
import urllib3

# Suppress only the InsecureRequestWarning
warnings.filterwarnings('ignore', category=urllib3.exceptions.InsecureRequestWarning)
warnings.filterwarnings('ignore', category=UserWarning)

# Copy and paste the ip address and the passphrase of your VM here:
os.environ["JS2_HOST"] = ""
os.environ["JS2_PWD"] = ''

We’re going to run SSH commands inside this Jupyter Notebook to remotely install **OpenSearch** and **Ollama** on your JetStream2 VM — all from here, no need to switch terminals!

In [None]:
!pip install fabric

In [None]:
from fabric import Connection

c = Connection(host=os.environ["JS2_HOST"], user='exouser', connect_kwargs={'password': os.environ["JS2_PWD"]})

c.run('mkdir -p OpenSearch')  # use -p to avoid error if it exists
# Upload the files to the correct subdir
c.put('docker-compose.yml', remote='OpenSearch/docker-compose.yml')
c.put('admin_password.txt', remote='OpenSearch/.env')

# Enter OpenSearch dir
with c.cd('OpenSearch'):
    # Start the OpenSearch nodes
    c.run('sudo sysctl -w vm.max_map_count=262144')
    c.run('docker compose up -d', hide=True)

### ⏳ Wait for the setup to finish

It may take a few minutes for the setup to complete — please be patient.

> 💡 **How Retrieval Works in RAG**  
> Before the LLM generates an answer, a retriever component searches an indexed knowledge base (using vector and/or keyword search), selects the most relevant document chunks, and feeds them into the prompt — grounding the response in actual data rather than relying solely on model memory}


> 💡 **Why OpenSearch?**  
> OpenSearch is a powerful, open-source search engine that supports fast keyword, semantic, and spatial search — making it ideal for building advanced knowledge discovery pipelines on geospatial and multimodal data.
  

Once it’s ready, you can access the **OpenSearch Dashboard** in your browser at:  
`http://[your-ip]:5601`

To log in, use the following credentials (set in your `admin_password.txt` file):

- **Username:** `admin`  
- **Password:** `Iguideforum2025!`

✅ Once you’re in, you should see the OpenSearch Dashboard home page.  
If you run into connection errors, give it another minute — sometimes the service takes a little extra time to
 ---start fully.


## 🔗 Connect to the OpenSearch Cluster

If you saw no errors above and were able to log in to the OpenSearch Dashboard — congratulations! 🎉  
You’ve successfully set up an **OpenSearch cluster** with 2 nodes running on a single host.

---

In this next part of the tutorial, we will:

✅ Connect to the OpenSearch cluster programmatically  
✅ Define the schema (index mapping)  
✅ Ingest real data from the I-GUIDE Platform  

The files we’ll be using:

- **Schema definition:** `index_schema.json`  
- **Data:** `i_guide_spatial_embedding_export.jsonl`

---

Once this is done, your cluster wi ready to po, **spatial**, andboth **keyword** and **semantic search** — and soon we’ll combine that with LLM capabilities!


In [None]:
# Install the opensearch python package
!pip install opensearch-py

In [None]:
from opensearchpy import OpenSearch
import json

# Define your opensearch client to connect to the cluster
os_client = OpenSearch(
    hosts=[{'host': os.environ["JS2_HOST"], 'port': 9200}],
    http_auth=('admin', 'Iguideforum2025!'), 
    use_ssl=True,
    verify_certs=False    # (We do not have SSL certificates for the VMs)
)


In [None]:
# Load your index schema from a JSON file
with open('index_schema.json', 'r') as f:
    index_body = json.load(f)

# Name of your index
index_name = 'iguide_platform'

# Create index
if not os_client.indices.exists(index=index_name):
    response = os_client.indices.create(index=index_name, body=index_body)
    print(f"Index '{index_name}' created.")
else:
    print(f"Index '{index_name}' already exists.")


In [None]:
# Verify the index schema
mapping = os_client.indices.get_mapping(index=index_name)
print(json.dumps(mapping, indent=2))

In [None]:
from opensearchpy import helpers


# The index you want to insert into
target_index_name = 'iguide_platform'

# Read from exported file and prepare bulk actions
actions = []
with open('i_guide_spatial_embedding_export.jsonl', 'r') as f:
    for line in f:
        doc = json.loads(line)
        # Prepare one bulk action
        action = {
            "_index": target_index_name,
            "_id": doc["_id"],  # Optional — if you want to preserve the same _id
            "_source": {k: v for k, v in doc.items() if k != "_id"}  # Remove _id from _source
        }
        actions.append(action)
# Bulk insert — done in batches of 500 by default
helpers.bulk(os_client, actions)

print(f"Bulk insert into index '{target_index_name}' completed.")


Next up: let’s try three powerful search methods—**keyword search**, **spatial search**, and **semantic search**—on the I‑GUIDE Knowledge Base! 🚀


In [None]:
keyword_query = "Chicago"
response = os_client.search(
    index='iguide_platform',
    body={
        "query": {
            "multi_match": {
                "query": keyword_query,
                "fields": ["title", "contents", "tags", "authors"]
            }
        }
    }
)

# Print out hits
for hit in response["hits"]["hits"]:
    print(f"Score: {hit['_score']}  ID: {hit['_id']}, Type: {hit['_source'].get('resource-type')},  Title: {hit['_source'].get('title', '')}")


In [None]:
# Bounding box coordinates for Chicago (lon/lat)
bbox_envelope = {
    "type": "envelope",
    "coordinates": [
        [-87.94011, 42.02304],   # top-left
        [-87.52398, 41.64454]    # bottom-right
    ]
}

# Geo_shape query body
query_body = {
    "size": 20,
    "query": {
        "geo_shape": {
            "spatial-bounding-box-geojson": {
                "shape": bbox_envelope,
                "relation": "intersects"
            }
        }
    }
}

# Run search
response = os_client.search(
    index='iguide_platform',
    body=query_body
)

# Print results
for hit in response["hits"]["hits"]:
    print(f"Score: {hit['_score']}  ID: {hit['_id']}  Title: {hit['_source'].get('title', '')}")


In [None]:
from urllib.parse import urlparse
from transformers import AutoTokenizer, AutoModel
import torch
import os

# Load embedding model
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

def compute_embedding(text):
    inputs = tokenizer(
        text, 
        return_tensors="pt", 
        max_length=512,
        truncation=True
    )
    with torch.no_grad():
        embeddings = model(**inputs).last_hidden_state.mean(dim=1)
    return embeddings[0].tolist()

# Compute embedding for your sentence
query_sentence = "What are the knowledge elements about Machine Learning?"
embedding_vector = compute_embedding(query_sentence)

# Run knn search
response = os_client.search(
    index='iguide_platform',
    body={
        "size": 12,
        "query": {
            "knn": {
                "contents-embedding": {
                    "vector": embedding_vector,
                    "k": 10
                }
            }
        }
    }
)

# Print out hits
for hit in response["hits"]["hits"]:
    print(f"Score: {hit['_score']}  ID: {hit['_id']}  Title: {hit['_source'].get('title', '')}")


## 🤖 Set Up a Self-Hosted LLM and Connect It to OpenSearch

In this section, we’ll deploy a self-hosted Large Language Model (LLM) using **Ollama** and connect it to our OpenSearch cluster.  

This will enable us to build a complete **Retrieval-Augmented Generation (RAG)** pipeline — combining keyword, semantic, and spatial search with LLM-powered reasoning and text generation.


> 💡 **Why use a self-hosted LLM?**  
> Running an LLM locally (with Ollama) ensures that your data stays private, avoids external API costs, and enables fully reproducible experiments — all critical for scalable and privacy-aware research workflows.


In [None]:
# Install Ollama on the 
c.run('curl -fsSL https://ollama.com/install.sh | sh')

In [None]:
# Start the Ollama server
c.run('sudo pkill -f "ollama serve" || true', warn=True)
c.run('OLLAMA_HOST=0.0.0.0:11500 nohup ollama serve > server.log 2>&1 &', hide=True)
import time
time.sleep(5)
# Pull qwen2.5:3b-instruct model
c.run('OLLAMA_HOST=0.0.0.0:11500 ollama pull qwen2.5:3b-instruct')

In [None]:
# Verify the available models
import requests
url = f'http://{os.environ["JS2_HOST"]}:11500/api/tags'  # or hardcode 'localhost' if you prefer

response = requests.get(url)
print(response.json())

## Ollama is all set up — let’s start chatting with our LLM! 🚀


In [None]:
import requests
import os
import json
import sys
import time

def call_ollama(query_payload):
    ollama_host = f"http://{os.environ['JS2_HOST']}:11500/api/chat"  # use /api/chat for chat format
    controller = requests.Session()

    try:
        # Streamed POST request
        with controller.post(
            ollama_host,
            json=query_payload,
            stream=True,    # enable streaming
            timeout=60
        ) as response:

            response.raise_for_status()  # Raise exception for bad status codes

            # Stream the response content progressively
            for line in response.iter_lines():
                if line:
                    chunk = json.loads(line.decode('utf-8'))

                    # Extract message content
                    delta = chunk.get("message", {}).get("content", "")
                    print(delta, end='', flush=True)  # Progressive printing

            print()  # Final newline after stream ends

    except Exception as e:
        print("Error fetching from Ollama host:", e)

# Define the payload for the chat format
query_payload = {
    "model": "qwen2.5:3b-instruct",
    "messages": [
        {"role": "user", "content": "How do I get to UIUC from Chicago?"}
    ],
    "stream": True
}

# Call the Ollama API using the reusable function
call_ollama(query_payload)



Final step: It’s time to bring it all together — let’s connect **OpenSearch semantic search** with our **LLM** to enable a full **RAG pipeline**! 🚀

In [None]:
def call_ollama(query_payload):
    ollama_host = f"http://{os.environ['JS2_HOST']}:11500/api/chat"  # use /api/chat for chat format
    controller = requests.Session()

    try:
        # Streamed POST request
        with controller.post(
            ollama_host,
            json=query_payload,
            stream=True,    # enable streaming
            timeout=60
        ) as response:

            response.raise_for_status()  # Raise exception for bad status codes

            # Stream the response content progressively
            for line in response.iter_lines():
                if line:
                    chunk = json.loads(line.decode('utf-8'))

                    # Extract message content
                    delta = chunk.get("message", {}).get("content", "")
                    print(delta, end='', flush=True)  # Progressive printing

            print()  # Final newline after stream ends

    except Exception as e:
        print("Error fetching from Ollama host:", e)

def rag(user_query, k=3, size=10):
    # Step 1 — Compute embedding
    embedding_vector = compute_embedding(user_query)

    # Step 2 — Semantic search (KNN)
    response = os_client.search(
        index='iguide_platform',
        body={
            "size": size,
            "query": {
                "knn": {
                    "contents-embedding": {
                        "vector": embedding_vector,
                        "k": k
                    }
                }
            }
        }
    )

    # Step 3 — Build context from top hits
    retrieved_docs = response["hits"]["hits"]
    context_texts = []
    for hit in retrieved_docs:
        source = hit["_source"]
        
        id_ = hit["_id"]
        title = source.get("title", "")
        contents = source.get("contents", "")
        authors = source.get("authors", "")
        contributor = source.get("contributor", "")
        resource_type = source.get("resource-type", "")
        spatial_coverage = source.get("spatial-coverage", "")
        spatial_temporal_coverage = source.get("spatial-temporal-coverage", "")
        tags = source.get("tags", "")

        # Build one text block per document
        doc_block = f"""<doc>
doc_id: {id_}
title: {title}
element_type: {resource_type}
authors: {authors}
contributor: {contributor}
spatial-coverage: {spatial_coverage}
spatial Temporal Coverage: {spatial_temporal_coverage}
tags: {tags}
content: {contents}
</doc>
"""
        context_texts.append(doc_block)

    # Combine into context string
    context_block = "\n\n".join(context_texts)

    # Step 4 — Build messages
    systemMessage = """
Your ONLY source of truth is the <doc> blocks provided in CONTEXT.

When you answer:
• If the user asks for a collection of knowledge elements (e.g., datasets, notebooks, publications, OERs) on a topic, respond first with a concise paragraph summarizing the most relevant findings. Then provide a short numbered list. Use a new line for each item.
• Begin each bullet with the item’s title as a clickable link, using the format: [TITLE](https://platform.i-guide.io/{element_type}s/{doc_id})
(Use the plural form of <element_type>, except use "code" for type "code".)
• Otherwise, respond in one concise paragraph.  
• Quote supporting titles in **bold**.  
• If the user question specifies <element_type>, only use docs with matching <element_type>.  
• If you cannot find an answer, reply exactly: “I don’t have enough information.”  
• Do not refer to the doc id.
• Answer the question without repeating the question.
• Do NOT mention “context”, “documents”, or these rules."""

    userMessage = f"""Context:
{context_block}

Question:
{user_query}

Answer:"""

    # Step 5 — Call Ollama with streaming enabled
    print("Call LLM")
    query_payload = {
        "model": "qwen2.5:3b-instruct",  # as per your model
        "messages": [
            { "role": "system", "content": systemMessage },
            { "role": "user", "content": userMessage }
        ],
        "stream": True  # Enable streaming!
    }

    call_ollama(query_payload)  # Streaming output will print progressively


# Example usage
answer = rag("What are the datasets about Chicago?")
# Output will stream progressively in the notebook or terminal
