# Tutorial Overview: Building a Semantic Search System from JSON Data with jRAG

This tutorial walks through the process of creating a semantic search system that can find relevant JSON documents based on a natural language query using [jRAG](https://pypi.org/project/jrag/) - A tool for generating embedding strings from JSON/Dictionary fields.

This is necessary because we can't embed JSONs as they are, we need to create a string representation to embed. It is possible to use `json.dumps()` to do this, however we might not always want to embed the entire JSON (only relevant fields) but we do still want to retrieve the entire JSON. jRAG helps simplify this.

jRAG is built on [jsonpath-ng](https://pypi.org/project/jsonpath-ng/) so refer to this documentation for the expressions used as part of the configs (look for cell "Combine fields from JSON data" for example)

# Step by Step

Here's a breakdown of the steps demonstrated:

1. Data Preparation (JSON): The tutorial starts with a list of Python dictionaries (json_lst), simulating a collection of JSON documents. Each document contains structured information like titles, authors, abstracts, tags, etc.

2. Text Conversion with jrag:
    * Challenge: Embedding models (like Sentence Transformers) work best with text, not raw JSON structures.
    * Solution: The jrag library is used to convert the structured JSON data into meaningful text strings suitable for embedding.
    * Configuration: A jrag_config dictionary is defined, mapping descriptive labels (like 'Title', 'Abstract') to jsonpath-ng expressions. This tells jrag exactly which fields from the JSON to extract and combine.
    * Execution: jrag.tag_list(json_lst, jrag_config) is called. This iterates through the list of dictionaries, applies the configured extraction rules to each, and adds a new key (defaulting to 'jrag_text') containing the resulting flattened text string to each dictionary.


3. Corpus Creation: The generated text strings (jrag_text) are extracted from the modified dictionaries to form the corpus_texts. Crucially, a parallel list (corpus_metadata) is kept to store references (or the full original items) back to the original JSON data, allowing retrieval of the full document later.

4. Embedding Generation:
    * A pre-trained SentenceTransformer model (all-MiniLM-L6-v2) is loaded.
    * This model is used to convert each text string in corpus_texts into a high-dimensional numerical vector (embedding). These embeddings capture the semantic meaning of the text.

5. Vector Indexing with FAISS:
    * Facebook AI Similarity Search (faiss) is used to create an efficient index (IndexFlatL2) for the generated embeddings.
    * The corpus embeddings are added to this index. FAISS allows for very fast searching over large numbers of vectors.

6. Querying:
    * A sample text query ("Tell me about semantic search technologies.") is defined.
    * The same Sentence Transformer model is used to convert the query text into its own embedding vector.
    * The FAISS index's search method is used to find the embeddings in the index that are most similar (closest in vector space, using L2 distance here) to the query embedding.

7. Retrieval and Display:
    * The search returns the indices of the top N most similar items in the original corpus.
    * Using the corpus_metadata list created earlier, the indices are mapped back to the original JSON documents.
    * The tutorial then prints the details of these retrieved JSON documents, demonstrating that the system successfully found the entries semantically related to the query.

In essence, the tutorial showcases how jrag acts as a bridge transforming structured JSON data into a text format that can be understood and processed by modern NLP tools for tasks like semantic search and retrieval, forming the core "retrieval" part of a RAG system.

# Installs

In [None]:
!pip install sentence-transformers faiss-cpu numpy jrag --quiet

# Imports

In [None]:
import json
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
import time
import jrag

# JSON data

In [None]:
json_lst = [
  {
    "id": "report_tech_001",
    "title": "FAISS Library Analysis",
    "author": "Alice Smith",
    "timestamp": "2025-04-10T10:00:00Z",
    "category": "tech",
    "status": "published",
    "abstract": "An in-depth look at FAISS, a library developed by Facebook AI for efficient similarity search and clustering of dense vectors. We explore its various index types and performance characteristics.",
    "metadata": {
      "source": "Internal Research",
      "confidence": 0.95,
      "word_count": 450,
      "tags": ["ai", "similarity search", "vector database", "facebook ai"]
    }
  },
  {
    "id": "news_astro_002",
    "title": "JWST Captures New Nebula",
    "author": "Bob Johnson",
    "timestamp": "2025-04-11T14:30:00Z",
    "category": "science",
    "status": "published",
    "abstract": "The James Webb Space Telescope has delivered breathtaking new imagery of the 'Cosmic Cliffs' region within the Carina Nebula, revealing previously hidden star formation.",
    "metadata": {
      "source": "NASA Press Release",
      "confidence": 0.99,
      "image_ref": "jwst_carina_01.jpg",
      "tags": ["astronomy", "jwst", "telescope", "space", "nebula"]
    }
  },
  {
    "id": "paper_nlp_003",
    "title": "Sentence Embedding Techniques",
    "author": "Carol Williams",
    "timestamp": "2025-04-12T09:15:00Z",
    "category": "tech",
    "status": "published",
    "abstract": "This paper reviews modern techniques for computing meaningful sentence and text embeddings, focusing on transformer-based models like those provided by the Sentence Transformers library.",
    "metadata": {
      "source": "AI Conference Proc.",
      "doi": "10.1234/aiconf.2025.5678",
      "word_count": 8500,
      "tags": ["nlp", "ai", "embeddings", "sentence transformers", "deep learning"]
    }
  },
  {
    "id": "brief_quantum_004",
    "title": "Quantum Computing Update",
    "author": "David Brown",
    "timestamp": "2025-04-12T16:00:00Z",
    "category": "news",
    "status": "draft",
    "abstract": "Recent advancements in qubit stability mark a significant step forward for practical quantum computing applications. Researchers highlight potential impacts on cryptography and materials science.",
    "metadata": {
      "source": "Tech Journal X",
      "confidence": 0.88,
      "tags": ["quantum computing", "technology", "research"]
    }
  },
  {
    "id": "howto_ai_005",
    "title": "Scalable Semantic Search Guide",
    "author": "Alice Smith",
    "timestamp": "2025-04-13T11:00:00Z",
    "category": "tech",
    "status": "published",
    "abstract": "A practical guide demonstrating how to combine Sentence Transformers for embedding generation and FAISS for indexing to build scalable semantic search systems capable of handling large datasets.",
    "metadata": {
      "source": "Tech Blog",
      "difficulty": "intermediate",
      "word_count": 1200,
      "tags": ["semantic search", "ai", "faiss", "embeddings", "tutorial"]
    }
  },
  {
    "id": "discovery_bio_006",
    "title": "New Deep-Sea Species",
    "author": "Eva Green",
    "timestamp": "2025-04-09T08:45:00Z",
    "category": "science",
    "status": "published",
    "abstract": "Marine biologists participating in the 'Ocean Depths' expedition have officially classified a new species of bioluminescent fish found near hydrothermal vents in the Pacific Ocean.",
    "metadata": {
      "source": "Journal of Marine Biology",
      "confidence": 0.97,
      "location": "Mariana Trench Region",
      "tags": ["biology", "marine life", "discovery", "oceanography"]
    }
  }
]

# Combine fields from JSON data

Using the most contextually relevant data to the usecase, we merge these fields into a single string that we use to create the embedding

In [None]:
# Pick which columns to combine using jsonpath-ng expressions
jrag_config = {
    'Title': '$.title',
    'Author': '$.author',
    'Category': '$.category',
    'Tags': '$.metadata.tags[*]',  # Select all inside list
    'Abstract': '$.abstract'
}

# tag_list adds the new combined field to the json
json_lst = jrag.tag_list(json_lst, jrag_config)

In [None]:
# Inspect first example
first_json = json_lst[0]
first_json['jrag_text']

# Build corpus

Here we create the embedding and vector store using SentenceTransformer and FAISS (all built locally, no API key needed)

In [None]:
# Extract the text content and keep track of original data reference
# We store the original index to map FAISS results back to our JSON objects
corpus_texts = []
corpus_metadata = [] # To store original dicts or just IDs

for i, item in enumerate(json_lst):
    jrag_text = item.get('jrag_text')
    if jrag_text and isinstance(jrag_text, str):
        corpus_texts.append(jrag_text)
        # Store the original item or just its ID for later retrieval
        # Storing the whole item is easier for this example
        corpus_metadata.append({"original_index": i, "data": item})
    else:
        print(f"Warning: Item at index {i} is missing 'content' key or it's not a string. Skipping.")

if not corpus_texts:
    print("Error: No valid text content found in the JSON data.")
    exit()

print(f"Loaded {len(corpus_texts)} items with text content.")# --- Configuration ---
MODEL_NAME = 'all-MiniLM-L6-v2' # A good & fast general-purpose model
NUM_NEIGHBORS = 3 # How many similar items to retrieve

# --- 2. Load Sentence Transformer Model ---
print(f"Loading Sentence Transformer model '{MODEL_NAME}'...")
start_time = time.time()
model = SentenceTransformer(MODEL_NAME)
end_time = time.time()
print(f"Model loaded in {end_time - start_time:.2f} seconds.")

# --- 3. Generate Embeddings ---
print("Generating embeddings for the corpus...")
start_time = time.time()
# Ensure convert_to_numpy=True for FAISS compatibility
corpus_embeddings = model.encode(corpus_texts, convert_to_numpy=True, show_progress_bar=True)
end_time = time.time()
print(f"Embeddings generated in {end_time - start_time:.2f} seconds.")

# FAISS requires float32 type
corpus_embeddings = corpus_embeddings.astype('float32')

# Get the dimensionality of embeddings (required by FAISS)
embedding_dim = corpus_embeddings.shape[1]
print(f"Embedding dimension: {embedding_dim}")

# --- 4. Build FAISS Index ---
# Using IndexFlatL2 - simple baseline, performs exhaustive search
# L2 distance = Euclidean distance
print("Building FAISS index (IndexFlatL2)...")
index = faiss.IndexFlatL2(embedding_dim)

# --- 5. Add Embeddings to Index ---
print(f"Adding {len(corpus_embeddings)} embeddings to the index...")
index.add(corpus_embeddings)
print(f"Index contains {index.ntotal} vectors.")

# Query

We can now query `"Tell me about semantic search technologies."`

In [None]:
# --- 6. Prepare and Perform Search ---
query_text = "Tell me about semantic search technologies."
print(f"\nPerforming search for query: '{query_text}'")
print(f"Finding top {NUM_NEIGHBORS} similar items...")

# Generate embedding for the query
start_time = time.time()
query_embedding = model.encode([query_text], convert_to_numpy=True)
query_embedding = query_embedding.astype('float32')
end_time = time.time()
print(f"Query embedding generated in {end_time - start_time:.2f} seconds.")

# Perform the search
start_time = time.time()
# The search function returns distances and indices (IDs) of neighbors
# query_embedding needs to be 2D array (even if it's just one query)
distances, indices = index.search(query_embedding, NUM_NEIGHBORS)
end_time = time.time()
print(f"Search completed in {end_time - start_time:.4f} seconds.")

# --- 7. Display Results ---
print("\nSearch Results:")
print("--------------")

# indices[0] contains the results for the first (and only) query
# distances[0] contains the corresponding distances
if not indices[0].size:
    print("No results found.")
else:
    for i, idx in enumerate(indices[0]):
        
        # Map the index `idx` from FAISS back to our original data
        # This works because we added embeddings in the same order as corpus_metadata
        original_item_info = corpus_metadata[idx]
        original_item = original_item_info['data']
        distance = distances[0][i]

        print(f"Rank {i+1}:")
        print(f"  Distance: {distance:.4f}")
        print(f"  ID: {original_item.get('id', 'N/A')}")
        print(f"  Category: {original_item.get('category', 'N/A')}")
        print(f"  Content: {original_item}")
        print("-" * 10)