# RAG Demo: Keywords Enhance Document Retrieval

**Core Concept:** YAKE keywords stored as metadata improve RAG retrieval

1. Extract keywords from documents using YAKE
2. Store keywords as metadata in ChromaDB
3. Show keyword overlap to explain why documents are retrieved

In [1]:
# Setup
import os
from pathlib import Path
from dotenv import load_dotenv
from yake import KeywordExtractor
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
import chromadb

load_dotenv()
DOCS_DIR = Path("demo_documents")
DB_DIR = "./chroma_db"

def extract_keywords(text: str, top: int = 15):
    extractor = KeywordExtractor(n=2, top=top, dedup_threshold=0.6)
    return [kw for kw, score in extractor.extract_keywords(text)]

## Step 1: Index Documents with Keywords

In [3]:
# Setup ChromaDB
client = chromadb.PersistentClient(path=DB_DIR)
try:
    client.delete_collection("demo")
except:
    pass
collection = client.create_collection("demo")
embeddings = OpenAIEmbeddings()

# Index documents
for txt_file in DOCS_DIR.glob("*.txt"):
    text = txt_file.read_text()
    keywords = extract_keywords(text, top=15)
    embedding = embeddings.embed_query(text)
    
    collection.add(
        ids=[txt_file.stem],
        embeddings=[embedding],
        documents=[text],
        metadatas=[{"keywords": ", ".join(keywords), "filename": txt_file.name}]
    )
    
    print(f"✓ {txt_file.name}")
    print(f"  Keywords: {', '.join(keywords[:5])}...\n")

✓ auth_guide.txt
  Keywords: Authentication Guide, Guide, Client Credentials, Flow Deprecated, Implicit Flow...

✓ ml_guide.txt
  Keywords: Learning Guide, Guide, Concepts Supervised, Neural Networks, Forests Ensemble...



## Step 2: Search with Keyword Overlap

In [None]:
def search_with_keywords(query: str):
    # Extract keywords from query
    query_keywords = extract_keywords(query, top=5)
    print(f"Query: {query}")
    print(f"Keywords: {', '.join(query_keywords)}\n")
    
    # Search
    query_embedding = embeddings.embed_query(query)
    results = collection.query(query_embeddings=[query_embedding], n_results=2)
    
    # Show results
    for i, metadata in enumerate(results['metadatas'][0], 1):
        doc_keywords = metadata['keywords'].split(', ')
        
        # Find keyword overlap (case-insensitive)
        query_kw_lower = [kw.lower() for kw in query_keywords]
        doc_kw_lower = [kw.lower() for kw in doc_keywords]
        
        overlap = []
        for qkw in query_kw_lower:
            for dkw in doc_kw_lower:
                if qkw in dkw or dkw in qkw:
                    overlap.append(f"{qkw}↔{dkw}")
        
        print(f"{i}. {metadata['filename']}")
        if overlap:
            print(f"Overlap: {', '.join(overlap[:3])}")
        else:
            print(f"No overlap")
    print()

### Try Different Queries

In [5]:
search_with_keywords("Tell me about OAuth and authentication methods")

Query: Tell me about OAuth and authentication methods
Keywords: authentication methods, OAuth, methods

1. auth_guide.txt
   ✓ Overlap: oauth↔oauth industry
2. ml_guide.txt
   ✗ No overlap



In [6]:
search_with_keywords("What is machine learning and neural networks")

Query: What is machine learning and neural networks
Keywords: machine learning, neural networks, learning, machine

1. ml_guide.txt
   ✓ Overlap: machine learning↔machine, neural networks↔neural networks, neural networks↔neural
2. auth_guide.txt
   ✗ No overlap



## Step 3: RAG Question Answering

In [7]:
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

question = "What are the OAuth 2.0 flows for authentication?"
print(f"Question: {question}\n")

# Retrieve and answer
query_embedding = embeddings.embed_query(question)
results = collection.query(query_embeddings=[query_embedding], n_results=1)
context = results['documents'][0][0]
source = results['metadatas'][0][0]['filename']

prompt = f"Based on this context, answer the question.\n\nContext: {context}\n\nQuestion: {question}\n\nAnswer:"
answer = llm.invoke(prompt).content

print(f"Answer: {answer}")
print(f"\nSource: {source}")

Question: What are the OAuth 2.0 flows for authentication?

Answer: The OAuth 2.0 flows for authentication include:

1. **Authorization Code Flow**: This is the most secure method for web applications, where an authorization code is exchanged for an access token after the user authenticates.

2. **Client Credentials Flow**: This flow is used for machine-to-machine authentication, where the client application authenticates itself to obtain an access token without user interaction.

3. **Implicit Flow**: This flow has been deprecated for security reasons and is no longer recommended for use. 

These flows are designed to facilitate secure authorization in various application scenarios.

Source: auth_guide.txt


## Key Takeaway

**Keywords as metadata provide:**

Better precision (filter by specific terms)  
Explainability (see keyword overlap)  
Domain knowledge (capture technical terms)

**This system's YAKE extractor becomes the metadata engine for RAG!**

## Try Your Own Query

In [None]:
# Modify and run
search_with_keywords("Your question here")