# Structuring Documents with Indexes
Indexes help **structure documents** to optimize **LLM interactions**. In **AWS-based RAG (Retrieval-Augmented Generation) applications**, three key components enable efficient document management:
- **Document Loaders** (ingest data from various sources)
- **Retrievers** (fetch relevant documents)
- **Vector Stores** (store and retrieve embeddings for search)

## **Document Loaders**
- **Purpose**: Load documents from different sources to prepare them for embedding and retrieval.
- **Supported Sources**: 
  - **Databases**
  - **Online stores**
  - **Local file storage**
- **Supported File Formats**:
  - **HTML**
  - **PDF**
  - **Code files**
  - **Microsoft Office Documents**

In [None]:
from langchain.document_loaders import S3FileLoader

# Load a document from S3
loader = S3FileLoader("mysource_bucket", "sample-file.docx")
data = loader.load()

## **Retriever**
- Purpose: Fetch relevant documents from an indexed dataset.
- How It Works:
  - User submits a query.
  - Retriever searches document index.
  - Relevant documents are sent to the LLM for processing.
- **AWS Integration:**
- Amazon Kendra provides semantic search with pre-built connectors for popular data sources:
  - Amazon S3
  - SharePoint
  - Confluence
  - Websites
- Supports formats such as HTML, Word, PowerPoint, PDF, Excel, and PureText files.

In [None]:
from langchain_aws.retrievers import AmazonKendraRetriever
from langchain.chains import ConversationalRetrievalChain
from langchain.prompts import PromptTemplate
from langchain_aws import ChatBedrock 

# Define LLM
llm = ChatBedrock(
    model_kwargs={"max_tokens_to_sample": 300, "temperature": 1, "top_k": 250, "top_p": 0.999},
    model_id="anthropic.claude-3-sonnet-20240229-v1:0"
)

# Set up the retriever
retriever = AmazonKendraRetriever(index_id=kendra_index_id, top_k=5, region_name=region)

# Define prompt template
prompt_template = """ 
Human: This is a friendly conversation between a human and an AI.
The AI provides specific details from its context but limits it to 240 tokens.
If the AI does not know the answer, it truthfully says it does not know.

Assistant: OK, got it, I'll be a truthful AI assistant.

Human: Here are a few documents:
<documents>
{context}
</documents>
Based on the above documents, answer the following question: {question}
Answer 'do not know' if the document does not contain relevant information.

Assistant:
"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

# Create a conversational retrieval chain
response = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    combine_docs_chain_kwargs={"prompt": PROMPT},
    verbose=True
)

## **Vector stores**
- Purpose: Store and retrieve vector embeddings for efficient semantic search.
- How It Works:
  - Convert documents into embeddings (numerical representations).
  - Store embeddings in a vector database.
  - Retrieve relevant embeddings based on user queries.
  - Pass the retrieved documents to the LLM for generating accurate responses.
- AWS Vector Store Integrations:
  - Amazon OpenSearch Serverless
  - Amazon Aurora PostgreSQL-Compatible Edition (pgvector extension)

In [None]:
import os
from langchain_community.embeddings import BedrockEmbeddings
from langchain_community.vectorstores import OpenSearchVectorSearch

# Get environment variables for index and endpoint
index_name = os.environ["AOSS_INDEX_NAME"]
endpoint = os.environ["AOSS_COLLECTION_ENDPOINT"]

# Initialize Bedrock Embeddings
embeddings = BedrockEmbeddings(client=bedrock_client)

# Set up the vector store
vector_store = OpenSearchVectorSearch(
    index_name=index_name,
    embedding_function=embeddings,
    opensearch_url=endpoint,
    use_ssl=True,
    verify_certs=True,
)

# Convert the vector store into a retriever
retriever = vector_store.as_retriever()

## **Key Takeaways**
- Document Loaders: Extract text from various sources (Amazon S3, HTML, PDFs, etc.).
- Retrievers: Find relevant documents from an indexed dataset.
  - Amazon Kendra enables semantic search and connects to multiple data sources.
- Vector Stores: Store and retrieve embeddings for efficient information retrieval.
  - Amazon OpenSearch Serverless and pgvector for Amazon Aurora PostgreSQL- Compatible Edition are commonly used.