ðŸ”§ **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys.

# Document Q&A with LangChain 1.0

## Welcome! ðŸ“š

In this notebook, you'll learn how to extract information from documents using LangChain 1.0.

## What is Document Q&A?

**Document Q&A** allows you to:
- Load documents from various sources (PDFs, text files, web pages)
- Split them into manageable chunks
- Create embeddings for semantic search
- Ask questions and get answers based on document content

## What You'll Build

By the end of this notebook, you'll have a system that can:
1. Load and process documents
2. Create a vector store for efficient retrieval
3. Answer questions based on document content
4. Use Ollama for local LLM processing

## Prerequisites

Make sure you have:
- **Ollama installed**: Download from [ollama.com](https://ollama.com)
- **Mistral-Nemo model**: Run `ollama pull mistral-nemo:12b` in your terminal
- **nomic-embed-text model**: Run `ollama pull nomic-embed-text` for embeddings

Let's get started!

## Step 1: Import Required Libraries

We'll need document loaders, text splitters, embeddings, vector stores, and our LLM.

In [1]:
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate

print("âœ“ Libraries imported successfully!")

  from .autonotebook import tqdm as notebook_tqdm


âœ“ Libraries imported successfully!


## Step 2: Initialize the Ollama Model

We'll use Mistral-Nemo for generating answers and nomic-embed-text for creating embeddings.

In [2]:
# Initialize the LLM
llm = ChatOllama(
    model="mistral-nemo:12b",
    temperature=0
)

# Initialize embeddings
embeddings = OllamaEmbeddings(
    model="nomic-embed-text"
)

print("âœ“ LLM and embeddings initialized!")
print(f"  LLM Model: mistral-nemo:12b")
print(f"  Embeddings Model: nomic-embed-text")

âœ“ LLM and embeddings initialized!
  LLM Model: mistral-nemo:12b
  Embeddings Model: nomic-embed-text


## Step 3: Create Sample Documents

Let's create some sample text documents to work with. In practice, you'd load these from files or web pages.

In [3]:
from langchain_core.documents import Document

# Create sample documents about AI topics
documents = [
    Document(
        page_content="""
        Machine Learning is a subset of artificial intelligence that focuses on developing 
        systems that can learn from and make decisions based on data. It uses statistical 
        techniques to give computers the ability to learn without being explicitly programmed.
        Common types include supervised learning, unsupervised learning, and reinforcement learning.
        """,
        metadata={"source": "ai_basics.txt", "topic": "machine_learning"}
    ),
    Document(
        page_content="""
        Natural Language Processing (NLP) is a branch of AI that helps computers understand, 
        interpret, and generate human language. NLP combines computational linguistics with 
        machine learning and deep learning models. Applications include chatbots, translation 
        services, sentiment analysis, and text summarization.
        """,
        metadata={"source": "ai_basics.txt", "topic": "nlp"}
    ),
    Document(
        page_content="""
        Large Language Models (LLMs) are neural networks trained on vast amounts of text data. 
        They can generate human-like text, answer questions, write code, and perform various 
        language tasks. Examples include GPT-4, Claude, and Mistral. LLMs use transformer 
        architecture and attention mechanisms to understand context.
        """,
        metadata={"source": "ai_advanced.txt", "topic": "llm"}
    ),
    Document(
        page_content="""
        Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval 
        with language generation. It allows LLMs to access external knowledge bases, reducing 
        hallucinations and providing more accurate, up-to-date information. RAG systems typically 
        use vector databases to store and retrieve relevant document chunks.
        """,
        metadata={"source": "ai_advanced.txt", "topic": "rag"}
    )
]

print(f"âœ“ Created {len(documents)} sample documents")
print("\nTopics covered:")
for doc in documents:
    print(f"  - {doc.metadata['topic'].upper()}")

âœ“ Created 4 sample documents

Topics covered:
  - MACHINE_LEARNING
  - NLP
  - LLM
  - RAG


## Step 4: Split Documents into Chunks

For better retrieval, we split long documents into smaller, overlapping chunks.

In [4]:
# Create a text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Size of each chunk in characters
    chunk_overlap=50,  # Overlap between chunks to maintain context
    separators=["\n\n", "\n", " ", ""]
)

# Split the documents
splits = text_splitter.split_documents(documents)

print(f"âœ“ Split {len(documents)} documents into {len(splits)} chunks")
print(f"  Chunk size: 500 characters")
print(f"  Chunk overlap: 50 characters")
print(f"\nExample chunk:")
print(f"{splits[0].page_content[:200]}...")

âœ“ Split 4 documents into 4 chunks
  Chunk size: 500 characters
  Chunk overlap: 50 characters

Example chunk:
Machine Learning is a subset of artificial intelligence that focuses on developing 
        systems that can learn from and make decisions based on data. It uses statistical 
        techniques to giv...


## Step 5: Create a Vector Store

We'll use FAISS (Facebook AI Similarity Search) to create a vector database for efficient retrieval.

In [5]:
print("Creating vector store... (this may take a moment)")

# Create vector store from documents
vectorstore = FAISS.from_documents(
    documents=splits,
    embedding=embeddings
)

print("âœ“ Vector store created!")
print(f"  Total vectors: {len(splits)}")
print(f"  Ready for semantic search")

Creating vector store... (this may take a moment)
âœ“ Vector store created!
  Total vectors: 4
  Ready for semantic search


## Step 6: Test Similarity Search

Let's test the vector store by finding documents similar to a query.

In [6]:
# Test query
query = "What is machine learning?"

# Find similar documents
relevant_docs = vectorstore.similarity_search(query, k=2)

print(f"Query: {query}\n")
print(f"Found {len(relevant_docs)} relevant documents:\n")

for i, doc in enumerate(relevant_docs, 1):
    print(f"Document {i}:")
    print(f"Topic: {doc.metadata['topic']}")
    print(f"Content: {doc.page_content.strip()[:200]}...")
    print()

Query: What is machine learning?

Found 2 relevant documents:

Document 1:
Topic: machine_learning
Content: Machine Learning is a subset of artificial intelligence that focuses on developing 
        systems that can learn from and make decisions based on data. It uses statistical 
        techniques to giv...

Document 2:
Topic: nlp
Content: Natural Language Processing (NLP) is a branch of AI that helps computers understand, 
        interpret, and generate human language. NLP combines computational linguistics with 
        machine learn...



## Step 7: Create a Retrieval QA Chain

Now we'll combine the retriever with our LLM to answer questions based on the documents.

In [8]:
# Create a retriever from the vector store
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}  # Retrieve top 3 most relevant chunks
)


retriever.invoke(input="types of reward hacking")


[Document(id='521aa57a-f6c9-46b7-8387-15d9afa45643', metadata={'source': 'ai_advanced.txt', 'topic': 'rag'}, page_content='Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval \n        with language generation. It allows LLMs to access external knowledge bases, reducing \n        hallucinations and providing more accurate, up-to-date information. RAG systems typically \n        use vector databases to store and retrieve relevant document chunks.'),
 Document(id='f0befea8-673b-43cb-a7bb-c7b4e28860bf', metadata={'source': 'ai_basics.txt', 'topic': 'machine_learning'}, page_content='Machine Learning is a subset of artificial intelligence that focuses on developing \n        systems that can learn from and make decisions based on data. It uses statistical \n        techniques to give computers the ability to learn without being explicitly programmed.\n        Common types include supervised learning, unsupervised learning, and reinforcement learning.'),


## Step 8: Ask Questions About Your Documents

Let's ask some questions and get answers based on our documents!

In [16]:
# Ask a question
question = "What is machine learning and what are its types?"

print(f"Question: {question}\n")
print("Retrieving relevant information...\n")

# Get the answer
result = retriever.invoke(input=question)

print("Answer:")
print(result)
print("\n" + "="*60 + "\n")

print("Source Documents:")
for doc in result:
    print(f"\n{i}. Source: {doc.metadata['source']} | Topic: {doc.metadata['topic']}")
    print(f"   Content: {doc.page_content.strip()[:150]}...")

Question: What is machine learning and what are its types?

Retrieving relevant information...

Answer:
[Document(id='f0befea8-673b-43cb-a7bb-c7b4e28860bf', metadata={'source': 'ai_basics.txt', 'topic': 'machine_learning'}, page_content='Machine Learning is a subset of artificial intelligence that focuses on developing \n        systems that can learn from and make decisions based on data. It uses statistical \n        techniques to give computers the ability to learn without being explicitly programmed.\n        Common types include supervised learning, unsupervised learning, and reinforcement learning.'), Document(id='cd675eb4-74e0-4b50-ad13-6ff306621e16', metadata={'source': 'ai_advanced.txt', 'topic': 'llm'}, page_content='Large Language Models (LLMs) are neural networks trained on vast amounts of text data. \n        They can generate human-like text, answer questions, write code, and perform various \n        language tasks. Examples include GPT-4, Claude, and Mistral. LLMs use t

## ðŸŽ‰ Congratulations!

You've successfully built a Document Q&A system with LangChain 1.0!

### What You Learned

âœ… **Document Loading**: How to create and load documents  
âœ… **Text Splitting**: Breaking documents into manageable chunks  
âœ… **Embeddings**: Creating vector representations of text  
âœ… **Vector Stores**: Using FAISS for efficient similarity search  
âœ… **Retrieval QA**: Combining retrieval with LLM for answers  
âœ… **Local Processing**: Running everything with Ollama  

### How It Works

1. **Indexing**: Documents are split into chunks and embedded into vectors
2. **Retrieval**: When you ask a question, the system finds relevant chunks
3. **Generation**: The LLM uses retrieved chunks to generate an answer
4. **Sources**: You can trace answers back to source documents

### Key Components

- **Document Loaders**: Load text from various sources
- **Text Splitters**: Break long documents into chunks
- **Embeddings**: Convert text to numerical vectors
- **Vector Stores**: Store and search vectors efficiently
- **Retrievers**: Find relevant documents for a query
- **QA Chains**: Combine retrieval + LLM for answers