# Phase 2: Document Processing

In this notebook, you'll learn how to:
1. Load documents from files
2. Split documents into chunks
3. Generate embeddings for semantic search

These are the foundational steps for building a RAG system.

In [None]:
# Add the project root to the path so we can import our modules
import sys
sys.path.insert(0, '../..')

## 1. Loading Documents

LangChain provides various document loaders for different file types.

In [None]:
from langchain_community.document_loaders import TextLoader

# Load a text file
loader = TextLoader('../../data/documents/sample_article.txt', encoding='utf-8')
documents = loader.load()

print(f"Loaded {len(documents)} document(s)")
print(f"\nDocument content preview (first 500 chars):")
print(documents[0].page_content[:500])

In [None]:
# Examine the document metadata
print("Document metadata:")
print(documents[0].metadata)

## 2. Splitting Documents into Chunks

Documents are typically too long to embed directly. We split them into smaller chunks.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create a text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,       # Maximum characters per chunk
    chunk_overlap=50,     # Overlap between chunks for context
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]  # Priority order for splitting
)

# Split the documents
chunks = text_splitter.split_documents(documents)

print(f"Split into {len(chunks)} chunks")

In [None]:
# Examine a few chunks
for i, chunk in enumerate(chunks[:3]):
    print(f"\n--- Chunk {i+1} ({len(chunk.page_content)} chars) ---")
    print(chunk.page_content[:200] + "...")

## 3. Understanding Embeddings

Embeddings convert text into numerical vectors that capture semantic meaning.
Similar texts will have similar embedding vectors.

In [None]:
from langchain_ollama import OllamaEmbeddings

# Initialize the embeddings model
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Embed a single text
test_text = "Machine learning is a type of artificial intelligence."
vector = embeddings.embed_query(test_text)

print(f"Text: {test_text}")
print(f"Embedding dimension: {len(vector)}")
print(f"First 10 values: {vector[:10]}")

## 4. Semantic Similarity

Let's see how embeddings capture semantic similarity.

In [None]:
import numpy as np

def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors."""
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Test semantic similarity
sentences = [
    "Machine learning helps computers learn from data.",
    "AI systems can improve through experience and data.",  # Similar meaning
    "The weather today is sunny and warm."  # Different topic
]

# Embed all sentences
vectors = embeddings.embed_documents(sentences)

# Compare similarities to the first sentence
print(f"Base sentence: '{sentences[0]}'\n")
for i in range(1, len(sentences)):
    similarity = cosine_similarity(vectors[0], vectors[i])
    print(f"Similarity to '{sentences[i][:50]}...': {similarity:.4f}")

## 5. Using Our Project Modules

Now let's use the modules we created in the `src/` directory.

In [None]:
from src.document_loader import load_and_split
from src.embeddings import get_embeddings, embed_text

# Load and split all documents from the data directory
chunks = load_and_split()

print(f"\nTotal chunks ready for embedding: {len(chunks)}")

In [None]:
# Test embedding a chunk
if chunks:
    sample_chunk = chunks[0].page_content
    vector = embed_text(sample_chunk)
    print(f"Embedded chunk with {len(vector)} dimensions")

## Key Takeaways

1. **Document Loaders** read files into LangChain Document objects
2. **Text Splitters** break large documents into manageable chunks
3. **RecursiveCharacterTextSplitter** tries to preserve semantic boundaries
4. **Embeddings** convert text to numerical vectors for similarity search
5. **Cosine Similarity** measures how similar two embedding vectors are

## Next Steps

In the next notebook, you'll learn how to:
- Store embeddings in a vector database (ChromaDB)
- Retrieve relevant documents based on a query
- Build a complete RAG chain