# Data Exploration Notebook

This notebook explores raw data, demonstrates preprocessing and chunking functions, and shows examples of chunks with metadata.

## Goals
- Explore raw document data
- Test preprocessing and chunking functions
- Visualize chunks and metadata
- Understand data characteristics


In [1]:
# Import required modules
import sys
from pathlib import Path

# Add src to path
sys.path.insert(0, str(Path("../src").resolve()))

from utils.io_utils import read_text, list_files
from utils.rag_preprocessing import clean_text, chunk_text, enrich_metadata
from utils.validation import validate_document

print("✅ Imports successful!")


✅ Imports successful!


## Step 1: Explore Raw Data

Let's first see what documents we have available.


In [2]:
# List available documents with extended file types
data_dir = Path("../data/raw")
file_types = ["*.txt", "*.pdf", "*.docx", "*.json", "*.csv", "*.md"]

if data_dir.exists():
    all_files = []
    for p in file_types:
        all_files += list_files(data_dir, pattern=p)

    print(f"Found {len(all_files)} files:")
    for f in all_files[:10]:
        print(f"  - {f.name}")

else:
    print("Data directory not found. Creating sample data...")
    data_dir.mkdir(parents=True, exist_ok=True)

    samples = {
        "sample.txt": "This is sample text about RAG and retrieval.",
        "sample.json": '{"topic":"RAG","desc":"Retrieval-Augmented Generation"}',
        "sample.md": "# RAG Example\nRAG retrieves context before generation."
    }

    for name, content in samples.items():
        with open(data_dir / name, "w", encoding="utf-8") as f:
            f.write(content)

    print("Created sample files:", ", ".join(samples.keys()))


Found 1 files:
  - Text Chunking.pdf


## Step 2: Load and Clean Text

Let's load a document and clean it.


In [4]:
# Load and clean text from any supported file
supported = ["*.txt", "*.pdf", "*.docx", "*.json", "*.md", "*.csv"]
files = []
for p in supported:
    files += list_files(data_dir, pattern=p)

if files:
    file = files[0]
    raw_text = read_text(file)
    print(f"Loaded: {file.name}\nRaw text:\n{raw_text[:200]}")
    print("\n" + "="*60 + "\n")

    cleaned = clean_text(
        raw_text,
        remove_html=True,
        normalize_whitespace=True,
    )
    print("Cleaned text:\n" + cleaned[:200])
else:
    print("No supported files found in data directory.")


Loaded: Text Chunking.pdf
Raw text:
%PDF-1.5
%
87 0 obj
<< /Filter /FlateDecode /Length 4363 >>
stream
xÚ:Ûã¸ïý~
h+¢¨ë¾uÙf0Év-fJ=°$ÚfJ<ÔÝ5ûó{Ï!E»`BwòÜoÊvÇ]¶ûéÃïüf;±Û·Ù®.Ë´Ìê]wþðôízøÛ.KeÛì¾¹eç]%røv_>üãCÆüåñÃÌ]¶U


Cleaned text:
%PDF-1.5 % 87 0 obj << /Filter /FlateDecode /Length 4363 >> stream xÚ:Ûã¸ïý~ h+¢¨ë¾uÙf0Év-fJ=°$ÚfJ<ÔÝ5ûó{Ï!E» `BwòÜoÊvÇ]¶ûéÃï üf;±Û·Ù®.Ë´Ìê]wþðôízøÛ.KeÛì¾¹eç]%rø v_>üãCÆüåñÃÌ]¶U


## Step 3: Chunk Text

Now let's chunk the cleaned text using different strategies.


In [5]:
# Create a longer sample text for chunking
long_text = """
Retrieval-Augmented Generation (RAG) is a powerful technique that combines the strengths of information retrieval and language generation.

RAG works by first retrieving relevant documents from a knowledge base, then using those documents as context for generating answers.

The key advantage of RAG is that it can provide accurate, up-to-date information by grounding generation in retrieved documents.

RAG systems typically consist of three main components: a retriever, a generator, and a knowledge base.

The retriever finds relevant documents, the generator produces answers based on those documents, and the knowledge base stores the information.
""" * 3  # Repeat to make it longer

# Chunk using different strategies
chunks_fixed = chunk_text(long_text, strategy="fixed_size", chunk_size=200, chunk_overlap=20)
chunks_recursive = chunk_text(long_text, strategy="recursive", chunk_size=200, chunk_overlap=20)
chunks_sentence = chunk_text(long_text, strategy="sentence", chunk_size=200, chunk_overlap=20)

print(f"Fixed-size chunks: {len(chunks_fixed)}")
print(f"Recursive chunks: {len(chunks_recursive)}")
print(f"Sentence chunks: {len(chunks_sentence)}")

print("\nFirst chunk (fixed-size):")
print(chunks_fixed[0][:150] + "...")


Fixed-size chunks: 11
Recursive chunks: 15
Sentence chunks: 15

First chunk (fixed-size):

Retrieval-Augmented Generation (RAG) is a powerful technique that combines the strengths of information retrieval and language generation.

RAG works...


## Step 4: Enrich Chunks with Metadata

Let's create chunk dictionaries with metadata.


In [None]:
# Create chunks with metadata
chunks = []
for i, chunk_text_content in enumerate(chunks_fixed):
    chunk = {
        "text": chunk_text_content,
        "metadata": {}
    }
    chunk = enrich_metadata(
        chunk,
        source="sample.txt",
        page=1,
        section="Introduction",
        chunk_index=i
    )
    chunks.append(chunk)

# Display first chunk with metadata
print("First chunk with metadata:")
print(f"Text: {chunks[0]['text'][:100]}...")
print(f"\nMetadata:")
for key, value in chunks[0]['metadata'].items():
    print(f"  {key}: {value}")

# Validate chunk
try:
    validate_document(chunks[0])
    print("\n✅ Chunk validation passed!")
except Exception as e:
    print(f"\n❌ Chunk validation failed: {e}")


First chunk with metadata:
Text: 
Retrieval-Augmented Generation (RAG) is a powerful technique that combines the strengths of informa...

Metadata:
  source: sample.txt
  page: 1
  section: Introduction
  timestamp: 2025-11-29T03:02:50.406629
  chunk_index: 0

✅ Chunk validation passed!
