# ? Parenting & Child Psychology RAG Pipeline

## üéØ Overview
This notebook implements a complete **Retrieval-Augmented Generation (RAG)** system for parenting and child psychology questions, combining:
- üåê **Automatic web scraping** from trusted sources (UNICEF, CDC)
- üìÑ **PDF document processing** for local files
- üß† **Free AI models** (DeepSeek R1 + Embedding Gemma)

## üí∞ Cost Breakdown
- **LLM (DeepSeek R1)**: $0.00 via OpenRouter free tier
- **Embeddings (Embedding Gemma)**: $0.00 (runs locally)
- **Total Cost**: $0.00 ‚ú®

## üìã Pipeline Steps
1. ‚öôÔ∏è **Setup**: Install dependencies & configure API keys
2. üåê **Web Scraping**: Download content from UNICEF & CDC
3. üìÑ **PDF Loading**: Extract text from local PDF files
4. üßπ **Text Cleaning**: Remove artifacts & normalize formatting
5. ‚úÇÔ∏è **Chunking**: Split into ~1000 token chunks with overlap
6. üî¢ **Embeddings**: Generate 768-dim vectors with Embedding Gemma
7. üíæ **Vector Storage**: Store in ChromaDB for fast retrieval
8. ü§ñ **RAG System**: Retrieve + generate answers with DeepSeek R1
9. üí¨ **Query Examples**: Test with parenting questions
10. üìä **Source Citations**: Display retrieved sources

## üîë Prerequisites
- Python 3.8+
- OpenRouter API key (free tier)
- 4GB+ RAM for Embedding Gemma
- ~1.5GB disk space for model cache

Let's get started! üöÄ

## 1Ô∏è‚É£ Setup & Install Dependencies

First, we'll install all required packages for our RAG pipeline.

In [1]:
# Install required dependencies
# Run this cell first to ensure all packages are available

!pip install -q pdfplumber langchain langchain-text-splitters langchain-core langchain-community chromadb tiktoken openai sentence-transformers

print("‚úÖ All dependencies installed successfully!")

‚úÖ All dependencies installed successfully!


In [4]:
# Import required libraries
import os
import re
import pdfplumber
from pathlib import Path
from typing import List, Dict, Any

# LangChain imports
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings

# Sentence Transformers for Embedding Gemma
from sentence_transformers import SentenceTransformer

# OpenAI for DeepSeek R1 via OpenRouter
import openai
from openai import OpenAI
import tiktoken

print("üì¶ All libraries imported successfully!")
print(f"Python version: {os.sys.version}")
print(f"Working directory: {os.getcwd()}")

üì¶ All libraries imported successfully!
Python version: 3.10.19 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 16:41:31) [MSC v.1929 64 bit (AMD64)]
Working directory: c:\projects\parenting-and-child


In [None]:
# Set up API key
# Only need OpenRouter for DeepSeek R1 - Embedding Gemma runs locally!
# On Windows PowerShell: 
#   $env:OPENROUTER_API_KEY="your-openrouter-key-here"
#   $env:HF_TOKEN="your-huggingface-token" (optional, for faster downloads)

openrouter_api_key = os.getenv("OPENROUTER_API_KEY")
hf_token = os.getenv("HF_TOKEN")

if not openrouter_api_key:
    print("‚ö†Ô∏è  WARNING: OPENROUTER_API_KEY not found in environment variables!")
    print("Please set it using: $env:OPENROUTER_API_KEY='your-api-key-here'")
    print("Get your key at: https://openrouter.ai/keys")
else:
    print("‚úÖ OpenRouter API key configured successfully!")
    print(f"Key preview: {openrouter_api_key[:8]}...{openrouter_api_key[-4:]}")

if hf_token:
    print("\n‚úÖ Hugging Face token found (for faster model downloads)")
else:
    print("\nüí° Optional: Set HF_TOKEN for faster model downloads")
    print("   Get token at: https://huggingface.co/settings/tokens")

print("\nüéâ Embedding Gemma will run completely locally - no API costs!")

## 2Ô∏è‚É£ PDF Loading & Extraction

Load PDF files from the `data/` folder and extract raw text content.

## 2Ô∏è‚É£ Web Scraping: Download Parenting Resources

Automatically scrape and download parenting guides from UNICEF and CDC websites.

In [6]:
# Install web scraping dependencies
!pip install -q beautifulsoup4 requests playwright
!playwright install chromium

print("‚úÖ Web scraping dependencies installed!")

Downloading Chromium 140.0.7339.16 (playwright build v1187)[2m from https://cdn.playwright.dev/dbazure/download/playwright/builds/chromium/1187/chromium-win64.zip[22m
|                                                                                |   0% of 148.9 MiB
|‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†                                                                        |  10% of 148.9 MiB
|‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†                                                                |  20% of 148.9 MiB
|‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†                                                        |  30% of 148.9 MiB
|‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†                                                |  40% of 148.9 MiB
|‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†‚ñ†                                        |  50

In [4]:
# Import web scraping libraries
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import time
from datetime import datetime

print("‚úÖ Web scraping libraries imported!")

‚úÖ Web scraping libraries imported!


In [11]:
def scrape_unicef_parenting(max_articles: int = 10) -> List[Dict[str, str]]:
    """
    Scrape parenting articles from UNICEF website.
    
    Args:
        max_articles: Maximum number of articles to scrape
        
    Returns:
        List of dictionaries containing article data
    """
    base_url = "https://www.unicef.org/parenting"
    parenting_url = f"{base_url}/child-care"
    
    articles = []
    
    try:
        print(f"üåê Scraping UNICEF parenting page: {parenting_url}")
        
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        
        response = requests.get(parenting_url, headers=headers, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find article links (adjust selectors based on actual page structure)
        article_links = soup.find_all('a', href=True, limit=max_articles * 2)
        
        print(f"   Found {len(article_links)} potential article links")
        
        for link in article_links[:max_articles]:
            href = link.get('href')
            if href and '/parenting/' in href:
                full_url = urljoin(base_url, href)
                
                try:
                    # Get article content
                    article_response = requests.get(full_url, headers=headers, timeout=10)
                    article_response.raise_for_status()
                    
                    article_soup = BeautifulSoup(article_response.content, 'html.parser')
                    
                    # Extract title
                    title = article_soup.find('h1')
                    title_text = title.get_text(strip=True) if title else "Untitled"
                    
                    # Extract main content
                    content_div = article_soup.find('article') or article_soup.find('main')
                    if content_div:
                        # Remove script and style elements
                        for script in content_div(['script', 'style']):
                            script.decompose()
                        
                        content_text = content_div.get_text(separator='\n', strip=True)
                        
                        if len(content_text) > 500:  # Only save if substantial content
                            articles.append({
                                'title': title_text,
                                'content': content_text,
                                'url': full_url,
                                'source': 'UNICEF',
                                'scraped_at': datetime.now().isoformat()
                            })
                            print(f"   ‚úÖ Scraped: {title_text[:50]}...")
                    
                    time.sleep(1)  # Be respectful to the server
                    
                except Exception as e:
                    print(f"   ‚ö†Ô∏è  Error scraping {full_url}: {str(e)}")
                    continue
                
                if len(articles) >= max_articles:
                    break
        
        print(f"\n‚úÖ Successfully scraped {len(articles)} UNICEF articles")
        return articles
        
    except Exception as e:
        print(f"‚ùå Error accessing UNICEF website: {str(e)}")
        return []

def scrape_cdc_parenting(max_pages: int = 5) -> List[Dict[str, str]]:
    """
    Scrape parenting tips from CDC website.
    
    Args:
        max_pages: Maximum number of pages to scrape
        
    Returns:
        List of dictionaries containing article data
    """
    base_url = "https://www.cdc.gov"
    parenting_url = f"{base_url}/parents/"
    
    articles = []
    
    try:
        print(f"\nüåê Scraping CDC parenting page: {parenting_url}")
        
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        
        response = requests.get(parenting_url, headers=headers, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find article/tip links
        links = soup.find_all('a', href=True, limit=max_pages * 3)
        
        print(f"   Found {len(links)} potential links")
        
        for link in links[:max_pages]:
            href = link.get('href')
            if href and ('parents' in href or 'positive-parenting' in href):
                full_url = urljoin(base_url, href)
                
                # Avoid duplicates
                if any(article['url'] == full_url for article in articles):
                    continue
                
                try:
                    article_response = requests.get(full_url, headers=headers, timeout=10)
                    article_response.raise_for_status()
                    
                    article_soup = BeautifulSoup(article_response.content, 'html.parser')
                    
                    # Extract title
                    title = article_soup.find('h1')
                    title_text = title.get_text(strip=True) if title else "Untitled"
                    
                    # Extract content
                    content_div = article_soup.find('article') or article_soup.find('div', class_='content')
                    if content_div:
                        for script in content_div(['script', 'style', 'nav', 'footer']):
                            script.decompose()
                        
                        content_text = content_div.get_text(separator='\n', strip=True)
                        
                        if len(content_text) > 300:
                            articles.append({
                                'title': title_text,
                                'content': content_text,
                                'url': full_url,
                                'source': 'CDC',
                                'scraped_at': datetime.now().isoformat()
                            })
                            print(f"   ‚úÖ Scraped: {title_text[:50]}...")
                    
                    time.sleep(1)
                    
                except Exception as e:
                    print(f"   ‚ö†Ô∏è  Error scraping {full_url}: {str(e)}")
                    continue
                
                if len(articles) >= max_pages:
                    break
        
        print(f"\n‚úÖ Successfully scraped {len(articles)} CDC articles")
        return articles
        
    except Exception as e:
        print(f"‚ùå Error accessing CDC website: {str(e)}")
        return []

print("‚úÖ Web scraping functions defined!")
print("   Available functions:")
print("   - scrape_unicef_parenting(max_articles=10)")
print("   - scrape_cdc_parenting(max_pages=5)")

‚úÖ Web scraping functions defined!
   Available functions:
   - scrape_unicef_parenting(max_articles=10)
   - scrape_cdc_parenting(max_pages=5)


In [15]:
# Scrape articles from both websites
print("üöÄ Starting web scraping process...\n")

# Scrape UNICEF articles
unicef_articles = scrape_unicef_parenting(max_articles=10)

# Scrape CDC articles
cdc_articles = scrape_cdc_parenting(max_pages=5)

# Combine all scraped articles
scraped_articles = unicef_articles + cdc_articles

print(f"\nüìä Scraping Summary:")
print(f"   UNICEF articles: {len(unicef_articles)}")
print(f"   CDC articles: {len(cdc_articles)}")
print(f"   Total scraped: {len(scraped_articles)}")

if scraped_articles:
    print(f"\n   Total characters scraped: {sum(len(article['content']) for article in scraped_articles):,}")
    print(f"   Average article length: {sum(len(article['content']) for article in scraped_articles) // len(scraped_articles):,} characters")
else:
    print("\n‚ö†Ô∏è  No articles were scraped. Check your internet connection or website availability.")

üöÄ Starting web scraping process...

üåê Scraping UNICEF parenting page: https://www.unicef.org/parenting/child-care
   Found 20 potential article links
   Found 20 potential article links
   ‚úÖ Scraped: UNICEF Parenting...
   ‚úÖ Scraped: UNICEF Parenting...
   ‚ö†Ô∏è  Error scraping https://www.unicef.org/parenting/child-care: 403 Client Error: Forbidden for url: https://www.unicef.org/parenting/child-care
   ‚ö†Ô∏è  Error scraping https://www.unicef.org/parenting/fr/soins-attentifs: 403 Client Error: Forbidden for url: https://www.unicef.org/parenting/fr/soins-attentifs
   ‚ö†Ô∏è  Error scraping https://www.unicef.org/parenting/child-care: 403 Client Error: Forbidden for url: https://www.unicef.org/parenting/child-care
   ‚ö†Ô∏è  Error scraping https://www.unicef.org/parenting/fr/soins-attentifs: 403 Client Error: Forbidden for url: https://www.unicef.org/parenting/fr/soins-attentifs
   ‚úÖ Scraped: Cuidado infantil...
   ‚úÖ Scraped: Cuidado infantil...
   ‚úÖ Scraped: ÿ±ÿπÿßŸä

In [19]:
def scraped_articles_to_documents(articles: List[Dict[str, str]]) -> List[Document]:
    """
    Convert scraped articles to LangChain Document format
    
    Args:
        articles: List of dictionaries with 'title', 'content', 'url', 'source', 'scraped_at'
    
    Returns:
        List of LangChain Document objects
    """
    documents = []
    
    for article in articles:
        # Combine title and content for better context
        full_text = f"{article['title']}\n\n{article['content']}"
        
        # Create metadata
        metadata = {
            'source': article['source'],
            'url': article['url'],
            'title': article['title'],
            'scraped_at': article['scraped_at'],
            'type': 'web_article'
        }
        
        # Create Document object
        doc = Document(page_content=full_text, metadata=metadata)
        documents.append(doc)
    
    return documents

# Convert scraped articles to documents
web_documents = scraped_articles_to_documents(scraped_articles)

print(f"‚úÖ Converted {len(web_documents)} scraped articles to document format")
if web_documents:
    print(f"\nüìÑ Sample document metadata:")
    print(f"   Title: {web_documents[2].metadata['title'][:60]}...")
    print(f"   Source: {web_documents[2].metadata['source']}")
    print(f"   URL: {web_documents[2].metadata['url']}")

‚úÖ Converted 5 scraped articles to document format

üìÑ Sample document metadata:
   Title: ÿ±ÿπÿßŸäÿ© ÿßŸÑÿ∑ŸÅŸÑ...
   Source: UNICEF
   URL: https://www.unicef.org/parenting/ar/%D8%B1%D8%B9%D8%A7%D9%8A%D8%A9-%D8%A7%D9%84%D8%B7%D9%81%D9%84


---

## 3Ô∏è‚É£ PDF Loading: Process Local Documents

You can also add your own PDF files to the `pdf_files/` directory. The pipeline will combine both web-scraped content and PDF documents.

In [6]:
# Create data directory if it doesn't exist
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)

print(f"üìÅ Data directory: {data_dir.absolute()}")
print(f"Directory exists: {data_dir.exists()}")

# List all PDF files in the data directory
pdf_files = list(data_dir.glob("*.pdf"))

if pdf_files:
    print(f"\n‚úÖ Found {len(pdf_files)} PDF file(s):")
    for pdf in pdf_files:
        print(f"  - {pdf.name} ({pdf.stat().st_size / 1024:.2f} KB)")
else:
    print("\n‚ö†Ô∏è  No PDF files found in 'data/' folder.")
    print("Please add parenting guides or psychology PDFs to the 'data/' directory.")
    print("\nSuggested free resources:")
    print("  - UNICEF Parenting Guides: https://www.unicef.org/parenting")
    print("  - CDC Positive Parenting Tips: https://www.cdc.gov/parents/")
    print("  - Public domain psychology texts from Project Gutenberg")

üìÅ Data directory: c:\projects\parenting-and-child\data
Directory exists: True

‚úÖ Found 4 PDF file(s):
  - 1-2-3-magic-effective-discipline-for-children-212-4nbsped-1889140589-9781889140582.pdf (2735.76 KB)
  - genius.pdf (1179.39 KB)
  - Parenting With Love and Logic_ Teaching Children Responsibility ( PDFDrive.com ).pdf (1856.94 KB)
  - The Whole-Brain Child_ 12 Revolutionary Strategies to Nurture Your Child_s Developing Mind .pdf (3420.78 KB)


In [7]:
def extract_text_from_pdf(pdf_path: Path) -> Dict[str, Any]:
    """
    Extract text from a PDF file using pdfplumber.
    
    Args:
        pdf_path: Path to the PDF file
        
    Returns:
        Dictionary containing filename, total pages, and extracted text per page
    """
    extracted_data = {
        "filename": pdf_path.name,
        "filepath": str(pdf_path),
        "pages": [],
        "total_pages": 0,
        "full_text": ""
    }
    
    try:
        with pdfplumber.open(pdf_path) as pdf:
            extracted_data["total_pages"] = len(pdf.pages)
            
            for page_num, page in enumerate(pdf.pages, start=1):
                text = page.extract_text()
                if text:  # Only add pages with text
                    extracted_data["pages"].append({
                        "page_number": page_num,
                        "text": text
                    })
                    extracted_data["full_text"] += text + "\n\n"
        
        print(f"‚úÖ Extracted {extracted_data['total_pages']} pages from '{pdf_path.name}'")
        return extracted_data
        
    except Exception as e:
        print(f"‚ùå Error extracting '{pdf_path.name}': {str(e)}")
        return extracted_data

# Extract text from all PDFs
pdf_documents = []

for pdf_file in pdf_files:
    doc_data = extract_text_from_pdf(pdf_file)
    if doc_data["full_text"]:
        pdf_documents.append(doc_data)

# Combine web-scraped documents with PDF documents
all_documents = web_documents + pdf_documents if 'web_documents' in globals() else pdf_documents

print(f"\nüìä Combined Document Summary:")
if 'web_documents' in globals():
    print(f"  Web articles: {len(web_documents)}")
print(f"  PDF documents: {len(pdf_documents)}")
print(f"  Total documents: {len(all_documents)}")
if pdf_documents:
    print(f"  Total PDF pages: {sum(doc['total_pages'] for doc in pdf_documents)}")
    print(f"  Total PDF characters: {sum(len(doc['full_text']) for doc in pdf_documents):,}")

‚úÖ Extracted 282 pages from '1-2-3-magic-effective-discipline-for-children-212-4nbsped-1889140589-9781889140582.pdf'
‚úÖ Extracted 110 pages from 'genius.pdf'
‚úÖ Extracted 110 pages from 'genius.pdf'
‚úÖ Extracted 305 pages from 'Parenting With Love and Logic_ Teaching Children Responsibility ( PDFDrive.com ).pdf'
‚úÖ Extracted 305 pages from 'Parenting With Love and Logic_ Teaching Children Responsibility ( PDFDrive.com ).pdf'
‚úÖ Extracted 225 pages from 'The Whole-Brain Child_ 12 Revolutionary Strategies to Nurture Your Child_s Developing Mind .pdf'

üìä Combined Document Summary:
  PDF documents: 4
  Total documents: 4
  Total PDF pages: 922
  Total PDF characters: 1,450,643
‚úÖ Extracted 225 pages from 'The Whole-Brain Child_ 12 Revolutionary Strategies to Nurture Your Child_s Developing Mind .pdf'

üìä Combined Document Summary:
  PDF documents: 4
  Total documents: 4
  Total PDF pages: 922
  Total PDF characters: 1,450,643


In [25]:
sample_doc = all_documents[5]
sample_doc

{'filename': '1-2-3-magic-effective-discipline-for-children-212-4nbsped-1889140589-9781889140582.pdf',
 'filepath': 'data\\1-2-3-magic-effective-discipline-for-children-212-4nbsped-1889140589-9781889140582.pdf',
 'pages': [{'page_number': 2,
   'text': '1-2-3 Magic\nA humorous look at parenting, a\nserious look at discipline.\nHere‚Äôs what people are saying:\nThank you for all you do!\n‚ÄúI am a school social worker and I recommend 1-2-3 Magic to ALL parents\nwith whom I work. It is without doubt the very best in parenting strategiesI‚Äù\nThis book is like oxygen.\n‚ÄúNeither my wife nor I knew how to discipline our two year old. A toddler\nwas running our house and our lives. Being out of ideas seemed like being out\nof oxygen and we were squirming‚Äîuntil 1-2-3 Magic was loaned to us.‚Äù\nMental health professional: Best discipline system, period.\n‚ÄúAs a mental health professional for over 16 years, I‚Äôve found 1-2-3 Magic to\nbe the most powerful method of managing kids ages 2-1

In [27]:
# Display a sample of extracted text
if all_documents:
    sample_doc = all_documents[5]
    print(f"üìÑ Sample from '{sample_doc['filename']}':")
    print("=" * 80)
    print(sample_doc['full_text'][4000:4500] + "...")
    print("=" * 80)
else:
    print("‚ö†Ô∏è  No documents to display. Please add PDF files to the 'data/' folder.")

üìÑ Sample from '1-2-3-magic-effective-discipline-for-children-212-4nbsped-1889140589-9781889140582.pdf':
s book is not intended to replace appropriate diagnosis
and/or treatment, when indicated, by a qualified mental health professional or physician.
Illustrations by Dan Farrell
Graphic Design by Mary Navolio
Distributed by Independent Publishers Group
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
For more information, contact:
ParentMagic, Inc. 800
Roosevelt Road
Glen Ellyn, Illinois 60137
Publisher‚Äôs Cataloging-in-Publication
(Provided by Quality Books, Inc.)
Phelan, Thomas W., 194...


## 4Ô∏è‚É£ Text Cleaning & Preprocessing

Clean the extracted text by removing artifacts, excessive whitespace, and standardizing formatting.

In [8]:
def clean_text(text: str) -> str:
    """
    Clean and preprocess extracted PDF text.
    
    Args:
        text: Raw text from PDF
        
    Returns:
        Cleaned text
    """
    # Remove excessive newlines (keep paragraph breaks)
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    # Remove page numbers (common patterns)
    text = re.sub(r'\n\s*\d+\s*\n', '\n', text)
    text = re.sub(r'Page \d+', '', text, flags=re.IGNORECASE)
    
    # Remove common header/footer artifacts
    text = re.sub(r'\n[A-Z\s]{20,}\n', '\n', text)  # All caps headers
    
    # Fix hyphenation at line breaks
    text = re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text)
    
    # Normalize whitespace
    text = re.sub(r'[ \t]+', ' ', text)  # Multiple spaces to single space
    text = re.sub(r'\n ', '\n', text)    # Remove leading spaces after newlines
    
    # Remove very short lines (likely artifacts)
    lines = text.split('\n')
    lines = [line for line in lines if len(line.strip()) > 2 or line.strip() == '']
    text = '\n'.join(lines)
    
    # Final cleanup
    text = text.strip()
    
    return text

# Clean all documents
cleaned_documents = []

# Process web documents (Document objects)
if 'web_documents' in globals():
    for doc in web_documents:
        original_length = len(doc.page_content)
        cleaned_content = clean_text(doc.page_content)
        cleaned_length = len(cleaned_content)
        
        # Create new Document with cleaned content
        cleaned_doc = Document(
            page_content=cleaned_content,
            metadata=doc.metadata
        )
        cleaned_documents.append(cleaned_doc)
        
        print(f"üßπ Cleaned web article '{doc.metadata['title'][:50]}...':")
        print(f"   Original: {original_length:,} characters")
        print(f"   Cleaned:  {cleaned_length:,} characters")
        print(f"   Reduced:  {original_length - cleaned_length:,} characters ({(1 - cleaned_length/original_length)*100:.1f}%)\n")

# Process PDF documents (dictionaries)
for doc in pdf_documents if 'pdf_documents' in globals() else []:
    original_length = len(doc['full_text'])
    doc['cleaned_text'] = clean_text(doc['full_text'])
    cleaned_length = len(doc['cleaned_text'])
    
    # Convert to Document format
    cleaned_doc = Document(
        page_content=doc['cleaned_text'],
        metadata={
            'source': doc['filename'],
            'filepath': doc['filepath'],
            'total_pages': doc['total_pages'],
            'type': 'pdf'
        }
    )
    cleaned_documents.append(cleaned_doc)
    
    print(f"üßπ Cleaned PDF '{doc['filename']}':")
    print(f"   Original: {original_length:,} characters")
    print(f"   Cleaned:  {cleaned_length:,} characters")
    print(f"   Reduced:  {original_length - cleaned_length:,} characters ({(1 - cleaned_length/original_length)*100:.1f}%)\n")

print(f"üìä Total cleaned documents: {len(cleaned_documents)}")

üßπ Cleaned PDF '1-2-3-magic-effective-discipline-for-children-212-4nbsped-1889140589-9781889140582.pdf':
   Original: 406,253 characters
   Cleaned:  405,921 characters
   Reduced:  332 characters (0.1%)

üßπ Cleaned PDF 'genius.pdf':
   Original: 281,783 characters
   Cleaned:  281,739 characters
   Reduced:  44 characters (0.0%)

üßπ Cleaned PDF 'Parenting With Love and Logic_ Teaching Children Responsibility ( PDFDrive.com ).pdf':
   Original: 446,249 characters
   Cleaned:  446,020 characters
   Reduced:  229 characters (0.1%)

üßπ Cleaned PDF 'The Whole-Brain Child_ 12 Revolutionary Strategies to Nurture Your Child_s Developing Mind .pdf':
   Original: 316,358 characters
   Cleaned:  314,310 characters
   Reduced:  2,048 characters (0.6%)

üìä Total cleaned documents: 4
üßπ Cleaned PDF 'The Whole-Brain Child_ 12 Revolutionary Strategies to Nurture Your Child_s Developing Mind .pdf':
   Original: 316,358 characters
   Cleaned:  314,310 characters
   Reduced:  2,048 character

In [7]:
# Show before/after comparison
if cleaned_documents:
    sample_doc = cleaned_documents[0]
    
    print("üìã Sample Document:")
    print("-" * 80)
    print(f"Source: {sample_doc.metadata.get('source', sample_doc.metadata.get('title', 'Unknown'))}")
    print(f"Type: {sample_doc.metadata['type']}")
    print(f"\nContent preview (first 300 chars):")
    print(sample_doc.page_content[:300] + "...")
    print("-" * 80)

üìã Sample Document:
--------------------------------------------------------------------------------
Source: 1-2-3-magic-effective-discipline-for-children-212-4nbsped-1889140589-9781889140582.pdf
Type: pdf

Content preview (first 300 chars):
1-2-3 Magic
A humorous look at parenting, a
serious look at discipline.
Here‚Äôs what people are saying:
Thank you for all you do!
‚ÄúI am a school social worker and I recommend 1-2-3 Magic to ALL parents
with whom I work. It is without doubt the very best in parenting strategiesI‚Äù
This book is like oxy...
--------------------------------------------------------------------------------


## 5Ô∏è‚É£ Text Chunking with Metadata

Split the cleaned text into ~1000 token chunks using LangChain's RecursiveCharacterTextSplitter and add metadata.

In [9]:
# Initialize the text splitter
# Target ~1000 tokens per chunk (roughly 750-800 characters per token, depending on content)
# We'll use 4000 characters as a safe estimate for ~1000 tokens

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000,           # Characters per chunk (~1000 tokens)
    chunk_overlap=200,          # Overlap to maintain context between chunks
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]  # Split on paragraphs first, then sentences
)

print("‚úÖ Text splitter initialized")
print(f"   Chunk size: 4000 characters (~1000 tokens)")
print(f"   Chunk overlap: 200 characters")
print(f"   Separators: {text_splitter._separators}")

‚úÖ Text splitter initialized
   Chunk size: 4000 characters (~1000 tokens)
   Chunk overlap: 200 characters
   Separators: ['\n\n', '\n', '. ', ' ', '']


In [10]:
# Create chunks with metadata from all cleaned documents
all_chunks = []

for doc in cleaned_documents:
    # Split the cleaned document into chunks
    chunks = text_splitter.split_documents([doc])
    
    # Add chunk index to metadata
    for i, chunk in enumerate(chunks):
        chunk.metadata['chunk_id'] = i
        chunk.metadata['total_chunks'] = len(chunks)
        all_chunks.append(chunk)
    
    source_name = doc.metadata.get('title', doc.metadata.get('source', 'Unknown'))[:50]
    print(f"üìù Created {len(chunks)} chunks from '{source_name}...'")

print(f"\n‚úÖ Total chunks created: {len(all_chunks)}")
if all_chunks:
    print(f"   Average chunk size: {sum(len(chunk.page_content) for chunk in all_chunks) / len(all_chunks):.0f} characters")

üìù Created 126 chunks from '1-2-3-magic-effective-discipline-for-children-212-...'
üìù Created 98 chunks from 'genius.pdf...'
üìù Created 139 chunks from 'Parenting With Love and Logic_ Teaching Children R...'
üìù Created 100 chunks from 'The Whole-Brain Child_ 12 Revolutionary Strategies...'

‚úÖ Total chunks created: 463
   Average chunk size: 3129 characters


In [11]:
# Display sample chunk with metadata
if all_chunks:
    sample_chunk = all_chunks[1]
    
    print("üìÑ Sample Chunk:")
    print("=" * 80)
    print(f"Content preview: {sample_chunk.page_content[:300]}...")
    print("=" * 80)
    print(f"\nüìä Metadata:")
    for key, value in sample_chunk.metadata.items():
        print(f"   {key}: {value}")
    print(f"\n   Chunk length: {len(sample_chunk.page_content)} characters")

üìÑ Sample Chunk:
Content preview: To order:
call 1-800-442-4453
or visit www.parentmagic.com

Copyright ¬© 2010, ParentMagic, Inc. All rights reserved. No part of this book may be reproduced
or transmitted in any form or by any means, electronic or mechanical, including photocopying,
recording, or by any information storage and retri...

üìä Metadata:
   source: 1-2-3-magic-effective-discipline-for-children-212-4nbsped-1889140589-9781889140582.pdf
   filepath: data\1-2-3-magic-effective-discipline-for-children-212-4nbsped-1889140589-9781889140582.pdf
   total_pages: 282
   type: pdf
   chunk_id: 1
   total_chunks: 126

   Chunk length: 3504 characters


## 6Ô∏è‚É£ Create Embeddings with Embedding Gemma

Load Google's Embedding Gemma model - a 300M parameter model that runs locally for FREE!

In [12]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"   Using device: {device}\n")

   Using device: cuda



In [13]:
import torch
print(torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("Device name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")


2.6.0+cu126
CUDA available: True
CUDA version: 12.6
Device name: NVIDIA GeForce RTX 3050 Laptop GPU


In [15]:
# Initialize Embedding Gemma model
# This will download the model on first run (~1.2GB)
# Subsequent runs will use the cached model

print("üîß Loading Embedding Gemma model...")
print("   This may take a few minutes on first run (downloading ~1.2GB)")
print("   Subsequent runs will be instant!\n")
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"   Using device: {device}\n")
try:
    embeddings_model = SentenceTransformer(
        "google/embeddinggemma-300m",
        token=hf_token if hf_token else None
    ).to(device=device)
    
    print("‚úÖ Embedding Gemma model loaded successfully!")
    print(f"   Model: google/embeddinggemma-300m")
    print(f"   Parameters: 300M")
    print(f"   Embedding dimension: 768")
    print(f"   Max sequence length: 2048 tokens")
    print(f"   Languages: 100+")
    print(f"   Cost: FREE (runs locally!)")
    print(f"   Device: {embeddings_model.device}")
    
except Exception as e:
    print(f"‚ùå Error loading Embedding Gemma: {str(e)}")
    print("\nüí° You may need to accept the license at:")
    print("   https://huggingface.co/google/embeddinggemma-300m")
    print("   (Requires free Hugging Face account)")

üîß Loading Embedding Gemma model...
   This may take a few minutes on first run (downloading ~1.2GB)
   Subsequent runs will be instant!

   Using device: cuda

‚úÖ Embedding Gemma model loaded successfully!
   Model: google/embeddinggemma-300m
   Parameters: 300M
   Embedding dimension: 768
   Max sequence length: 2048 tokens
   Languages: 100+
   Cost: FREE (runs locally!)
   Device: cuda:0
‚úÖ Embedding Gemma model loaded successfully!
   Model: google/embeddinggemma-300m
   Parameters: 300M
   Embedding dimension: 768
   Max sequence length: 2048 tokens
   Languages: 100+
   Cost: FREE (runs locally!)
   Device: cuda:0


In [16]:
# Test embedding generation with a sample chunk
if all_chunks:
    sample_text = all_chunks[0].page_content[:200]  # Use first 200 chars for testing
    
    print("üß™ Testing embedding generation...")
    print(f"Sample text: '{sample_text}...'\n")
    
    try:
        # Generate embedding for sample using Embedding Gemma
        # Use encode_document for document-style text
        sample_embedding = embeddings_model.encode(sample_text, convert_to_numpy=True)
        
        print(f"‚úÖ Embedding generated successfully!")
        print(f"   Embedding dimension: {len(sample_embedding)}")
        print(f"   First 10 values: {sample_embedding[:10]}")
        print(f"   Embedding type: {type(sample_embedding)}")
        print(f"   Processing time: Instant (runs locally!)")
    except Exception as e:
        print(f"‚ùå Error generating embedding: {str(e)}")

üß™ Testing embedding generation...
Sample text: '1-2-3 Magic
A humorous look at parenting, a
serious look at discipline.
Here‚Äôs what people are saying:
Thank you for all you do!
‚ÄúI am a school social worker and I recommend 1-2-3 Magic to ALL parents...'

‚úÖ Embedding generated successfully!
   Embedding dimension: 768
   First 10 values: [-0.05879203 -0.00481089  0.01744458 -0.00423945  0.02426259  0.00779395
 -0.00952305  0.03587556  0.04948891 -0.04869011]
   Embedding type: <class 'numpy.ndarray'>
   Processing time: Instant (runs locally!)
‚úÖ Embedding generated successfully!
   Embedding dimension: 768
   First 10 values: [-0.05879203 -0.00481089  0.01744458 -0.00423945  0.02426259  0.00779395
 -0.00952305  0.03587556  0.04948891 -0.04869011]
   Embedding type: <class 'numpy.ndarray'>
   Processing time: Instant (runs locally!)


In [17]:
# Create a LangChain-compatible embeddings wrapper for Embedding Gemma

class EmbeddingGemmaEmbeddings(Embeddings):
    """LangChain wrapper for Embedding Gemma model."""
    
    def __init__(self, model: SentenceTransformer):
        self.model = model
    
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed a list of documents using Embedding Gemma."""
        # Use encode for document-style prompts
        embeddings = self.model.encode(
            texts,
            convert_to_numpy=True,
            show_progress_bar=True
        )
        return embeddings.tolist()
    
    def embed_query(self, text: str) -> List[float]:
        """Embed a query using Embedding Gemma."""
        # Use encode for query-style prompts
        embedding = self.model.encode(
            text,
            convert_to_numpy=True
        )
        return embedding.tolist()

# Wrap the Embedding Gemma model for use with LangChain
langchain_embeddings = EmbeddingGemmaEmbeddings(embeddings_model)

print("‚úÖ LangChain wrapper created for Embedding Gemma")
print("   Ready to use with ChromaDB!")

‚úÖ LangChain wrapper created for Embedding Gemma
   Ready to use with ChromaDB!


## 7Ô∏è‚É£ Store Embeddings in Chroma Vector Database

Create a Chroma vector store and populate it with our text chunks and their embeddings.

In [18]:
# Create Chroma vector store
# This will automatically generate embeddings for all chunks using Embedding Gemma

print("üîß Creating Chroma vector database...")
print(f"   Processing {len(all_chunks)} chunks with Embedding Gemma...")
print("   This may take a few moments (running locally on your machine).\n")

try:
    # Create the vector store from documents
    # Chroma will persist to ./chroma_db directory by default
    vectorstore = Chroma.from_documents(
        documents=all_chunks,
        embedding=langchain_embeddings,
        persist_directory="./chroma_db",
        collection_name="parenting_knowledge"
    )
    
    print("‚úÖ Vector database created successfully!")
    print(f"   Collection name: parenting_knowledge")
    print(f"   Persist directory: ./chroma_db")
    print(f"   Total documents stored: {vectorstore._collection.count()}")
    print(f"   üí∞ Cost: $0.00 (Embedding Gemma runs locally!)")
    
except Exception as e:
    print(f"‚ùå Error creating vector store: {str(e)}")

üîß Creating Chroma vector database...
   Processing 463 chunks with Embedding Gemma...
   This may take a few moments (running locally on your machine).



Batches:   0%|          | 0/15 [00:00<?, ?it/s]

‚úÖ Vector database created successfully!
   Collection name: parenting_knowledge
   Persist directory: ./chroma_db
   Total documents stored: 463
   üí∞ Cost: $0.00 (Embedding Gemma runs locally!)


In [19]:
# Verify the vector store by performing a test similarity search
test_query = "child behavior"

print(f"üîç Testing vector store with query: '{test_query}'")
print("   Retrieving top 3 similar chunks...\n")

try:
    results = vectorstore.similarity_search(test_query, k=3)
    
    print(f"‚úÖ Retrieved {len(results)} results:")
    for i, doc in enumerate(results, 1):
        print(f"\n   Result {i}:")
        print(f"   Source: {doc.metadata.get('source', 'N/A')}")
        print(f"   Title: {doc.metadata.get('title', 'N/A')}")
        print(f"   Preview: {doc.page_content[:100]}...")
        
except Exception as e:
    print(f"‚ùå Error searching vector store: {str(e)}")

üîç Testing vector store with query: 'child behavior'
   Retrieving top 3 similar chunks...

‚úÖ Retrieved 3 results:

   Result 1:
   Source: 1-2-3-magic-effective-discipline-for-children-212-4nbsped-1889140589-9781889140582.pdf
   Title: N/A
   Preview: routines, defining for positive behavior, 118‚Äì119, 135‚Äì136
running away, threat of, 86
Ssadness, dea...

   Result 2:
   Source: Parenting With Love and Logic_ Teaching Children Responsibility ( PDFDrive.com ).pdf
   Title: N/A
   Preview: Index
abandonment, 58
ability. See skills
acceptance, 29, 249
quest for, 39, 188
accomplishments, 44...

   Result 3:
   Source: 1-2-3-magic-effective-discipline-for-children-212-4nbsped-1889140589-9781889140582.pdf
   Title: N/A
   Preview: martyrdom testing tactic, 83‚Äì84, 87‚Äì88
MBAs (Minor-But-Aggravating actions), 43‚Äì44
mealtimes
3-out-o...


## 8Ô∏è‚É£ Implement RAG Query Function

Create a function that retrieves relevant context and generates answers using DeepSeek R1 via OpenRouter.

In [20]:
# Initialize DeepSeek R1 via OpenRouter
from openai import OpenAI

llm_client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=openrouter_api_key,
)

print("‚úÖ Language model initialized")
print(f"   Model: DeepSeek R1 (via OpenRouter)")
print(f"   Provider: tngtech/deepseek-r1t2-chimera:free")
print(f"   Purpose: Generate answers based on retrieved context")
print(f"   Note: Free tier with rate limits")

‚úÖ Language model initialized
   Model: DeepSeek R1 (via OpenRouter)
   Provider: tngtech/deepseek-r1t2-chimera:free
   Purpose: Generate answers based on retrieved context
   Note: Free tier with rate limits


In [None]:
def query_rag_system(question: str, k: int = 3) -> Dict[str, Any]:
    """
    Query the RAG system with a question about parenting and child psychology.
    
    Args:
        question: The question to ask
        k: Number of relevant chunks to retrieve (default: 3)
        
    Returns:
        Dictionary containing the answer, sources, and retrieved chunks
    """
    print(f"üîç Processing query: '{question}'")
    print(f"   Retrieving top {k} relevant chunks...\n")
    
    # Step 1: Retrieve relevant chunks from vector store
    retrieved_docs = vectorstore.similarity_search(question, k=k)
    
    # Step 2: Format the context from retrieved chunks
    context = "\n\n---\n\n".join([
        f"Source: {doc.metadata.get('title', 'Unknown')} ({doc.metadata.get('source', 'N/A')})\n"
        f"Content: {doc.page_content}"
        for doc in retrieved_docs
    ])
    
    # Step 3: Create the prompt for the LLM
    prompt = f"""You are a helpful parenting expert assistant. Use the following context from parenting guides and psychology resources to answer the question. 

If the context doesn't contain enough information to fully answer the question, say so and provide general guidance based on child psychology principles.

Context:
{context}

Question: {question}

Provide a clear, practical, and empathetic answer that helps parents understand and address the situation. Include specific techniques or strategies when available from the context."""

    # Step 4: Generate answer using DeepSeek R1
    print("ü§ñ Generating answer with DeepSeek R1...")
    
    response = llm_client.chat.completions.create(
        extra_headers={
            "HTTP-Referer": "https://parenting-assistant.local",
            "X-Title": "Parenting RAG Assistant",
        },
        model="tngtech/deepseek-r1t2-chimera:free",
        messages=[
            {
                "role": "user",
                "content": prompt
            }
        ]
    )
    answer = response.choices[0].message.content
    
    # Step 5: Prepare sources information
    sources = []
    for i, doc in enumerate(retrieved_docs, 1):
        sources.append({
            "number": i,
            "title": doc.metadata.get('title', 'Unknown'),
            "source": doc.metadata.get('source', 'N/A'),
            "chunk_id": doc.metadata.get('chunk_id', 'N/A'),
            "excerpt": doc.page_content[:200] + "..."
        })
    
    return {
        "question": question,
        "answer": answer,
        "sources": sources,
        "num_sources": len(sources)
    }

print("‚úÖ RAG query function ready!")
print("   Use: query_rag_system('your question here')")

‚úÖ RAG query function ready!
   Use: query_rag_system('your question here')


## 9Ô∏è‚É£ Example QA: Query the System

Test the RAG system with real parenting questions.

In [22]:
# Example Question 1: Dealing with tantrums
question1 = "How can I deal with my child's tantrums?"

result1 = query_rag_system(question1)

print("\n" + "="*80)
print("‚ùì QUESTION:")
print("="*80)
print(result1['question'])
print("\n" + "="*80)
print("üí° ANSWER:")
print("="*80)
print(result1['answer'])
print("\n" + "="*80)

üîç Processing query: 'How can I deal with my child's tantrums?'
   Retrieving top 3 relevant chunks...

ü§ñ Generating answer with DeepSeek R1...

‚ùì QUESTION:
How can I deal with my child's tantrums?

üí° ANSWER:

Handling your child‚Äôs tantrums can be challenging, but the context provides several evidence-based strategies. Here‚Äôs a synthesized, empathetic guide to addressing them effectively:

### 1. **Identify the Type of Tantrum**  
   - **"Upstairs" (Strategic/Ma‚Å†nipulative) Tantrums**: These occur when your child seeks to control a situation (e.g., demanding toys in a store). They can stop immediately if they get their way or face consequences.  
   - **Example**: Your child screams for a toy but quiets down if you threaten to cancel a playdate.  

### How to Respond:  
#### For "Upstairs" Tantrums:  
   - **Set firm boundaries with clear consequences**:  
     *"I see you want the slippers, but screaming isn‚Äôt okay. If you don‚Äôt stop now, we won‚Äôt get them, and y

In [25]:
# Example Question 2: Managing child anger
question2 = "What are positive ways to manage child anger according to psychology experts?"

result2 = query_rag_system(question2)

print("\n" + "="*80)
print("‚ùì QUESTION:")
print("="*80)
print(result2['question'])
print("\n" + "="*80)
print("üí° ANSWER:")
print("="*80)
print(result2['answer'])
print("\n" + "="*80)

üîç Processing query: 'What are positive ways to manage child anger according to psychology experts?'
   Retrieving top 3 relevant chunks...

ü§ñ Generating answer with DeepSeek R1...

‚ùì QUESTION:
What are positive ways to manage child anger according to psychology experts?

üí° ANSWER:


Based on the provided context from parenting resources, there are no explicit strategies for managing *child* anger. However, the principles in these resources‚Äîalong with general child psychology‚Äîsuggest these **positive, evidence-based approaches** to help children navigate anger:

---

### **Key Strategies from Context + General Psychology Principles**
1. **Empathy First, Always**  
   When a child is angry (or acting out), start with validation: *"I see you're really upset. Big feelings are hard, but we can figure this out together."*  
   - *Why it works*: The "Love and Logic" approach emphasizes empathy as foundational. Children learn emotional regulation when they feel understood, not c

In [26]:
# Example Question 3: Your own custom question
# Change this to ask any parenting question you'd like!

question3 = "How can I encourage positive behavior in my child?"

result3 = query_rag_system(question3)

print("\n" + "="*80)
print("‚ùì QUESTION:")
print("="*80)
print(result3['question'])
print("\n" + "="*80)
print("üí° ANSWER:")
print("="*80)
print(result3['answer'])
print("\n" + "="*80)

üîç Processing query: 'How can I encourage positive behavior in my child?'
   Retrieving top 3 relevant chunks...

ü§ñ Generating answer with DeepSeek R1...

‚ùì QUESTION:
How can I encourage positive behavior in my child?

üí° ANSWER:

Based on the parenting resources provided, here are practical, research-backed strategies to encourage positive behavior in your child:

### 1. **Praise Early and Often (Positive Reinforcement)**
   - **Why it works:** Children respond to encouragement like athletes respond to cheering. Praise motivates far more than criticism.
   - **How to do it:**
     - **Be specific:** "Thanks for clearing the table so quickly!" 
     - **Aim for a 3:1 ratio:** Give 3‚Äì4 positive comments for every 1 corrective remark.
     - **Notice small wins:** Reinforce cooperation, even during calm moments (e.g., "I love how you‚Äôre playing peacefully!").
   - *Context insight:* Parents often default to negative feedback‚Äîbreak this habit by intentionally praising posit

## üîü Display Retrieved Sources

Show which documents informed each answer for transparency and verification.

In [23]:
def display_sources(result: Dict[str, Any]) -> None:
    """
    Display the sources that were used to generate an answer.
    
    Args:
        result: Result dictionary from query_rag_system()
    """
    print(f"\n{'='*80}")
    print(f"üìö SOURCES FOR: '{result['question']}'")
    print(f"{'='*80}")
    print(f"Retrieved {result['num_sources']} relevant sources:\n")
    
    for source in result['sources']:
        print(f"  [{source['number']}] {source['title']}")
        print(f"      File: {source['source']}")
        print(f"      Chunk: {source['chunk_id']}")
        print(f"      Excerpt: {source['excerpt']}")
        print()
    
    print(f"{'='*80}\n")

# Display sources for all three questions
display_sources(result1)
display_sources(result2)
display_sources(result3)


üìö SOURCES FOR: 'How can I deal with my child's tantrums?'
Retrieved 3 relevant sources:

  [1] Unknown
      File: The Whole-Brain Child_ 12 Revolutionary Strategies to Nurture Your Child_s Developing Mind .pdf
      Chunk: 30
      Excerpt: push buttons and terrorize you until she get what she wants.
Despite her dramatic and seemingly heartfelt pleas, she could
instantly stop the tantrum if she wanted to‚Äîfor instance, if you
gave in to h...

  [2] Unknown
      File: 1-2-3-magic-effective-discipline-for-children-212-4nbsped-1889140589-9781889140582.pdf
      Chunk: 43
      Excerpt: So let‚Äôs say you chose to time-out your six-year-old son for tantruming.
He‚Äôs now in his room and he‚Äôs still having a fit. What if the time-out
period is up but the child‚Äôs not done with the tantrum? ...

  [3] Unknown
      File: Parenting With Love and Logic_ Teaching Children Responsibility ( PDFDrive.com ).pdf
      Chunk: 125
      Excerpt: Wise parents simply let tantrums happen. There‚

NameError: name 'result2' is not defined

## üéØ Interactive Query Cell

Use this cell to ask your own questions!

In [24]:
# üéØ Ask your own question here!
# Simply change the question below and run this cell

your_question = "How do I help my child develop emotional intelligence?"

# Get the answer
result = query_rag_system(your_question)

# Display the answer
print("\n" + "="*80)
print("‚ùì YOUR QUESTION:")
print("="*80)
print(result['question'])
print("\n" + "="*80)
print("üí° ANSWER:")
print("="*80)
print(result['answer'])
print("\n" + "="*80)

# Display sources
display_sources(result)

üîç Processing query: 'How do I help my child develop emotional intelligence?'
   Retrieving top 3 relevant chunks...

ü§ñ Generating answer with DeepSeek R1...

‚ùì YOUR QUESTION:
How do I help my child develop emotional intelligence?

üí° ANSWER:

Based on the principles from "The Whole-Brain Child," fostering emotional intelligence in children involves nurturing their ability to recognize, understand, and manage emotions‚Äîboth their own and others'. Here are practical, evidence-based strategies grounded in the source material:

### **1. Teach Emotional Literacy ("Name It to Tame It")**  
- **Why it matters:** Labeling emotions helps children integrate their emotional (right brain) and logical (left brain) functions.
- **How to do it:**  
  - When your child experiences strong emotions, gently narrate their feelings: *"You‚Äôre feeling frustrated because the puzzle isn‚Äôt working, aren‚Äôt you?"*  
  - Use "feel" instead of "am" to emphasize temporariness: *"You* ***feel*** *ang

---

## üìù Summary & Next Steps

### What We Built:
‚úÖ A complete RAG pipeline for parenting and child psychology questions  
‚úÖ PDF document processing and text extraction  
‚úÖ **FREE local embeddings** with Google's Embedding Gemma (300M)  
‚úÖ ChromaDB vector storage for efficient retrieval  
‚úÖ DeepSeek R1 powered answer generation (via OpenRouter - FREE tier)  
‚úÖ **100% FREE** - No API costs for embeddings or LLM!

### How to Use:
1. **Add PDFs**: Place parenting guides and psychology texts in the `data/` folder
2. **Accept Embedding Gemma License**: Visit https://huggingface.co/google/embeddinggemma-300m (one-time, free)
3. **Set API Key**: OpenRouter for DeepSeek R1: `$env:OPENROUTER_API_KEY='your-key'`
4. **Run All Cells**: Execute the notebook from top to bottom
5. **Ask Questions**: Use the interactive query cell to ask your own questions

### Suggested Free Resources:
- **UNICEF Parenting Guides**: https://www.unicef.org/parenting
- **CDC Positive Parenting Tips**: https://www.cdc.gov/parents/
- **WHO Parenting Resources**: https://www.who.int/teams/mental-health-and-substance-use/parenting-for-lifelong-health
- **Project Gutenberg Psychology**: https://www.gutenberg.org/ebooks/search/?query=child+psychology

### Future Enhancements:
- üåê **Web Interface**: Build a Streamlit or Gradio UI
- üîå **API Endpoint**: Create a FastAPI REST API
- üìä **Analytics**: Track common questions and answer quality
- üó£Ô∏è **Multi-language**: Already supports 100+ languages!
- üì± **Mobile App**: Develop a mobile interface
- ü§ù **Conversation History**: Add chat history and follow-up questions
- üìà **User Feedback**: Implement rating system for answers

### Cost Considerations:
- **Embeddings** (Embedding Gemma): **$0.00** - Runs locally!
- **DeepSeek R1** (OpenRouter free tier): **$0.00** - FREE with rate limits
- **Total cost per session**: **$0.00** üéâ
- **Comparison**: 100% savings vs OpenAI GPT-4 + embeddings (~$0.25/session)

### System Requirements:
- **RAM**: 4GB+ recommended for Embedding Gemma
- **Storage**: ~1.5GB for model cache (one-time download)
- **GPU**: Optional (CPU works fine, GPU makes it faster)

### Get Your API Keys:
- **OpenRouter**: https://openrouter.ai/keys (Free tier - no credit card!)
- **Hugging Face** (optional): https://huggingface.co/settings/tokens (for faster downloads)

---

**Made with ‚ù§Ô∏è for parents seeking evidence-based guidance**  
**Now 100% FREE to run! üéâ**