# Astronomy AI Tutor: RAG-Based Chatbot for Course Materials

*Project 1 - Jacob Parzych & Jamison [Partner Name]*

## Project Overview

This project implements a Retrieval-Augmented Generation (RAG) system that serves as an AI tutor for astronomy course materials. The system processes lecture notes from the course, creates a searchable knowledge base using sentence transformers, and provides intelligent responses to student questions by combining relevant course content with AI-generated explanations.

### Key Features:
- **Document Processing**: Automatically loads and chunks all lecture markdown files
- **Semantic Search**: Uses sentence transformers to find relevant content
- **AI Integration**: Leverages OpenAI's API for intelligent responses
- **Interactive Chat**: Simple command-line interface for student queries
- **Object-Oriented Design**: Clean, modular architecture with multiple classes

### Technical Implementation:
The system uses three main classes:
1. **DocumentProcessor**: Handles loading and chunking of lecture files
2. **VectorStore**: Manages embeddings and similarity search
3. **AITutor**: Orchestrates the complete RAG pipeline

This demonstrates proficiency in Python programming concepts including OOP, file I/O, error handling, and integration with modern AI tools.

In [None]:
# Install required packages
!pip install sentence-transformers anthropic numpy matplotlib python-dotenv

In [None]:
# Import necessary libraries
import os
import glob
import numpy as np
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
import anthropic
from typing import List, Dict
from datetime import datetime
from collections import Counter
from dotenv import load_dotenv
import warnings

warnings.filterwarnings('ignore')
load_dotenv()

# Load environment variables from .env file

## Class 1: Document Processor

The `DocumentProcessor` class handles loading and processing of lecture markdown files. It implements the chunking strategy you specified, splitting documents by section headers to maintain semantic coherence.

In [None]:
class DocumentProcessor:
    """
    A simplified class to process lecture documents for RAG implementation.
    """
    
    def __init__(self, lecture_directory: str):
        """Initialize the DocumentProcessor."""
        self.lecture_directory = lecture_directory
        self.chunks = []
        
    def load_and_process_documents(self) -> None:
        """Load all markdown files and process them into chunks."""
        try:
            # Find all markdown files
            pattern = os.path.join(self.lecture_directory, "*.md")
            markdown_files = glob.glob(pattern)
            
            if not markdown_files:
                raise FileNotFoundError(f"No markdown files found in {self.lecture_directory}")
            
            print(f"Found {len(markdown_files)} lecture files")
            
            total_chunks = 0
            for file_path in markdown_files:
                with open(file_path, 'r', encoding='utf-8') as f:
                    content = f.read()
                    
                filename = os.path.basename(file_path).replace('.md', '')
                doc_chunks = self.chunk_by_sections(content, filename)
                self.chunks.extend(doc_chunks)
                total_chunks += len(doc_chunks)
                print(f"  - {filename}: {len(doc_chunks)} chunks")
            
            print(f"Total chunks created: {total_chunks}")
                    
        except Exception as e:
            print(f"Error processing documents: {e}")
            raise
    
    def chunk_by_sections(self, text: str, source_filename: str) -> List[Dict]:
        """Split document into chunks based on ## section headers."""
        sections = text.split('\n## ')
        chunks = []
        
        for i, section in enumerate(sections):
            # Add back the '## ' that was removed during split
            if i == 0:
                chunk_text = section
            else:
                chunk_text = '## ' + section
            
            # Only keep chunks with substantial content
            if len(chunk_text.strip()) > 100:
                chunks.append({
                    'text': chunk_text.strip(),
                    'source_file': source_filename,
                    'chunk_id': i
                })
        
        return chunks
    
    def get_chunks(self) -> List[Dict]:
        """Return the processed chunks."""
        return self.chunks

## Class 2: Vector Store

The `VectorStore` class manages the embedding and retrieval system. It uses sentence transformers to create embeddings and implements cosine similarity search to find relevant content for user queries.

In [None]:
class VectorStore:
    """
    A simplified class to manage embeddings and similarity search.
    """
    
    def __init__(self):
        """Initialize the VectorStore."""
        print("Loading sentence transformer model...")
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.chunks = []
        self.embeddings = None
        
    def add_chunks(self, chunks: List[Dict]) -> None:
        """Add chunks and create embeddings."""
        print(f"Creating embeddings for {len(chunks)} chunks...")
        
        self.chunks = chunks
        chunk_texts = [chunk['text'] for chunk in chunks]
        
        self.embeddings = self.model.encode(chunk_texts, convert_to_tensor=True)
        print(f"Created embeddings with dimension: {self.embeddings.shape[1]}")
    
    def search(self, query: str, top_k: int = 3) -> List[Dict]:
        """Search for relevant chunks based on similarity."""
        if self.embeddings is None:
            return []
        
        # Get query embedding
        query_embedding = self.model.encode([query], convert_to_tensor=True)
        
        # Calculate similarities (using built-in utility function)
        from sentence_transformers.util import cos_sim
        similarities = cos_sim(query_embedding, self.embeddings)[0]
        
        # Get top results
        top_indices = similarities.argsort(descending=True)[:top_k]
        
        results = []
        for idx in top_indices:
            if similarities[idx] > 0.1:  # minimum similarity threshold
                result = self.chunks[idx].copy()
                result['similarity_score'] = float(similarities[idx])
                results.append(result)
        
        return results

## Class 3: AI Tutor

The `AITutor` class orchestrates the complete RAG pipeline. It combines the document processing, vector search, and AI generation to create an intelligent tutoring system that can answer questions about the course materials.

In [None]:
class AITutor:
    """
    A simple RAG-based AI tutor using Anthropic's Claude.
    """
    
    def __init__(self, lecture_directory: str):
        """Initialize the AI Tutor."""
        self.document_processor = DocumentProcessor(lecture_directory)
        self.vector_store = VectorStore()
        
        # Setup Anthropic client using API key from .env file
        api_key = os.getenv('ANTHROPIC_API_KEY')
        if api_key:
            try:
                self.client = anthropic.Anthropic(api_key=api_key)
                print("✅ Anthropic API key loaded successfully")
            except Exception as e:
                print(f"Warning: Anthropic API setup failed - {e}")
                self.client = None
        else:
            print("Warning: ANTHROPIC_API_KEY not found in .env file")
            self.client = None
    
    def initialize(self):
        """Load documents and create embeddings."""
        print("Initializing AI Tutor...")
        
        # Process documents
        self.document_processor.load_and_process_documents()
        chunks = self.document_processor.get_chunks()
        
        # Create embeddings
        self.vector_store.add_chunks(chunks)
        
        print("✅ AI Tutor Ready!")
    
    def ask(self, question: str) -> str:
        """Ask a question and get an AI response."""
        # Find relevant content using top 3 chunks
        relevant_chunks = self.vector_store.search(question, top_k=3)
        
        if not relevant_chunks:
            return "Sorry, I couldn't find relevant information for your question."
        
        if not self.client:
            return "AI responses not available. Please set ANTHROPIC_API_KEY environment variable."
        
        # Prepare context from retrieved chunks
        context = "\n\n".join([chunk['text'][:400] for chunk in relevant_chunks])
        
        # Simple prompt for Claude
        prompt = f"""Answer this student's question about astronomy programming based on the course materials:

    Question: {question}

    Course Materials:
    {context}

    Provide a clear, helpful answer."""
        
        try:
            response = self.client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=300,
            messages=[{"role": "user", "content": prompt}]
            )
            return response.content[0].text
        except Exception as e:
            return f"Error getting AI response: {e}"

## System Initialization

Now let's initialize our AI tutor system. This will load all the lecture files, process them into chunks, and create embeddings for semantic search.

In [None]:
# Initialize the AI Tutor
# Note: Make sure you have ANTHROPIC_API_KEY in your .env file

lecture_directory = "Lecture"

# Check if initialization is already complete
def is_tutor_ready():
    return ('tutor' in globals() and 
            hasattr(tutor, 'vector_store') and 
            tutor.vector_store.embeddings is not None and
            len(tutor.document_processor.get_chunks()) > 0)

if is_tutor_ready():
    print("✅ AI Tutor already initialized and ready to use!")
    print(f"   - Total chunks: {len(tutor.document_processor.get_chunks())}")
    print(f"   - Embeddings shape: {tutor.vector_store.embeddings.shape}")
    print("   - Use 'tutor.ask(question)' to interact with the system")
else:
    print("🚀 Creating new AI Tutor instance...")
    
    # Delete any existing partial tutor to prevent conflicts
    if 'tutor' in globals():
        del tutor
    
    # Create and initialize fresh instance
    tutor = AITutor(lecture_directory)
    tutor.initialize()
    
    print("🎉 AI Tutor initialization complete!")

In [None]:
# Display system statistics
chunks = tutor.document_processor.get_chunks()
source_files = list(set(chunk['source_file'] for chunk in chunks))

print("=== System Statistics ===")
print(f"Total chunks: {len(chunks)}")
print(f"Source files loaded:")
for file in source_files:
    print(f"  - {file}")

## Data Analysis and Visualization

Let's analyze the document processing results and visualize some key metrics about our knowledge base.

In [None]:
# Analyze chunk distribution
chunks = tutor.document_processor.get_chunks()
chunk_lengths = [len(chunk['text']) for chunk in chunks]  # Calculate length from text
source_files = [chunk['source_file'] for chunk in chunks]

# Create visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

# 1. Histogram of chunk lengths
ax1.hist(chunk_lengths, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
ax1.set_xlabel('Chunk Length (characters)')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of Chunk Lengths')
ax1.grid(True, alpha=0.3)

# 2. Box plot of chunk lengths
ax2.boxplot(chunk_lengths)
ax2.set_ylabel('Chunk Length (characters)')
ax2.set_title('Chunk Length Distribution (Box Plot)')
ax2.grid(True, alpha=0.3)

# 3. Chunks per source file
file_counts = Counter(source_files)
files = list(file_counts.keys())
counts = list(file_counts.values())

ax3.bar(range(len(files)), counts, color='lightcoral', alpha=0.7)
ax3.set_xlabel('Source Files')
ax3.set_ylabel('Number of Chunks')
ax3.set_title('Chunks per Source File')
ax3.set_xticks(range(len(files)))
ax3.set_xticklabels([f.replace('Lecture', 'L') for f in files], rotation=45, ha='right')
ax3.grid(True, alpha=0.3)

# 4. Cumulative chunk length by file
file_total_lengths = {}
for chunk in chunks:
    file = chunk['source_file']
    if file not in file_total_lengths:
        file_total_lengths[file] = 0
    file_total_lengths[file] += len(chunk['text'])  # Use text length

files = list(file_total_lengths.keys())
lengths = list(file_total_lengths.values())

ax4.bar(range(len(files)), lengths, color='lightgreen', alpha=0.7)
ax4.set_xlabel('Source Files')
ax4.set_ylabel('Total Characters')
ax4.set_title('Total Content Length per File')
ax4.set_xticks(range(len(files)))
ax4.set_xticklabels([f.replace('Lecture', 'L') for f in files], rotation=45, ha='right')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.suptitle('Knowledge Base Analysis', fontsize=16, y=1.02)
plt.show()

# Print detailed statistics
print("=== Detailed Chunk Analysis ===")
print(f"Total chunks: {len(chunks)}")
print(f"Mean chunk length: {np.mean(chunk_lengths):.1f} characters")
print(f"Median chunk length: {np.median(chunk_lengths):.1f} characters")
print(f"Standard deviation: {np.std(chunk_lengths):.1f} characters")
print(f"Min chunk length: {min(chunk_lengths)} characters")
print(f"Max chunk length: {max(chunk_lengths)} characters")

## Testing the AI Tutor

Let's test our AI tutor with various types of questions to demonstrate its capabilities. These examples will show real AI responses using your Anthropic API key.

In [None]:
# Test the AI Tutor with sample questions

test_questions = [
    "What is the purpose of this course?",
    "How do I use Python for astronomy?",
    "What is NumPy used for?",
    "How do I create plots with matplotlib?"
]

print("=== Testing AI Tutor with Real Responses ===")
print("Using ANTHROPIC_API_KEY for full AI responses")
print()

# Test with actual AI responses
for i, question in enumerate(test_questions, 1):
    print(f"Question {i}: {question}")
    
    # Show what context gets retrieved
    relevant_chunks = tutor.vector_store.search(question, top_k=3)
    
    print(f"Found {len(relevant_chunks)} relevant chunks:")
    for j, chunk in enumerate(relevant_chunks, 1):
        print(f"  {j}. {chunk['source_file']} (similarity: {chunk['similarity_score']:.3f})")
    
    # Get actual AI response
    print("\n🤖 AI Response:")
    response = tutor.ask(question)
    print(response)
    
    print("\n" + "=" * 80)
    print()

In [None]:
# Live Interactive Demo with Real AI Responses
def demo_chat():
    """Demo of actual AI tutor interactions."""
    
    sample_questions = [
        "What programming concepts will I learn?",
        "How do I handle data in astronomy?", 
        "What visualization tools are covered?"
    ]
    
    print("=== Live AI Tutor Demo ===")
    print("Real conversations with the AI tutor:")
    print()
    
    for i, q in enumerate(sample_questions, 1):
        print(f"💬 Student Question {i}: {q}")
        
        # Show what gets found
        results = tutor.vector_store.search(q, top_k=1)
        if results:
            print(f"📚 Found relevant content from: {results[0]['source_file']} (similarity: {results[0]['similarity_score']:.3f})")
        
        # Get real AI response  
        print("\n🤖 AI Tutor Response:")
        response = tutor.ask(q)
        print(response)
        print("\n" + "-" * 60)
        print()
    
    print("✅ Demo complete! The AI tutor is ready for your questions.")
    print("Usage: response = tutor.ask('Your question here')")

# Clear any previous output and run demo
import sys
sys.stdout.flush()  # Clear output buffer
demo_chat()

In [None]:
# Additional Test Questions - Try Your Own!

print("=== Try Some Additional Questions ===")
print("Here are some more examples you can test:")
print()

additional_questions = [
    "How do I work with variables in Python?",
    "What is object-oriented programming?", 
    "How do I read files in Python?",
    "What are the main data visualization libraries?",
    "How do I use Git for version control?"
]

# Test one example
example_question = additional_questions[0]
print(f"Example: {example_question}")
print("\n🤖 AI Response:")
response = tutor.ask(example_question)
print(response)

print(f"\n\nOther questions you can try:")
for i, q in enumerate(additional_questions[1:], 2):
    print(f"{i}. {q}")

print(f"\nTo test any question, run: tutor.ask('your question')")

## Error Handling and Edge Cases

Let's demonstrate the robust error handling built into our system and test various edge cases.

In [None]:
# Test basic error handling

print("=== Error Handling Tests ===")

# Test 1: Query with no relevant results
print("Test 1: Irrelevant query")
results = tutor.vector_store.search("quantum physics of unicorns", top_k=3)
print(f"Results for irrelevant query: {len(results)}")

# Test 2: Empty query
print("\nTest 2: Empty query")
results = tutor.vector_store.search("", top_k=3)
print(f"Results for empty query: {len(results)}")

# Test 3: System status
print("\nTest 3: System status")
print(f"Has embeddings: {tutor.vector_store.embeddings is not None}")
print(f"Total chunks: {len(tutor.document_processor.get_chunks())}")
print(f"API client available: {tutor.client is not None}")

# Test 4: Ask method with no relevant results
print("\nTest 4: Ask method with irrelevant query")
response = tutor.ask("quantum physics of unicorns")
print(f"Response: {response}")

print("\n=== Error Handling Complete ===")

## Usage Instructions and Examples

Here's how to use the AI tutor system in practice, with examples of common use cases for astronomy students.
# Simplified AI Tutor Usage

## Setup

1. **Install packages:** `pip install sentence-transformers anthropic matplotlib python-dotenv`
2. **Set API key:** `export ANTHROPIC_API_KEY='your-key-here'` 
3. **Put lecture .md files in 'Lecture' directory**

## Basic Usage

```python
# Initialize
tutor = AITutor('Lecture')
tutor.initialize()

# Ask questions
response = tutor.ask('How do I use NumPy?')
print(response)
```

## Features

- **Automatic Processing**: Processes all lecture markdown files automatically
- **Section-based Chunking**: Splits content by section headers (##)
- **Semantic Search**: Uses sentence transformers for intelligent content retrieval
- **AI-Powered**: Powered by Anthropic's Claude AI for natural responses
- **Clean Architecture**: Simple 3-class object-oriented design

## Example Questions

- "What is this course about?"
- "How do I create Python functions?" 
- "What is object-oriented programming?"
- "How do I make plots with matplotlib?"
- "What are NumPy arrays used for?"

**Ready to help with astronomy programming!** 🚀

## Project Summary and Reflection

### What We Built
This simplified RAG-based AI tutor demonstrates the core concepts of retrieval-augmented generation in a clean, understandable way. The system consists of three main components:

1. **DocumentProcessor**: Loads markdown lecture files and splits them into semantic chunks
2. **VectorStore**: Creates embeddings using sentence transformers and performs similarity search  
3. **AITutor**: Orchestrates the RAG pipeline using Anthropic's Claude for intelligent responses

### Key Programming Concepts Demonstrated
- **Object-Oriented Programming**: Three well-designed classes with clear responsibilities
- **File I/O**: Reading and processing multiple markdown files
- **Error Handling**: Try/except blocks throughout for robust operation
- **Data Structures**: Lists, dictionaries, and NumPy arrays for data management
- **External APIs**: Integration with Anthropic's Claude API
- **Modern AI Tools**: Sentence transformers for semantic similarity

### Simplifications Made
Compared to complex RAG systems, we simplified by:
- Using basic section-based chunking instead of advanced text splitting
- Implementing straightforward cosine similarity search
- Using a single embedding model without fine-tuning
- Streamlined prompt engineering for Claude
- Minimal conversation history tracking

### Real-World Applications
This system could be extended for:
- Course Q&A systems for any subject area
- Internal company knowledge bases
- Research paper summarization tools
- Technical documentation assistants

The core RAG pattern demonstrated here scales to much larger document collections and more sophisticated retrieval strategies.