# ü§ñ Course Notes Chatbot Demonstration

## Assignment 1: Build a Chatbot that Answers from Your Course Notes

This notebook demonstrates how to build a complete RAG (Retrieval-Augmented Generation) chatbot using:
- **LlamaIndex** for document processing and indexing
- **Faiss** for fast vector similarity search  
- **Hugging Face Transformers** for embeddings and language models

The chatbot will be able to answer questions based on your course notes by finding relevant content and generating contextual responses.

### üìã Assignment Objectives
‚úÖ Understand chatbot concepts and functionality  
‚úÖ Learn to use LlamaIndex, Faiss, and language models  
‚úÖ Build a working chatbot that answers from course notes  
‚úÖ Test and evaluate the chatbot performance

---

## 1. Environment Setup and Library Installation

First, let's install all the required libraries for our chatbot. This includes LlamaIndex for document processing, Faiss for vector search, and Hugging Face transformers for embeddings and language models.

In [None]:
# Install required packages
# Run this cell if you haven't installed the requirements yet
import subprocess
import sys

def install_packages():
    """Install required packages for the chatbot"""
    packages = [
        "llama-index==0.10.62",
        "llama-index-llms-huggingface==0.2.4", 
        "llama-index-embeddings-huggingface==0.2.2",
        "llama-index-vector-stores-faiss==0.1.2",
        "faiss-cpu==1.7.4",
        "transformers==4.35.2",
        "torch",
        "sentence-transformers==2.2.2",
        "pypdf==3.17.0",
        "python-docx==1.1.0",
        "numpy",
        "pandas",
        "tqdm"
    ]
    
    for package in packages:
        try:
            subprocess.check_call([sys.executable, "-m", "pip", "install", package])
            print(f"‚úÖ Installed {package}")
        except subprocess.CalledProcessError as e:
            print(f"‚ùå Failed to install {package}: {e}")

# Uncomment the line below to install packages
# install_packages()

print("üì¶ If you see import errors in the next cells, uncomment the line above and run this cell")

## 2. Import Required Libraries

Now let's import all the necessary libraries for building our chatbot.

In [None]:
# Core Python libraries
import os
import logging
from pathlib import Path
from typing import List, Dict, Optional
import warnings

# Data handling
import pandas as pd
import numpy as np
from tqdm import tqdm

# LlamaIndex core components
from llama_index.core import VectorStoreIndex, Document, ServiceContext, StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.readers.file import PDFReader, DocxReader

# Hugging Face components
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM

# Vector store
from llama_index.vector_stores.faiss import FaissVectorStore
import faiss

# PyTorch
import torch

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')
logging.basicConfig(level=logging.INFO)

print("‚úÖ All libraries imported successfully!")
print(f"üî• PyTorch version: {torch.__version__}")
print(f"üíª CUDA available: {torch.cuda.is_available()}")
print(f"üß† Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")

## 3. Load and Prepare Course Notes

Let's create a function to load course notes from various file formats. For this demo, we'll use the sample notes provided, but you can replace these with your actual course materials.

In [None]:
# Set up paths
course_notes_dir = Path("../course_notes")
sample_notes_path = course_notes_dir / "sample_ml_notes.txt"

# Create directory if it doesn't exist
course_notes_dir.mkdir(exist_ok=True)

# Create sample course notes if they don't exist
if not sample_notes_path.exists():
    sample_content = """
# Machine Learning Fundamentals

## Introduction to Machine Learning
Machine Learning is a subset of artificial intelligence that focuses on algorithms and statistical models that enable computers to improve their performance on a task through experience.

### Types of Machine Learning
1. **Supervised Learning**: Learning with labeled data
   - Classification: Predicting categories
   - Regression: Predicting continuous values

2. **Unsupervised Learning**: Learning without labeled data
   - Clustering: Grouping similar data points
   - Dimensionality Reduction: Reducing feature space

3. **Reinforcement Learning**: Learning through interaction with environment

## Key Concepts

### Data Preprocessing
- **Data Cleaning**: Removing or correcting corrupted data
- **Feature Scaling**: Normalizing data ranges
- **Feature Selection**: Choosing relevant features

### Model Evaluation
- **Accuracy**: Percentage of correct predictions
- **Precision**: True positives / (True positives + False positives)
- **Recall**: True positives / (True positives + False negatives)
- **F1-Score**: Harmonic mean of precision and recall

### Popular Algorithms
1. **Linear Regression**: Simple predictive modeling
2. **Logistic Regression**: Classification algorithm
3. **Decision Trees**: Tree-like model for decisions
4. **Random Forest**: Ensemble of decision trees
5. **Support Vector Machines**: Finding optimal decision boundaries
6. **Neural Networks**: Inspired by biological neural networks

## Important Definitions

**Overfitting**: When a model performs well on training data but poorly on new data.

**Underfitting**: When a model is too simple to capture the underlying pattern.

**Bias-Variance Tradeoff**: Balance between model's ability to minimize bias and variance.

**Gradient Descent**: Optimization algorithm used to minimize loss functions.

## Practical Applications
- Image Recognition
- Natural Language Processing  
- Recommendation Systems
- Fraud Detection
- Medical Diagnosis
"""
    
    with open(sample_notes_path, 'w', encoding='utf-8') as f:
        f.write(sample_content)
    print(f"‚úÖ Created sample course notes at {sample_notes_path}")

# Load documents function
def load_documents_from_directory(directory: Path) -> List[Document]:
    """Load all supported documents from a directory"""
    documents = []
    
    # Supported file types
    loaders = {
        '.txt': lambda p: [Document(text=p.read_text(encoding='utf-8'), 
                                   metadata={"filename": p.name, "file_type": "text"})],
        '.pdf': lambda p: PDFReader().load_data(str(p)),
        '.docx': lambda p: DocxReader().load_data(str(p))
    }
    
    for file_path in directory.iterdir():
        if file_path.is_file() and file_path.suffix.lower() in loaders:
            try:
                print(f"üìÑ Loading {file_path.name}...")
                docs = loaders[file_path.suffix.lower()](file_path)
                for doc in docs:
                    if not hasattr(doc, 'metadata'):
                        doc.metadata = {}
                    doc.metadata.update({
                        "filename": file_path.name,
                        "file_path": str(file_path)
                    })
                documents.extend(docs)
                print(f"‚úÖ Loaded {len(docs)} chunks from {file_path.name}")
            except Exception as e:
                print(f"‚ùå Error loading {file_path.name}: {e}")
    
    return documents

# Load all course notes
print("üìö Loading course notes...")
documents = load_documents_from_directory(course_notes_dir)
print(f"‚úÖ Loaded {len(documents)} document chunks total")

# Display first document preview
if documents:
    print(f"\nüìñ Preview of first document:")
    print(f"Filename: {documents[0].metadata.get('filename', 'Unknown')}")
    print(f"Content preview: {documents[0].text[:300]}...")
else:
    print("‚ö†Ô∏è No documents loaded!")

## 4. Initialize LlamaIndex with Course Documents

Now we'll process our documents using LlamaIndex's text splitter to create optimal chunks for retrieval.

In [None]:
# Configure document chunking
CHUNK_SIZE = 512
CHUNK_OVERLAP = 50

# Create text splitter
text_splitter = SentenceSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

print(f"üîß Configured text splitter:")
print(f"   - Chunk size: {CHUNK_SIZE} tokens")
print(f"   - Chunk overlap: {CHUNK_OVERLAP} tokens")

# Split documents into chunks
print(f"\nüìù Processing {len(documents)} documents into chunks...")
nodes = text_splitter.get_nodes_from_documents(documents, show_progress=True)

print(f"‚úÖ Created {len(nodes)} text chunks")

# Display chunk information
if nodes:
    print(f"\nüìä Chunk Statistics:")
    chunk_lengths = [len(node.text) for node in nodes]
    print(f"   - Average chunk length: {np.mean(chunk_lengths):.0f} characters")
    print(f"   - Min chunk length: {min(chunk_lengths)} characters")
    print(f"   - Max chunk length: {max(chunk_lengths)} characters")
    
    print(f"\nüìñ Sample chunk:")
    print(f"Text: {nodes[0].text[:200]}...")
    print(f"Metadata: {nodes[0].metadata}")

## 5. Set Up Faiss Vector Store

We'll create a Faiss vector database to enable fast similarity search over our document embeddings.

In [None]:
# Set up embedding model
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
EMBEDDING_DIM = 384  # Dimension for all-MiniLM-L6-v2

print(f"üß† Setting up embedding model: {EMBEDDING_MODEL}")

# Initialize embedding model
embed_model = HuggingFaceEmbedding(
    model_name=EMBEDDING_MODEL,
    trust_remote_code=True
)

print("‚úÖ Embedding model loaded successfully!")

# Create Faiss index
print(f"\nüóÑÔ∏è Creating Faiss vector store with dimension {EMBEDDING_DIM}")

# Create Faiss index (L2 distance)
faiss_index = faiss.IndexFlatL2(EMBEDDING_DIM)

print(f"‚úÖ Faiss index created:")
print(f"   - Index type: L2 (Euclidean distance)")
print(f"   - Dimension: {EMBEDDING_DIM}")
print(f"   - Current size: {faiss_index.ntotal} vectors")

# Create LlamaIndex Faiss vector store
vector_store = FaissVectorStore(faiss_index=faiss_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

print("‚úÖ Vector store and storage context ready!")

## 6. Configure Language Model

Now we'll set up a language model for generating responses. We'll use a Hugging Face model that works well for question answering.

In [None]:
# Configure language model
LLM_MODEL = "microsoft/DialoGPT-medium"

print(f"ü§ñ Setting up language model: {LLM_MODEL}")

try:
    # Initialize language model
    llm = HuggingFaceLLM(
        model_name=LLM_MODEL,
        tokenizer_name=LLM_MODEL,
        context_window=1024,
        max_new_tokens=256,
        model_kwargs={"torch_dtype": torch.float16 if torch.cuda.is_available() else torch.float32},
        tokenizer_kwargs={},
        device_map="auto" if torch.cuda.is_available() else "cpu"
    )
    print("‚úÖ Language model loaded successfully!")
    
except Exception as e:
    print(f"‚ö†Ô∏è Error loading language model: {e}")
    print("üí° Using default LLM (this may affect response quality)")
    llm = None

# Create service context
service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    node_parser=text_splitter
)

print("‚úÖ Service context created with:")
print(f"   - LLM: {LLM_MODEL if llm else 'Default'}")
print(f"   - Embedding: {EMBEDDING_MODEL}")
print(f"   - Chunk size: {CHUNK_SIZE}")

## 7. Build the Query Engine

This is where the magic happens! We'll create a vector index from our documents and set up a query engine that can retrieve relevant information and generate answers.

In [None]:
# Create vector index from documents
print("üèóÔ∏è Building vector index from documents...")
print("This may take a few minutes as we embed all text chunks...")

try:
    # Create the vector index
    index = VectorStoreIndex.from_documents(
        documents,
        service_context=service_context,
        storage_context=storage_context,
        show_progress=True
    )
    
    print("‚úÖ Vector index created successfully!")
    print(f"üìä Index contains {faiss_index.ntotal} vectors")
    
except Exception as e:
    print(f"‚ùå Error creating index: {e}")
    # Create a simpler index without custom components
    print("üí° Trying with default settings...")
    index = VectorStoreIndex.from_documents(documents, show_progress=True)
    print("‚úÖ Fallback index created!")

# Create query engine
TOP_K = 5  # Number of most relevant chunks to retrieve

query_engine = index.as_query_engine(
    similarity_top_k=TOP_K,
    response_mode="compact"  # Combines retrieved chunks efficiently
)

print(f"‚úÖ Query engine ready!")
print(f"üîç Configuration:")
print(f"   - Retrieval: Top {TOP_K} most relevant chunks")
print(f"   - Response mode: Compact (efficient combination)")
print(f"   - Vector store: Faiss with {faiss_index.ntotal} vectors")

## 8. Create Chatbot Interface

Now let's create an interactive chatbot function that can answer questions about your course notes!

In [None]:
def ask_chatbot(question: str, show_sources: bool = True) -> None:
    """
    Ask the chatbot a question and display the answer with optional source information
    
    Args:
        question: The question to ask
        show_sources: Whether to show source information
    """
    print(f"ü§î Question: {question}")
    print("ü§ñ Thinking...")
    
    try:
        # Get response from query engine
        response = query_engine.query(question)
        
        # Display answer
        print(f"\nüí° Answer:")
        print(f"{response}")
        
        # Show source information if requested
        if show_sources and hasattr(response, 'source_nodes'):
            print(f"\nüìö Sources ({len(response.source_nodes)} chunks used):")
            for i, node in enumerate(response.source_nodes[:3], 1):  # Show top 3 sources
                filename = node.metadata.get('filename', 'Unknown')
                score = getattr(node, 'score', 'N/A')
                print(f"   {i}. {filename} (similarity: {score})")
                print(f"      Preview: {node.text[:100]}...")
        
        print("\n" + "="*80)
        
    except Exception as e:
        print(f"‚ùå Error: {e}")

# Create a simple chatbot class for easier use
class CourseNotesChatbot:
    """Simple chatbot class for course notes Q&A"""
    
    def __init__(self, query_engine):
        self.query_engine = query_engine
        self.conversation_history = []
    
    def ask(self, question: str) -> str:
        """Ask a question and return the answer"""
        try:
            response = self.query_engine.query(question)
            answer = str(response)
            
            # Store in conversation history
            self.conversation_history.append({
                'question': question,
                'answer': answer,
                'timestamp': pd.Timestamp.now()
            })
            
            return answer
        except Exception as e:
            return f"Sorry, I encountered an error: {e}"
    
    def get_conversation_history(self) -> pd.DataFrame:
        """Get conversation history as DataFrame"""
        return pd.DataFrame(self.conversation_history)

# Initialize chatbot
chatbot = CourseNotesChatbot(query_engine)

print("üéâ Chatbot is ready!")
print("üí¨ You can now ask questions using: ask_chatbot('your question here')")
print("üîß Or use the chatbot object: chatbot.ask('your question')")

## 9. Test the Chatbot with Sample Questions

Let's test our chatbot with various questions to see how well it can answer from the course notes!

In [None]:
# Test Questions - demonstrating different types of queries
test_questions = [
    "What is machine learning?",
    "What are the types of machine learning?", 
    "Explain overfitting and underfitting",
    "What is the bias-variance tradeoff?",
    "List the popular machine learning algorithms mentioned",
    "What are some practical applications of machine learning?",
    "Define precision and recall",
    "What is gradient descent?"
]

print("üß™ Testing chatbot with sample questions...")
print("="*80)

# Test each question
for i, question in enumerate(test_questions, 1):
    print(f"\nüìù Test {i}/{len(test_questions)}")
    ask_chatbot(question, show_sources=False)
    
print("\nüéØ All tests completed!")

### Interactive Testing

You can also ask your own questions by running the cells below!

In [None]:
# Try your own questions here!
# Modify the question below and run the cell

your_question = "What is the difference between supervised and unsupervised learning?"
ask_chatbot(your_question, show_sources=True)

## 10. Save and Export the Chatbot

Let's save our chatbot configuration and provide instructions for future use.

In [None]:
# Save the index for future use
vector_store_dir = Path("../vector_store")
vector_store_dir.mkdir(exist_ok=True)

try:
    # Save the index
    index.storage_context.persist(persist_dir=str(vector_store_dir))
    print("‚úÖ Index saved successfully!")
    print(f"üìÅ Location: {vector_store_dir}")
except Exception as e:
    print(f"‚ö†Ô∏è Could not save index: {e}")

# Create a summary of the chatbot configuration
config_summary = {
    "embedding_model": EMBEDDING_MODEL,
    "llm_model": LLM_MODEL,
    "chunk_size": CHUNK_SIZE,
    "chunk_overlap": CHUNK_OVERLAP,
    "top_k_retrieval": TOP_K,
    "vector_dimension": EMBEDDING_DIM,
    "total_documents": len(documents),
    "total_chunks": len(nodes) if 'nodes' in locals() else 0,
    "vector_count": faiss_index.ntotal
}

print(f"\nüìä Chatbot Configuration Summary:")
for key, value in config_summary.items():
    print(f"   {key}: {value}")

# Save configuration to file
config_file = Path("../chatbot_config.json")
import json
with open(config_file, 'w') as f:
    json.dump(config_summary, f, indent=2)
print(f"\nüíæ Configuration saved to: {config_file}")

# Display conversation history if any questions were asked
if hasattr(chatbot, 'conversation_history') and chatbot.conversation_history:
    print(f"\nüí¨ Conversation History:")
    history_df = chatbot.get_conversation_history()
    print(history_df.to_string(index=False))

## üéâ Congratulations!

You have successfully built a complete RAG-based chatbot that can answer questions from your course notes!

### What You've Accomplished

‚úÖ **Document Processing**: Loaded and processed course notes from multiple formats  
‚úÖ **Text Chunking**: Split documents into optimal chunks for retrieval  
‚úÖ **Vector Embeddings**: Created semantic embeddings using Hugging Face models  
‚úÖ **Vector Database**: Set up Faiss for fast similarity search  
‚úÖ **Language Model**: Integrated a language model for response generation  
‚úÖ **RAG Pipeline**: Built a complete Retrieval-Augmented Generation system  
‚úÖ **Interactive Interface**: Created functions to interact with the chatbot  
‚úÖ **Testing**: Tested the chatbot with various types of questions  

### Key Technical Components

1. **LlamaIndex**: Document indexing and retrieval framework
2. **Faiss**: High-performance vector similarity search
3. **Hugging Face Transformers**: Pre-trained models for embeddings and text generation
4. **RAG Architecture**: Combines retrieval and generation for accurate answers

### Assignment Submission Checklist

- ‚úÖ **Code Files**: Complete implementation with comments
- ‚úÖ **Requirements**: All dependencies listed in requirements.txt  
- ‚úÖ **Documentation**: Comprehensive README and code documentation
- ‚úÖ **Demonstration**: This notebook serves as your demo
- ‚úÖ **Testing**: Multiple test questions answered successfully

### Next Steps for Your Assignment

1. **Replace Sample Notes**: Add your actual course notes to the `course_notes/` folder
2. **Test Thoroughly**: Ask questions covering all your course topics
3. **Take Screenshots**: Capture the chatbot answering at least 5 different questions
4. **Write Report**: Document your approach, challenges, and learnings
5. **Create Submission**: Package everything according to submission guidelines

### How to Use Your Chatbot

```python
# In a new session, you can recreate the chatbot by running all cells above
# Or use the standalone files in the code/ directory:

# Command line:
# python code/chatbot.py

# Web interface:
# streamlit run code/streamlit_app.py
```

### Reflection

This assignment demonstrates the power of RAG systems for creating domain-specific chatbots. By combining document retrieval with language generation, we can create chatbots that provide accurate, contextual answers based on specific knowledge sources - in this case, your course notes!

**What makes this chatbot special:**
- It only answers based on your course content
- It shows source information for transparency  
- It handles various document formats
- It uses state-of-the-art NLP techniques
- It's fully customizable and extensible

Great job building your course notes chatbot! üöÄ