# üí° LLM + RAG Interview Guide

---

## üîπ LLM Fundamentals

### ‚û§ What is tokenization, and how does it affect generation?
Tokenization is the process of breaking input text into smaller units (tokens), such as words, subwords, or characters. LLMs process and generate outputs token-by-token, so the type of tokenizer (like BPE or WordPiece) impacts:
- Model input length (affecting cost and speed)
- Output fluency and formatting
- Memory and computation needs

---

### ‚û§ How do embeddings really work?
Embeddings are vector representations of tokens or text chunks. They capture **semantic relationships**:
- Similar meanings ‚Üí closer vectors
- LLMs convert token IDs into embeddings as input
- These vectors are passed through transformer layers to understand context

---

### ‚û§ What‚Äôs the role of attention and positional encoding?
- **Attention**: Helps the model decide "what to focus on" at each token. It gives weight to more relevant tokens in context.
- **Positional Encoding**: Since transformers process input in parallel, they need position information (e.g., order of words). Positional encodings inject this order awareness.

---

### ‚û§ What changes during fine-tuning? (optimizers, schedulers, layer freezing)
- **Optimizers** like AdamW are used to update weights.
- **Schedulers** control learning rate decay.
- **Layer freezing** can keep earlier layers fixed while updating only the later ones to avoid catastrophic forgetting and save compute.

---

### ‚û§ LoRA vs QLoRA vs Full Fine-tune ‚Äì Tradeoffs

| Method       | Pros                           | Cons                           |
|--------------|--------------------------------|--------------------------------|
| Full Fine-tune | Best performance             | Very compute and memory heavy |
| LoRA         | Lightweight, efficient         | Needs adapter injection        |
| QLoRA        | Most memory-efficient (4-bit)  | Slight tradeoff in accuracy    |

---

## üîπ Prompting & Context Engineering

### ‚û§ Few-shot vs Zero-shot ‚Äì Which works better where?
- **Zero-shot**: Best for general-purpose tasks with clear instructions (e.g., classification).
- **Few-shot**: Better for nuanced, domain-specific, or creative tasks where examples help steer behavior.

---

### ‚û§ How do you design system prompts that are robust across users?
- Use clear, concise instructions.
- Define expected format and tone.
- Include edge-case handling.
- Use consistent structure to reduce ambiguity.

---

### ‚û§ How do you make output deterministic?
- Set **temperature = 0** and **top_p = 1.0**
- Use consistent prompts
- Fix random seeds (for some APIs)

---

### ‚û§ How do you track, version, and backfill changing context?
- Use versioned templates or prompt IDs
- Store prompt history with timestamps
- Backfill by rerunning previous inputs with updated contexts

---

### ‚û§ How do you build/maintain memory?
- Use vector databases or key-value stores
- Index past interactions by session/user
- Retrieve relevant past interactions per new query
- Summarize or prune long-term memory

---

## üîπ RAG Systems

### ‚û§ What‚Äôs your chunking strategy ‚Äì by length, semantics, or structure?
- **Structure-first** (e.g., paragraphs or sections)
- **Length-bounded** (e.g., 500-800 tokens)
- **Semantic overlap** ensures smoother context continuity

Use tools like `RecursiveCharacterTextSplitter` in LangChain for hybrid strategies.

---

### ‚û§ How do you choose a vector DB?
| DB       | When to Use                             |
|----------|------------------------------------------|
| Chroma   | Local dev, light RAG prototypes          |
| Pinecone | Production-scale vector retrieval        |
| OpenSearch | Combine keyword + vector search       |

---

### ‚û§ Can you update or backfill embeddings with zero downtime?
Yes:
- Shadow indexing
- Background jobs for re-embedding
- Dual index systems (old vs new) with hot swap

---

### ‚û§ How do you evaluate retrieval quality?
- **Precision@k, Recall@k, MRR**
- **Manual eval** for semantic correctness
- **Reranking models** (cross-encoders)
- **Citations** to trace source chunks

---

## üîπ MLOps & LLMOps

### ‚û§ Sketch a pipeline: raw data ‚Üí model ‚Üí serving ‚Üí feedback
1. Ingest raw/unstructured data
2. Preprocess & embed
3. Fine-tune or prompt-template
4. Deploy (API or batch)
5. Log outputs & collect feedback
6. Monitor & retrain

---

### ‚û§ How would you monitor performance drift or hallucinations?
- Track similarity between response and ground truth
- Run named entity/fact checkers
- Score outputs for domain deviation
- Use feedback ratings

---

### ‚û§ How do you log prompts and outputs?
- Structured logging (JSON)
- Include: prompt ID, input, output, model, temperature
- Store in databases or logging platforms (e.g., ELK, LangSmith)

---

### ‚û§ CI/CD for LLM workflows ‚Äì What‚Äôs different?
- Includes prompt testing and versioning
- Need evaluation of output quality, bias, hallucination
- Human-in-the-loop validation
- Rollbacks for prompt templates or chains

---

## üîπ Cost & Latency Tradeoffs

### ‚û§ How do you reduce token usage?
- Minimize prompt length
- Compress context with embeddings
- Avoid over-engineering system prompts
- Use summarization for history

---

### ‚û§ When should you quantize a model?
- On edge devices (mobile, embedded systems)
- When cost, latency, or memory constraints apply
- For faster inference with minimal accuracy loss

---

### ‚û§ Batching & caching strategy?
- Batch requests to GPUs (e.g., using Hugging Face or Triton)
- Cache embeddings, frequent prompts/responses
- Use async APIs to parallelize latency

---

### ‚û§ Hosted APIs vs open-source models?
- **Hosted APIs**: Easier, scalable, costlier in long term
- **Open-source models**: More control, cost-effective, but higher DevOps effort

---

## üîπ System Design Thinking

### ‚û§ How to make AI systems more deterministic?
- Fix temperature, top_p
- Use prompt chains with fallback logic
- Unit test with edge case prompts

---

### ‚û§ What fallback do you use if LLM fails?
- Rule-based system
- Predefined templates
- Human-in-the-loop escalation
- Cache valid responses

---

### ‚û§ Can you solve this without LLM or vector DB?
Yes:
- Use regex or rule-based NLU for structured tasks
- SQL or keyword search for fixed-structure data

---

### ‚û§ Right database: SQL, NoSQL, or vector?
| Type      | Use Case                            |
|-----------|-------------------------------------|
| SQL       | Structured, relational data         |
| NoSQL     | Flexible, document-based storage    |
| Vector DB | Semantic similarity, embeddings, RAG|

---

## üîπ Real-World Scenarios

### 1Ô∏è‚É£ What happens if your embedding model changes?
- Recompute embeddings for all docs
- Use background jobs for re-indexing
- Run A/B tests with shadow index
- Cutover when stable

---

### 2Ô∏è‚É£ How would you fine-tune a model on user behavior?
- Log interactions
- Label data from feedback
- Train with supervised fine-tuning
- Validate on holdout set
- Deploy with version control

---

### 3Ô∏è‚É£ How to make the system cheaper?
- Use distilled/smaller models for simple tasks
- Cache frequent queries
- Use RAG instead of stuffing long context
- Quantize model (int8 or QLoRA)

---

### 4Ô∏è‚É£ Debugging LLM outputs?
- Review prompt + output logs
- Check model version + config (temp, top_p)
- Reproduce in isolation
- Check for hallucinations, prompt injection
- Iterate on prompt design or retrieval context

---
