# 🚀 Gen AI / RAG Interview Cheat Sheet – Suraj Khodade

| **Category** | **Question** | **One-Line Answer** |
|--------------|--------------|----------------------|
| **Intro** | Tell me about yourself | 6+ yrs IT (4+ GenAI/NLP) | Built RAG (FAISS/Pinecone) & FastAPI apps; Payroll Assistant cut HR tickets 55%. |
| **Gen AI Basics** | What is RAG? | LLM + VectorDB (retrieval+gen) → factual, grounded answers. |
| | Why RAG vs fine-tuning? | RAG = flexible, cheaper; FT = domain-locked, costly. |
| | How to reduce hallucinations? | Prompt grounding, low temp, reranker, citations, fallback FAQs. |
| | What embeddings used? | Sentence-BERT, OpenAI Ada; choice depends on task/domain. |
| | Embedding drift handling? | Versioning, re-embed, dual-write, A/B test migration. |
| | Hybrid search? | BM25 (keywords) + embeddings → rerank for recall + precision. |
| **Vector DBs** | FAISS vs Pinecone vs Weaviate? | FAISS/Chroma: light, local; Pinecone: scalable SaaS; Weaviate: schema, hybrid. |
| | How do you choose similarity metric? | Cosine (orientation), Dot (speed/normalized), Euclidean (absolute distance). |
| | Why normalize embeddings? | Ensures cosine/dot comparability; avoids magnitude bias. |
| **System Design** | How design FastAPI RAG service? | Endpoints (/query,/ingest,/health) | async I/O | Celery+Redis | Docker+K8s | cache. |
| | How to scale for 1k users? | Async workers, Redis cache, K8s HPA, fallback models, observability. |
| | Handle API rate limits? | Retry/backoff, batching, caching, API Gateway throttling. |
| **Quality & Eval** | Key eval metrics? | Precision@k, hallucination rate, latency, adoption %, workload reduction. |
| | How to measure success? | Tech (P@k ↑, hallucinations ↓) → User (40% auto-resolve) → Biz (55% HR tickets ↓). |
| | Sustain post-launch? | Auto re-embed, monitor drift, feedback loops, quarterly review. |
| **Optimization** | How to cut cost? | Route to smaller models, cache, LoRA, optimized chunking. |
| | GPT-4 too costly/slow? | Use GPT-3.5, LLaMA2, Mistral; fallback routing. |
| **Security** | Sensitive data handling? | PII masking, RBAC, on-prem/VPC, encrypt at rest+transit, no raw logs. |
| **Debug & Test** | Debug inconsistent answers? | Deterministic retriever, low temp, enforce schema, reranker. |
| | How to test Gen AI? | Pytest APIs, golden QAs, LangSmith eval, A/B testing. |
| **Multimodal** | Extend RAG to images/audio? | Images: CLIP; Audio: Whisper; Video: STT+OCR; Models: GPT-4V, LLaVA. |
| | Example: Meeting summarizer? | Whisper STT → chunk/embed → RAG summary → enrich with slides/visuals. |
| **Fine-Tuning** | Fine-tuning vs LoRA vs Prompting? | FT: accurate but costly; LoRA: efficient mid-ground; Prompt: cheap, flexible. |
| | When to fine-tune? | When domain-specific, repeated tasks not solved by RAG/prompting. |
| **Projects** | Payroll Assistant impact? | 40% payroll queries auto-resolved → 55% HR tickets reduced. |
| | Recommender impact? | Cosine sim model → staffing efficiency +27%. |
| **Behavioral** | Stakeholder resistance? | HR resisted structured data → demoed hybrid solution → dev time ↓25%, efficiency ↑27%. |
| | Explain embeddings to non-tech? | "Like digital fingerprints of documents" → similarity = closeness. |
| **Learning** | How stay updated? | arXiv | HuggingFace/LangChain | GitHub | LinkedIn | Slack | prototyping. |
