## Introduction
Hi, I’m Suraj, an AI/ML Engineer with 6+ years of IT experience, including 4 years in Generative AI, NLP, and ML. I specialize in building RAG-based applications with LangChain, Hugging Face, and vector DBs like FAISS/Chroma, and deploying them via FastAPI/Docker. At Globant, I built a Payroll Gen AI Assistant that reduced HR query tickets by 55%, and at CitiusTech, I led AI projects like fraud detection and semantic search recommenders. I’m passionate about designing scalable and reliable Gen AI systems, and I’m excited about this opportunity to apply my expertise to create impactful solutions.

## Generative AI & RAG

### **Q:** Walk me through the architecture of a RAG pipeline you built.
### **A:**
*"In my Payroll Assistant project, I used OpenAI embeddings + Chroma DB to store vectorized payroll policies. On a query, a retriever fetched top-k documents, re-ranked them, and passed them into GPT-4 via LangChain. The response included citations for trust. I exposed the workflow as a FastAPI API and integrated LangSmith for tracing. This reduced hallucination rates and cut HR tickets by 55%."*


### **Q:** Even after correct retrieval, your LLM hallucinates. What would you do?

### **A:**
“In such cases, I approach the problem in layers.

* **First, retrieval validation:** I check the chunk size, the top-k value, and the similarity metric to ensure the right context is being passed.
* **Second, prompting:** I refine prompts with stricter instructions like *‘Only answer from the provided context’* and sometimes add few-shot role-based examples.
* **Third, model parameters:** I tune hyperparameters — lowering temperature, adjusting top-p, and restricting max tokens to keep responses more deterministic.
* **Finally, validation:** I enforce structured outputs such as JSON schemas or citations, and in some cases provide a fallback to FAQs when no confident answer exists.

For example, in my payroll assistant project, introducing citation-linked answers reduced hallucination rates by nearly 70%, which significantly boosted user trust and adoption.”




###  **Q:** You’ve used FAISS/Chroma. How would you compare them to Pinecone/Weaviate?

### **A:**
“I see FAISS and Chroma as excellent for prototyping because they’re lightweight and easy to set up. However, they require manual handling for persistence and scaling. Pinecone and Weaviate, on the other hand, are enterprise-ready.

* **Pinecone** is fully managed, cloud-native, and optimized for low-latency similarity search at scale, with strong metadata filtering.
* **Weaviate** goes further by offering schema support, hybrid search, and GraphQL APIs, making it ideal for use cases where structured + unstructured search needs to be combined.

So, I’d typically use FAISS/Chroma for quick experiments, Pinecone when I need plug-and-play scalability, and Weaviate when the solution requires schema-heavy or hybrid search capabilities.”


### **Q:** How do you handle embedding drift when the embedding model changes?

### **A:**
“Embedding drift is a real concern, so I handle it through versioning and controlled migration. I always tag embeddings with the model version in metadata. When upgrading, I re-embed documents with the new model, and during the transition phase, I dual-write queries to both the old and new indexes. Then, I run A/B tests to compare results and only cut over traffic once I’m confident the new embeddings maintain or improve accuracy. This ensures minimal disruption to end users.”




### **Q:** How would you structure a FastAPI service for RAG?

### **A:**
“I’d structure it with clear endpoints — `/query` for answering user questions, `/ingest` for uploading new documents, and `/health` for monitoring. For retrieval and LLM calls, I’d use async functions to handle concurrency efficiently. For heavier tasks like bulk embedding generation, I’d offload them to background workers using Celery with Redis. The service would be containerized with Docker, deployed behind Uvicorn/Gunicorn workers, and scaled with a load balancer. To optimize performance, I’d also add Redis caching for frequently asked queries.”

### **Q:** If 1,000 users hit your API simultaneously, how do you scale?

### **A:**
“To handle that kind of load, I’d rely on async I/O to maximize concurrency and Redis caching to reduce repeated computation. Background tasks would be queued with Celery so they don’t block user queries. For scaling, I’d run the service in Kubernetes with horizontal pod auto-scaling based on CPU and memory metrics. If the LLM provider gets overloaded — for example, GPT-4 latency spikes — I’d implement a fallback to smaller, faster models for non-critical queries. Observability tools would monitor latency and throughput so we can proactively adjust scaling.”




### **Q:** How do you measure quality and accuracy of your AI solutions?

### **A:**
“I look at it from three perspectives.

* **Technical metrics** like precision\@k for retrieval, hallucination rate, and response latency.
* **User-level metrics** such as resolution rate and feedback signals like thumbs-up or thumbs-down.
* **Business KPIs** that show real impact — for instance, in my payroll assistant project, HR workload was reduced by 38%, and in a recommender system, staffing time improved by 27%.

On the tooling side, I also integrate LangSmith to trace prompts and responses, and I run A/B tests to compare changes before rolling them out broadly.”

### **Q:** How do you sustain efficiency gains post-launch?

### **A:**
“Sustaining gains requires both automation and governance. I automate re-embedding pipelines so the knowledge base stays fresh, and I monitor for model or embedding drift. I version prompts and models to track changes over time, and I incorporate user feedback loops into continuous improvement. I also schedule periodic reviews with stakeholders to validate business outcomes. For example, in my payroll assistant, we re-embedded policy documents quarterly and tracked query resolution, which helped maintain around 85% accuracy consistently post-launch.”



### **Q:** Give an example where stakeholders resisted process changes.

### **A:**
“In the Employee–Project Alignment Engine project, HR stakeholders initially resisted providing structured data because they were used to unstructured records. To address this, I collaborated with the business analyst to clearly explain the benefits and demonstrated the difference between structured vs. unstructured results.

As a compromise, I automated the preprocessing pipeline to handle most of the structuring, while HR only needed to tag skills and certifications. This reduced the manual effort on their side, cut our development time by about 25%, and improved staffing efficiency by 27%. Once they saw the tangible improvements, the stakeholders became strong advocates of the solution.”




### **Q:** How do you measure business impact, not just technical accuracy?
### **A:**
*"I always connect tech → user → business. For example, in the Payroll Assistant: tech metrics = precision\@k, latency; user = 40% queries auto-resolved; business = 55% HR ticket reduction. Similarly, in the Alignment Engine, staffing time reduced 27%, improving resource utilization."*



### **Q:** How would you extend your Payroll Assistant to handle scanned payslips?
### **A:**
*"I’d integrate OCR (Azure OCR/Tesseract) to extract text, store embeddings in Pinecone, and enrich with metadata like employee ID. For multimodal Q\&A, I’d consider CLIP/BLIP or LLaVA to align text-image content, then feed it into the RAG pipeline for grounded responses."*

### **Q:** How would you reduce inference cost when deploying GPT-4-based RAG in production?
### **A:**
\*"Strategies:

* Use smaller models (GPT-3.5, LLaMA2) for simpler queries, GPT-4 only for complex ones.
* Cache frequent queries/responses in Redis.
* Distill/fine-tune smaller models on domain data.
* Optimize retrieval to reduce context length (chunking, filtering).
  In one project, caching + hybrid retrieval cut API costs by \~30%."\*

### **Q:** How do you stay updated with AI advancements?
### **A:**
*"I track arXiv, NeurIPS/ACL papers, Hugging Face and LangChain releases, and test new tools via prototypes. I’m active on LinkedIn/GitHub and in LangChain/Pinecone Slack communities. I also upskill via certifications — e.g., Azure, Gen AI bootcamps, and prompt engineering courses."*


Here are the **latest releases / updates** for Hugging Face and LangChain (as of mid-September 2025), along with what they mean. If you need very recent patch info, I can pull that too.

---

## 🔍 Hugging Face

1. **Transformers library version v4.56.1**

   * A preview version “Vault-Gemma” has been added on top of v4.56.1. ([GitHub][1])
   * Also “Embedding Gemma” (preview) was released on top of v4.56.0. ([GitHub][1])

2. **New dataset: FinePDFs**

   * Hugging Face released **FinePDFs**, the largest publicly available corpus built entirely from PDFs. ([InfoQ][2])
   * It spans \~475 million documents in \~1,733 languages (\~3 trillion tokens) and is \~3.65 TB. Useful for training large models, especially with document-based / domain-specific knowledge. ([InfoQ][2])

3. **New tool: AI Sheets**

   * A no-code/open-source tool for transforming/enriching datasets with AI, in a spreadsheet-like UI. ([InfoQ][3])
   * Enables users to experiment with many open models, for dataset tasks, without writing code. ([InfoQ][3])

---

## ⚙ LangChain

1. **LangChain stable version & pre-releases**

   * The most recent *stable* version is **`0.3.27`** (released July 24, 2025). ([PyPI][4])
   * There are also several *pre-release* / *alpha* / *beta* versions around `1.0.0aX`. For example, `1.0.0a1` was released as an alpha build. ([PyPI][4])

2. **Changelog / New Features via LangSmith & LangGraph**

   * Recent updates include LangSmith allowing **org-scoped API keys with granular permissions**. ([changelog.langchain.com][5])
   * LangGraph Platform now has revision queueing. ([changelog.langchain.com][5])
   * More observability: trace mode in LangGraph Studio, integrating deployment metrics. ([changelog.langchain.com][5])

---

If you want, I can also give you the **latest patch changes** (bug fixes + breaking changes) released in both, to help you mention in interview.

[1]: https://github.com/huggingface/transformers/releases?utm_source=chatgpt.com "Releases · huggingface/transformers"
[2]: https://www.infoq.com/news/2025/09/finepdfs/?utm_source=chatgpt.com "Hugging Face Releases FinePDFs: a 3-Trillion-Token ..."
[3]: https://www.infoq.com/news/2025/09/ai-sheets/?utm_source=chatgpt.com "Hugging Face Introduces AI Sheets, a No-Code Tool for ..."
[4]: https://pypi.org/project/langchain/?utm_source=chatgpt.com "langchain"
[5]: https://changelog.langchain.com/?utm_source=chatgpt.com "LangChain - Changelog"



**Q:** What are the trade-offs between fine-tuning an LLM vs. using prompt engineering or adapters like LoRA/PEFT?

**A:**
“Each approach has its own trade-offs.

* **Fine-tuning** gives the best performance for domain-specific tasks, but it’s expensive, needs large curated datasets, and increases deployment complexity.
* **Prompt engineering** is fast, cost-effective, and works well for many use cases, but its adaptability is limited.
* **LoRA or PEFT adapters** offer a middle ground — they enable lightweight, parameter-efficient fine-tuning with much lower infrastructure costs.

In enterprise projects, I usually start with prompt engineering combined with RAG to cover the majority of use cases. If precision on domain-specific queries is critical, I move toward LoRA-based fine-tuning.”

---

**Q:** How would you handle sensitive data in a Generative AI pipeline?

**A:**
“I treat sensitive data with a layered approach. Before ingestion, I anonymize or mask PII and avoid storing raw sensitive fields in vector databases. Access is controlled with role-based permissions. At inference time, I redact sensitive tokens and, for compliance-heavy clients, deploy pipelines in secure environments such as on-prem or within a VPC. Logging is limited to metadata, never raw content, and I rely on encryption both in transit and at rest. For example, with an Azure deployment, I implemented these controls to satisfy strict healthcare compliance requirements.”

---

**Q:** How do you debug an LLM that gives inconsistent answers to the same query?

**A:**
“I start by checking retrieval stability — making sure embeddings and retrievers are deterministic. Next, I adjust generation parameters; lowering temperature usually improves consistency. I also reinforce prompts with clear grounding instructions and, when needed, enforce schemas for structured responses. If inconsistency persists, I add a reranker or a post-validation layer to normalize outputs. In one payroll assistant project, simply reducing temperature from 0.7 to 0.1 improved answer consistency by about 80%.”



### **Q:** What similarity metrics have you used in vector search, and when do you choose one over another?
### **A:**
*"I’ve used cosine similarity, dot product, and Euclidean distance. Cosine is best for sentence embeddings where orientation matters more than magnitude. Dot product is faster and common with normalized vectors. Euclidean works well when absolute distances matter, e.g., clustering. In my Employee–Project recommender, cosine similarity gave the most semantically relevant matches."*

---

### **Q:** How would you design a hybrid search pipeline?
### **A:**
*"I’d combine dense embeddings (semantic) with sparse retrieval (BM25/TF-IDF). First, run BM25 for keyword relevance, then embeddings for semantic similarity, and fuse results via re-ranking. This balances lexical precision with semantic recall. Weaviate and Vespa natively support this hybrid approach."*


### **Q:** How would you handle rate limiting when integrating with OpenAI APIs?
### **A:**
*"I’d implement retries with exponential backoff, batch requests when possible, and add a caching layer for repeated queries. On the backend, I’d enforce per-user rate limits using API Gateway or middleware. This ensures we respect API quotas and keep user experience stable."*

---

### **Q:** How do you test Generative AI applications?
### **A:**
*"I use Pytest for unit tests (retriever correctness, API endpoints). For LLMs, I apply golden test cases — predefined queries with expected answers — and run them regularly. I also set up evaluation pipelines with LangSmith to benchmark hallucination rate, grounding accuracy, and latency. A/B testing with real users validates improvements before production."*



### **Q:** How do you monitor drift in embeddings or retrievers?
### **A:**
*"I log retrieval precision\@k over time and compare against baseline benchmarks. If relevance drops, it signals drift due to new data or embedding model updates. I then trigger a re-embedding pipeline and revalidate."*

---

### **Q:** What are the key KPIs you track for production Gen AI systems?
### **A:**

* **Tech:** Latency, precision\@k, token usage, cost/query
* **User:** Resolution rate, thumbs-up/down ratio, active user adoption
* **Business:** % workload reduction, time saved, financial savings



### **Q:** How would you extend an NLP RAG system to multimodal inputs like images or audio?
### **A:**
*"For images, I’d extract embeddings using CLIP or BLIP, store them alongside text embeddings in the vector DB, and retrieve multimodal chunks for context. For audio, I’d apply Whisper for speech-to-text and then feed text into the RAG pipeline. If multimodal reasoning is needed, models like LLaVA or GPT-4V can process text + vision inputs together."*

---

### **Q:** Example: You need to summarize a video meeting. How would you design the pipeline?
### **A:**
*"Step 1: Use Whisper or Azure Speech-to-Text for transcription.
Step 2: Chunk and embed transcript into a vector DB.
Step 3: Apply RAG summarization with GPT-4.
Step 4: Optionally extract visuals/slides using CLIP + OCR for context.
This gives a multimodal, context-rich meeting summary."*



### **Q:** How do you optimize cost in production LLM systems?
### **A:**
*"Strategies include: caching frequent queries, routing simple requests to smaller models, truncating unnecessary context, using retrieval filters to reduce token count, and experimenting with quantization/distilled models. In one project, caching + hybrid retrieval reduced token cost by \~30%."*

---

### **Q:** GPT-4 is too slow and expensive. What alternatives would you suggest?
### **A:**
*"For cost-sensitive tasks, I’d explore GPT-3.5, LLaMA-2, or Mistral. For domain-specific use cases, fine-tuning a smaller model with LoRA on enterprise data gives efficiency without recurring API costs. If high precision is required, I’d use GPT-4 selectively with fallback routing."*



### **Q:** Tell me about a time you disagreed with a stakeholder on AI design.
### **A:**
*"In the Alignment Engine, HR resisted structured data. I disagreed because unstructured input degraded accuracy. I built a demo comparing results, and after seeing the improvement, stakeholders agreed to a hybrid approach — we automated 70% of preprocessing, while HR provided critical metadata. This reduced dev time by 25% and improved adoption."*

---

### **Q:** How do you explain AI concepts to non-technical stakeholders?
### **A:**
*"I use analogies and focus on outcomes, not algorithms. For example, I explained embeddings to HR as ‘turning documents into fingerprints so the AI can find the closest match.’ I back it up with demos so they see the impact directly, which builds trust."*

---

### **Q:** What’s your strongest achievement in AI so far?
### **A:**
*"The Payroll Gen AI Assistant at Globant. It automated \~40% of payroll queries using a RAG pipeline (LangChain + Chroma + GPT-4). The project reduced HR ticket volume by 55% and saved significant time, showing my ability to deliver scalable, high-impact Gen AI solutions in production."*



## **Introduction**

### **Q1. Please introduce yourself.**
### **A1 (Sample):**
“Hi, I’m Suraj Khodade, an AI/ML Engineer with over 6 years of IT experience, including 4+ years in Artificial Intelligence, Generative AI, and NLP. I specialize in building Gen AI applications using LangChain, Hugging Face, and OpenAI, with strong expertise in Retrieval-Augmented Generation, embeddings, and backend integration using Python frameworks like FastAPI. I’ve deployed production-grade AI solutions across healthcare, enterprise, and HR domains, and I’m certified in Azure Cloud. Recently at Globant, I built Gen AI-powered payroll and HR assistants, invoice summarization chatbots, and multimodal AI pipelines. I’m passionate about delivering scalable, impactful AI applications.”

---

## **Technical Questions**

### **Generative AI & LLMs**

### **Q2. Can you explain what Retrieval-Augmented Generation (RAG) is and why it is important?**
### **A2:**
“RAG combines the generative capabilities of LLMs with external knowledge retrieval. Instead of relying only on model parameters, it retrieves relevant context from vector databases like FAISS, Chroma, or Pinecone, and feeds it into the LLM prompt. This improves factual accuracy, reduces hallucinations, and ensures domain-specific adaptability. For example, I built a payroll Gen AI assistant at Globant where RAG fetched compliance documents from Chroma DB to answer queries reliably.”

---

### **Q3. How do you optimize LLM performance for production use?**
### **A3:**

* Prompt engineering and prompt templates
* Caching frequent responses
* Tuning hyperparameters like temperature and max tokens
* Using embeddings for context compression
* Employing observability tools like LangSmith for tracing
* Optimizing retrieval pipelines with hybrid search (BM25 + embeddings)

---

### **Q4. You’ve worked with Chroma and FAISS. If tomorrow you migrate to Pinecone, what factors would you evaluate?**
### **A4:**

* **Scalability:** Pinecone’s managed infrastructure can scale horizontally
* **Persistence & durability:** Cloud-native persistence vs local FAISS
* **Metadata filtering:** Advanced filtering in Pinecone for domain-specific queries
* **Latency & performance benchmarks**
* **Cost considerations** compared to open-source FAISS/Chroma
* **Integration ease** with LangChain

---

### **Backend & Python**

### **Q5. How would you design a FastAPI service to expose your RAG pipeline?**
### **A5:**

* Define endpoints like `/query`, `/add_documents`, `/healthcheck`.
* Use async endpoints for concurrency (async/await).
* Integrate middleware for logging & auth.
* Backend flow: query → embedding generation → vector DB retrieval → LLM response → return JSON.
* Use Celery/Redis for background tasks (document ingestion).
* Containerize with Docker for deployment.

---

### **Q6. How do you handle concurrency in Python backend applications?**
### **A6:**

* Using `asyncio` and frameworks like FastAPI.
* For I/O-bound tasks, async coroutines ensure non-blocking execution.
* For CPU-heavy tasks, offload to multiprocessing or Celery workers.
* Example: In payroll chatbot ingestion pipeline, used async DB calls to handle parallel document inserts without blocking user queries.

---

### **Q7. Can you walk me through one of your Gen AI projects and its impact?**
### **A7 (Payroll Query Assistant – Globant):**

* Built an LLM-powered chatbot using LangChain + OpenAI for payroll/HR queries.
* Integrated Chroma DB for RAG + LangSmith for observability.
* Reduced HR workload by **38%**, improved accuracy in compliance explanations.
* Stack: Python, FastAPI, LangChain, Hugging Face, Chroma DB, Docker.

---

### **Data & ML**

### **Q8. Tell me about a time you built a recommendation system.**
### **A8 (Employee-Project Alignment Engine – CitiusTech):**

* Used cosine similarity with TF-IDF embeddings.
* Matched employee skills to projects for staffing.
* Result: Reduced staffing time by 27%, improved match score by 21%.
* Learned importance of semantic similarity over keyword search.

---

## **Behavioral Questions**

### **Q9. Tell me about a challenge you faced when stakeholders resisted AI adoption.**
### **A9:**
“At CitiusTech, some stakeholders were hesitant to adopt an AI-driven fraud detection system due to trust concerns. I addressed this by conducting explainability sessions, showing how SHAP values explained model predictions, and running a pilot with small datasets. Once they saw improved fraud detection precision, adoption followed smoothly.”

---

### **Q10. How do you stay updated with Generative AI advancements?**
### **A10:**

* Regularly follow Hugging Face, LangChain, and OpenAI release notes.
* Engage in bootcamps (recently completed a Gen AI Bootcamp).
* Explore arXiv research papers.
* Actively contribute to GitHub projects and prototypes.

---

### **Q11. If you join Coditas, how will you ensure your solutions are maintainable and scalable?**
### **A11:**

* Follow clean code and SOLID principles.
* Implement observability (logs, metrics, tracing).
* Ensure modular RAG pipelines with clear boundaries.
* Containerized deployments with CI/CD.
* Documentation and knowledge transfer sessions.



### 🔹 Conversational Simulation

### **AI Interviewer:** Hi Suraj, welcome to this virtual interview. Let’s start with a quick introduction. Could you tell me a bit about yourself?

### **You:**
“Sure. I’m Suraj Khodade, an AI/ML Engineer with 6+ years of IT experience, including 4+ years in Artificial Intelligence and Generative AI. My expertise lies in building Gen AI applications using LangChain, Hugging Face, and OpenAI, with strong focus on Retrieval-Augmented Generation pipelines, embeddings, and backend integration using Python frameworks like FastAPI. Recently, I’ve delivered production-grade solutions such as payroll query assistants, invoice summarization bots, and recommendation engines that improved efficiency and reduced manual workloads. I’m also certified in Azure Cloud and passionate about creating scalable and reliable AI systems.”

---

### **AI Interviewer:** Great. You mentioned Retrieval-Augmented Generation. Could you explain it in simple terms and share how you applied it in a project?

### **You:**
“Retrieval-Augmented Generation, or RAG, enhances LLM responses by combining them with external knowledge from a vector database. Instead of the model relying only on its training data, it retrieves relevant context documents, embeds them, and feeds them into the LLM prompt.
At Globant, I implemented a payroll Gen AI assistant where queries were answered using compliance and HR policy documents stored in Chroma DB. This approach reduced hallucinations and ensured context-aware, accurate answers.”

---

### **AI Interviewer:** Interesting. If you were asked to migrate from Chroma or FAISS to Pinecone, what would you evaluate before making that shift?

### **You:**
“I’d evaluate factors such as scalability, since Pinecone offers a managed cloud-native infrastructure; persistence and durability compared to FAISS; metadata filtering capabilities; latency and query performance; and overall cost of adoption. I’d also check how easily Pinecone integrates with LangChain for minimal disruption to the current RAG pipeline.”

---

### **AI Interviewer:** Let’s talk backend. Suppose you want to expose your RAG pipeline as an API. How would you structure it in FastAPI?

### **You:**
“I’d create a RESTful service with endpoints like `/query` for user input, `/add_documents` for ingestion, and `/healthcheck` for monitoring. Each endpoint would be asynchronous to handle concurrent requests efficiently. The flow would be: user query → embedding generation → vector DB retrieval → LLM response → structured JSON output. I’d also include middleware for logging, authentication, and error handling, and use Celery for background ingestion tasks. Finally, I’d containerize with Docker for deployment.”

---

### **AI Interviewer:** Good. Now, let’s move to a bit of coding mindset. In your experience, how do you handle concurrency in Python when building APIs?

### **You:**
“For I/O-bound tasks, I use async/await with frameworks like FastAPI. For CPU-heavy operations, I prefer multiprocessing or background workers using Celery. For example, in a document ingestion pipeline, I used async DB calls for parallel inserts, ensuring the chatbot could still serve queries without being blocked.”

---

### **AI Interviewer:** Let’s discuss a project. Tell me about the payroll assistant you built and its impact.

### **You:**
“I designed an LLM-powered chatbot for payroll/HR queries using LangChain and OpenAI. It leveraged Chroma DB for RAG and LangSmith for observability. The assistant automated compliance explanations and reduced HR workload by 38%. It was deployed using FastAPI and Docker, making it scalable and maintainable for enterprise use.”

---

### **AI Interviewer:** Excellent. Now let’s test your problem-solving. Imagine a stakeholder resists adopting an AI solution due to trust concerns. How would you handle it?

### **You:**
“I would first acknowledge their concerns and demonstrate explainability. For instance, in a fraud detection project, stakeholders were worried about black-box predictions. I introduced SHAP values to explain model outputs and ran a pilot with small datasets. This helped them build confidence in the system, and eventually, they agreed to scale it.”

---

### **AI Interviewer:** Good response. One last question: How do you ensure your AI solutions are both maintainable and scalable?

### **You:**
“I follow clean code principles, SOLID design patterns, and modular architecture. I also integrate observability for logs, metrics, and tracing. For scalability, I containerize solutions with Docker, manage CI/CD pipelines, and design RAG workflows that can handle larger datasets. Finally, I ensure documentation and knowledge sharing so the solution is sustainable beyond my individual contribution.”

---




### 🔹 Conversational Simulation – Round 2

### **AI Interviewer:** Hi Suraj, let’s continue. To begin, what excites you most about Generative AI, and why do you want to join Coditas?

### **You:**
“What excites me about Generative AI is its ability to transform unstructured data into actionable intelligence. I’ve seen its impact in domains like payroll and healthcare where automation reduces manual workload and improves accuracy. At Coditas, I see the opportunity to apply my skills in building scalable Gen AI products that combine strong backend engineering with cutting-edge AI research. Coditas’ culture of innovation and impact-driven projects aligns perfectly with my career goals.”

---

### **AI Interviewer:** Can you explain the difference between fine-tuning an LLM and using embeddings with RAG? When would you choose one over the other?

### **You:**
“Fine-tuning adapts the base LLM by updating weights with domain-specific data, making it naturally better at certain tasks. However, it requires compute resources and retraining when knowledge changes.
RAG, on the other hand, keeps the base LLM frozen but retrieves context from an external knowledge base at runtime. This is cost-effective and dynamic for rapidly changing data.
I’d use fine-tuning for highly repetitive, domain-specific tasks like classification or summarization. I’d prefer RAG for dynamic knowledge domains, such as HR policies or finance regulations, where content changes frequently.”

---

### **AI Interviewer:** Imagine your RAG pipeline starts giving irrelevant results. What are the top three steps you’d take to debug it?

### **You:**
“First, I’d check embedding quality – ensuring the model used is appropriate for the domain. Second, I’d validate vector database indexing and confirm the right similarity metric (cosine, dot product, etc.). Third, I’d review the prompt construction to ensure retrieved context is properly injected. Tools like LangSmith also help in tracing where the pipeline fails.”

---

### **AI Interviewer:** Tell me about how you’ve used observability in LLM applications.

### **You:**
“In my payroll assistant project, I integrated LangSmith for observability. It allowed me to trace each query — from embedding generation to document retrieval and LLM response. I monitored latency, token usage, and accuracy of responses. Observability helped detect when the model was hallucinating or when retrieval was weak, enabling continuous improvement of the pipeline.”

---

### **AI Interviewer:** Let’s shift gears. How do you ensure API performance and scalability in Python backends?

### **You:**
“I optimize database queries with indexing and caching, use async frameworks like FastAPI to handle concurrency, and apply pagination to manage large responses. For scalability, I containerize with Docker, use load balancers, and apply CI/CD for deployments. Profiling tools like Pytest-benchmark or cProfile also help me tune bottlenecks.”

---

### **AI Interviewer:** How would you integrate multimodal data, say text and images, into a Gen AI workflow?

### **You:**
“I’d use a multimodal embedding model like CLIP to generate vector representations of both text and images in the same embedding space. These embeddings can then be stored in a vector DB like Weaviate. When a query is made, similarity search retrieves both text and image context. The retrieved context is then passed into the LLM, enabling cross-modal reasoning.”

---

### **AI Interviewer:** You have experience with Docker. Suppose you are deploying a FastAPI + LangChain service. How would you set it up for production?

### **You:**
“I’d create a lightweight Dockerfile using Python slim base image, add dependencies, and expose the FastAPI app via Uvicorn/Gunicorn for production-grade performance. I’d configure environment variables for API keys, use volume mounts for logs, and add a healthcheck endpoint. The container would be deployed on Azure Kubernetes Service for scaling, with CI/CD pipelines managing updates.”

---

### **AI Interviewer:** Let’s cover behavioral. Tell me about a time you had to quickly learn a new technology to deliver a project.

### **You:**
“At CitiusTech, I was primarily working in .NET when I was asked to lead a project involving Python-based ML workflows. I had limited exposure at the time, so I upskilled myself rapidly through hands-on prototyping, internal bootcamps, and weekend practice. Within a few months, I delivered a semantic search engine in Python that improved project-to-skill matching by 21%. That experience shaped my career transition into full-time AI/ML.”

---

### **AI Interviewer:** Finally, how do you measure the success of a Generative AI solution?

### **You:**
“I measure success using both technical and business metrics. On the technical side, I track response accuracy, latency, token usage, and user satisfaction through feedback loops. On the business side, I look at measurable impact such as reduction in manual workload, faster decision-making, or cost savings. For example, my payroll assistant reduced HR workload by 38% — that was a direct indicator of success.”

---



**Q1. What was your role in the Payroll project?**
**A:**
“In the Payroll Gen AI Assistant project, my role was to design and implement the core Generative AI pipeline. I built the Retrieval-Augmented Generation workflow using Chroma DB and LangChain, integrated OpenAI models for contextual responses, and added observability through LangSmith. I also developed the FastAPI backend to expose the assistant as a service. Beyond technical work, I collaborated with HR stakeholders to refine requirements and measured success by tracking reduced HR query tickets — around 38% reduction.”

---

**Q2. How do you ensure smooth component integration within the internal team?**
**A:**
“I ensure smooth integration by following three steps:

* **Clear interfaces:** defining well-documented APIs and contracts between components.
* **Continuous integration:** using Git, code reviews, and CI/CD pipelines so changes are tested early.
* **Collaboration:** frequent syncs with frontend, data, and DevOps colleagues to surface blockers early.

For example, in the payroll assistant, I aligned closely with the frontend team to standardize the response schema so integration was seamless.”

---

**Q3. Do you have any feedback?**
**A:**
“Yes — I found the interview experience very engaging and structured. The questions were relevant and allowed me to explain both my technical expertise and business impact. If anything, I appreciate how the process balances technical depth with soft skills, which reflects well on how the team values holistic contributions.”

---

**Q4. Do you have any message for the team?**
**A:**
“My message would be that I’m genuinely excited about the possibility of joining Coditas. I see strong alignment between my experience in building scalable Gen AI applications and the kind of impactful solutions your team is driving. I’d be glad to contribute not just technically but also by collaborating across functions to deliver real business value.”

