
## 🔹 **Foundations**

**Q1. What is tokenization and why is subword tokenization preferred today?**
**A:** Tokenization means breaking text into smaller units—like words, characters, or subwords—so a model can process them. Subword tokenization (like BPE or WordPiece) strikes a balance: it keeps vocabulary size manageable while still handling rare or unseen words gracefully. It also captures meaningful parts of words (like prefixes/suffixes), which helps models generalize better.

---

**Q2. Stemming vs Lemmatization—when to use which?**
**A:** Stemming is a quick heuristic that chops off word endings (so “running” → “run”), but it might produce invalid words. Lemmatization, on the other hand, uses linguistic knowledge to get the true base form (“better” → “good”). If speed is key—say, in large search systems—stemming is fine. For tasks needing linguistic accuracy like chatbots or information extraction, lemmatization is better.

---

**Q3. Stopwords—should we remove them?**
**A:** For older models like Bag-of-Words or TF-IDF, yes—it reduces noise and dimensionality. But for transformer models like BERT, no—since stopwords can affect meaning (for example, “not happy” ≠ “happy”).

---

**Q4. BoW vs TF-IDF vs Embeddings**
**A:**

* **BoW:** Simple word counts, no notion of importance.
* **TF-IDF:** Highlights rare but meaningful words by downweighting common ones.
* **Embeddings:** Capture deeper semantic meaning, allowing words like *“king”* and *“queen”* to be close in vector space. Contextual embeddings (like BERT) even adjust based on sentence meaning.

---

## 🔹 **Classic Modeling**

**Q5. Why use n-grams? What’s the catch?**
**A:** N-grams capture short word sequences like “New York” or “machine learning.” The drawback? The vocabulary explodes as n increases, and unseen sequences get zero probability. That’s why we use smoothing and later switched to neural or subword models.

---

**Q6. What is smoothing in language modeling?**
**A:** Smoothing redistributes probability to unseen n-grams so they aren’t zero. Techniques like Kneser-Ney or Good-Turing help models generalize and reduce perplexity.

---

**Q7. When are CRFs preferred?**
**A:** In sequence labeling tasks like NER or POS tagging, where labels depend on each other (e.g., “New” → “York” → “City”). CRFs model label dependencies globally, unlike independent classifiers.

---

## 🔹 **Embeddings**

**Q8. Compare Word2Vec, GloVe, FastText, and contextual embeddings.**
**A:**

* **Word2Vec:** Learns from local context windows.
* **GloVe:** Uses global co-occurrence statistics.
* **FastText:** Breaks words into character n-grams, great for rare or morphologically rich languages.
* **Contextual (ELMo/BERT):** Word meaning depends on context—so “bank” in “river bank” vs “credit bank” differs.

---

**Q9. How do we handle Out-of-Vocabulary (OOV) words?**
**A:** Subword tokenization (like BPE or WordPiece) or models like FastText that build word vectors from character n-grams solve OOV problems effectively. Avoiding a single `<UNK>` token is key.

---

## 🔹 **Transformers and Attention**

**Q10. Explain self-attention and its complexity.**
**A:** Self-attention lets each token look at every other token to understand relationships. It’s powerful but computationally heavy—O(n²) with respect to sequence length—so researchers use optimized versions like sparse or linear attention.

---

**Q11. Why did Transformers replace RNNs/LSTMs?**
**A:** Transformers process all tokens simultaneously (better parallelism), capture long-range dependencies, and scale beautifully with large datasets and compute. That’s why models like BERT and GPT became dominant.

---

**Q12. Encoder vs Decoder vs Encoder-Decoder?**
**A:**

* **Encoder:** Understands input (BERT).
* **Decoder:** Generates text (GPT).
* **Encoder-Decoder:** Translates or summarizes by mapping input to output (T5, BART).

---

## 🔹 **Pretraining and Adaptation**

**Q13. Common pretraining objectives?**
**A:**

* **BERT:** Masked Language Modeling (MLM)
* **GPT:** Next-token prediction (causal LM)
* **T5/BART:** Denoising seq-to-seq tasks
* **Contrastive:** Used for retrieval or alignment (e.g., CLIP)

---

**Q14. Full fine-tuning vs adapters vs prompt-tuning?**
**A:**

* **Full fine-tuning:** Update all model parameters (best performance, high cost).
* **Adapters/prefix-tuning:** Train small added layers, much cheaper.
* Use **PEFT** methods when you need multiple domain models or cost-efficient deployment.

---

**Q15. In-context learning vs fine-tuning?**
**A:** In ICL, we guide the model with examples in the prompt—no retraining. Fine-tuning updates model weights. ICL is quick and flexible; fine-tuning is stable for long-term, domain-specific use.

---

## 🔹 **Tokenization Details**

**Q16. WordPiece vs BPE vs UnigramLM**
**A:**

* **BPE:** Greedy merges of frequent pairs.
* **WordPiece:** Maximizes likelihood for better balance.
* **UnigramLM:** Probabilistic and prunes less useful tokens.
  Choice affects sequence length and rare word handling.

---

**Q17. Normalization and special tokens?**
**A:** Normalize case/Unicode, handle tokens like `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`. Padding must be masked during attention to avoid biasing results.

---

## 🔹 **Sequence Labeling and Parsing**

**Q18. How to build a strong NER today?**
**A:** Use a pretrained transformer encoder (like BERT) with a token classification head or CRF decoder. Add domain-specific dictionaries and span models for nested entities. Track recall through domain evaluation.

---

**Q19. POS/Dependency parsing best practices?**
**A:** Use contextual encoders with biaffine or graph-based decoders. For multilingual tasks, shared subword vocabularies and adapters work well.

---

## 🔹 **Classification and Retrieval**

**Q20. Robust text classification architecture?**
**A:** Encoder (like RoBERTa) → pooling layer (CLS or mean) → linear classifier. Handle imbalance with focal loss or weighted sampling. Evaluate per-class F1 for clarity.

---

**Q21. What is RAG and why use it?**
**A:** Retrieval-Augmented Generation (RAG) enhances LLMs with real documents to ground answers in facts. It reduces hallucinations and keeps outputs up-to-date by combining retrievers (BM25 or dual encoders) with generators.

---

## 🔹 **Evaluation**

**Q22. Perplexity, BLEU, ROUGE, METEOR, BERTScore—when to use?**
**A:**

* **Perplexity:** Language modeling quality.
* **BLEU:** Precision for machine translation.
* **ROUGE:** Recall for summarization.
* **METEOR:** Includes synonyms/stemming.
* **BERTScore:** Contextual similarity, often closer to human judgment.

---

**Q23. Macro vs Micro F1?**
**A:**

* **Macro F1:** Equal weight to all classes—good for imbalance analysis.
* **Micro F1:** Aggregates over all predictions—good for overall accuracy.
  Always report both.

---

## 🔹 **Data Quality and Imbalance**

**Q24. How to handle noisy labels?**
**A:** Use robust losses (like label smoothing), confidence filtering, and small clean validation sets. Analyze errors by data slices and retrain iteratively.

---

**Q25. Handling class imbalance?**
**A:** Techniques include oversampling, class weighting, focal loss, or threshold tuning. Evaluate using PR-AUC or per-class F1 rather than overall accuracy.

---

## 🔹 **Production and Safety**

**Q26. How to reduce hallucinations in LLM apps?**
**A:** Add retrieval grounding, constrained decoding, validation checks, and feedback loops. Use instruction-tuned models and safe fallback responses.

---

**Q27. Defenses against prompt injection/jailbreaks?**
**A:** Sanitize inputs, restrict accessible tools, isolate retrieval contexts, and perform adversarial testing. Never execute model outputs directly.

---

**Q28. What should be monitored post-deployment?**
**A:** Track quality, drift (embedding stats), latency, cost, and safety (toxicity/PII). Implement feedback loops and versioned rollbacks.

---

## 🔹 **Ethics and Privacy**

**Q29. How to detect and mitigate bias?**
**A:** Measure subgroup metrics, test counterfactuals, balance training data, and use adversarial training or calibration. Publish model cards to ensure transparency.

---

**Q30. Privacy-preserving NLP?**
**A:** Use PII redaction, differential privacy, federated learning, and strict data retention policies. Keep sensitive information isolated from inference logs.

---

## 🔹 **Efficiency and Scaling**

**Q31. How to speed up Transformer inference?**
**A:** Quantize (INT8/INT4), prune redundant weights, use knowledge distillation, caching (for decoding), and frameworks like ONNX/TensorRT.

---

**Q32. When to distill a model?**
**A:** When latency or cost is critical. A smaller “student” mimics a larger “teacher” model’s behavior—retain performance while cutting inference time.

---

## 🔹 **Multilingual NLP**

**Q33. How does zero-shot cross-lingual transfer work?**
**A:** Multilingual models (like mBERT) use shared subword vocabularies and aligned embedding spaces, so they can generalize knowledge to unseen languages.

---

**Q34. Tokenization challenges for Indic/CJK scripts?**
**A:** These languages lack clear word boundaries. Careful segmentation and Unicode normalization are crucial to avoid token fragmentation and loss of meaning.

---

## 🔹 **LLM Practicalities**

**Q35. Why does context window size matter?**
**A:** It limits how much text the model can “see” at once. Larger windows reduce truncation issues but cost more in compute. Retrieval or summarization can help manage length efficiently.

---

**Q36. Guardrails and tool-use orchestration?**
**A:** Use structured prompts, function calling with schema validation, and deterministic routing for critical outputs. Always enforce post-processing checks.

---

## 🔹 **Design / Case Scenarios**

**Q37. Design a production NER for noisy support tickets.**
**A:**

1. Define clear annotation guidelines.
2. Fine-tune a domain BERT + CRF.
3. Add custom dictionaries.
4. Handle imbalance via focal loss.
5. Deploy with confidence thresholds + human review.
6. Continuously monitor drift and re-train.

---

**Q38. Build a multilingual sentiment model on a budget.**
**A:** Use XLM-R with PEFT adapters, augment data via translation, and distill to a smaller model. Calibrate per language and monitor drift regularly.

---

## 🔹 **Coding-style Probes**

**Q39. Compute TF-IDF and top features per class.**
**A:** Use `TfidfVectorizer` + `LogisticRegression`. Then check model coefficients for top positive/negative n-grams per class. Ensure train/test split avoids leakage.

---

**Q40. Sketch a simple BPE tokenizer.**
**A:**

1. Start with a character-level vocab.
2. Merge most frequent pairs iteratively.
3. Encode by longest-match subwords.
4. Keep special tokens and handle unknowns gracefully.




## 🚑 **Project 1: US Healthcare Insurance Claim Fraud Analysis**

**Q1. What problem were you solving and why NLP?**
We aimed to detect fraudulent healthcare claims, where a significant portion of data lies in unstructured text — such as procedure notes and diagnosis descriptions. NLP helped extract linguistic and contextual patterns that aren’t visible in numeric features, thereby improving fraud detection accuracy and reducing manual investigator workload.

---

**Q2. What was your end-to-end pipeline?**
Data ingestion (Pandas) → text cleaning (normalizing ICD/CPT codes, lowercasing, punctuation removal) → feature engineering (TF-IDF n-grams, medical term flags, code co-occurrence stats) → model training (XGBoost/CatBoost) → PR-AUC-based threshold tuning → Streamlit dashboard for investigator insights.

---

**Q3. How did you address class imbalance?**
Fraud cases were under 1%. I used **class weighting** and **PR-AUC optimization** to balance recall and precision. Additionally, thresholds were chosen based on business capacity — fixing precision at ~90% to manage investigator bandwidth.

---

**Q4. Why XGBoost/CatBoost instead of deep models?**
Given sparse tabular + textual features and limited labeled data, gradient boosting models provided a strong balance of accuracy, interpretability, and speed. Deep models like CNN/RNN didn’t show meaningful uplift but required higher compute and tuning effort.

---

**Q5. Which features contributed the most?**
Key signals included rare CPT/ICD code combinations, phrase patterns in provider notes, and claim description rarity. SHAP analysis identified anomalous term clusters and provider-specific behavior as top drivers.

---

**Q6. How did you ensure generalization and avoid data leakage?**
Splitting was done by provider and time period to prevent leakage across claims. We also excluded post-adjudication fields and used temporal validation to simulate real-world deployment.

---

**Q7. How did you evaluate success with business teams?**
We reported **cost savings** = (fraud amount recovered – investigation cost – false positive load).
Operational KPIs included *precision at review capacity* and *hours saved*. A Streamlit dashboard visualized prioritized claims and explainability layers.

---

**Q8. How did you implement explainability?**
We integrated SHAP at both global and claim levels, displaying top indicative phrases and anomalous code combinations. The dashboard showed textual highlights and links to similar confirmed fraud cases for investigator trust.

---

**Q9. What were the major risks and how did you mitigate them?**

* **Data drift:** monthly term drift monitoring and calibration.
* **Fairness:** provider-slice performance analysis.
* **False accusations:** conservative thresholds + mandatory human review.

---

**Q10. What would you add with more data or budget?**
Integrate **UMLS/SNOMED** for concept normalization, fine-tune **clinical transformer encoders** via PEFT, and incorporate **graph-based relationships** between providers, claims, and procedures.

---

## 💬 **Project 2: Employee Feedback Sentiment Analysis**

**Q1. What was the goal and constraints?**
To extract actionable insights from employee exit feedback, highlighting sentiment trends. Constraints included small labeled data, strong privacy requirements, and the need for interpretable dashboards.

---

**Q2. Baseline vs improved approach?**
Baseline: **TF-IDF + Logistic Regression**.
Enhanced version: added **bigrams for negation**, **domain lexicons**, **class-weighting**, and tuned thresholds using **macro-F1** for imbalanced labels.

---

**Q3. How did you handle sarcasm or negation?**
Handled negations via bigram features (“not satisfied”), and for ambiguous sarcasm, flagged low-confidence samples for human review instead of overfitting on rare patterns.

---

**Q4. What privacy measures were implemented?**
Used **spaCy + regex** for PII redaction (names, emails, IDs), aggregated outputs at department level, and applied role-based access control and retention limits.

---

**Q5. Why choose Logistic Regression?**
Given limited labeled data, LR was ideal for interpretability and fast iteration. The model’s coefficients also gave direct insight into drivers of sentiment, which was critical for HR adoption.

---

**Q6. How was evaluation and calibration handled?**
Used **stratified cross-validation**, tracked **macro/per-class F1**, and applied **Platt scaling** for calibrated probability outputs to support automated alerts.

---

**Q7. What were common model errors?**
Domain idioms (e.g., “bench time”) and mixed sentiments within feedback. We mitigated these using domain dictionaries, sentence-level polarity aggregation, and targeted augmentation.

---

**Q8. What was the business impact?**
Delivered **quarterly sentiment dashboards** by department, identified top linguistic drivers of dissatisfaction, and flagged potential retention risks—supporting data-driven HR actions.

---

**Q9. What would you do next with more time or data?**
Fine-tune lightweight **transformer adapters**, add **topic modeling** to uncover themes, and introduce **active learning** loops using HR feedback for incremental labeling.

---

## 👔 **Project 3: Employee–Project Alignment Engine (Recommender)**

**Q1. What did the system do and why cosine similarity?**
It matched employees to projects based on skill alignment. We used **TF-IDF + cosine similarity** since it handles sparse, high-dimensional text well and provides a transparent, scalable baseline.

---

**Q2. How were profiles and queries represented?**
Employee profiles included concatenated skills, roles, certifications, and recent project text. Projects were similarly vectorized. Synonym normalization and weighted boosts for certifications and recency were applied.

---

**Q3. How did you measure success?**
We ran an A/B study — **time-to-staff** reduced by 27%, and **match quality** (manager-rated relevance 0–3) improved by 21%. Offline, we tracked top-k hit rates and ranking correlation.

---

**Q4. What ranking adjustments mattered most?**
Boosting for certifications, synonym expansion, and decaying scores for outdated experience improved ranking quality. We also enforced diversity to avoid repeatedly surfacing the same candidates.

---

**Q5. How did you validate and iterate?**
We gathered qualitative feedback from hiring managers, analyzed misranked cases, tuned similarity weights, and updated domain dictionaries iteratively.

---

**Q6. Why not use neural embeddings?**
Given latency and limited training data, TF-IDF was sufficient. Neural embeddings (e.g., Sentence-BERT) are in the roadmap to bridge semantic gaps once data volume justifies fine-tuning.

---

**Q7. How did you handle cold start and data quality?**
For new employees, relied on certifications and job titles. Standardized skill taxonomies and ran automated checks on parsed resumes to prevent noise.

---

**Q8. How was deployment handled?**
Implemented batch vectorization with cached embeddings, incremental updates for new profiles, and exposed via a lightweight API integrated with Streamlit dashboards.

---

## ⚙️ **Cross-Cutting Interview Probes**

**Q1. How did you ensure fairness and avoid bias?**
Tracked model performance across provider/department/location slices, conducted disparate impact analysis, and set human review thresholds for sensitive cases.

---

**Q2. How did you monitor these systems post-deployment?**
Monitored input drift (term distributions), output drift (prediction rates), calibration stability, and business KPIs. Automated alerts triggered retraining when drift exceeded thresholds.

---

**Q3. How did you communicate results to non-technical stakeholders?**
Used **precision–recall trade-offs**, **cost–benefit visuals**, and **interpretable examples**. Dashboards emphasized business impact—hours saved, fraud captured, or sentiment shifts.

---

**Q4. How did you ensure data security and privacy?**
Implemented **PII redaction**, **role-based access**, **audit trails**, and **aggregate reporting** across all projects, aligning with organizational data governance policies.

---

**Q5. Biggest learning from these projects?**
That **alignment between technical metrics and operational goals** (like review capacity or staffing speed) matters more than model accuracy. And continuous error-slice analysis yields the highest ROI over time.

