# DATA + BUSINESS SCENARIO QUESTIONS

---

Your model accuracy is 95% in training, but only 63% in production. What is happening and how will you fix it?

A food delivery company sees that orders are dropping in one city. As a data scientist, how would you analyze the problem step-by-step?

Netflix changed the recommendation algorithm. Watch time dropped by 12%. What would you investigate?

Several features in your dataset are highly correlated. What problems can this cause and how will you handle it?

You‚Äôre given a dataset with 40% missing values in one column. Will you drop or keep the column? Why?

Your company wants to increase ad clicks. What ML approach would you use and what metric will you optimize?

Suddenly there is a spike in model errors, but the data pipeline did not change. What might be the cause?

You deploy a churn prediction model. After 3 months, its performance drops. Why does this happen?

‚úÖ MODELING DECISION QUESTIONS

When will you choose Logistic Regression over XGBoost, even if XGBoost performs better?

A dataset has 50,000 rows and 10 features vs 500 rows and 200 features. Which dataset is harder to model and why?

You are solving fraud detection. Accuracy is 99% but fraud detection rate is low. Is your model good?

For a recommendation system, which approach would you use:

Content based

Collaborative filtering

Hybrid ‚Äî and why?

If your dataset is small, would you choose Deep Learning? Why or why not?

Why might Random Forest perform better than Decision Tree?

Why does Gradient Boosting often outperform Random Forest on structured data?

‚úÖ EVALUATION + METRIC SCENARIOS

Facebook wants to detect fake accounts. Should you use accuracy? Why?

In medical cancer detection, what is more important: Precision or Recall?

Your AUC is high, but Recall is low. What does this mean?

What would be the worst metric to use for highly imbalanced data?

How will you decide the threshold in a classification model?

‚úÖ EDA REAL LIFE SCENARIOS

How do you detect data leakage before training?

What if the target variable is extremely skewed?

You notice a feature that is almost perfectly correlated with the target. What will you do?

How will you deal with outliers in a housing price model?

How will you check if features are important or just noise?

‚úÖ CLUSTERING + SEGMENTATION SCENARIOS

A retail company wants to segment users. Which algorithm and how will you choose K?

Data is not spherical shaped. Which clustering is better than K-means?

How do you measure success of a clustering model without labels?

‚úÖ DEPLOYMENT + PRODUCTION SCENARIOS (VERY IMPRESSIVE IF YOU ANSWER WELL)

How do you detect concept drift in production?

How often should you retrain your model and why?

A model is performing well technically but business KPIs are not improving. Why?

Your model is too slow for real-time prediction. What can you do?

How do you make a machine learning solution scalable?

‚úÖ ADVANCED / TOP-COMPANY THINKING

You have 300 features. How will you select the top 20?

How do you balance interpretability vs accuracy?

Why might a simpler model outperform a complicated one?

Would you prefer F1-score or ROC-AUC for email spam? Why?

How do you test whether a new ML model is truly better?

‚úÖ BONUS: QUICK CHALLENGE (Asked in Interviews)

Predict whether a user will click on an ad. Design the full solution.

Design an ML system for recommending reels/shorts on Instagram.

How would you design price prediction for Uber/Ola?

How do you detect fake reviews on Amazon?

Design a content moderation ML system.

Build a resume screening system. What bias issues can occur?

Predict students who will drop out of a course.

‚úÖ HARD INTELLIGENCE QUESTIONS

If you had unlimited data, would your model be perfect? Why / why not?

Why does adding more data sometimes make a model worse?

Can unsupervised learning be used for prediction? Explain.

How do you know if your dataset represents the real world?

What question do you ask before even building an ML model?


---

## 1. **Your model accuracy is 95% in training, but only 63% in production. What is happening & how do you fix it?**

This is a classic case of **overfitting + data mismatch**.

### ‚úÖ What‚Äôs likely happening

1. **Overfitting**

   * Model memorized training data instead of learning general patterns
   * Very high training accuracy but poor real-world performance

2. **Data shift / Data drift**

   * Production data distribution is different from training data
   * Examples:

     * New user behavior
     * Different customer segment
     * Seasonal patterns
     * Changed data pipeline logic

3. **Data leakage during training**

   * Model accidentally had access to future info or target-related features during training

4. **Incorrect preprocessing in production**

   * Missing scaling/encoding in live pipeline
   * Different handling of missing values or categories

---

### ‚úÖ How I would fix it (Step-by-step in interview)

1. **First check for Overfitting**

   Compare:

   * Training accuracy: 95%
   * Validation accuracy: ?
   * Test accuracy: ?

   If validation is also low ‚Üí overfitting confirmed

   ‚úÖ Fix:

   * Use **regularization (L1/L2)**
   * Reduce model complexity
   * Add dropout (for deep models)
   * Use proper cross-validation

2. **Check for data drift**

   I would compare:

   * Feature distributions (train vs prod)

     * Use: KS-test / PSI / histograms

   ‚úÖ Fix:

   * Retrain with recent data
   * Add drift detection monitoring
   * Use adaptive learning / scheduled retraining

3. **Verify pipeline consistency**

   Ensure:

   * Same scaler
   * Same encoders
   * Same feature engineering
   * Same feature order

   ‚úÖ Fix:

   * Save and reuse pipeline with `Pipeline` or `MLFlow`
   * Use versioning

4. **Re-evaluate metric**

   * Maybe accuracy is misleading because data is imbalanced
   * I would check: precision, recall, F1, ROC-AUC

---

## 2. **Orders are dropping in one city ‚Äì How will you analyze?**

A business + data breakdown question.

### ‚úÖ Step-by-step thinking (interview-ready):

1. **Validate if the drop is real**

   * Compare WoW / MoM trends
   * Compare with nearby cities
   * Check data error possibility

2. **Segment the data**

   I would check:

   * New users vs returning users
   * Android vs iOS
   * Time of day pattern
   * Weekend vs weekday
   * Areas / zones in city

3. **Look for operational factors**

   * Delivery time ‚Üë?
   * Restaurant partners ‚Üì?
   * Stock availability ‚Üì?
   * Rider availability ‚Üì?
   * App issues / crashes?

4. **External factors**

   * Weather event?
   * Local festival or strikes?
   * Competitor offers?
   * Government restrictions?

5. **Cohort analysis**

   * Compare user retention
   * Are users ordering less or uninstalling?

6. **Run hypothesis testing**

   Example hypotheses:

   * H1: Increase in delivery time ‚Üí less orders
   * H2: Increased price ‚Üí reduced ordering

   Validate using correlation & A/B tests

---

### ‚úÖ My final output as a data scientist

I would provide:

* Root cause(s)
* Key metric change (numbers & graphs)
* Recommendation:

  * Discount campaign
  * Increase riders
  * Restaurant onboarding

Interviewers love **actionable output**, not just analysis.

---

## 3. **Netflix algorithm change caused 12% drop in watch time ‚Äì What to investigate?**

Very interview-favorite question.

### ‚úÖ What I will check:

1. **A/B test results**

   * Was the algorithm properly tested?
   * How big was the sample size?
   * Was result statistically significant?

2. **User segments impact**

   * Did it affect:

     * New users more?
     * Old users?
     * Specific genres?
     * Some regions?

3. **Recommendation diversity**

   * Was it showing too narrow content?
   * Is the user bored?

4. **Cold-start problem**

   * New users not getting good recommendations?

5. **Latency & load times**

   * Recommendation generation slow = user leaves

---

### ‚úÖ Possible fixes or experiments I'd suggest

* Hybrid model: Collaborative + Content-based
* Increase content diversity
* Re-run experiment with better personalization
* Add manual overrides for top content

**As a fresher**, saying *"I would propose another A/B test"* is important.

---

## 4. **Highly correlated features ‚Äì What problems & how to solve?**

### ‚úÖ Problems

This causes **MULTICOLLINEARITY** which leads to:

* Unstable model coefficients
* Hard to interpret results
* Overfitting risk
* Poor generalization

Especially affects: **Linear / Logistic Regression**

---

### ‚úÖ How I handle it

1. **Correlation matrix + heatmap**

2. **Remove one of the correlated features**

3. **Use PCA** for dimensionality reduction

4. **Use models not affected much**

   * Random Forest
   * XGBoost

5. **Calculate VIF (Variance Inflation Factor)**

   If VIF > 5 or 10 ‚Üí remove that feature

‚úÖ In interviews, mention: **VIF + PCA** ‚Üí very good impression

---

## 5. **40% missing values ‚Äì Drop or keep?**

### ‚ùå I won't directly drop.

### ‚úÖ I decide based on:

1. **Feature importance**

   * Is that column important to business?
   * Domain knowledge

2. **Pattern of missing data**

   * Random (MCAR)? ‚Üí safe to impute
   * Not random (MNAR)? ‚Üí must be careful

### ‚úÖ What I can do:

| Method        | When I use it  |
| ------------- | -------------- |
| Mean/Median   | Numerical      |
| Mode          | Categorical    |
| Interpolation | Time-series    |
| Model-based   | Regression/KNN |

If:

* Highly important + can be filled ‚Üí keep
* Not important + too noisy ‚Üí drop

Showing this logical decision matters.

---

## 6. **Want to increase ad clicks ‚Äì ML approach & metric?**

### ‚úÖ This is a classification + ranking problem

I would use:

* Logistic Regression
* XGBoost
* Deep Learning
* Recommendation system

### ‚úÖ Optimize for:

NOT accuracy (because data is imbalanced)

I will use:

* **CTR (Click Through Rate)**
* **Precision / Recall**
* **ROC-AUC**
* **PR-AUC**
* **Log Loss**

If the business wants more clicks:

‚û°Ô∏è Optimize **Recall** (catch more possible clickers)

If they want better quality:

‚û°Ô∏è Optimize **Precision**

Also use:

* Context
* User behavior
* Time
* Location

---

## 7. **Spike in model errors, pipeline didn‚Äôt change ‚Äì Why?**

### ‚úÖ Possible causes

1. **Data drift**
2. **Concept drift**
3. **Bad data input (nulls, wrong format)**
4. **Edge cases suddenly increased**
5. **User behavior change**
6. **External events (festival, trend)**

I would check:

* Input data distribution
* Feature statistics
* Production logs
* Data versioning

Áî® ‚Üí Statistical tests + monitoring

---

## 8. **Churn model drops after 3 months ‚Äì Why?**

This is because of:

> ‚úÖ **Model decay + concept drift**

Reasons:

* Customer behavior changed
* Business strategy changed
* New competitors
* Old data used
* Model not retrained
* Seasonality

---

### ‚úÖ Solution

* Retrain quarterly or monthly
* Use rolling window data
* Add new feature (usage trends)
* Monitor live performance
* Auto re-training pipelines

This shows **MLOps understanding**, very valuable.





---

## 1. When will you choose **Logistic Regression over XGBoost**, even if XGBoost performs better?

Even if XGBoost has slightly higher accuracy, I would choose **Logistic Regression** when:

**a) Interpretability is more important than performance**

* Logistic Regression gives clear coefficients ‚Üí easy to explain to business/stakeholders
* Important in finance, healthcare, legal applications

**b) Dataset is small & simple**

* XGBoost may overfit
* Logistic is more stable

**c) Fast & low-cost deployment needed**

* Logistic is light, less memory, lower latency

**d) Need probability explanation**

* Logistic gives directly interpretable probabilities

‚úÖ In short:

> I choose Logistic Regression when **explainability, simplicity and speed** matter more than tiny performance gain.

This shows mature decision-making.

---

## 2. 50,000 rows & 10 features **vs** 500 rows & 200 features ‚Äî Which is harder to model and why?

‚úÖ **500 rows & 200 features is much harder** to model.

Because:

1. **Curse of Dimensionality**

   * Too many features, too little data
   * Distance between points becomes meaningless

2. **High risk of overfitting**

   * Model will memorize instead of generalizing

3. **Noise dominates the signal**

4. **More preprocessing needed**

   * Feature selection
   * PCA
   * Regularization

Whereas:
‚úîÔ∏è 50,000 rows & 10 features = more stable and learnable patterns

**In short:**

> More data with fewer features is much easier than fewer data with many features.

---

## 3. Fraud detection: 99% accuracy but low fraud detection rate ‚Äî Good model?

‚ùå **NO ‚Äî it‚Äôs a BAD model.**

Because:

Fraud detection is an **imbalanced problem**.

Example:

* 99% transactions are normal
* Only 1% is fraud

If model predicts **everything as normal** ‚Üí accuracy is still 99% but **zero fraud detection**

üìâ Accuracy is misleading.

‚úÖ I would focus on:

* Recall (fraud detection rate) ‚Äì most important
* Precision
* F1-score
* ROC-AUC / PR-AUC

**In fraud detection: missing a fraud is more expensive than false positives**

‚úÖ I‚Äôd also use:

* SMOTE
* Class weights
* Threshold tuning

---

## 4. Recommendation system ‚Äì Content / Collaborative / Hybrid? Why?

### ‚úÖ Best answer: **Hybrid system**

Because:

| Method        | Problem                                  |
| ------------- | ---------------------------------------- |
| Content-based | Only recommends similar things -> boring |
| Collaborative | Cold start problem                       |
| Hybrid        | Solves both problems                     |

‚úÖ Hybrid uses:

* User behavior (collaborative)
* Item attributes (content)
* Popularity & trends

That‚Äôs why Netflix & Spotify use **Hybrid**.

üß† Smart line for interview:

> Hybrid is more robust, scalable and personalized in real-world systems.

---

## 5. Small dataset ‚Äî Would you choose Deep Learning?

‚ùå Generally **NO**.

Because:

1. Deep learning needs large data
2. High risk of overfitting
3. Needs more computing
4. Harder to tune and explain

‚úÖ Instead I would try:

* Logistic / SVM
* Random Forest
* XGBoost
* KNN

UNLESS:

* I am using **transfer learning**
* Or pretrained models

**Smart answer:**

> Without transfer learning, deep learning is usually not suitable for small datasets.

---

## 6. Why might Random Forest perform better than a Decision Tree?

A single decision tree:

* Overfits easily
* Sensitive to noise

Random Forest:

* Combines many trees
* Uses random sampling
* Reduces variance
* More generalizable

‚úÖ Key interview words:

* **Bagging**
* **Ensemble**
* **Reduces Overfitting**
* **Better Stability**

**Final line:**

> Random Forest reduces model variance by averaging multiple trees, leading to better generalization.

---

## 7. Why does Gradient Boosting often outperform Random Forest on structured data?

Because:

Random Forest:

* Builds trees independently
* Focus: reduce variance

Gradient Boosting / XGBoost:

* Builds trees sequentially
* Each tree fixes previous tree‚Äôs mistakes
* Reduces **bias + variance**

On structured/tabular data:
‚úÖ Boosting captures complex patterns more efficiently

‚úÖ Also supports:

* Regularization
* Learning rate control
* Handling missing values

**Interview closing line:**

> Boosting learns from previous errors step-by-step, which makes it stronger on tabular data compared to independent trees in Random Forest.

---






# ‚úÖ **EVALUATION + METRIC SCENARIOS**

---

## **1. Facebook wants to detect fake accounts. Should you use accuracy? Why?**

‚ùå **No, accuracy is a bad metric.**

Because:

* Fake accounts = very rare ‚Üí **highly imbalanced dataset**
* Model predicting ‚Äúall accounts are real‚Äù may get **99% accuracy** but detect **0% fake accounts**

Facebook cares about:

* Catching fake accounts ‚Üí **Recall**
* Avoiding banning real users ‚Üí **Precision**

Better metrics:

* F1 score
* Precision-Recall AUC
* Recall@K
* ROC-AUC

**Perfect interview line:**

> "Accuracy hides failure in imbalanced problems. I would optimize Precision, Recall, and F1 instead."

---

## **2. Cancer detection ‚Äî Precision or Recall?**

‚û°Ô∏è **Recall is more important.**

Reason:

* Missing a cancer patient (False Negative) is far more dangerous than a False Positive.
* High Recall = catch maximum cancer cases.

But also keep:

* Acceptable Precision to avoid too many false alarms.

**Simple interview answer:**

> "Recall is critical because we don‚Äôt want to miss sick patients."

---

## **3. AUC is high but Recall is low ‚Äî what does it mean?**

This means:

1. **The model separates classes well overall**,
   but‚Ä¶
2. **At the chosen threshold**, the model fails to catch positive cases.

Why?

* Threshold too high ‚Üí model is too conservative
* Class is highly imbalanced
* Model favors precision over recall

Fix:

* Lower classification threshold
* Optimize for Recall or F1
* Use PR-AUC instead of ROC-AUC

---

## **4. Worst metric for imbalanced data?**

‚ùå **Accuracy** is the worst.

Why:

* Hides model failure
* Misleading when classes are skewed
* A dummy model can get high accuracy

Better:

* Precision, Recall
* F1
* PR-AUC
* Balanced Accuracy

---

## **5. How to decide threshold in classification?**

This is an important ML engineering question.

I would choose threshold based on:

### **1. Business Goal**

* Fraud detection ‚Üí maximize Recall
* Spam detection ‚Üí maximize Precision
* Churn ‚Üí maximize F1

### **2. Use ROC curve**

* Choose point closest to (0,1)

### **3. Use Precision-Recall curve**

* Pick best tradeoff for rare class

### **4. Cost-based thresholding**

* Assign cost to FN and FP
* Choose threshold minimizing total cost

### **5. Grid search**

* Evaluate thresholds 0.1 to 0.9

**Interview phrase:**

> "Threshold is not fixed at 0.5; I select it based on the business cost of false positives and false negatives."

---

# ‚úÖ **EDA REAL-LIFE SCENARIOS**

---

## **1. How do you detect data leakage before training?**

Data leakage means **model sees information it shouldn‚Äôt**.

How I detect it:

### **1. Suspiciously high accuracy during validation**

* Extremely high performance ‚Üí red flag

### **2. Features strongly correlated with target**

* Example: ‚Äútotal_bill_paid_after_purchase‚Äù

### **3. Time leakage**

* Using future information (e.g., using future sales)

### **4. Validate pipeline**

* Preprocessing must be inside cross-validation
  (no scaling before splitting)

### **5. Domain checks**

* Ask: ‚ÄúDoes this feature exist at prediction time?‚Äù

---

## **2. Target variable extremely skewed ‚Äî what to do?**

If target is heavily skewed (e.g., income, house price):

### **1. Apply log transformation**

* log(target + 1)
* Reduces skew ‚Üí model learns better

### **2. Use QuantileTransformer / Box-Cox**

### **3. Use robust metrics**

* MAE instead of RMSE

### **4. Winsorization**

* Cap extreme values

---

## **3. A feature is almost perfectly correlated with target ‚Äî what will you do?**

Possibilities:

### **1. First check if it is VALID**

* If it‚Äôs a legitimate feature ‚Üí keep it
  (e.g., previous purchase amount predicts next purchase)

### **2. If it's leakage**

* Example: ‚ÄúFinal price after discount‚Äù when predicting ‚Äúprice‚Äù
* REMOVE it

### **3. If it makes model unstable**

* Drop it if too predictive & suspicious

### **4. If redundancy**

* Use PCA or drop one of duplicates.

Interview answer:

> ‚ÄúI check if correlation is due to leakage or genuine predictive power.‚Äù

---

## **4. Outliers in housing price model ‚Äî what to do?**

Outliers greatly impact regression.

### **Options:**

#### **1. Log-transform price**

* Stabilizes extreme values

#### **2. Cap outliers**

* Winsorize top 1% / bottom 1%

#### **3. Remove extreme luxury properties**

* If your target audience is normal houses

#### **4. Use robust models**

* Random Forest
* Median regression (Huber)

**Interview-ready answer:**

> ‚ÄúOutliers in price cause model instability, so I either transform, cap, or use tree-based models.‚Äù

---

## **5. How to check if features are important or just noise?**

### **1. Feature importance scores**

* Random Forest / XGBoost importance
* Permutation importance

### **2. SHAP values**

* Best for interpretability

### **3. Statistical tests**

* ANOVA
* Chi-square
* Mutual information

### **4. Correlation tests**

* Pearson / Spearman

### **5. Drop-column importance**

* Drop feature ‚Üí retrain ‚Üí see performance drop

### **6. Regularization**

* Lasso removes noisy features automatically

---

