### **Interview Questions: Evaluating ML Model Performance**  

These questions focus on **model evaluation, metrics, and debugging performance issues** in machine learning, particularly in TensorFlow. Each question includes a code-based solution.  

---

## **Q1: Compute Accuracy, Precision, Recall, and F1-Score for a Model**  
**Task:** After training a classification model, compute **accuracy, precision, recall, and F1-score** using TensorFlow/Keras.  

### **Solution:**  
```python
import tensorflow as tf
from tensorflow import keras
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load a dataset (MNIST for example)
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()
X_train, X_test = X_train / 255.0, X_test / 255.0  # Normalize

# Define a simple model
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

# Compile and train
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=3, validation_data=(X_test, y_test))

# Get predictions
y_pred = model.predict(X_test)
y_pred_classes = tf.argmax(y_pred, axis=1).numpy()  # Convert probabilities to class labels

# Compute evaluation metrics
accuracy = accuracy_score(y_test, y_pred_classes)
precision = precision_score(y_test, y_pred_classes, average='macro')
recall = recall_score(y_test, y_pred_classes, average='macro')
f1 = f1_score(y_test, y_pred_classes, average='macro')

print(f"Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1-score: {f1:.4f}")
```

---
## **Q2: Plot ROC Curve and Compute AUC Score**  
**Task:** Modify the model evaluation process to include the **ROC curve** and **AUC score** for binary classification.  

### **Solution:**  
```python
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Assume we have a binary classification model (MNIST: 0 vs non-0)
y_test_binary = (y_test == 0).astype(int)  # Convert to binary (0 vs rest)
y_pred_prob = y_pred[:, 0]  # Take probability for class 0

# Compute ROC curve and AUC
fpr, tpr, _ = roc_curve(y_test_binary, y_pred_prob)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='blue', lw=2, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')  # Random guess line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()
```
---

## **Q3: Identify and Debug Model Overfitting**  
**Task:** Implement **early stopping** and **dropout** to prevent overfitting.  

### **Solution:**  
```python
from tensorflow.keras.callbacks import EarlyStopping

# Define model with dropout
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dropout(0.3),  # Dropout added
    keras.layers.Dense(10, activation='softmax')
])

# Add EarlyStopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Compile and train
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=20, validation_data=(X_test, y_test), callbacks=[early_stopping])
```
---


## **Q4: Handle Class Imbalance with Weighted Loss**  
**Task:** Modify the model to **handle class imbalance** by using **class weights**.  

### **Solution:**  
```python
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# Assume imbalanced classes in y_train
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
class_weight_dict = {i: class_weights[i] for i in range(len(class_weights))}

# Train model with class weights
model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test), class_weight=class_weight_dict)
```
---


## **Q5: Use K-Fold Cross-Validation for Model Evaluation**  
**Task:** Evaluate the model using **5-fold cross-validation** to get a more reliable estimate of performance.  

### **Solution:**  
```python
from sklearn.model_selection import KFold
import numpy as np

kf = KFold(n_splits=5, shuffle=True, random_state=42)
accuracies = []

for train_index, val_index in kf.split(X_train):
    X_train_fold, X_val_fold = X_train[train_index], X_train[val_index]
    y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]

    # Define and train model
    model = keras.Sequential([
        keras.layers.Flatten(input_shape=(28, 28)),
        keras.layers.Dense(128, activation='relu'),
        keras.layers.Dense(10, activation='softmax')
    ])

    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    model.fit(X_train_fold, y_train_fold, epochs=3, verbose=0)

    # Evaluate on validation set
    val_accuracy = model.evaluate(X_val_fold, y_val_fold, verbose=0)[1]
    accuracies.append(val_accuracy)

print(f"Mean cross-validation accuracy: {np.mean(accuracies):.4f}")
```
---

### **Key Takeaways**
✅ Use **precision, recall, and F1-score** for better insights in classification.  
✅ **ROC-AUC** helps evaluate binary classifiers beyond accuracy.  
✅ **EarlyStopping & Dropout** prevent **overfitting**.  
✅ Use **class weights** when handling **imbalanced datasets**.  
✅ **K-Fold cross-validation** ensures stable and reliable performance estimates.  

Let me know if you need **harder** interview questions! 🚀

Here are some **interview questions focused on evaluating ML model performance**, including **code correction**-type problems.  

---

## **1️⃣ Fix Incorrect Accuracy Calculation**
**❓ Question:**  
You are given a model trained on a dataset. The following code tries to compute the accuracy but produces incorrect results. Can you **identify and correct the mistake**?  

```python
import numpy as np

y_true = np.array([1, 0, 1, 1, 0, 1])
y_pred_probs = np.array([0.8, 0.4, 0.7, 0.6, 0.3, 0.9])

accuracy = np.mean(y_pred_probs == y_true)
print("Accuracy:", accuracy)
```

### **✅ Solution (Fix the Comparison)**
The mistake is comparing **probabilities** (`y_pred_probs`) with **true labels (`y_true`)** directly. We need to convert predictions to binary labels before computing accuracy:

```python
import numpy as np

y_true = np.array([1, 0, 1, 1, 0, 1])
y_pred_probs = np.array([0.8, 0.4, 0.7, 0.6, 0.3, 0.9])

# Convert probabilities to binary predictions (threshold = 0.5)
y_pred = (y_pred_probs >= 0.5).astype(int)

accuracy = np.mean(y_pred == y_true)
print("Accuracy:", accuracy)  # ✅ Corrected version
```

---

## **2️⃣ Fix Incorrect AUC Score Calculation**
**❓ Question:**  
The following code is supposed to compute the **AUC-ROC score**, but it raises an error. What’s wrong, and how would you fix it?

```python
from sklearn.metrics import roc_auc_score

y_true = [1, 0, 1, 1, 0, 1]
y_pred = [1, 0, 1, 1, 0, 1]

auc_score = roc_auc_score(y_true, y_pred)
print("AUC-ROC Score:", auc_score)
```

### **✅ Solution (Use Probabilities Instead of Binary Predictions)**
AUC-ROC requires **probabilistic scores**, not discrete predictions (0 or 1). The fix is to pass probability scores instead.

```python
from sklearn.metrics import roc_auc_score

y_true = [1, 0, 1, 1, 0, 1]
y_pred_probs = [0.9, 0.2, 0.8, 0.7, 0.1, 0.85]  # ✅ Use probability scores

auc_score = roc_auc_score(y_true, y_pred_probs)
print("AUC-ROC Score:", auc_score)  # ✅ Corrected version
```

---

## **3️⃣ Incorrect Precision & Recall Calculation**
**❓ Question:**  
There is a mistake in how precision and recall are calculated in the following code. Find and fix it.

```python
from sklearn.metrics import precision_score, recall_score

y_true = [1, 0, 1, 1, 0, 1]
y_pred_probs = [0.8, 0.3, 0.7, 0.6, 0.2, 0.9]

precision = precision_score(y_true, y_pred_probs)
recall = recall_score(y_true, y_pred_probs)

print("Precision:", precision)
print("Recall:", recall)
```

### **✅ Solution (Convert Probabilities to Binary Labels)**
Precision and recall expect **binary predictions**, not probabilities. Convert them before passing them into `precision_score` and `recall_score`.

```python
from sklearn.metrics import precision_score, recall_score

y_true = [1, 0, 1, 1, 0, 1]
y_pred_probs = [0.8, 0.3, 0.7, 0.6, 0.2, 0.9]

# Convert probabilities to binary predictions (threshold = 0.5)
y_pred = (np.array(y_pred_probs) >= 0.5).astype(int)

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)

print("Precision:", precision)  # ✅ Fixed
print("Recall:", recall)        # ✅ Fixed
```

---

## **4️⃣ Incorrect F1 Score Calculation**
**❓ Question:**  
The F1 score calculation below is producing incorrect results. What’s wrong, and how would you correct it?

```python
from sklearn.metrics import f1_score

y_true = [1, 0, 1, 1, 0, 1]
y_pred_probs = [0.7, 0.2, 0.6, 0.8, 0.3, 0.9]

f1 = f1_score(y_true, y_pred_probs)
print("F1 Score:", f1)
```

### **✅ Solution (Use Binary Predictions Instead of Probabilities)**
The issue is that **`f1_score` requires binary labels, not probabilities**. We must first threshold the probability predictions.

```python
from sklearn.metrics import f1_score

y_true = [1, 0, 1, 1, 0, 1]
y_pred_probs = [0.7, 0.2, 0.6, 0.8, 0.3, 0.9]

# Convert probabilities to binary predictions
y_pred = (np.array(y_pred_probs) >= 0.5).astype(int)

f1 = f1_score(y_true, y_pred)
print("F1 Score:", f1)  # ✅ Fixed
```

---

## **5️⃣ Fix Overfitting Detection with Validation Loss**
**❓ Question:**  
The following code tries to detect overfitting by comparing training and validation loss. However, it doesn’t correctly detect overfitting. What’s wrong?

```python
if train_loss < val_loss:
    print("Overfitting detected!")
```

### **✅ Solution (Fix the Condition)**
- Overfitting occurs when **validation loss increases while training loss decreases**.
- The correct condition should be checking **whether validation loss is higher than training loss by a significant margin**.

```python
if val_loss > train_loss + 0.01:  # ✅ Add margin to detect overfitting
    print("Overfitting detected!")
```

---

### **Summary of Common Mistakes in Model Evaluation**
✅ **Comparing probabilities instead of binary predictions**  
✅ **Passing hard labels instead of probability scores for AUC-ROC**  
✅ **Confusing `train_loss < val_loss` with actual overfitting detection**  
✅ **Incorrect metric usage (e.g., using precision/recall on probabilities)**  

Would you like more **advanced debugging questions**? 🚀

Here are some **advanced debugging questions** related to **evaluating ML model performance**, including **error analysis, debugging loss functions, and incorrect metric calculations**.

---

## **1️⃣ Debugging a Model with Perfect Accuracy but Poor Business Metrics**
**❓ Question:**  
You trained a TensorFlow classification model, and it achieves **99% accuracy** on the test set. However, business metrics (e.g., **conversion rate, revenue**) remain **low**. What could be wrong? How would you debug this issue?

### **✅ Solution**
Possible issues and debugging strategies:
1. **Class Imbalance** – If one class dominates the dataset, accuracy alone is misleading.  
   - ✅ Check class distribution:  
     ```python
     from collections import Counter
     print(Counter(y_test))  # Check class imbalance
     ```
   - ✅ Use **F1-score**, **Precision-Recall** instead of accuracy:  
     ```python
     from sklearn.metrics import classification_report
     print(classification_report(y_test, model.predict(X_test)))
     ```

2. **Wrong Business Metric Optimization** – If the model **predicts the wrong target**, business impact is low.  
   - ✅ Check if the label correlates with the business outcome:  
     ```python
     import pandas as pd
     print(pd.crosstab(y_test, business_metric))  # Verify correlation
     ```

3. **Poor Generalization to Real Data** – If the model overfits, test data may not reflect real-world inputs.  
   - ✅ Check train/test distribution shift using KS test:  
     ```python
     from scipy.stats import ks_2samp
     for col in X_train.columns:
         print(f"{col}: {ks_2samp(X_train[col], X_test[col]).pvalue}")  # Low p-value means a shift
     ```
   - ✅ Use **out-of-sample** real-world data for validation.

---

## **2️⃣ Fixing Incorrect Log Loss Calculation**
**❓ Question:**  
You implemented **log loss** manually, but the result **differs from scikit-learn’s log loss**. Find and fix the bug.

```python
import numpy as np

y_true = np.array([1, 0, 1, 1, 0])
y_pred_probs = np.array([0.9, 0.2, 0.7, 0.8, 0.3])

log_loss = -np.mean(y_true * np.log(y_pred_probs) + (1 - y_true) * np.log(1 - y_pred_probs))
print("Log Loss:", log_loss)
```

### **✅ Solution**
🔴 **Bug:** The implementation does **not handle zero probabilities**, causing **log(0) errors**.  
✅ **Fix:** Add `np.clip()` to prevent log(0):

```python
import numpy as np
from sklearn.metrics import log_loss

y_true = np.array([1, 0, 1, 1, 0])
y_pred_probs = np.array([0.9, 0.2, 0.7, 0.8, 0.3])

# Fix: Avoid log(0) by clipping probabilities
eps = 1e-15
y_pred_probs = np.clip(y_pred_probs, eps, 1 - eps)

log_loss_fixed = -np.mean(y_true * np.log(y_pred_probs) + (1 - y_true) * np.log(1 - y_pred_probs))
print("Corrected Log Loss:", log_loss_fixed)

# Validate with sklearn
print("Sklearn Log Loss:", log_loss(y_true, y_pred_probs))
```

---

## **3️⃣ Debugging a Model with High Validation Loss**
**❓ Question:**  
Your model achieves **low training loss but high validation loss**. What could be wrong, and how would you fix it?

### **✅ Solution**
The issue suggests **overfitting**. Possible fixes:

1. **Increase Regularization (L2/L1 Penalty)**
   ```python
   model.add(keras.layers.Dense(128, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01)))
   ```

2. **Add Dropout**
   ```python
   model.add(keras.layers.Dropout(0.3))
   ```

3. **Use Early Stopping**
   ```python
   from tensorflow.keras.callbacks import EarlyStopping
   early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
   ```

4. **Ensure Proper Train-Test Splitting (Avoid Data Leakage)**
   ```python
   from sklearn.model_selection import train_test_split
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
   ```

---

## **4️⃣ Debugging Precision & Recall Mismatch**
**❓ Question:**  
Your model has **high precision but low recall**. What does this indicate, and how do you fix it?

### **✅ Solution**
- **High Precision, Low Recall** = **Model is conservative** (only predicts positive when very confident).  
- **Fix by adjusting the decision threshold**:
  ```python
  from sklearn.metrics import precision_recall_curve

  y_pred_probs = model.predict_proba(X_test)[:, 1]  # Get probabilities
  precision, recall, thresholds = precision_recall_curve(y_test, y_pred_probs)

  # Find the threshold that gives a better balance
  best_threshold = thresholds[np.argmax(recall >= 0.8)]  # Adjust recall target
  y_pred_new = (y_pred_probs >= best_threshold).astype(int)
  ```

---

## **5️⃣ Identifying Label Leakage in Feature Engineering**
**❓ Question:**  
Your model has **great validation performance** but fails **in production**. You suspect **data leakage**. How would you debug this?

### **✅ Solution**
🔍 **Check if Features Contain Target Information**  
- **Feature leakage happens if a feature is correlated with the label in a way that wouldn't be available at inference time**.

🔹 **Fix:** Run feature importance analysis using SHAP:
```python
import shap

explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_train)

shap.summary_plot(shap_values, X_train)
```
🔹 **If a future-based feature is highly important**, it likely contains **leaked information**.

---

## **6️⃣ Why is Model Accuracy 100%?**
**❓ Question:**  
You trained a model, but it achieves **100% accuracy** even on the validation set. What could be wrong?

### **✅ Solution**
Possible issues:
1. **Data Leakage** – Training and validation data overlap. Fix by checking for duplicate rows:
   ```python
   duplicates = X_train.merge(X_test, how='inner')
   print("Duplicates:", len(duplicates))
   ```

2. **Overly Simple Dataset** – If the dataset is too easy, the model may memorize it.

3. **Training on the Wrong Data** – Check if `X_train` and `y_train` match in shape and indices:
   ```python
   print(X_train.shape, y_train.shape)
   print(X_train.head(), y_train.head())
   ```

---

### **Summary of Advanced Debugging Techniques**
| Issue | Debugging Fix |
|--------|-------------|
| **Accuracy is high, but business metrics are low** | Check class imbalance, proper target variable |
| **Model overfits (low train loss, high val loss)** | Add dropout, L2 regularization, early stopping |
| **High precision but low recall** | Adjust decision threshold for recall |
| **AUC-ROC calculation is incorrect** | Ensure probability predictions instead of binary labels |
| **Model fails in production** | Check for data leakage (SHAP analysis) |
| **Model achieves 100% accuracy** | Check for duplicated train/test data |

Would you like **even harder debugging challenges** with **real-world ML system failures**? 🚀