# Data Science & AI Engineer Interview Drills

Sharpen core skills across statistics, data wrangling, modeling, and deep learning with hands-on, interview-style exercises. Each section increases in difficulty (Easy → Medium → Expert).

## How to use this notebook
- Work through exercises in order; time-box yourself like an interview (10–20 min each).
- Avoid looking at hints immediately; try to verbalize your approach first.
- Write clean, tested code; add assertions/printouts to validate assumptions.
- Use standard libraries (`numpy`, `pandas`, `scikit-learn`, `torch`) as needed, but implement the logic yourself unless specified.

## Easy: Foundations
Warm up with vectorized thinking, basic stats, and quick data checks.

### 1) Z-Score Normalization (NumPy)
Given a 1D NumPy array `x`, return a normalized array `(x - mean) / std`.

**Constraints:**
- Avoid Python loops; use vectorized operations.
- Handle the case `std == 0` by returning zeros.

**Test yourself:** Verify mean≈0 and std≈1 on random inputs.

In [None]:
import numpy as np

def zscore(x: np.ndarray) -> np.ndarray:
    """Return z-scored version of x. Handle zero std by returning zeros."""
    # TODO: implement
    raise NotImplementedError

# Quick checks (uncomment as you work)
# x = np.random.randn(1000)
# out = zscore(x)
# print(out.mean(), out.std())

### 2) Missing-Value Imputer (Pandas)
Implement `impute_with_median(df, cols)` that replaces `NaN` in specified columns with the column median.

**Constraints:**
- Do not mutate the input DataFrame; return a copy.
- Columns may be non-numeric; raise a `ValueError` if median cannot be computed.
- Preserve dtypes.

**Follow-up:** How would you handle grouped medians (e.g., per category)?

In [None]:
import pandas as pd

def impute_with_median(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
    """Return a copy with NaNs in cols replaced by median values."""
    # TODO: implement
    raise NotImplementedError

# Example usage
# data = pd.DataFrame({"a": [1, 2, np.nan], "b": [3.0, np.nan, 5.0]})
# print(impute_with_median(data, ["a", "b"]))

## Medium: Classical ML & Evaluation
Practice feature engineering, model evaluation, and algorithmic thinking.

### 3) Sliding-Window Feature Extraction (NumPy)
Given a 1D array `x` and window size `k`, generate a 2D array of shape `(len(x) - k + 1, k)` where each row is a contiguous window.

**Constraints:**
- Use stride tricks (`np.lib.stride_tricks.as_strided`) or vectorization; avoid Python loops.
- Raise `ValueError` if `k` is invalid.

**Test yourself:** Compare against a loop-based baseline for correctness.

In [None]:
def sliding_windows(x: np.ndarray, k: int) -> np.ndarray:
    """Return strided 2D view of sliding windows of length k."""
    # TODO: implement
    raise NotImplementedError

# x = np.arange(6)
# print(sliding_windows(x, 3))

### 4) Custom AUC (Scikit-Learn)
Implement `binary_auc(y_true, y_score)` without using `roc_auc_score`. You may use NumPy but implement the trapezoidal integration yourself.

**Constraints:**
- Handle ties in scores robustly.
- Validate input shapes and value ranges.

**Follow-up:** Discuss how class imbalance affects AUC interpretation.

In [None]:
from sklearn.metrics import roc_curve

def binary_auc(y_true: np.ndarray, y_score: np.ndarray) -> float:
    """Compute AUC via ROC curve and trapezoidal rule, without roc_auc_score."""
    # TODO: implement
    raise NotImplementedError

# y_true = np.array([0, 0, 1, 1])
# y_score = np.array([0.1, 0.4, 0.35, 0.8])
# print(binary_auc(y_true, y_score))

### 5) Logistic Regression from Scratch (Binary)
Implement logistic regression using batch gradient descent.

**Requirements:**
- Functions: `sigmoid(z)`, `predict_proba(X, w)`, `loss(X, y, w)`, `fit_logreg(X, y, lr, epochs)` returning weights.
- Add L2 regularization (hyperparameter `lambda_`).
- Stop early if loss plateaus.

**Test yourself:** Compare learned weights to `sklearn.linear_model.LogisticRegression` on a toy dataset.

In [None]:
def sigmoid(z: np.ndarray) -> np.ndarray:
    # TODO
    raise NotImplementedError

def predict_proba(X: np.ndarray, w: np.ndarray) -> np.ndarray:
    # TODO
    raise NotImplementedError

def loss(X: np.ndarray, y: np.ndarray, w: np.ndarray, lambda_: float = 0.0) -> float:
    # TODO
    raise NotImplementedError

def fit_logreg(
    X: np.ndarray,
    y: np.ndarray,
    lr: float = 0.1,
    epochs: int = 1000,
    lambda_: float = 0.0,
    tol: float = 1e-6,
) -> np.ndarray:
    """Train logistic regression with L2 penalty and early stopping. Return weights."""
    # TODO
    raise NotImplementedError

# # Example sanity check
# from sklearn.datasets import make_classification
# X, y = make_classification(n_samples=200, n_features=4, random_state=42)
# w = fit_logreg(X, y, lr=0.1, epochs=5000, lambda_=0.1)
# preds = predict_proba(X, w) >= 0.5
# print((preds == y).mean())

## Expert: Deep Learning & ML Systems
Apply PyTorch, optimization, and production-aware thinking.

### 6) Minimal MLP in PyTorch (Binary Classification)
Build and train a two-layer MLP on synthetic data.

**Requirements:**
- Model: `Linear -> ReLU -> Dropout -> Linear`.
- Use BCEWithLogitsLoss; track training loss and accuracy per epoch.
- Implement `train_epoch` and `evaluate` loops without `torchvision` helpers.
- Add deterministic seeding.

**Follow-up:** Explain when to prefer `nn.BCEWithLogitsLoss` vs `nn.BCELoss`.

In [None]:
import torch
from torch import nn
from torch.utils.data import TensorDataset, DataLoader

class MLP(nn.Module):
    def __init__(self, in_dim: int, hidden_dim: int = 32, p_drop: float = 0.2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p_drop),
            nn.Linear(hidden_dim, 1),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x).squeeze(-1)

def set_seed(seed: int = 0):
    torch.manual_seed(seed)
    np.random.seed(seed)

def train_epoch(model, loader, optimizer, criterion, device: str = "cpu") -> float:
    """One training epoch; return average loss."""
    # TODO: implement loop
    raise NotImplementedError

def evaluate(model, loader, criterion, device: str = "cpu") -> tuple[float, float]:
    """Evaluate; return (avg_loss, accuracy)."""
    # TODO: implement loop
    raise NotImplementedError

# # Example synthetic run (uncomment to test)
# set_seed(42)
# X = torch.randn(1000, 10)
# y = (X.sum(dim=1) + 0.2 * torch.randn(1000) > 0).float()
# dataset = TensorDataset(X, y)
# loader = DataLoader(dataset, batch_size=32, shuffle=True)
# model = MLP(in_dim=10)
# optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# criterion = nn.BCEWithLogitsLoss()
# for epoch in range(5):
#     train_loss = train_epoch(model, loader, optimizer, criterion)
#     val_loss, acc = evaluate(model, loader, criterion)
#     print(epoch, train_loss, val_loss, acc)


### 7) Transformer Block Forward Pass (PyTorch)
Implement a simplified Transformer encoder block forward pass with pre-layer normalization.

**Components:**
- Multi-head self-attention (`nn.MultiheadAttention`, batch-first).
- Feed-forward network: `Linear -> GELU -> Dropout -> Linear`.
- Residual connections & layer norms (pre-norm style).
- Dropout on attention output and FFN output.

**API:** `forward(x: Tensor) -> Tensor` where `x` has shape `(batch, seq, d_model)`.

**Follow-up:** Discuss why pre-norm can stabilize training for deep stacks.

In [None]:
class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, nhead: int, dim_ff: int, p_drop: float = 0.1):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, nhead, dropout=p_drop, batch_first=True)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, dim_ff),
            nn.GELU(),
            nn.Dropout(p_drop),
            nn.Linear(dim_ff, d_model),
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(p_drop)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # TODO: implement pre-norm residual block
        raise NotImplementedError

# # Sanity check
# blk = TransformerBlock(d_model=32, nhead=4, dim_ff=64)
# dummy = torch.randn(2, 5, 32)
# out = blk(dummy)
# print(out.shape)

### 8) Offline Metrics vs. Online Metrics (Design)
You are asked to ship a recommendation model trained offline. Describe:
- Three offline metrics you would report and why.
- Two online metrics (A/B test) and guardrail metrics.
- How you would design a canary rollout to mitigate risk.

Write concise bullets; aim for depth and clarity.

### 9) Feature Store Consistency (Design)
Explain how you would ensure **training/serving skew** is minimized when using a feature store.

Cover:
- Point-in-time correctness.
- Backfills and data versioning.
- Validation/monitoring signals to catch drift or schema changes.
- Operational playbooks when a breaking change occurs.

Write a short, structured proposal (bullets).

## Tips for Interview Readiness
- Articulate trade-offs (compute vs. latency, bias vs. variance, offline vs. online metrics).
- Narrate your debugging steps; interviewers value clarity.
- Add lightweight tests to prove correctness under edge cases.
- Keep code modular and readable; prefer pure functions where possible.