
# JobInterviewGuide_Workshop — Personalized Study Guide

**Focus areas (from your quiz results):**  
1) Train/Validation/Test Split & Data Leakage  
2) Linear Regression Residuals & Diagnostics  
3) R-squared (R²) Interpretation  
4) Gradient Descent & Learning Rate

This notebook mirrors the tone and structure of the workshop materials: short **concept primers**, followed by **guided practice** and **Try-It** scaffolding cells.


## 1) Train / Validation / Test Split & Data Leakage



**Why it matters:**  
We evaluate models on **unseen data** to estimate real-world performance. **Data leakage** happens when information from the test (or future) data leaks into model training or preprocessing steps, leading to **overly optimistic** metrics.

**Key rules:**  
- Split first → then fit transforms **only on training** (and apply to val/test).  
- Keep target leakage out of features (e.g., post-outcome variables).  
- Use a validation set (or cross-validation) for model selection; reserve the test set **once** for final evaluation.


In [None]:

# Guided Practice: Correct vs. leaky preprocessing
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import numpy as np

X, y = load_diabetes(return_X_y=True)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Correct pipeline: scaler fit on TRAIN only, applied to val/test via Pipeline
pipe_correct = Pipeline([('scaler', StandardScaler()), ('lr', LinearRegression())])
pipe_correct.fit(X_train, y_train)
r2_val_correct = pipe_correct.score(X_val, y_val)
r2_test_correct = pipe_correct.score(X_test, y_test)

# Leaky approach: scaler fit on ALL data (simulating leakage)
scaler_leaky = StandardScaler().fit(np.vstack([X_train, X_val, X_test]))
X_train_leaky = scaler_leaky.transform(X_train)
X_val_leaky   = scaler_leaky.transform(X_val)
X_test_leaky  = scaler_leaky.transform(X_test)

lr_leaky = LinearRegression().fit(X_train_leaky, y_train)
r2_val_leaky = lr_leaky.score(X_val_leaky, y_val)
r2_test_leaky = lr_leaky.score(X_test_leaky, y_test)

print("Correct   R² (val, test):", round(r2_val_correct, 3), round(r2_test_correct, 3))
print("**Leaky** R² (val, test):", round(r2_val_leaky, 3), round(r2_test_leaky, 3))
print("\nNote: If 'leaky' appears much higher, that's a red flag — metrics are inflated by leakage.")


: 

In [None]:

# Try-It: Replace LinearRegression with a regularized model (Ridge or Lasso) and compare again.
# from sklearn.linear_model import Ridge
# pipe_correct = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge(alpha=1.0))])
# pipe_correct.fit(X_train, y_train)
# print("Ridge R² (val, test):", round(pipe_correct.score(X_val, y_val),3), round(pipe_correct.score(X_test, y_test),3))


## 2) Linear Regression Residuals & Diagnostics



**Residual** = observed − predicted. Well-behaved residuals should be **centered around 0**, show **no clear pattern** vs. fitted values, and approximate **constant variance**.

**Checks you should do in interviews:**  
- Plot residuals vs. predictions (look for randomness).  
- Plot histogram of residuals (approx. symmetric).  
- Investigate outliers/influential points.


In [None]:

import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Synthetic linear data with noise + slight heteroscedasticity
rng = np.random.RandomState(0)
X = np.linspace(0, 10, 200).reshape(-1,1)
y_true = 3.0 * X.squeeze() + 5.0
noise = rng.normal(0, 1 + 0.2*X.squeeze(), size=X.shape[0])  # variance grows with X
y = y_true + noise

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
lr = LinearRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)
res = y_test - y_pred

print("Coefficients:", lr.coef_, "Intercept:", round(lr.intercept_,3))
print("Test R²:", round(lr.score(X_test, y_test), 3))

# Residuals vs predictions
plt.figure()
plt.scatter(y_pred, res, alpha=0.7)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted")
plt.ylabel("Residual (y - ŷ)")
plt.title("Residuals vs Predicted")
plt.show()

# Residual histogram
plt.figure()
plt.hist(res, bins=20)
plt.title("Residual Distribution")
plt.xlabel("Residual")
plt.ylabel("Count")
plt.show()


In [None]:

# Try-It: Create polynomial features to reduce pattern in residuals and compare.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

poly_pipe = Pipeline([('poly', PolynomialFeatures(degree=2, include_bias=False)),
                      ('lr', LinearRegression())])
poly_pipe.fit(X_train, y_train)
y_pred2 = poly_pipe.predict(X_test)

import numpy as np
res2 = y_test - y_pred2
print("Poly(deg=2) Test R²:", round(poly_pipe.score(X_test, y_test), 3))

import matplotlib.pyplot as plt
plt.figure()
plt.scatter(y_pred2, res2, alpha=0.7)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted (poly)")
plt.ylabel("Residual")
plt.title("Residuals vs Predicted (Polynomial)")
plt.show()


## 3) R-squared (R²) Interpretation



**R²** measures the proportion of variance in the target explained by the model.  

**Interview-ready notes:**  
- \( R^2 = 1 - \frac{\sum (y-\hat{y})^2}{\sum (y-\bar{y})^2} \)  
- High R² ≠ good model (it can be inflated by complexity).  
- Prefer **Adjusted R²** for multiple regression to penalize unnecessary features.  
- Always pair R² with residual diagnostics and hold-out performance.


In [None]:

from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# Compare simple vs. overfit model on the same data
rng = np.random.RandomState(7)
X = np.linspace(-3, 3, 120).reshape(-1,1)
y = 2*X.squeeze()**2 + rng.normal(0, 1.5, size=X.shape[0])  # quadratic relationship

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

models = {
    "Linear (deg=1)": Pipeline([('poly', PolynomialFeatures(1, include_bias=False)), ('lr', LinearRegression())]),
    "Moderate (deg=2)": Pipeline([('poly', PolynomialFeatures(2, include_bias=False)), ('lr', LinearRegression())]),
    "High (deg=8)": Pipeline([('poly', PolynomialFeatures(8, include_bias=False)), ('lr', LinearRegression())]),
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"{name:>16} | R²={r2_score(y_test, y_pred):.3f}  MSE={mean_squared_error(y_test, y_pred):.3f}")


In [None]:

# Try-It: Implement Adjusted R²
# adj_R2 = 1 - (1 - R2) * (n - 1) / (n - p - 1)
# Set n = number of samples in test, p = number of features after polynomial expansion (exclude bias).


## 4) Gradient Descent & Learning Rate



The **learning rate (η)** controls the **step size** when moving along the negative gradient of the loss surface.

**Failure modes:**  
- Too large → divergence or oscillation.  
- Too small → very slow convergence.

**Interview tip:** Show learning curves under different η and explain the trade-off.


In [None]:

import numpy as np

# Simple 1D quadratic loss: L(w) = (w - 3)^2
def grad(w): 
    return 2*(w - 3)

def run_gd(eta, steps=20, w0=0.0):
    w = w0
    traj = [(0, w, (w-3)**2)]
    for t in range(1, steps+1):
        w = w - eta*grad(w)
        traj.append((t, w, (w-3)**2))
    return traj

for lr in [0.05, 0.2, 0.8, 1.2]:
    traj = run_gd(lr, steps=10, w0=0.0)
    final = traj[-1]
    print(f"eta={lr:<4} -> final w={final[1]:.4f}, loss={final[2]:.6f}")


In [None]:

# Try-It: Plot trajectories for two learning rates to visualize convergence behavior.
import matplotlib.pyplot as plt
import numpy as np

def run_vals(eta, steps=20, w0=0.0):
    w = w0
    ws, ls = [], []
    for t in range(steps):
        ws.append(w); ls.append((w-3)**2)
        w = w - eta*2*(w - 3)
    return np.array(ws), np.array(ls)

w1, l1 = run_vals(eta=0.2, steps=30, w0=0.0)
w2, l2 = run_vals(eta=1.2, steps=30, w0=0.0)

plt.figure()
plt.plot(l1, label="eta=0.2")
plt.plot(l2, label="eta=1.2")
plt.xlabel("Step")
plt.ylabel("Loss")
plt.title("Learning Rate vs. Convergence")
plt.legend()
plt.show()


## Wrap-Up & Interview Prompts



- Explain a recent case where you **prevented data leakage**. What checks did you implement?  
- Show a residual plot and describe whether linear model assumptions hold.  
- Interpret an **R²** of 0.35 vs. 0.85 in context. What else would you report?  
- Describe how you would **choose a learning rate** and detect divergence in practice.
