
# JobInterviewGuide_Workshop — Junior ML Specialist Prep

This notebook targets the exact areas from your quiz that need reinforcement.  
Each section includes a **concise concept recap** (Markdown) and **scaffolded code cells** (with `TODO` prompts).

**Focus areas from your quiz:**
1. Logistic Regression probabilities (sigmoid) ✅ *what the output means*  
2. Cross-Entropy (Log-Loss) vs Regularization ✅ *don’t confuse the roles*  
3. Decision Trees — what **leaf nodes** represent  
4. Classification Metrics — know which **apply to classification** vs **regression**  
5. Parametric vs Non-Parametric models — definitions + tiny experiment  
6. Feature Engineering — the real goal (not random features)  
7. Train/Validation/Test — the **validation** set’s purpose  
8. Gradient Descent — optimization, not scaling

> Tip: Run each code cell, fill the TODOs, and keep short notes below each section.



## 1) Logistic Regression & the Sigmoid Output (Probability)

**Key point:** The sigmoid function \( \sigma(z) = 1/(1+e^{-z}) \) maps any real number to a **probability** in \([0,1]\).  
In logistic regression, \(z = w^T x + b\). The model predicts **P(y=1|x)**.

**Why it matters:** Your answer mixed this with residuals. Residuals are a regression concept; **sigmoid outputs probabilities**.


In [None]:

# TODO: Compute sigmoid probabilities for a set of log-odds (z values)
import math

z_values = [-4, -1, 0, 1, 3]
def sigmoid(z):
    return 1.0 / (1.0 + math.exp(-z))

probs = [sigmoid(z) for z in z_values]
print("z:", z_values)
print("sigmoid(z):", [round(p, 4) for p in probs])

# TODO: Given a probability threshold=0.5, convert these to class predictions (0/1).
threshold = 0.5
preds = [1 if p >= threshold else 0 for p in probs]
print("class predictions @0.5:", preds)



## 2) Cross-Entropy (Log-Loss) vs Regularization

**Cross-Entropy (Log-Loss)** measures how close predicted probabilities are to true labels:  
\[ \mathcal{L}_{CE} = -\frac{1}{N}\sum_i \big( y_i\log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i) \big) \]

**Regularization (L1/L2)** penalizes large weights to control complexity:  
\( \mathcal{L}_{total} = \mathcal{L}_{CE} + \lambda \cdot \text{Penalty}(w) \)

> You chose "penalize large coefficients" for cross-entropy. That’s **regularization**, not CE.


In [None]:

# TODO: Implement binary cross-entropy for a tiny example
import math

y_true = [1, 0, 1, 1, 0]
y_hat =  [0.9, 0.2, 0.6, 0.8, 0.1]  # predicted probabilities

def binary_cross_entropy(y, p):
    eps = 1e-12
    loss = 0.0
    for yt, pt in zip(y, p):
        pt = min(max(pt, eps), 1.0 - eps)
        loss += -(yt*math.log(pt) + (1-yt)*math.log(1-pt))
    return loss/len(y)

ce = binary_cross_entropy(y_true, y_hat)
print("Cross-Entropy Loss:", round(ce, 4))

# TODO: Add simple L2 regularization term to the total loss for weights w
w = [0.8, -0.3, 0.5]
lam = 0.1
l2 = lam * sum(wi*wi for wi in w)
total = ce + l2
print("L2 penalty:", round(l2, 4), "  Total loss:", round(total, 4))



## 3) Decision Trees — Leaf Nodes = Final Predictions

A **leaf node** is where splitting stops; it contains the **final prediction** (class label or mean value).  
> You selected "a split point" — that’s an **internal node**, not a leaf.


In [None]:

# TODO: Train a tiny decision tree and inspect leaf predictions
from sklearn.tree import DecisionTreeClassifier, export_text

X = [[0],[1],[2],[3],[4],[5]]
y = [0,0,0,1,1,1]

clf = DecisionTreeClassifier(max_depth=2, random_state=0)
clf.fit(X,y)

print(export_text(clf, feature_names=["x"]))

# TODO: Predict for x = 1.5 and x = 3.2, and explain which leaf each ends in.
print("Pred(1.5) =", clf.predict([[1.5]])[0])
print("Pred(3.2) =", clf.predict([[3.2]])[0])



## 4) Classification Metrics vs Regression Metrics

**Classification:** accuracy, precision, recall, F1, confusion matrix.  
**Regression:** R², MSE, RMSE, MAE.  
> You picked F1 as "not used for classification" — the true mismatch is **R²** (regression only).


In [None]:

# TODO: Compute accuracy, precision, recall, F1 on a small example
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

y_true = [1,0,1,1,0,1]
y_pred = [1,0,0,1,0,1]

print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
print("Accuracy:", round(accuracy_score(y_true,y_pred),3))
print("Precision:", round(precision_score(y_true,y_pred),3))
print("Recall:", round(recall_score(y_true,y_pred),3))
print("F1:", round(f1_score(y_true,y_pred),3))

# QUESTION: Which metric belongs to regression only?  -> R^2



## 5) Parametric vs Non-Parametric — Mini Experiment

- **Parametric**: fixed number of parameters (e.g., Linear Regression).  
- **Non-Parametric**: model complexity grows with data (e.g., KNN, Decision Trees).  
> You selected that both require linear relationships — **not true**.


In [None]:

# TODO: Compare Linear Regression (parametric) vs KNN Regressor (non-parametric) on a non-linear function
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

rng = np.random.RandomState(0)
X = np.linspace(-3, 3, 60).reshape(-1,1)
y = np.sin(X).ravel() + 0.1*rng.randn(60)

# Parametric
lin = LinearRegression().fit(X,y)
mse_lin = mean_squared_error(y, lin.predict(X))

# Non-Parametric
knn = KNeighborsRegressor(n_neighbors=5).fit(X,y)
mse_knn = mean_squared_error(y, knn.predict(X))

print("MSE Linear Regression (parametric):", round(mse_lin,4))
print("MSE KNN Regressor (non-parametric):", round(mse_knn,4))

# QUESTION: Which handles the non-linear pattern better here? Why?



## 6) Feature Engineering — Purpose

**Goal:** Transform raw data into **meaningful** features that improve signal-to-noise and model performance.  
It is **not** about randomly adding variables.

Examples: scaling, encoding, interaction terms, domain-specific aggregates.


In [None]:

# TODO: Demonstrate a simple feature transform that improves separability
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import accuracy_score

rng = np.random.RandomState(42)
X = rng.uniform(-2,2,(300,1))
y = (X[:,0]**2 + rng.normal(0,0.3,300) > 1.0).astype(int)

# Baseline: raw feature
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)
lr = LogisticRegression().fit(X_tr, y_tr)
base_acc = accuracy_score(y_te, lr.predict(X_te))

# Add polynomial feature
poly = PolynomialFeatures(degree=2, include_bias=False)
X2 = poly.fit_transform(X)
X2_tr, X2_te, y_tr, y_te = train_test_split(X2, y, test_size=0.3, random_state=0)
lr2 = LogisticRegression().fit(X2_tr, y_tr)
poly_acc = accuracy_score(y_te, lr2.predict(X2_te))

print("Baseline acc (raw):", round(base_acc,3))
print("With polynomial feature acc:", round(poly_acc,3))

# QUESTION: Which feature set performs better? Why?



## 7) Train / Validation / Test — Role of Validation

- **Train:** fit model parameters  
- **Validation:** **tune hyperparameters**, monitor overfitting, pick the best model  
- **Test:** final unbiased evaluation

> You selected "balance data" for validation — its role is **model selection & tuning**.


In [None]:

# TODO: Show how validation affects hyperparameter selection with a simple grid search
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

X, y = make_classification(n_samples=400, n_features=6, n_informative=4, random_state=0)
param_grid = {"n_neighbors":[1,3,5,7,9]}

grid = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, cv=5, scoring="accuracy")
grid.fit(X,y)

print("Best params via CV (validation):", grid.best_params_)
print("Best CV accuracy:", round(grid.best_score_,3))

# TODO: Explain why we don't use the test set during this process.



## 8) Gradient Descent — Optimization

**Definition:** Iteratively updates parameters in the direction of **negative gradient** to **minimize loss**.  
Common for linear/logistic regression and neural nets.

> You chose a scaling method — but GD is an **optimizer**, not a preprocessor.


In [None]:

# TODO: Implement gradient descent to fit y = wx + b on synthetic data
import numpy as np

rng = np.random.RandomState(1)
X = rng.rand(200,1)
true_w, true_b = 2.0, -0.5
y = true_w*X[:,0] + true_b + rng.normal(0,0.05,200)

w, b = 0.0, 0.0
lr = 0.5
for step in range(200):
    y_hat = w*X[:,0] + b
    # gradients of MSE loss
    dw = (2/len(X)) * np.sum((y_hat - y) * X[:,0])
    db = (2/len(X)) * np.sum((y_hat - y))
    w -= lr*dw
    b -= lr*db
    if step % 50 == 0:
        mse = np.mean((y_hat - y)**2)
        print(f"step={step:3d}  w={w:.3f}  b={b:.3f}  MSE={mse:.4f}")

print("Estimated w,b:", round(w,3), round(b,3), "  (true:", true_w, true_b, ")")



---

### ✅ How to Use This Notebook
- Work through sections in order.
- Replace `TODO` stubs with your own code/notes.
- If something feels shaky, re-run the earlier study materials (class notebooks) for that topic.

When you finish, **export to PDF** and keep it as your *Interview Readiness Workbook*.
