# Supervised Learning Assessment

> **"The goal of supervised learning is to learn a mapping from inputs to outputs based on example input-output pairs."** - Tom Mitchell

## Assessment Overview

This comprehensive assessment evaluates your mastery of supervised learning algorithms, their mathematical foundations, and practical applications. The questions test both theoretical understanding and practical implementation skills.

### Assessment Structure
- **15 Questions** covering Regression, Classification, Ensemble Methods, and Model Evaluation
- **Difficulty Levels**: Foundational (1-5), Intermediate (6-10), Advanced (11-15)
- **Time Limit**: 2.5 hours
- **Format**: Mix of theoretical, computational, and implementation problems

### Learning Objectives Tested
- Linear and non-linear regression techniques
- Classification algorithms and decision boundaries
- Ensemble methods and model combination
- Model evaluation and validation strategies
- Feature selection and dimensionality reduction
- Regularization and overfitting prevention

---

## Instructions

1. **Read each question carefully** - Many questions have multiple parts
2. **Show your work** - Partial credit will be given for correct methodology
3. **Explain your reasoning** - Understanding the "why" is as important as the "how"
4. **Use appropriate notation** - Mathematical rigor is expected
5. **Connect to real-world applications** - Where relevant, explain practical implications

**Good luck!**


## Question 1: Linear Regression Fundamentals (Foundational)

Consider a simple linear regression model: y = β₀ + β₁x + ε, where ε ~ N(0, σ²).

**a)** Derive the normal equations for estimating β₀ and β₁ using the method of least squares.

**b)** Show that the least squares estimator β̂₁ is unbiased, i.e., E[β̂₁] = β₁.

**c)** Calculate the variance of β̂₁ and explain what factors influence this variance.

**d)** In the context of machine learning, explain:
   - Why we use the sum of squared errors as the loss function
   - The assumptions required for linear regression to work well
   - How to detect and handle violations of these assumptions

---

## Question 2: Logistic Regression and Classification (Foundational)

For binary classification using logistic regression with sigmoid function σ(z) = 1/(1 + e^(-z)):

**a)** Derive the log-likelihood function for logistic regression.

**b)** Show that the gradient of the log-likelihood with respect to weights w is:
   ∇w L = Xᵀ(y - ŷ), where ŷ = σ(Xw)

**c)** Explain why we cannot use the normal equations to solve logistic regression analytically.

**d)** Compare logistic regression with linear regression in terms of:
   - Output interpretation
   - Decision boundaries
   - Optimization methods
   - When to use each approach

---

## Question 3: Decision Trees and Information Theory (Foundational)

Consider a dataset with 100 samples, 60% class A and 40% class B.

**a)** Calculate the entropy of the class distribution.

**b)** A feature splits the data into two groups:
   - Group 1: 30 samples (20 class A, 10 class B)
   - Group 2: 70 samples (40 class A, 30 class B)
   
   Calculate the information gain of this split.

**c)** Explain how decision trees handle:
   - Categorical features
   - Missing values
   - Overfitting

**d)** Compare decision trees with linear models in terms of:
   - Interpretability
   - Handling non-linear relationships
   - Feature interactions
   - Computational complexity

---

## Question 4: Model Evaluation and Validation (Foundational)

You have trained a binary classifier and obtained the following confusion matrix:

|                | Predicted 0 | Predicted 1 |
|----------------|-------------|-------------|
| **Actual 0**   | 80          | 20          |
| **Actual 1**   | 15          | 85          |

**a)** Calculate accuracy, precision, recall, and F1-score.

**b)** Calculate the ROC-AUC score if the classifier outputs probabilities.

**c)** Explain the trade-off between precision and recall, and when you might prioritize one over the other.

**d)** Design a cross-validation strategy for:
   - Small dataset (n=100)
   - Large dataset (n=1,000,000)
   - Time series data
   - Imbalanced dataset

---

## Question 5: Regularization and Overfitting (Foundational)

Consider linear regression with L2 regularization (Ridge): J(w) = ||y - Xw||² + λ||w||²

**a)** Derive the closed-form solution for Ridge regression.

**b)** Show that Ridge regression always has a unique solution, unlike ordinary least squares.

**c)** Explain how the regularization parameter λ affects:
   - Model complexity
   - Bias-variance trade-off
   - Feature selection

**d)** Compare L1 (Lasso) and L2 (Ridge) regularization:
   - Mathematical differences
   - Effect on feature selection
   - When to use each method
   - Computational considerations

---


## Question 6: Support Vector Machines (Intermediate)

Consider a linearly separable dataset with two classes. The SVM optimization problem is:

min (1/2)||w||² subject to yᵢ(wᵀxᵢ + b) ≥ 1 for all i

**a)** Derive the dual optimization problem using Lagrange multipliers.

**b)** Explain the significance of support vectors and how they define the decision boundary.

**c)** For the RBF kernel k(x,y) = exp(-γ||x-y||²), explain:
   - Why it's called the "Gaussian" kernel
   - How the parameter γ affects the decision boundary
   - The curse of dimensionality in kernel methods

**d)** Compare SVM with logistic regression:
   - Optimization objectives
   - Handling of outliers
   - Computational complexity
   - Interpretability

---

## Question 7: Ensemble Methods and Random Forests (Intermediate)

Consider a Random Forest with 100 decision trees, each trained on a bootstrap sample.

**a)** Explain how Random Forest reduces overfitting compared to a single decision tree.

**b)** Derive the formula for out-of-bag (OOB) error estimation.

**c)** Show that the bias of Random Forest is approximately equal to the bias of individual trees, but the variance is reduced.

**d)** Design an ensemble method that combines:
   - Linear regression models
   - Decision trees
   - Neural networks

   Explain your choice of combination strategy and how to prevent overfitting.

---

## Question 8: Feature Selection and Dimensionality Reduction (Intermediate)

Given a dataset with 1000 features and 100 samples, you want to reduce dimensionality.

**a)** Compare filter methods, wrapper methods, and embedded methods for feature selection.

**b)** For Principal Component Analysis (PCA):
   - Derive the first principal component
   - Explain why we center the data before applying PCA
   - Calculate the proportion of variance explained by the first k components

**c)** Design a feature selection pipeline that:
   - Handles missing values
   - Removes highly correlated features
   - Selects features based on mutual information
   - Validates the selection using cross-validation

**d)** Explain when you would use:
   - PCA vs. feature selection
   - Linear vs. non-linear dimensionality reduction
   - Supervised vs. unsupervised methods

---

## Question 9: Advanced Model Evaluation (Intermediate)

You're evaluating a medical diagnosis system with the following characteristics:
- Disease prevalence: 5%
- Test sensitivity: 90%
- Test specificity: 95%

**a)** Calculate precision, recall, and F1-score for this system.

**b)** Explain why accuracy might be misleading for this problem and suggest better metrics.

**c)** Design a comprehensive evaluation strategy that includes:
   - Cross-validation methodology
   - Statistical significance testing
   - Confidence intervals for performance metrics
   - Handling of class imbalance

**d)** Compare different evaluation approaches:
   - Hold-out validation vs. k-fold cross-validation
   - Stratified vs. random sampling
   - Bootstrap vs. cross-validation

---

## Question 10: Hyperparameter Tuning and Model Selection (Intermediate)

You need to tune hyperparameters for a machine learning pipeline with multiple components.

**a)** Compare grid search, random search, and Bayesian optimization for hyperparameter tuning.

**b)** Design a nested cross-validation strategy for:
   - Model selection (choosing between algorithms)
   - Hyperparameter tuning
   - Performance estimation

**c)** Explain the bias-variance trade-off in the context of:
   - Model complexity
   - Training set size
   - Regularization strength

**d)** Implement a robust model selection procedure that:
   - Handles multiple performance metrics
   - Accounts for computational constraints
   - Provides uncertainty estimates
   - Prevents data leakage

---


## Question 11: Advanced Ensemble Methods (Advanced)

Consider a stacking ensemble that combines multiple base learners using a meta-learner.

**a)** Derive the optimal weights for a linear combination of base learners that minimizes the mean squared error.

**b)** Explain how to prevent overfitting in stacking by using cross-validation to generate meta-features.

**c)** Design a hierarchical ensemble that:
   - Uses different algorithms for different data regions
   - Automatically determines the number of regions
   - Handles concept drift over time

**d)** Compare stacking with other ensemble methods:
   - Bagging (Random Forest)
   - Boosting (AdaBoost, Gradient Boosting)
   - Bayesian Model Averaging
   - When to use each approach

---

## Question 12: Advanced Regularization and Optimization (Advanced)

Consider the Elastic Net regularization: J(w) = ||y - Xw||² + λ₁||w||₁ + λ₂||w||²

**a)** Derive the coordinate descent update rule for Elastic Net.

**b)** Explain how Elastic Net combines the benefits of Lasso and Ridge regression.

**c)** Design an adaptive regularization scheme that:
   - Automatically adjusts regularization strength based on training progress
   - Uses different regularization for different feature groups
   - Handles non-convex regularization penalties

**d)** Compare different optimization algorithms for regularized regression:
   - Coordinate descent
   - Proximal gradient methods
   - Alternating direction method of multipliers (ADMM)
   - When to use each method

---

## Question 13: Advanced Model Interpretability (Advanced)

You need to explain the predictions of a complex ensemble model to domain experts.

**a)** Derive SHAP (SHapley Additive exPlanations) values for a linear model and explain their interpretation.

**b)** Design a method to explain individual predictions from a Random Forest model.

**c)** Compare different interpretability methods:
   - LIME (Local Interpretable Model-agnostic Explanations)
   - SHAP
   - Partial dependence plots
   - Permutation importance

**d)** Design a comprehensive model explanation framework that:
   - Provides both global and local explanations
   - Handles feature interactions
   - Quantifies explanation uncertainty
   - Is computationally efficient for large models

---

## Question 14: Advanced Model Validation and Testing (Advanced)

You're developing a machine learning system for autonomous vehicles that must meet strict safety requirements.

**a)** Design a validation strategy that:
   - Ensures statistical significance of performance improvements
   - Handles temporal dependencies in the data
   - Accounts for distribution shift between training and deployment
   - Provides confidence bounds on performance metrics

**b)** Implement a statistical testing framework for comparing multiple models that:
   - Controls the family-wise error rate
   - Handles multiple performance metrics
   - Accounts for multiple testing corrections
   - Provides effect size estimates

**c)** Design a continuous monitoring system that:
   - Detects model degradation in real-time
   - Triggers model retraining when needed
   - Handles concept drift and data drift
   - Maintains model performance over time

**d)** Explain how to validate model fairness and bias:
   - Define fairness metrics for different protected groups
   - Test for disparate impact and treatment
   - Implement bias mitigation strategies
   - Monitor fairness in production

---

## Question 15: Integration and Production Systems (Advanced)

Design a complete machine learning pipeline for a real-world application (e.g., fraud detection, recommendation system, or medical diagnosis).

**a)** Design the overall system architecture including:
   - Data ingestion and preprocessing
   - Feature engineering and selection
   - Model training and validation
   - Model deployment and serving
   - Monitoring and maintenance

**b)** Implement a robust model selection framework that:
   - Handles multiple algorithms and hyperparameters
   - Uses appropriate cross-validation strategies
   - Accounts for business constraints and requirements
   - Provides uncertainty quantification

**c)** Design a production monitoring system that:
   - Tracks model performance in real-time
   - Detects data drift and concept drift
   - Handles model versioning and rollback
   - Provides alerts and automated responses

**d)** Address ethical and practical considerations:
   - Data privacy and security
   - Model fairness and bias
   - Regulatory compliance
   - Scalability and performance
   - Cost optimization

---

## Answer Key and Solutions

### Question 1 Solutions

**a)** Normal equations derivation:
For J(β₀, β₁) = Σᵢ₌₁ⁿ(yᵢ - β₀ - β₁xᵢ)²

∂J/∂β₀ = -2Σᵢ₌₁ⁿ(yᵢ - β₀ - β₁xᵢ) = 0
∂J/∂β₁ = -2Σᵢ₌₁ⁿxᵢ(yᵢ - β₀ - β₁xᵢ) = 0

Solving: β̂₀ = ȳ - β̂₁x̄, β̂₁ = Σᵢ₌₁ⁿ(xᵢ-x̄)(yᵢ-ȳ)/Σᵢ₌₁ⁿ(xᵢ-x̄)²

**b)** Unbiasedness proof:
E[β̂₁] = E[Σᵢ₌₁ⁿ(xᵢ-x̄)(yᵢ-ȳ)/Σᵢ₌₁ⁿ(xᵢ-x̄)²]
= E[Σᵢ₌₁ⁿ(xᵢ-x̄)(β₀+β₁xᵢ+εᵢ-ȳ)/Σᵢ₌₁ⁿ(xᵢ-x̄)²]
= β₁ (after simplification)

**c)** Variance calculation:
Var(β̂₁) = σ²/Σᵢ₌₁ⁿ(xᵢ-x̄)²

Factors affecting variance:
- Error variance σ² (higher error → higher variance)
- Sample size n (more data → lower variance)
- Feature spread Σᵢ₌₁ⁿ(xᵢ-x̄)² (more spread → lower variance)

**d)** ML context:
- SSE loss: Differentiable, penalizes large errors quadratically
- Assumptions: Linearity, independence, homoscedasticity, normality
- Detection: Residual plots, statistical tests
- Handling: Transformations, robust methods, non-parametric approaches

---

### Question 2 Solutions

**a)** Log-likelihood derivation:
L(w) = ∏ᵢ₌₁ⁿ p(yᵢ|xᵢ,w) = ∏ᵢ₌₁ⁿ σ(wᵀxᵢ)^yᵢ(1-σ(wᵀxᵢ))^(1-yᵢ)
log L(w) = Σᵢ₌₁ⁿ[yᵢ log σ(wᵀxᵢ) + (1-yᵢ) log(1-σ(wᵀxᵢ))]

**b)** Gradient derivation:
∂log L/∂w = Σᵢ₌₁ⁿ[yᵢ(1-σ(wᵀxᵢ))xᵢ - (1-yᵢ)σ(wᵀxᵢ)xᵢ]
= Σᵢ₌₁ⁿ[yᵢ - σ(wᵀxᵢ)]xᵢ = Xᵀ(y - ŷ)

**c)** Why no normal equations:
The log-likelihood is non-linear in w, so we can't solve ∂log L/∂w = 0 analytically.

**d)** Comparison:
- Output: Logistic gives probabilities, linear gives continuous values
- Boundaries: Logistic gives non-linear, linear gives linear
- Optimization: Logistic needs iterative methods, linear has closed form
- Use cases: Logistic for classification, linear for regression

---

### Question 3 Solutions

**a)** Entropy calculation:
H(Y) = -0.6 log₂(0.6) - 0.4 log₂(0.4) = 0.971

**b)** Information gain:
H(Y|Group1) = -0.67 log₂(0.67) - 0.33 log₂(0.33) = 0.918
H(Y|Group2) = -0.57 log₂(0.57) - 0.43 log₂(0.43) = 0.985
H(Y|Split) = 0.3×0.918 + 0.7×0.985 = 0.963
IG = 0.971 - 0.963 = 0.008

**c)** Decision tree handling:
- Categorical: Use information gain for each category
- Missing: Use surrogate splits or imputation
- Overfitting: Pruning, early stopping, minimum samples per leaf

**d)** Comparison with linear models:
- Interpretability: Trees more interpretable
- Non-linear: Trees handle non-linear relationships better
- Interactions: Trees automatically find interactions
- Complexity: Trees can be more complex

---

### Question 4 Solutions

**a)** Performance metrics:
Accuracy = (80+85)/(80+20+15+85) = 165/200 = 0.825
Precision = 85/(85+20) = 0.81
Recall = 85/(85+15) = 0.85
F1 = 2×0.81×0.85/(0.81+0.85) = 0.83

**b)** ROC-AUC: Need probability outputs to calculate

**c)** Precision-Recall trade-off:
- High precision: Few false positives, may miss some true positives
- High recall: Catch most true positives, may have more false positives
- Choose based on business requirements

**d)** Cross-validation strategies:
- Small dataset: Leave-one-out or 5-fold
- Large dataset: 10-fold or hold-out
- Time series: Time series split
- Imbalanced: Stratified k-fold

---

### Question 5 Solutions

**a)** Ridge solution:
∂J/∂w = -2Xᵀ(y-Xw) + 2λw = 0
XᵀXw + λIw = Xᵀy
w = (XᵀX + λI)⁻¹Xᵀy

**b)** Uniqueness proof:
XᵀX + λI is always invertible for λ > 0 because it's positive definite.

**c)** Effect of λ:
- High λ: Lower complexity, higher bias, lower variance
- Low λ: Higher complexity, lower bias, higher variance
- Feature selection: L2 doesn't perform feature selection

**d)** L1 vs L2 comparison:
- L1: Sparse solutions, feature selection, non-differentiable
- L2: Smooth solutions, no feature selection, differentiable
- Use L1 for feature selection, L2 for regularization
- L1 computationally more expensive

---

## Grading Rubric

### Excellent (90-100%)
- Correct mathematical derivations and implementations
- Clear explanations of concepts and trade-offs
- Strong connections to real-world applications
- Demonstrates deep understanding of algorithms

### Good (80-89%)
- Mostly correct solutions with minor errors
- Good understanding of concepts
- Some connections to applications
- Minor gaps in implementation details

### Satisfactory (70-79%)
- Basic understanding shown
- Some correct solutions
- Limited connections to applications
- Several computational or conceptual errors

### Needs Improvement (60-69%)
- Limited understanding of concepts
- Many computational errors
- Weak connections to applications
- Incomplete solutions

### Unsatisfactory (<60%)
- Little understanding demonstrated
- Major errors throughout
- No connections to applications
- Incomplete or missing solutions

---

**Congratulations on completing the Supervised Learning Assessment!**

This assessment tests your mastery of supervised learning algorithms, from basic linear models to advanced ensemble methods. Success in this assessment demonstrates readiness to tackle real-world machine learning problems and develop production-ready ML systems.
