# Mathematical Foundations Assessment

> **"Mathematics is the language in which God has written the universe."** - Galileo Galilei

## Assessment Overview

This comprehensive assessment evaluates your mastery of mathematical foundations essential for machine learning. The questions are designed to test not just computational skills, but deep conceptual understanding and the ability to connect mathematical theory to practical applications.

### Assessment Structure
- **15 Questions** covering Linear Algebra, Probability, Statistics, and Optimization
- **Difficulty Levels**: Foundational (1-5), Intermediate (6-10), Advanced (11-15)
- **Time Limit**: 2 hours
- **Format**: Mix of theoretical, computational, and applied problems

### Learning Objectives Tested
- Vector operations and geometric interpretations
- Matrix decompositions and their applications
- Probability theory and Bayesian inference
- Statistical distributions and hypothesis testing
- Optimization theory and gradient methods
- Information theory and entropy

---

## Instructions

1. **Read each question carefully** - Many questions have multiple parts
2. **Show your work** - Partial credit will be given for correct methodology
3. **Explain your reasoning** - Understanding the "why" is as important as the "how"
4. **Use appropriate notation** - Mathematical rigor is expected
5. **Connect to ML applications** - Where relevant, explain how concepts apply to machine learning

**Good luck!**


## Question 1: Vector Geometry and Machine Learning (Foundational)

Consider the vectors **a** = [3, 4, 0] and **b** = [1, 2, 2] in ℝ³.

**a)** Calculate the dot product **a · b** and explain what this value represents geometrically.

**b)** Find the angle between vectors **a** and **b**.

**c)** In the context of machine learning, explain how the dot product is used in:
   - Linear regression
   - Neural network computations
   - Similarity measures between data points

**d)** Calculate the projection of vector **a** onto vector **b** and interpret this result.

---


## Question 2: Matrix Operations and Linear Transformations (Foundational)

Given the matrix **A** = [[2, 1], [1, 3]]:

**a)** Calculate the determinant of **A** and explain its geometric significance.

**b)** Find the eigenvalues and eigenvectors of **A**.

**c)** Perform the eigendecomposition **A = QΛQᵀ** and verify your result.

**d)** Explain how eigendecomposition is used in:
   - Principal Component Analysis (PCA)
   - Linear regression with multiple features
   - Dimensionality reduction techniques

---

## Question 3: Probability Theory and Bayesian Inference (Foundational)

A medical test for a disease has:
- Sensitivity (True Positive Rate): 95%
- Specificity (True Negative Rate): 90%
- Disease prevalence in population: 2%

**a)** Calculate the probability that a person has the disease given a positive test result.

**b)** Calculate the probability that a person does not have the disease given a negative test result.

**c)** Explain why the probability in part (a) is much lower than the test's sensitivity.

**d)** In machine learning, explain how this concept applies to:
   - Class imbalance problems
   - Precision and recall metrics
   - Calibrating classifier outputs

---

## Question 4: Statistical Distributions and Hypothesis Testing (Foundational)

A machine learning model is trained to predict house prices. The residuals (actual - predicted) are normally distributed with mean 0 and standard deviation $15,000.

**a)** What is the probability that a prediction error exceeds $30,000?

**b)** What is the probability that a prediction error is between -$10,000 and $20,000?

**c)** If we want 95% of predictions to be within a certain range, what should that range be?

**d)** Explain how the Central Limit Theorem applies to:
   - Cross-validation estimates
   - Bootstrap sampling
   - Confidence intervals for model performance

---

## Question 5: Optimization Theory and Gradient Descent (Foundational)

Consider the function f(x, y) = x² + 2y² - 4x - 8y + 20.

**a)** Find the gradient ∇f(x, y) and the critical point(s).

**b)** Determine whether the critical point is a minimum, maximum, or saddle point using the second derivative test.

**c)** Starting from point (0, 0), perform two iterations of gradient descent with learning rate α = 0.1.

**d)** Explain how gradient descent is used in:
   - Linear regression
   - Logistic regression
   - Neural network training
   - What are the advantages and disadvantages compared to analytical solutions?

---


## Question 6: Linear Algebra in Dimensionality Reduction (Intermediate)

Given a dataset X with n=100 samples and p=5 features, where X is centered (mean=0).

**a)** Derive the formula for the covariance matrix C = (1/n)XᵀX.

**b)** Explain why the covariance matrix is symmetric and positive semi-definite.

**c)** If the eigenvalues of C are [10, 5, 2, 0.5, 0.1], calculate the proportion of variance explained by the first two principal components.

**d)** In the context of PCA, explain:
   - Why we center the data before computing the covariance matrix
   - The relationship between eigenvalues and variance
   - How to choose the number of components to retain

---

## Question 7: Information Theory and Entropy (Intermediate)

Consider a binary classification problem with the following class distribution:
- Class 0: 70% of samples
- Class 1: 30% of samples

**a)** Calculate the entropy H(Y) of the class distribution.

**b)** If a feature X splits the data such that:
   - When X=0: 60% Class 0, 40% Class 1
   - When X=1: 80% Class 0, 20% Class 1
   
   Calculate the conditional entropy H(Y|X) and information gain IG(Y,X).

**c)** Explain how information gain is used in decision tree construction.

**d)** In machine learning, explain the relationship between:
   - Entropy and uncertainty
   - Information gain and feature selection
   - Mutual information and feature relevance

---

## Question 8: Advanced Probability and Maximum Likelihood (Intermediate)

A machine learning model assumes that data points x₁, x₂, ..., xₙ are independently drawn from a normal distribution N(μ, σ²).

**a)** Write the likelihood function L(μ, σ²) for the observed data.

**b)** Derive the maximum likelihood estimators for μ and σ².

**c)** Show that the MLE for μ is unbiased, but the MLE for σ² is biased.

**d)** Explain how maximum likelihood estimation is used in:
   - Logistic regression
   - Gaussian Naive Bayes
   - Neural network parameter estimation

---

## Question 9: Matrix Decompositions and Numerical Stability (Intermediate)

Consider the matrix **A** = [[4, 2], [2, 3]].

**a)** Perform Cholesky decomposition of **A**.

**b)** Perform QR decomposition of **A**.

**c)** Compare the computational complexity of solving **Ax = b** using:
   - Gaussian elimination
   - Cholesky decomposition
   - QR decomposition

**d)** In machine learning, explain when you would use each decomposition:
   - Cholesky: Linear regression with normal equations
   - QR: Least squares problems
   - SVD: Dimensionality reduction and regularization

---

## Question 10: Advanced Optimization and Convexity (Intermediate)

Consider the function f(x) = x₁² + x₂² + 2x₁x₂ + 4x₁ + 6x₂ + 10.

**a)** Write f(x) in the form f(x) = (1/2)xᵀQx + cᵀx + d.

**b)** Determine if f(x) is convex by examining the Hessian matrix.

**c)** Find the global minimum using the analytical solution.

**d)** Explain the importance of convexity in machine learning:
   - Why do we prefer convex optimization problems?
   - What happens when the objective function is non-convex?
   - How does regularization affect convexity?

---


## Question 11: Advanced Linear Algebra and Kernel Methods (Advanced)

Given a dataset with n samples, consider the kernel matrix **K** where Kᵢⱼ = k(xᵢ, xⱼ) for some kernel function k.

**a)** Prove that any kernel matrix is positive semi-definite.

**b)** For the RBF kernel k(x, y) = exp(-γ||x-y||²), explain why it's called the "Gaussian" kernel.

**c)** In Support Vector Machines, explain how the kernel trick allows us to work in high-dimensional feature spaces without explicitly computing the features.

**d)** Derive the dual optimization problem for SVM with kernel k(x, y).

---

## Question 12: Advanced Probability and Bayesian Methods (Advanced)

In Bayesian linear regression, we assume:
- Likelihood: y|X, w, σ² ~ N(Xw, σ²I)
- Prior: w ~ N(0, α⁻¹I)

**a)** Derive the posterior distribution p(w|y, X, σ², α).

**b)** Show that the posterior mean is equivalent to Ridge regression with λ = σ²/α.

**c)** Derive the predictive distribution p(y*|x*, y, X, σ², α) for a new input x*.

**d)** Explain the advantages of Bayesian methods:
   - Uncertainty quantification
   - Automatic regularization
   - Model selection through marginal likelihood

---

## Question 13: Advanced Statistics and Model Selection (Advanced)

Consider model selection using AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion).

**a)** Derive the AIC formula: AIC = 2k - 2ln(L), where k is the number of parameters and L is the likelihood.

**b)** Derive the BIC formula: BIC = k·ln(n) - 2ln(L), where n is the sample size.

**c)** Compare AIC and BIC in terms of:
   - Model complexity penalty
   - Asymptotic behavior
   - When to use each criterion

**d)** In the context of machine learning, explain how these criteria relate to:
   - Overfitting and underfitting
   - Cross-validation
   - Regularization methods

---

## Question 14: Advanced Optimization and Stochastic Methods (Advanced)

Consider stochastic gradient descent (SGD) for minimizing f(w) = (1/n)Σᵢ₌₁ⁿ fᵢ(w).

**a)** Derive the SGD update rule: wₜ₊₁ = wₜ - αₜ∇fᵢₜ(wₜ), where iₜ is randomly sampled.

**b)** Show that E[∇fᵢₜ(wₜ)] = ∇f(wₜ), i.e., SGD is an unbiased estimator of the true gradient.

**c)** Explain the convergence conditions for SGD:
   - Learning rate schedule
   - Gradient variance bounds
   - Strong convexity requirements

**d)** Compare SGD with batch gradient descent in terms of:
   - Computational complexity per iteration
   - Convergence rate
   - Memory requirements
   - Practical considerations

---

## Question 15: Integration and Advanced Applications (Advanced)

Consider a machine learning pipeline that processes high-dimensional data through multiple stages:
1. Data preprocessing and normalization
2. Dimensionality reduction using PCA
3. Feature selection using mutual information
4. Model training with regularization
5. Model evaluation and validation

**a)** For each stage, identify the key mathematical concepts involved and explain their role.

**b)** Design a mathematical framework for:
   - Combining multiple models (ensemble methods)
   - Handling missing data probabilistically
   - Quantifying prediction uncertainty

**c)** Explain how the mathematical foundations you've learned enable:
   - Scalable algorithms for big data
   - Robust models that generalize well
   - Interpretable machine learning systems

**d)** Propose a research direction that builds upon these mathematical foundations to address current limitations in machine learning.

---


## Answer Key and Solutions

### Question 1 Solutions

**a)** Dot product: **a · b** = 3×1 + 4×2 + 0×2 = 3 + 8 + 0 = 11

Geometrically, this represents the product of the magnitudes of the vectors and the cosine of the angle between them.

**b)** Angle calculation:
||**a**|| = √(3² + 4² + 0²) = 5
||**b**|| = √(1² + 2² + 2²) = 3
cos(θ) = (**a · b**)/(||**a**|| ||**b**||) = 11/(5×3) = 11/15
θ = arccos(11/15) ≈ 42.8°

**c)** ML applications:
- Linear regression: y = wᵀx + b (dot product between weights and features)
- Neural networks: Each neuron computes a weighted sum (dot product)
- Similarity: Cosine similarity uses normalized dot product

**d)** Projection: proj_b(**a**) = ((**a · b**)/||**b**||²)**b** = (11/9)[1, 2, 2] = [11/9, 22/9, 22/9]

This represents how much of vector **a** points in the direction of vector **b**.

---

### Question 2 Solutions

**a)** Determinant: det(**A**) = 2×3 - 1×1 = 6 - 1 = 5

Geometrically, this represents the area scaling factor of the linear transformation.

**b)** Eigenvalues and eigenvectors:
Characteristic equation: det(**A** - λ**I**) = 0
(2-λ)(3-λ) - 1 = 0
λ² - 5λ + 5 = 0
λ = (5 ± √5)/2

For λ₁ = (5 + √5)/2: eigenvector [1, (1+√5)/2]
For λ₂ = (5 - √5)/2: eigenvector [1, (1-√5)/2]

**c)** Eigendecomposition verification:
**Q** = [[1, 1], [(1+√5)/2, (1-√5)/2]]
**Λ** = [[(5+√5)/2, 0], [0, (5-√5)/2]]
**A** = **QΛQᵀ** (verification by matrix multiplication)

**d)** Applications:
- PCA: Eigenvectors of covariance matrix are principal components
- Linear regression: Normal equations involve matrix inverses
- Dimensionality reduction: Project data onto subspace spanned by top eigenvectors

---

### Question 3 Solutions

**a)** Using Bayes' theorem:
P(Disease|Positive) = P(Positive|Disease) × P(Disease) / P(Positive)
P(Positive) = P(Positive|Disease) × P(Disease) + P(Positive|No Disease) × P(No Disease)
P(Positive) = 0.95 × 0.02 + 0.10 × 0.98 = 0.019 + 0.098 = 0.117
P(Disease|Positive) = 0.95 × 0.02 / 0.117 ≈ 0.162 (16.2%)

**b)** P(No Disease|Negative) = P(Negative|No Disease) × P(No Disease) / P(Negative)
P(Negative) = 0.90 × 0.98 + 0.05 × 0.02 = 0.882 + 0.001 = 0.883
P(No Disease|Negative) = 0.90 × 0.98 / 0.883 ≈ 0.999 (99.9%)

**c)** The probability is low because the disease is rare (2% prevalence). Even with a good test, most positive results are false positives due to the large number of healthy people.

**d)** ML applications:
- Class imbalance: Rare classes need special handling
- Precision/Recall: Trade-off between false positives and false negatives
- Calibration: Adjusting classifier outputs to reflect true probabilities

---

### Question 4 Solutions

**a)** P(|error| > 30,000) = 2 × P(error > 30,000) = 2 × P(Z > 2) = 2 × 0.0228 = 0.0456

**b)** P(-10,000 < error < 20,000) = P(-0.67 < Z < 1.33) = 0.9082 - 0.2514 = 0.6568

**c)** For 95% confidence: P(-1.96 < Z < 1.96) = 0.95
Range: ±1.96 × 15,000 = ±29,400

**d)** CLT applications:
- Cross-validation: Sample means of performance metrics are approximately normal
- Bootstrap: Distribution of bootstrap statistics approaches normal
- Confidence intervals: Based on normal approximation of sampling distributions

---

### Question 5 Solutions

**a)** ∇f = [2x - 4, 4y - 8]
Critical point: 2x - 4 = 0 → x = 2, 4y - 8 = 0 → y = 2
Critical point: (2, 2)

**b)** Hessian matrix: H = [[2, 0], [0, 4]]
Eigenvalues: 2, 4 (both positive) → local minimum

**c)** Gradient descent iterations:
Iteration 1: ∇f(0,0) = [-4, -8], x₁ = [0,0] - 0.1[-4,-8] = [0.4, 0.8]
Iteration 2: ∇f(0.4,0.8) = [-3.2, -4.8], x₂ = [0.4,0.8] - 0.1[-3.2,-4.8] = [0.72, 1.28]

**d)** Gradient descent applications:
- Linear regression: Minimize MSE
- Logistic regression: Minimize cross-entropy
- Neural networks: Backpropagation
- Advantages: Works for large datasets, handles non-analytical solutions
- Disadvantages: Requires tuning learning rate, may converge slowly

---

## Grading Rubric

### Excellent (90-100%)
- Correct mathematical derivations
- Clear explanations of concepts
- Strong connections to ML applications
- Demonstrates deep understanding

### Good (80-89%)
- Mostly correct solutions
- Good understanding of concepts
- Some connections to applications
- Minor errors in derivations

### Satisfactory (70-79%)
- Basic understanding shown
- Some correct solutions
- Limited connections to applications
- Several computational errors

### Needs Improvement (60-69%)
- Limited understanding
- Many computational errors
- Weak connections to applications
- Incomplete solutions

### Unsatisfactory (<60%)
- Little understanding demonstrated
- Major errors throughout
- No connections to applications
- Incomplete or missing solutions

---

**Congratulations on completing the Mathematical Foundations Assessment!**

This assessment tests the fundamental mathematical knowledge required for advanced machine learning. Mastery of these concepts will enable you to understand, implement, and innovate in machine learning algorithms and applications.
