# Problem 1 (10 pts):

From Maximum Likelihood to Cross-Entropy Loss
Learning Objectives: Connect probability theory to loss functions, understand why cross-entropy emerges naturally.

Part A: Binary Classification Loss Derivation

Setup: We have $n$ data points $\{ (x_i, y_i) \}_{i=1}^n$ where $x_i \in \mathbb{R}^d$ and $y_i \in \{ 0, 1 \}$. Assume your model outputs the probability of class 1 as $p_i = p(y_i = 1 \mid x_i) = \sigma(w^T x_i + b)$ where $w \in \mathbb{R}^d$, $b \in \mathbb{R}$, and $\sigma(z)$ is the sigmoid function $\sigma(z) = 1 / (1 + e^{-z})$.

1. Derive from MLE:
    * Write the likelihood function for the dataset
    * Take the log-likelihood
    * Show that maximizing log-likelihood = minimizing binary cross-entropy
    * Bonus (5 pts): Derive the gradient and show it has the nice form: $\nabla_w BCE = X^T(p - y)$

Part B: Extension to Multi-class

1. Softmax derivation: Extend to $K$ classes using softmax function
    * Likelihood
    * Log-likelihood
    * Maximizing log-likelihood = minimizing categorical cross-entropy

Part C: Implement dependencies in `hw1_impl.py` for function `problem_1_part_c` in `hw1_script.py`. You will implement both binary and multi-class cross-entropy from scratch. You will compare your implementation with `sklearn.metrics.log_loss`.

# Problem 2 (10 pts): Normal Equations vs. Gradient Descent - A Computational Study
Learning Objectives: Understand trade-offs between analytical and iterative solutions.

Implement dependencies in `hw1_impl.py` for function `problem_2` in `hw1_script.py`.

Analysis Tasks:

1. Answer this question. Memory Usage: When does the normal equation become impractical?
2. Conditioning: What happens when $X^TX$ is nearly singular? Add ridge regularization.
3. Report: When would you choose each method in practice?

# Problem 3 (10 pts): SGD Exploration - Escaping Local Minima (Extended)
Learning Objectives: Understand SGD's stochastic nature and hyperparameter effects.

Due to history, there is no Part A.

Part B: Implement dependencies in `hw1_impl.py` for functions `problem_3_part_b` and `problem_3_part_c` in `hw1_script.py`.

Part D: Analysis Questions

1. What batch size gives the best exploration vs. exploitation trade-off?
2. How does the "escape probability" change with learning rate?

# Problem 4 (10 pts): The Perceptron Problem - Understanding Linear Separability Limitations

## What is a Perceptron?

Based on our lecture, a **perceptron** is a binary classifier that makes predictions using a linear decision boundary. It consists of:

- **Inputs**: A feature vector $x \in \mathbb{R}^d$
- **Weights**: A weight vector $w \in \mathbb{R}^d$
- **Bias**: A scalar bias term $b \in \mathbb{R}$
- **Activation**: A step function (threshold function)

The perceptron computes:
$$ f(x) = \text{step} (w^T x + b) $$

Where the step function outputs:
$$ \text{step}(w^T x + b) = \begin{cases}
1 & \text{if} \quad w^T x + b \geq 0 \\
0 & \text{if} \quad w^T x + b < 0
\end{cases} $$

The decision boundary is the hyperplane defined by $w^T x + b = 0$, which divides the input space into two regions.

## The Fundamental Problem

The perceptron suffers from a **critical limitation**: it can only solve **linearly separable** problems. This means it can only correctly classify data where the two classes can be perfectly separated by a single straight line (in 2D) or hyperplane (in higher dimensions).

### The XOR Problem: A Classic Example

The most famous demonstration of this limitation is the **XOR (Exclusive OR) problem**:

| x₁ | x₂ | XOR Output |
|----|----|------------|
| 0  | 0  | 0          |
| 0  | 1  | 1          |
| 1  | 0  | 1          |
| 1  | 1  | 0          |

If you plot these four points:
- Points (0,1) and (1,0) should be classified as class 1 (XOR = 1)
- Points (0,0) and (1,1) should be classified as class 0 (XOR = 0)

**No single straight line can separate these classes!** The pattern requires a non-linear decision boundary.

## Why This Matters

This limitation reveals why:

1. **Single perceptrons are insufficient** for many real-world problems
2. **We need non-linearity** in our models (like ReLU activation functions)
3. **Multiple layers are essential** to create complex, non-linear decision boundaries
4. **The XOR problem motivated** the development of multi-layer neural networks

As we learned in our previous lecture, when we combine multiple ReLU neurons and stack them in layers, we can create complex, bent decision boundaries that can solve non-linearly separable problems like XOR.

This historical limitation of the perceptron was so significant that it contributed to the "AI winter" of the 1970s, until researchers developed multi-layer networks with backpropagation in the 1980s.

Learning Objectives

1. Implement a perceptron from scratch to understand its mechanics
2. Demonstrate why linear models fail on non-linearly separable data
3. Visualize decision boundaries and their limitations
4. Show how adding non-linear features can solve the problem

Implement dependencies in `hw1_impl.py` for functions `problem_4` in `hw1_script.py`.