# Task 1

## 1. Derivation of the Logistic Regression Objective Function Using MLE (see reference[3])

We consider a binary classification problem with training data $ \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^n $, where $ \mathbf{x}^{(i)} \in \mathbb{R}^d $ is the feature vector and $ y^{(i)} \in \{0, 1\} $ is the label for the $ i $-th data point.

We assume the following:
- The data points are independent and identically distributed (I will use i.d.d in the future)
- The conditional probability of the label is as follows:

$$
P(y = 1 \mid \mathbf{x}; \boldsymbol{\theta}) = \sigma(\mathbf{x}^\top \boldsymbol{\theta}) = \frac{1}{1 + e^{-\mathbf{x}^\top \boldsymbol{\theta}}}
$$

$$
P(y = 0 \mid \mathbf{x}; \boldsymbol{\theta}) = 1 - \sigma(\mathbf{x}^\top \boldsymbol{\theta})
$$

---

### Combind our Conditional Probability Expressions

$$
P(y \mid \mathbf{x}; \boldsymbol{\theta}) =
\sigma(\mathbf{x}^\top \boldsymbol{\theta})^y \cdot (1 - \sigma(\mathbf{x}^\top \boldsymbol{\theta}))^{1 - y}
$$

This works because:
- If $ y = 1 $, then the expression becomes $ \sigma(\mathbf{x}^\top \boldsymbol{\theta}) $
- If $ y = 0 $, then it becomes $ 1 - \sigma(\mathbf{x}^\top \boldsymbol{\theta}) $

---

### Likelihood of the Dataset

If like in our original assumtion the data is i.i.d., the likelihood of the full dataset is the product of the individual probabilities:

$$
L(\boldsymbol{\theta}) = \prod_{i=1}^n P(y^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\theta}) =
\prod_{i=1}^n \left[
\sigma(\mathbf{x}^{(i)\top} \boldsymbol{\theta})^{y^{(i)}} \cdot (1 - \sigma(\mathbf{x}^{(i)\top} \boldsymbol{\theta}))^{1 - y^{(i)}}
\right]
$$

---

### Log Likelihood 

To simplify the product we take the logarithm of the likelihood:

$$
\ell(\boldsymbol{\theta}) = \log L(\boldsymbol{\theta}) = \log \prod_{i=1}^n a_i = \sum_{i=1}^n \log a_i
$$

We now substitue $a_i$ with our expression from earlier:

$$
a_i = \sigma(\mathbf{x}^{(i)\top} \boldsymbol{\theta})^{y^{(i)}} \cdot (1 - \sigma(\mathbf{x}^{(i)\top} \boldsymbol{\theta}))^{1 - y^{(i)}}
$$

Using $ \log(ab) = \log a + \log b $:

$$
\log a_i = y^{(i)} \log \sigma(\mathbf{x}^{(i)\top} \boldsymbol{\theta}) +
(1 - y^{(i)}) \log (1 - \sigma(\mathbf{x}^{(i)\top} \boldsymbol{\theta}))
$$

Which gives us:

$$
\ell(\boldsymbol{\theta}) = \sum_{i=1}^n \left[
y^{(i)} \log \sigma(\mathbf{x}^{(i)\top} \boldsymbol{\theta}) +
(1 - y^{(i)}) \log (1 - \sigma(\mathbf{x}^{(i)\top} \boldsymbol{\theta}))
\right]
$$

---

### Objective Function

Our goal is to maximize the log likelihood, However, as instructed in class we should we minimize the negative log likelihood:

$$
W(\boldsymbol{\theta}) = -\ell(\boldsymbol{\theta}) =
- \sum_{i=1}^n \left[
y^{(i)} \log \sigma(\mathbf{x}^{(i)\top} \boldsymbol{\theta}) +
(1 - y^{(i)}) \log (1 - \sigma(\mathbf{x}^{(i)\top} \boldsymbol{\theta}))
\right]
$$

## 2. MAP vs MLE (see reference 4)

### Maximum Likelihood Estimation (MLE)

- MLE chooses the parameters $ \boldsymbol{\theta} $ that make the observed data most likely
- MLE treats the parameters as fixed but unknown quantities.
- MLE does not include any prior beliefs about $ \boldsymbol{\theta}, it only uses the observed data.

MLE can be defined as:

$$
\hat{\boldsymbol{\theta}}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}} \; P(\mathcal{D} \mid \boldsymbol{\theta})
$$

Where:
- $ \mathcal{D} $ is the dataset (with inputs $ \mathbf{X} $ and labels $ \mathbf{y} $),
- $ P(\mathcal{D} \mid \boldsymbol{\theta}) $ is the likelihood of the data given the parameters.

### Maximum A Posteriori Estimation (MAP)

MAP extends MLE by including a prior belief about the parameters. It uses Bayes’ theorem to compute a posterior distribution over parameters:

$$
P(\boldsymbol{\theta} \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \boldsymbol{\theta}) \cdot P(\boldsymbol{\theta})}{P(\mathcal{D})}
$$

MAP then chooses the parameters that maximize this:

$$
\hat{\boldsymbol{\theta}}_{\text{MAP}} = \arg\max_{\boldsymbol{\theta}} \; P(\boldsymbol{\theta} \mid \mathcal{D}) = \arg\max_{\boldsymbol{\theta}} \; P(\mathcal{D} \mid \boldsymbol{\theta}) \cdot P(\boldsymbol{\theta})
$$

Where:
- $ P(\boldsymbol{\theta}) $ is the prior distribution over parameters

## 2. Machine Learning Problem and Model Justification (see reference 7)

### Problem Definition

I wish to use logistic regression to classify particle collision events as either signal (Higgs boson production) or background (standard processes).

We can formulate this as a binary classification task by making:
- `1` = signal event
- `0` = background event

### Why Should We Use Logistic Regression?

- Logistic regression is built for binary classification problems like signal vs. background
- Logistic regression is gives probabilities (For example, 80% chance of signal), which helps physicists decide if an event is likely Higgs boson production.
- Collision data often has features that can be separated linearly
- Logistic regression is more resistant to noisy data from particle collisions which can help avoid overfitting

### Comparison to Linear Support Vector Machine (Linear SVM)

#### Similarity:
- Both create a straight line to separate signal from background events
- Both work well with large datasets and can handle many features
- Both use regularization to avoid overfitting

#### Difference:
- Logistic regression gives probabilities whereas Linear SVM gives a yes/no classification
- Logistic regression optimizes for accurate probabilities. Linear SVM focuses on maximizing the gap between signal and background
- Logistic regression is better with noisy data whereas Linear SVM can be affected more by outliers


## 3. Dataset and Equation Correspondence

In the derivation of logistic regression, each training instance was defined as a pair $ (\mathbf{x}^{(i)}, y^{(i)}) $, where:

- $ \mathbf{x}^{(i)} \in \mathbb{R}^d $: a feature vector  
- $ y^{(i)} \in \{0, 1\} $: a binary label  
- $ \boldsymbol{\theta} \in \mathbb{R}^d $: model parameters

This maps directly to the HIGGS dataset as follows:

- Each row in the dataset is a collision event, so $ i = 1, \dots, n $ with $ n \approx 11,000,000 $
- The feature vector $ \mathbf{x}^{(i)} $ consists of 28 features per event including momentum, transverse energy, etc
- The label $ y^{(i)} $ is a binary indicator: `1` for signal, `0` for background

### Modeling Assumptions

- We assume each event is treated as conditionally independent of the others, which is reasonable given the simulation based nature of the data


# Task 2

## 1. Dataset Selection

For this task, I selected the **HIGGS dataset** from the UCI Machine Learning Repository.

- **Dataset link**: [https://archive.ics.uci.edu/dataset/280/higgs](https://archive.ics.uci.edu/dataset/280/higgs)
- **Instances**: 11,000,000
- **Features**: 28 per event
- **Target**: Binary classification
  - `1` = signal event 
  - `0` = background event

## References

1. Cornell University. (n.d.). *Logistic Regression*. CS 4780: Machine Learning.  
   Retrieved from https://www.cs.cornell.edu/courses/cs4780/2015fa/web/lecturenotes/lecturenote06.html

2. Murphy, K. P. (2022). *Probabilistic Machine Learning: An Introduction*. MIT Press.  
   Retrieved from https://www.cs.ubc.ca/~murphyk/PMLbook/book1.html

3. NucleusBox. (n.d.). *Cost Function in Logistic Regression – Understanding the Theory Behind the Loss*.  
   Retrieved from https://www.nucleusbox.com/cost-function-in-logistic-regression/

4. Schmidt, M. (2017). *MLE and MAP Estimation* [Lecture slides]. CPSC 340, University of British Columbia.  
   Retrieved from https://www.cs.ubc.ca/~schmidtm/Courses/340-F17/L25.pdf

5. University of Pennsylvania. (n.d.). *Logistic Regression - CIS 520 Machine Learning*.  
   Retrieved from https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.Logistic

6. Wu, S. (n.d.). *Lecture 5: Logistic Regression* [PDF]. CSCI 5525 - Machine Learning, University of Minnesota.  
   Retrieved from https://zstevenwu.com/courses/s20/csci5525/resources/slides/lecture05.pdf
   
7. Cortes, C., & Vapnik, V. (1995). *Support-vector networks*. Machine Learning, 20(3), 273–297.  
   Retrieved from https://doi.org/10.1007/BF00994018

8. QuickRef. (n.d.). *LaTeX Math Symbols Cheat Sheet*.  
   Retrieved from https://quickref.me/latex.html
