# BME I9400 Machine Learning Midterm Review Notebook

---
## 1. Bayes' Rule and Conditional Probability

### Concept Recap
Bayes' theorem connects prior and conditional probabilities:

$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$

Bayes' Rule allows us to update our beliefs about event A given evidence B.


A = disease (1) or no disease (0)
B = test positive (1) or negative (0)

### Example

 A = disease or no disease
 B = test positive or negative

# sensitivy measures the rate of false negatives
# specificity measures the rate of false positives

Given the information, convert into probabilites: 
- Prevalence: Suppose 1% of people have a disease --> $P(A=1) = 0.01$
- Test is 99% sensitive. --> $P(B=1 | A=1) = 0.99$
- Test is 95% specific -->  $P(B=1 | A=0) = 0.05$

### Questions
1. What is the probability of having the disease if you test positive?

Solving for $P(A=1 | B=1)$.

Bayes' Rule: P(A|B) = P(B|A) P(A) / P(B)

??? 





2. How would you compute $P(B|A)$ if you only have sample data?




---
## 2. Feature Matrices, Vectors, and Targets

### Concept Recap
- **Feature matrix** \( X \in \mathbb{R}^{n \times d} \): n examples, d features
- **Target vector** \( y \in \mathbb{R}^n \)


In [3]:
import numpy as np

# 3 examples, 2 features each
X = np.array([[1.0, 2.0, 1],
              [2.0, 0.5, 1],
              [3.0, 4.0, 1]])
y = np.array([1, 0, 1])

print("Feature matrix X:\n", X)
print("Target vector y:\n", y)

Feature matrix X:
 [[1.  2.  1. ]
 [2.  0.5 1. ]
 [3.  4.  1. ]]
Target vector y:
 [1 0 1]


### Practice Questions
1. How would you represent a bias term in this matrix form?
2. What happens to the shape of X if you standardize each feature?




## 3. Loss Functions (MSE, BCE)

### Concept Recap
- **Mean Squared Error (MSE):**
  $$ L_{MSE} = \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2 $$
- **Binary Cross Entropy (BCE):**
  $$ L_{BCE} = -\frac{1}{n}\sum_i [y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)] $$

### Example
```python
y_true = np.array([0, 1, 1, 0])
y_pred = np.array([0.1, 0.9, 0.8, 0.3])

```

### Practice Questions
1. What is the MSE on this sample dataset?
2. What is the BCE on this sample dataset?



---
## 4. Linear Regression

### Concept Recap
$$ \hat{y} = Xw + b $$
The parameters \( w, b \) minimize MSE.

### Example
```python
from sklearn.linear_model import LinearRegression

X = np.array([[1], [2], [3], [4]])
y = np.array([2.2, 2.8, 4.5, 4.2])

model = LinearRegression().fit(X, y)
print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)
print("Predictions:", model.predict(X))
```

### Practice Question
- Assuming no intercept term, and knowing that the least squares solution is $w^{\ast} (X^{T} X)^{-1} X^T * y$, what is the optimal regression coefficient $w$ for this toy dataset>

---
## 5. Logistic Regression

### Concept Recap
$$ \hat{y} = \sigma(Xw + b) = \frac{1}{1 + e^{-(Xw+b)}} $$

### Example
```python
from sklearn.linear_model import LogisticRegression

X = np.array([[0], [1], [2], [3]])
y = np.array([0, 0, 1, 1])

log_model = LogisticRegression().fit(X, y)
print("Weights:", log_model.coef_)
print("Intercept:", log_model.intercept_)
print("Predictions:", log_model.predict_proba(X))
```

### Practice Question
1. What’s the geometric interpretation of the decision boundary? (It helps to visualize the dataset)

---
## 6. Gradient Descent and Optimization

### Concept Recap
Gradient descent iteratively updates parameters:

$$ w \leftarrow w - \eta \nabla_w L(w) $$

### Example: Simple quadratic loss
```python
w = 5.0  # initial guess
eta = 0.1
for i in range(10):
    grad = 2 * (w - 3)
    w -= eta * grad
    print(f"Iter {i}: w = {w:.3f}")
```

### Discussion
1. Why is the update rule $$ w \leftarrow w - \eta \nabla_w L(w) $$ and not $$ w \leftarrow w + \eta \nabla_w L(w) $$ ?
2. What issues arise if the learning rate is too large or too small?

---
## 7. Overfitting, Regularization (L1/L2)

### Concept Recap
- **Overfitting:** Low training error, high test error.
- **Regularization:** Adds penalty to weights.

$$ L_{\mathrm{L2}} = MSE + \lambda ||w||_2^2 $$  (L2)
$$ L_{\mathrm{L1}} = MSE + \lambda ||w||_1 $$  (L1)

### Example
```python
from sklearn.linear_model import Ridge, Lasso

X = np.random.randn(20, 3)
y = X @ np.array([1.5, -2.0, 0.5]) + np.random.randn(20)*0.3

ridge = Ridge(alpha=1.0).fit(X, y)
lasso = Lasso(alpha=0.1).fit(X, y)

print("Ridge coefficients:", ridge.coef_)
print("Lasso coefficients:", lasso.coef_)
```

### Stretch Questions
1. Why does L1 encourage sparsity in coefficients?
2. Why is L2 regularization sometimes called ``shrinkage''?

---
## 8. Cross Validation

### Concept Recap
- **Cross-validation** splits data into training and validation sets to estimate generalization.

### Example
```python
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression

X = np.random.randn(50, 2)
y = X @ np.array([1.0, 2.0]) + np.random.randn(50)

cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(LinearRegression(), X, y, cv=cv, scoring='r2')
print("Cross-validated R^2 scores:", scores)
print("Mean R^2:", np.mean(scores))
```

### Discussion
1. Why is k-fold CV preferred over a single train/test split?
2. How might you use CV to select hyperparameters such as \( \lambda \) in regularization?