# CSCI 5521 — HW1 (CODE)

**This notebook contains only the coding parts and you will need to submit a separate PDF for other parts of the homework.** It mirrors the MATLAB file structure in Python. But, instead of having separate files, we will implement the following in separate cells of this notebook:
- `MLE_Learning` (learn Bernoulli parameters per class)
- `Bayes_Testing` (classify a dataset, compute error rate)
- `Bayes_Learning` (sweep priors on validation, pick the best)
- (Given) A driver cell that behaves like `HW1_script.m`, its MATLAB counterpart (Parts 1–3).

**INSTRUCTIONS**
- Make a copy of this template file to modify the file.
- Download and upload the 4 data files on to the Colab environment. Do not modify the provided data files. Provided files:
  - toy_data.txt
  - training_data.txt
  - validation_data.txt
  - testing_dtaa.txt
- Implement the `TODO` sections

**SUBMISSION**

If you choose to do your homework in Python, you can complete and submit this file alongside a PDF file with the rest of your answers.

## Setup (Imports, Data Paths and Data Loader)

In [1]:
# TODO: Edit if your files are elsewhere
TOY_PATH   = "dataset/toy_data.txt"
TRAIN_PATH = "dataset/training_data.txt"
VALID_PATH = "dataset/validation_data.txt"
TEST_PATH  = "dataset/testing_data.txt"

In [2]:
import numpy as np
import pandas as pd

def load_txt_dataset(path: str):
    arr = np.loadtxt(path)
    if arr.ndim == 1:
        arr = arr[None, :]
    X = arr[:, :-1].astype(int)
    y = arr[:, -1].astype(int)
    # Minimal checks
    if not set(np.unique(y)).issubset({1, 2}):
        raise ValueError(f"Labels must be 1 or 2: found {sorted(set(np.unique(y)))} in {path}")
    uniq = set(np.unique(X)) if X.size else set()
    if uniq - {0, 1}:
        raise ValueError(f"Features must be 0/1 only: found {sorted(uniq)} in {path}")
    return X, y

## Part A — `MLE_Learning(X, y)`

In [3]:
def MLE_Learning(training_data):
    """
    Return:
      p1: Bernoulli parameters for class 1 (P(x_j=0 | C1))
      p2: Bernoulli parameters for class 2 (P(x_j=0 | C2))
      pc1: prior for class 1
      pc2: prior for class 2
    """
    # - Separate features (X) from labels
    data = np.array(training_data)
    X = data[:, :-1]  
    y = data[:, -1]   
    N = X.shape[0]  
    D = X.shape[1]  
    
    X_C1 = X[y == 1]
    X_C2 = X[y == 2]

    # - Count samples per class
    N1 = len(X_C1) 
    N2 = len(X_C2) 

    # - Compute pc1 and pc2 (Class Priors)
    pc1 = N1 / N
    pc2 = N2 / N

    # - Compute p1 and p2 (Bernoulli Parameters P(x_j=0 | C_i))
    if N1 > 0:
        sum_x1_C1 = np.sum(X_C1, axis=0) 
        sum_x0_C1 = N1 - sum_x1_C1
        p1 = sum_x0_C1 / N1
    else:
        p1 = np.full(D, 0.5) # The Laplace Smoothed estimate is 0.5. Use this to fill in if there is no sample in C1.

    if N2 > 0:
        sum_x1_C2 = np.sum(X_C2, axis=0) 
        sum_x0_C2 = N2 - sum_x1_C2
        p2 = sum_x0_C2 / N2
    else:
        p2 = np.full(D, 0.5)

    return p1, p2, pc1, pc2

    # TODO: Implement MLE estimation as in MATLAB version
    # - Separate features (X) from labels
    # - Count samples per class
    # - Compute p1 and p2
    # - Compute pc1 and pc2
    pass


## Part B — `Bayes_Testing(data, p1, p2, pc1, pc2)`

In [4]:
def Bayes_Testing(data, p1, p2, pc1, pc2):
    """
    Return:
      test_error: fraction of misclassified samples using given parameters and priors
    """
    X = data[:, :-1]  
    y = data[:, -1]   
    N = X.shape[0]  

    errors = 0
    small_prob = 1e-10  # To avoid log(0)

    log_p1_0 = np.log(p1 + small_prob)
    log_p1_1 = np.log(1 - p1 + small_prob)
    log_p2_0 = np.log(p2 + small_prob)
    log_p2_1 = np.log(1 - p2 + small_prob)

    log_pc1 = np.log(pc1 + small_prob)
    log_pc2 = np.log(pc2 + small_prob)

    # - Implement classification loop
    for n in range(N):
        x = X[n, :] 
        true_label = y[n]

        # - For each sample, compute likelihood under each class
        log_likelihood1 = np.sum(x * log_p1_1 + (1 - x) * log_p1_0)
        log_likelihood2 = np.sum(x * log_p2_1 + (1 - x) * log_p2_0)
  
        log_posterior1 = log_likelihood1 + log_pc1
        log_posterior2 = log_likelihood2 + log_pc2

        # - Compare g_x = pc1 * likelihood1 - pc2 * likelihood2
        if log_posterior1 > log_posterior2:
            predicted_label = 1
        else:
            predicted_label = 2
            
        # - Count errors
        if predicted_label != true_label:
            errors += 1

    test_error = errors / N
    return test_error

    # TODO: Implement classification loop
    # - For each sample, compute likelihood under each class
    # - Compare g_x = pc1 * likelihood1 - pc2 * likelihood2
    # - Count errors
    pass


## Part C — `Bayes_Learning(X_tr, y_tr, X_val, y_val, priors)`

In [6]:
def Bayes_Learning(training_data, validation_data):
    """
    Return:
      p1, p2: Bernoulli parameters per class
      pc1, pc2: best priors found on validation set
    """
    # Define a list of candidate priors P(C1) to sweep (P(C2) = 1 - P(C1))
    prior_candidates_pc1 = np.linspace(0.01, 0.99, 20) 

    # - Implement MLE on training data
    p1_mle, p2_mle, _, _ = MLE_Learning(training_data)
    
    best_error = float('inf')
    pc1_best = None
    pc2_best = None
    
    results = [] 

    # - Sweep priors list
    for pc1_can in prior_candidates_pc1:
        pc2_can = 1.0 - pc1_can
        
        # - For each prior, compute validation error
        current_error = Bayes_Testing(
            data=validation_data, 
            p1=p1_mle, 
            p2=p2_mle, 
            pc1=pc1_can, 
            pc2=pc2_can
        )
        
        results.append({
            'P(C1) Prior': pc1_can, 
            'P(C2) Prior': pc2_can, 
            'Validation Error Rate': current_error
        })
        
        # - Pick best prior
        if current_error < best_error:
            best_error = current_error
            pc1_best = pc1_can
            pc2_best = pc2_can
            
    if pc1_best is None:
        pc1_best = 0.5
        pc2_best = 0.5

    validation_results_df = pd.DataFrame(results).round(4)
    
    return p1_mle, p2_mle, pc1_best, pc2_best, validation_results_df

    # TODO: Implement MLE on training data
    # TODO: Sweep priors list
    # TODO: For each prior, compute validation error
    # TODO: Pick best prior and return
    pass


## Main HW1 Script

**Part 1 (Toy sanity):**  
- Load `toy_data.txt`, run `MLE_Learning`, and then compute training error using **pc1=pc2=0.5**.

**Part 2 (Train/Test):**  
- Fit on `training_data.txt` and test on `testing_data.txt` using **empirical priors** returned by `MLE_Learning`.

**Part 3 (Validation sweep):**  
- Fit on training, sweep priors on validation, choose the **best prior**, then test on `testing_data.txt`.
- Print a small table for the validation errors.

In [7]:
import numpy as np

# Load datasets - paths mentioned above in setup portion
training_data = np.loadtxt(TRAIN_PATH)
validation_data = np.loadtxt(VALID_PATH)
testing_data = np.loadtxt(TEST_PATH)

# -------------------------
# Part 1 — Toy sanity check
# -------------------------
toy_data = np.loadtxt(TOY_PATH)
p1, p2, pc1, pc2 = MLE_Learning(toy_data)
print("Toy MLE parameters:")
print("p1:", p1)
print("p2:", p2)
print("pc1:", pc1, "pc2:", pc2)

train_error = Bayes_Testing(toy_data, p1, p2, 0.5, 0.5)  # fixed priors for sanity check
print("Training error (toy, pc1=pc2=0.5):", train_error)

Toy MLE parameters:
p1: [0.8 1. ]
p2: [0.2 0.6]
pc1: 0.5 pc2: 0.5
Training error (toy, pc1=pc2=0.5): 0.2


In [8]:
# -------------------------
# Part 2 — Train/Test with empirical priors
# -------------------------
p1, p2, pc1, pc2 = MLE_Learning(training_data)
test_error = Bayes_Testing(testing_data, p1, p2, pc1, pc2)
print("Test error (empirical priors):", test_error)

Test error (empirical priors): 0.115


In [24]:
# -------------------------
# Part 3 — Validation sweep and best prior
# -------------------------
p1, p2, pc1_best, pc2_best, validation_df = Bayes_Learning(training_data, validation_data) 
test_error = Bayes_Testing(testing_data, p1, p2, pc1_best, pc2_best)

# Display the table of validation errors
print("\n--- Validation Error Rates for Prior Sweep ---")
min_error_row = validation_df['Validation Error Rate'].idxmin()

print(validation_df)
print(f"\nBest Priors Found: P(C1) = {pc1_best:.3f}, P(C2) = {pc2_best:.3f}")
print(f"Minimum Validation Error: {validation_df.loc[min_error_row, 'Validation Error Rate']:.4f}")

print("\n--- Final Test Error Rate ---")
final_results_df = pd.DataFrame({
    'Parameter': ['Best P(C1)', 'Best P(C2)', 'Test Error Rate'],
    'Value': [f"{pc1_best:.3f}", f"{pc2_best:.3f}", f"{test_error:.4f}"]
})
print(final_results_df.to_string(index=False))


--- Validation Error Rates for Prior Sweep ---
    P(C1) Prior  P(C2) Prior  Validation Error Rate
0        0.0100       0.9900                  0.155
1        0.0616       0.9384                  0.120
2        0.1132       0.8868                  0.100
3        0.1647       0.8353                  0.095
4        0.2163       0.7837                  0.100
5        0.2679       0.7321                  0.100
6        0.3195       0.6805                  0.110
7        0.3711       0.6289                  0.105
8        0.4226       0.5774                  0.105
9        0.4742       0.5258                  0.105
10       0.5258       0.4742                  0.125
11       0.5774       0.4226                  0.120
12       0.6289       0.3711                  0.140
13       0.6805       0.3195                  0.140
14       0.7321       0.2679                  0.145
15       0.7837       0.2163                  0.175
16       0.8353       0.1647                  0.200
17       0.8868 