# MIS 382N: Advanced Machine Learning Assignment 5

**Total points**: 75 pts

**Due**: 11:59 PM CST, Friday, November 21st, 2025.

**Submission**:
1. Submit your **Jupyter Notebook via Canvas**, AND
2. **Save your Jupyter Notebook to a PDF, and submit the PDF via Gradescope**.

You may work in groups of two if you wish. Only one student per team needs to submit the assignment on Canvas and Gradescope. But be sure to include the name and UT EID for both students.

Homework groups will be created and managed through Canvas, so please do not arbitrarily change your homework group. If you do change, let the TAs know.

For questions involving mathematical derivations, you can write your answer on paper and then upload an image. Also, please make sure your code runs and the graphics (and anything else) are displayed in your notebook before submitting. (%matplotlib inline)

**Name(s) and EID(s)**:

-------------------------

Bookmarks:

Q1. <a href=#Q1>Implementing Logistic Regression</a>

Q2. <a href=#Q2>Ensemble Methods for Classification</a>

Q3. <a href=#Q3>Ensemble Conceptual Questions</a>


-------------------------

**Q1. Logistic Regression** (30 pts) <a name='Q1'/>

In this question, you will implement Logistic Regression from scratch using numpy (not sklearn). You will train the Logistic Regression on a synthetic dataset generated from two Isotropic Gaussians. Then, you will visualize the decision boundary and examine the learned parameters.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from typing import Optional, Tuple
from sklearn.model_selection import train_test_split

np.random.seed(42)

In [None]:
rng = np.random.default_rng(42)
n_per_class = 500
mu0 = np.array([0.0, 0.0])
mu1 = np.array([2.0, 2.0])
Sigma = np.array([[1.0, 0.0],
                  [0.0, 1.0]])

X0 = rng.multivariate_normal(mu0, Sigma, size=n_per_class)
X1 = rng.multivariate_normal(mu1, Sigma, size=n_per_class)
y0 = np.zeros(n_per_class, dtype=int)
y1 = np.ones(n_per_class, dtype=int)

X = np.vstack([X0, X1])          # shape (2*n_per_class, 2)
y = np.concatenate([y0, y1])     # shape (2*n_per_class,)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

plt.figure()
plt.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], s=10, label="class 0")
plt.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], s=10, label="class 1")
plt.legend()
plt.title("Training data: Two Gaussians")
plt.xlabel("x1"); plt.ylabel("x2")
plt.show()

**Part 1.** (5 points)

We model the probability of a sample belonging to the positive class $p_{\theta}(y = 1 | \mathbf{x}) = \sigma(z)$ with $z = \mathbf{w}^\top \mathbf{x} + b$ and sigmoid $\sigma(z) = \frac{1}{1 + \exp^{-z}}$ where $\mathbf{x} \in \mathbb{R}^d, \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R},$ and $z \in \mathbb{R}$. In this exercise, $d=2$.

An efficient way to implement Logistic Regression is to compute probabilities of multiple samples simultaneously. Suppose there are $N$ training samples in total, we denote the label vector by $\mathbf{y} \in \{0,1\}^N$ and the predicted probability vector by $\mathbf{p} \in \mathbb{R}^N$. The $i$'th entries $\mathbf{y}_i$ and $\mathbf{p}_i$ corresponds to the label and predicted probability of the $i$'th sample, respectively. Let the data matrix be

$$\mathbf{X} = \begin{bmatrix}
\mathbf{x}_1^\top \\[4pt]
\mathbf{x}_2^\top \\[4pt]
\vdots \\[4pt]
\mathbf{x}_N^\top
\end{bmatrix} \in \mathbb{R}^{N \times d}$$

, the weight vector $\mathbf{w} \in \mathbb{R}^d$ be same as before, and the bias vector be $$\mathbf{b} = \begin{bmatrix} b \\ \vdots \\ b\end{bmatrix} \in \mathbb{R}^N$$.

Then, $\mathbf{p} = \sigma(\mathbf{X}\mathbf{w} + \mathbf{b}) \in \mathbb{R}^N$ would be the vector of predicted probabilities.



The average logistic loss (negative log-likelihood) is:
$$L = \frac{1}{N} \sum_{i=1}^N [\mathbf{y}_i \log \mathbf{p}_i + (1 - \mathbf{y}_i) \log (1 - \mathbf{p}_i)]$$

with optional L2 loss: $\frac{\lambda}{2N} \|\mathbf{w}\|_2^2$.

Implement the following functions: ```sigmoid```, ```predict_proba```, and ```log_loss```.

**Answer:**

In [None]:
def sigmoid(z: np.ndarray) -> np.ndarray:
    """
    Compute the sigmoid function.
    Args:
        z: Input logits. np.ndarray, shape (N,)
    Returns
        proba: Probabilities after sigmoid transform. np.ndarray, shape (N,)
    """
    # Implement sigmoid function
    ### START CODE ###
    proba = ...
    ### END CODE ###
    return proba

def predict_proba(X: np.ndarray, w: np.ndarray, b: float) -> np.ndarray:
    """
    Predict class probabilities for inputs X.
    Args:
        X: Data matrix. np.ndarray, shape (N, d)
        w: Weights. np.ndarray, shape (d,)
        b: Bias. float
    Returns:
        proba: Probabilities. np.ndarray, shape (N,)
    """
    # Predict class probabilities for inputs X
    ### START CODE ###
    proba = ...
    ### END CODE ###
    return proba

def log_loss(y: np.ndarray, p: np.ndarray, l2_lambda: float = 0.0, w: Optional[np.ndarray] = None) -> float:
    """
    Args:
        y: True labels. np.ndarray, shape (N,)
        p: Predicted probabilities. np.ndarray, shape (N,)
        l2_lambda: L2 regularization strength. float
        w: Weights (for computing L2 regularization loss). np.ndarray, shape (d,)
    Returns:
        loss: Loss value. float
    """
    # Add a small epsilon to avoid log(0)
    eps = 1e-12
    p = np.clip(p, eps, 1 - eps)
    N = y.shape[0]
    # Calculate the log loss
    ### START CODE ###
    loss = ...
    ### END CODE ###

    if l2_lambda > 0 and w is not None:
        # Add L2 regularization if specified
        ### START CODE ###

        ### END CODE ###
    return loss


**Part 2.** (5 points)
Given the data matrix $\mathbf{X}$, the weight vector $\mathbf{w}$, the bias $b$, the probability vector $\mathbf{p}$, and the label vector $\mathbf{y}$, the gradient of the loss is:

\begin{align}
\nabla_{\mathbf{w}} L &= \frac{1}{N} \mathbf{X}^\top (\mathbf{p} - \mathbf{y}) + \frac{\lambda}{N} \mathbf{w} \in \mathbb{R}^d \\
\nabla_{b} L &= \frac{1}{N} \sum_{i=1}^N (\mathbf{p}_i - \mathbf{y}_i) \in \mathbb{R}
\end{align}

Please implement the gradient below:

In [None]:
def gradients(
    X: np.ndarray,
    y: np.ndarray,
    w: np.ndarray,
    b: float,
    l2_lambda: float = 0.0
) -> Tuple[np.ndarray, float]:
    """
    Compute gradients of the logistic loss with respect to w and b.
    """
    N = X.shape[0]
    p = predict_proba(X, w, b)
    # Implement gradient of the logistic loss (w/o regularization component) for the weight vector
    ### START CODE ###
    grad_w = ...
    ### END CODE ###
    if l2_lambda > 0:
        # Add optional L2 regularization loss gradient
        ### START CODE ###

        ### END CODE ###
    # Implement gradient of the logistic loss for the bias term
    ### START CODE ###
    grad_b = ...
    ### END CODE ###
    return grad_w, grad_b

**Part 3.** (5 points) Implement the training function ```train_logreg``` that does full batch gradient descent and track loss per epoch. Then, train a logistic regression model with ```lr=0.25```, ```epochs=300```, ```l2_lambda=0.5``` and visualize the training loss curve.



In [None]:
def train_logreg(X, y, lr=0.1, epochs=300, l2_lambda=0.0):
    N, d = X.shape
    w = np.zeros(d)
    b = 0.0
    loss_history = []

    for e in range(epochs):
        # Get the predicted probabilities
        ### START CODE ###
        p = ...
        ### END CODE ###

        # Calculate the loss
        ### START CODE ###
        loss = ...
        ### END CODE ###
        loss_history.append(loss)

        # Calculate the gradients and update the parameters
        ### START CODE ###

        ### END CODE ###

    return w, b, np.array(loss_history)

In [None]:
w, b, loss_history = train_logreg(X_train, y_train, lr=0.2, epochs=1000, l2_lambda=0.5)

plt.figure()
plt.plot(loss_history)
plt.xlabel("epoch"); plt.ylabel("train loss")
plt.title("Training loss")
plt.show()

**Part 4.** (5 points) Implement the ```predict_label``` function and evaluate the training and testing accuracies.

In [None]:
def predict_label(X, w, b, threshold=0.5):
    # If the predicted probability >= threshold, predict 1, otherwise predict 0
    ### START CODE ###
    pred = ...
    ### END CODE ###
    return pred

# Evaluate the training and testing accuracies
y_pred_train = predict_label(X_train, w, b)
y_pred_test = predict_label(X_test, w, b)

train_acc = np.mean(y_pred_train == y_train)
test_acc = np.mean(y_pred_test == y_test)
print(f"Training accuracy: {train_acc:.3f}, Testing accuracy: {test_acc:.3f}")

Run the following code to visualize the decision boundary and print out the learned Logistic Regression parameters ```w``` and ```b```.

In [None]:
def plot_boundary(X, y, w, b, title="Decision boundary"):
    x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
    y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    grid = np.c_[xx.ravel(), yy.ravel()]
    pp = predict_proba(grid, w, b).reshape(xx.shape)

    plt.figure()
    plt.contour(xx, yy, pp, levels=[0.5])
    plt.scatter(X[y==0, 0], X[y==0, 1], s=10, label="class 0")
    plt.scatter(X[y==1, 0], X[y==1, 1], s=10, label="class 1")
    plt.legend()
    plt.title(title)
    plt.xlabel("x1"); plt.ylabel("x2")
    plt.show()

plot_boundary(X_train, y_train, w, b, title="Train set: learned boundary")
print(f"Learned parameters: w = {w}, b = {b}")

**Part 5.** (5 points) Are the relative relationship between the learned parameters ($\mathbf{w}$ and $b$) what you expected according to the synthetic data generation process (Gaussian parameters)? Why?

**Answer:**

**Part 6a.** (2.5 points) Suppose one decides to reject if the higher posterior probability is less than 0.7 for a given $\mathbf{x}$. **Derive** the reject region in the input space by identifying the two boundaries of this region (provide two inequalities) and then complete the code below to plot the region.

**Answer:**

In [None]:
def plot_rejection_boundaries(X, y, w, b, p1, p2, title="Rejection boundary"):
    """
    The new args p1 and p2 are the probabilities of the rejection boundaries. p1 < p2.
    """
    assert p1 <= p2
    x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
    y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    grid = np.c_[xx.ravel(), yy.ravel()]
    pp = predict_proba(grid, w, b).reshape(xx.shape)

    plt.figure()
    # Plot the rejection boundaries and the shaded region in between
    ### START CODE ###
    plt.contour(xx, yy, pp, levels=...)
    plt.contour(xx, yy, pp, levels=...)
    plt.contourf(xx, yy, pp, levels=..., colors=['tab:blue'], alpha=0.2, zorder=0)
    ### END CODE ###
    plt.scatter(X[y==0, 0], X[y==0, 1], s=10, label="class 0")
    plt.scatter(X[y==1, 0], X[y==1, 1], s=10, label="class 1")
    plt.legend()
    plt.title(title)
    plt.xlabel("x1"); plt.ylabel("x2")
    plt.show()

plot_rejection_boundaries(X_train, y_train, w, b, 0.3, 0.7, title="Train set: learned boundary")
print(f"Learned parameters: w = {w}, b = {b}")

**Part 6b.** (2.5 points) State in plain English the impact of independent variable $x_1$  on a "determination" for Class 1, by properly interpreting its coefficient returned by your logistic regression model.

**Answer:**

**Q2. Ensemble Methods for Classification** (35 pts) <a name='Q2'/>

In this question, we will compare the performances of [Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) and different ensemble methods for classification problems: [Bagging](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html), [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html), [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and [XGBoost](https://xgboost.readthedocs.io/en/stable/python/python_api.html) classifiers.

We will look at the [GiveMeSomeCredit](https://www.kaggle.com/c/GiveMeSomeCredit) dataset for this question. The dataset is extremely large so for this question we will only consider a subset which has been provided along with the notebook for this assignment. The dataset has already been split into train and test sets.

The task is to predict the probability that someone will experience financial distress in the next two years.


In [None]:
# Only use this code block if you are using Google Colab.
# If you are using Jupyter Notebook, please ignore this code block. You can directly upload the file to your Jupyter Notebook file systems.
from google.colab import files

## It will prompt you to select a local file. Click on “Choose Files” then select and upload the file.
## Wait for the file to be 100% uploaded. You should see the name of the file once Colab has uploaded it.
uploaded = files.upload()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from time import time
import xgboost
from sklearn.model_selection import train_test_split, GridSearchCV, ParameterGrid
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
%matplotlib inline

In [None]:
data = pd.read_csv('hw5_data.csv')
data.drop(data.columns[data.columns.str.contains('unnamed',case = False)],axis=1, inplace=True)
data.head()

In [None]:
y = data['SeriousDlqin2yrs']
X = data.drop(['SeriousDlqin2yrs'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('train:',X_train.shape, y_train.shape)
print('test:',X_test.shape, y_test.shape)

In [None]:
columns_list = list(X.columns)

**Part 1.** (5 points) Complete the following functions that will be repeatedly used later: ```evaluate_classifier```, ```train_and_evaluate_classifier```, and ```grid_search_for_classifier```. Please implement according to the comments.

In [None]:
def evaluate_classifier(clf, X_eval, y_eval):
    # Perform prediction on X_eval and extract the probability score of the positive class
    ### START CODE ###
    y_pred = ...
    y_pred_proba = ...
    ### END CODE ###

    # Calculate accuracy and AU-ROC score
    ### START CODE ###
    acc_score = ...
    auc_score = ...
    ### END CODE ###

    return acc_score, auc_score

def train_and_evaluate_classifier(clf, X_train, y_train, X_eval, y_eval):
    # Fit your classifier on the training set
    ### START CODE ###

    ### END CODE ###

    acc_score, auc_score = evaluate_classifier(clf, X_eval, y_eval)
    return clf, acc_score, auc_score

def grid_search_for_classifier(clf, param_grid, X_train, y_train):
    # Initialize GridSearchCV. Use 5-fold cross validation and AU-ROC as the scoring metric.
    ### START CODE ###
    grid_search = ...
    ### END CODE ###

    # Perform grid search
    ### START CODE ###

    ### END CODE ###

    return grid_search.best_estimator_, grid_search.best_params_, grid_search.best_score_

**Part 2.** (5 points) Fit a Decision Tree Classifier with ```random_state=42``` for this classification problem. Tune the following hyper-parameters: ```max_depth```, ```min_samples_split```, and ```min_samples_leaf```. Report the ```accuracy_score``` and ```roc_auc_score``` on the test set.

In [None]:
# Define your hyper-parameter grid for decision tree classifier
### START CODE ###
hparams_grid_dt = {
    ...
}
### END CODE ###
clf_dt = DecisionTreeClassifier(random_state=42)

# Perform grid search and get the trained classifier, best hyper-parameters, and best AU-ROC score
### START CODE ###
clf_dt, best_hparams_dt, best_score_dt = ...
### END CODE ###
print(f"Best Hyper-parameters: {best_hparams_dt}, AU-ROC: {best_score_dt:.3f}")

# Evaluate the decision tree classifier on the test set
### START CODE ###
test_acc_dt, test_auc_dt = ...
### END CODE ###
print(f" Decision Tree Test Accuracy: {test_acc_dt:.3f}, Test AU-ROC: {test_auc_dt:.3f}")

**Part 3.** (5 points) Create a [Bagging](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) of 10 classifiers (i.e, n_estimators=10) with ```random_state=42```. Please use Decision Tree Classifier with ```random_state=42``` and the previously found best hyper-parameter combination as the base classifier. Report ```accuracy_score``` and ```roc_auc_score``` on the test data for this emsemble classifier.

In [None]:
# Initialize your bagging classifier
### START CODE ###
clf_bag = ...
### END CODE ###

# Train and evaluate the bagging classifier
### START CODE ###
clf_bag, test_acc_bag, test_auc_bag = ...
### END CODE ###
print(f"Bagging of Decision Trees Test Accuracy: {test_acc_bag:.3f}, Test AU-ROC: {test_auc_bag:.3f}")

**Part 4.** (5 points) In this question, you will fit a [Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) model on the training data for this classification task.

1. First, please set ```random_state=42```, find the best hyper-parameters, and refit your classifier with those hyper-parameters. Consider tuning the following hyper-parameters: ```n_estimators```, ```max_depth```, ```min_samples_leaf```, and ```bootstrap```.
2. Second, evaluate your fitted random forest classifier on the test set. Report the accuracy and AU-ROC.

In [None]:
# Define your hyper-parameter grid for random forest classifier
### START CODE ###
hparams_grid_rf = {
    ...
}
### END CODE ###

# Initialize your random forest classifier
### START CODE ###
clf_rf = ...
### END CODE ###

# Perform grid search and get the trained classifier, best hyper-parameters, and best AU-ROC score
### START CODE ###
clf_rf, best_hparams_rf, best_score_rf = ...
### END CODE ###
print(f"Best Hyper-parameters: {best_hparams_rf}, AU-ROC: {best_score_rf:.3f}")

# Evaluate the random forest classifier on the test set
### START CODE ###
test_acc_rf, test_auc_rf = ...
### END CODE ###
print(f"Random Forest Test Accuracy: {test_acc_rf:.3f}, Test AU-ROC: {test_auc_rf:.3f}")

**Part 5.** (10 points) This time, let us use [AdaBoost](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier) and [XGBoost](https://xgboost.readthedocs.io/en/stable/python/python_api.html) for the same task. For AdaBoost and XGBoost, please set ```random_state=42``` and find the best hyper-parameters such as ```n_estimators```, ```learning_rate```, ```max_depth```. Refit your model using the best hyper-parameters, and report the ```accuracy_score``` and ```roc_auc_score``` on test data.

In [None]:
# Define your hyper-parameter grid for AdaBoost classifier
### START CODE ###
hparams_grid_ab = {
    ...
}
### END CODE ###

# Initialize your AdaBoost classifier
### START CODE ###
clf_ab = ...
### END CODE ###

# Perform grid search and get the trained classifier, best hyper-parameters, and best AU-ROC score
### START CODE ###
clf_ab, best_hparams_ab, best_score_ab = ...
### END CODE ###
print(f"Best Hyper-parameters: {best_hparams_ab}, AU-ROC: {best_score_ab:.3f}")

# Evaluate the AdaBoost classifier on the test set
### START CODE ###
test_acc_ab, test_auc_ab = ...
print(f" AdaBoost Test Accuracy: {test_acc_ab:.3f}, Test AU-ROC: {test_auc_ab:.3f}")

In [None]:
# Define your hyper-parameter grid for XGBoost classifier
### START CODE ###
hparams_grid_xgb = {
    ...
}
### END CODE ###
# Initialize your XGBoost classifier
### START CODE ###
clf_xgb = ...
### END CODE ###

# Perform grid search and get the trained classifier, best hyper-parameters, and best AU-ROC score
### START CODE ###
clf_xgb, best_hparams_xgb, best_score_xgb = ...
### END CODE ###
print(f"Best Hyper-parameters: {best_hparams_xgb}, AU-ROC: {best_score_xgb:.3f}")

# Evaluate the XGBoost classifier on the test set
### START CODE ###
test_acc_xgb, test_auc_xgb = ...
### END CODE ###
print(f" XGBoost Test Accuracy: {test_acc_xgb:.3f}, Test AU-ROC: {test_auc_xgb:.3f}")

**Part 6.** (5 points) Compare the performance of decision tree with the ensemble methods that you tried. Please briefly describe the key concepts behind the ensemble method that performed the best for this dataset.

**Answer:**

**Q3. Ensemble Conceptual Questions** (10 pts) <a name='Q3'/>

**Part 1.** (5 points) Boosting combines a series of weak predictors into a strong predictor. However, it has the danger of being sensitive to outliers. Please use XGBoost as an example to briefly describe how outliers may undermine the algorithm.

**Answer:**

**Part 2.** (5 points) Where does the randomness of random forests come from?

**Answer:**

