**Problem 1: Supervised Classification Libraries: Regression, Decision Tree**

6 Runs of Supervised Training / Testing : 3 datasets (MNIST, Spambase, 20NG) x 2 Classification Algorithms (L2-reg Logistic Regression, Decision Trees). You can use a library for the classification algorithms, and also can use any library/script to process data in appropriate formats.
You are required to explain/analyze the model trained in terms of features : for each of the 6 runs list the top F=30 features. For the Regression these correspond to the highest-absolute-value F coefficients; for Decision Tree they are the first F splits. In particular for Decision Tree on 20NG, report performance for two tree sizes ( by depths of the tree, or number of leaves, or number of splits )

In [4]:
import numpy as np
from collections import deque
from sklearn.datasets import fetch_openml, fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

##############################################
# Helper functions for feature extraction
##############################################

def get_top_features_logistic(model, feature_names=None, F=30):
    """
    For logistic regression: Returns the top F features
    sorted by absolute value of the coefficients.
    If feature_names is provided, return names with coefficients.
    """
    # For binary classification, coef_ is shape (1, n_features)
    coef = model.coef_[0]
    abs_coef = np.abs(coef)
    top_indices = np.argsort(abs_coef)[-F:][::-1]
    if feature_names is not None:
        top_features = [(feature_names[i], coef[i]) for i in top_indices]
    else:
        top_features = [(i, coef[i]) for i in top_indices]
    return top_features

def get_top_splits(decision_tree, F=30):
    """
    For decision trees: Traverse the tree in breadth-first order
    and return the feature indices for the first F splits.
    Leaf nodes have feature index -2.
    """
    tree = decision_tree.tree_
    q = deque([0])  # Start from the root node (index 0)
    splits = []
    while q and len(splits) < F:
        node = q.popleft()
        # Check if current node is a split (non-leaf)
        if tree.feature[node] != -2:
            splits.append(tree.feature[node])
            # Enqueue children nodes (if they exist)
            left_child = tree.children_left[node]
            right_child = tree.children_right[node]
            if left_child != -1:
                q.append(left_child)
            if right_child != -1:
                q.append(right_child)
    return splits

##############################################
# General classification function for a dataset
##############################################

def run_classification(X, y, feature_names=None, dataset_name="Dataset"):
    print(f"\n=== {dataset_name} ===")
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    ##############################################
    # Logistic Regression
    ##############################################
    print("\n-- Logistic Regression --")
    lr = LogisticRegression(penalty='l2', solver='liblinear', max_iter=1000)
    lr.fit(X_train, y_train)
    y_pred_lr = lr.predict(X_test)
    acc_lr = accuracy_score(y_test, y_pred_lr)
    print(f"Accuracy: {acc_lr:.4f}")
    top_lr_features = get_top_features_logistic(lr, feature_names)
    print("Top 30 features (by coefficient magnitude):")
    for feature, coef in top_lr_features:
        print(f"Feature: {feature}, Coefficient: {coef}")

    ##############################################
    # Decision Tree
    ##############################################
    print("\n-- Decision Tree --")
    dt = DecisionTreeClassifier(random_state=42)
    dt.fit(X_train, y_train)
    y_pred_dt = dt.predict(X_test)
    acc_dt = accuracy_score(y_test, y_pred_dt)
    print(f"Accuracy: {acc_dt:.4f}")
    top_dt_splits = get_top_splits(dt)
    print("Top 30 splits (in order of appearance):")
    for i, feat in enumerate(top_dt_splits):
        if feature_names is not None and feat < len(feature_names):
            feat_name = feature_names[feat]
        else:
            feat_name = feat
        print(f"Split {i+1}: Feature: {feat_name}")

##############################################
# MNIST: Load and run experiments
##############################################

def run_mnist():
    print("Loading MNIST dataset...")
    # MNIST from openml; scale pixel values to [0,1]
    mnist = fetch_openml('mnist_784', version=1)
    X = mnist.data.astype(np.float32) / 255.0
    y = mnist.target.astype(np.int64)
    # Create feature names for pixels
    feature_names = [f"pixel_{i}" for i in range(X.shape[1])]
    run_classification(X, y, feature_names, dataset_name="MNIST")

##############################################
# Spambase: Load and run experiments
##############################################

def run_spambase():
    print("Loading Spambase dataset...")
    # Assumes 'spambase.data' is in the current directory.
    # The dataset is comma-separated, with the last column as the target.
    data = np.loadtxt("spambase.data", delimiter=",")
    X = data[:, :-1]
    y = data[:, -1]
    # Create feature names for spambase features
    feature_names = [f"feature_{i}" for i in range(X.shape[1])]
    run_classification(X, y, feature_names, dataset_name="Spambase")

##############################################
# 20 Newsgroups: Load and run experiments
##############################################

def run_20newsgroups():
    print("Loading 20 Newsgroups dataset...")
    # Load train and test splits; remove headers/footers/quotes to focus on content.
    newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
    newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
    # Use TfidfVectorizer to convert text to feature vectors (limit features for speed)
    vectorizer = TfidfVectorizer(max_features=2000)
    X_train = vectorizer.fit_transform(newsgroups_train.data)
    X_test = vectorizer.transform(newsgroups_test.data)
    # Combine train and test to allow our own splitting
    from scipy.sparse import vstack
    X = vstack([X_train, X_test])
    y = np.concatenate([newsgroups_train.target, newsgroups_test.target])
    feature_names = vectorizer.get_feature_names_out()

    ##############################################
    # Logistic Regression on 20NG
    ##############################################
    print("\n=== 20 Newsgroups (Logistic Regression) ===")
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
    lr = LogisticRegression(penalty='l2', solver='liblinear', max_iter=1000)
    lr.fit(X_tr, y_tr)
    y_pred_lr = lr.predict(X_te)
    acc_lr = accuracy_score(y_te, y_pred_lr)
    print(f"Accuracy: {acc_lr:.4f}")
    top_lr_features = get_top_features_logistic(lr, feature_names)
    print("Top 30 features (by coefficient magnitude):")
    for feature, coef in top_lr_features:
        print(f"Feature: {feature}, Coefficient: {coef}")

    ##############################################
    # Decision Tree on 20NG with two configurations
    ##############################################
    print("\n=== 20 Newsgroups (Decision Tree) ===")

    # Configuration 1: Shallow Tree (max_depth=5)
    print("\n-- Decision Tree (Shallow: max_depth=5) --")
    dt_shallow = DecisionTreeClassifier(max_depth=5, random_state=42)
    dt_shallow.fit(X_tr, y_tr)
    y_pred_shallow = dt_shallow.predict(X_te)
    acc_shallow = accuracy_score(y_te, y_pred_shallow)
    print(f"Accuracy: {acc_shallow:.4f}")
    top_splits_shallow = get_top_splits(dt_shallow)
    print("Top 30 splits (Shallow Tree):")
    for i, feat in enumerate(top_splits_shallow):
        feat_name = feature_names[feat] if feat < len(feature_names) else feat
        print(f"Split {i+1}: Feature: {feat_name}")

    # Configuration 2: Deep Tree (max_depth=15)
    print("\n-- Decision Tree (Deep: max_depth=15) --")
    dt_deep = DecisionTreeClassifier(max_depth=15, random_state=42)
    dt_deep.fit(X_tr, y_tr)
    y_pred_deep = dt_deep.predict(X_te)
    acc_deep = accuracy_score(y_te, y_pred_deep)
    print(f"Accuracy: {acc_deep:.4f}")
    top_splits_deep = get_top_splits(dt_deep)
    print("Top 30 splits (Deep Tree):")
    for i, feat in enumerate(top_splits_deep):
        feat_name = feature_names[feat] if feat < len(feature_names) else feat
        print(f"Split {i+1}: Feature: {feat_name}")

##############################################
# Main execution: run all experiments
##############################################

if __name__ == "__main__":
    # Run experiments on MNIST
    run_mnist()

    # Run experiments on Spambase
    # Ensure that "spambase.data" is uploaded to your Colab environment.
    run_spambase()

    # Run experiments on 20 Newsgroups
    run_20newsgroups()


Loading MNIST dataset...

=== MNIST ===

-- Logistic Regression --
Accuracy: 0.9155
Top 30 features (by coefficient magnitude):
Feature: pixel_379, Coefficient: -2.140631880246161
Feature: pixel_517, Coefficient: -1.724224949896848
Feature: pixel_710, Coefficient: -1.6411961929051553
Feature: pixel_305, Coefficient: -1.5838818330965132
Feature: pixel_461, Coefficient: -1.5161776873412551
Feature: pixel_712, Coefficient: -1.4760219309135363
Feature: pixel_715, Coefficient: -1.4701182900734062
Feature: pixel_222, Coefficient: -1.4373566837046465
Feature: pixel_718, Coefficient: -1.41116516174984
Feature: pixel_177, Coefficient: -1.3643682506328025
Feature: pixel_249, Coefficient: -1.3559286839744613
Feature: pixel_102, Coefficient: -1.346720288565536
Feature: pixel_489, Coefficient: -1.3217004157337955
Feature: pixel_192, Coefficient: -1.315472231315428
Feature: pixel_714, Coefficient: -1.2855499217760067
Feature: pixel_250, Coefficient: -1.2788168791521628
Feature: pixel_164, Coefficien

**Problem 2: PCA library on MNIST**

A) For MNIST dataset, run a PCA-library to get data on D=5 features. Rerun the classification tasks from PB1, compare testing performance with the one from PB1. Then repeat this exercise for D=20
B) Run PCA library on Spambase and repeat one of the classification algorithms. What is the smallest D (number of PCA dimensions) you need to get a comparable test result?

**Problem 3: Implement PCA on MNIST**

Repeat PB2 exercises on MNIST (D=5 and D=20) with your own PCA implementation. You can use any built-in library/package/API for : matrix storage/multiplication, covariance computation, eigenvalue or SVD decomposition, etc. Matlab is probably the easiest language for implementing PCA due to its excellent linear algebra support.


**Problem 4: PCA for clustering visualization**

A) Run KMeans on MNIST data (or a sample of it)
B) Run PCA on same data
C) Plot data in 3D with PCA representation with t=3 top eigen values; use shapes to to indicate truth digit label (circle, triangle, "+", stars, etc) and colors to indicate cluster ID (red blue green etc).
D) Select other 3 at random eigen values from top 20; redo the plot several times.

**Problem 5: Implement Kernel PCA for Logistic Regression**

Dataset: 1000 2-dim datapoints TwoSpirals
Dataset: 1000 2-dim datapoints ThreeCircles

A) First, train a Linear/Logistic Regression (library, logistic if data labels are categories) and confirm that it doesnt work , i.e. it has a high classification error or high Root Mean Squared Error.
B) Run KernelPCA with Gaussian Kernel to obtain a representation of T features. For reference these steps we demoed in class (Matlab):
%get pairwise squared euclidian distance
X2 = dot(X,X,2);
DIST_euclid = bsxfun(@plus, X2, X2') - 2 * X * X';
% get a kernel matrix NxN
sigma = 3;
K = exp(-DIST_euclid/sigma);
%normalize the Kernel to correspond to zero-mean
U = ones(N)/ N ;
Kn = K - U*K -K*U + U*K*U ;
% obtain kernel eignevalues, vectors; then sort them with largest eig first
[V,D] = eig(Kn,'vector') ;
[D,sorteig] = sort(D,'descend') ;
V = V(:, sorteig);
% get the projection matrix
XG = Kn*V;
%get first 3 dimensions
X3G = XG(:,1:3);
%get first 20 dimensions
X20G = XG(:,1:20);
%get first 100 dimensions
X100G = XG(:,1:100);

C) Retrain the regression algorithm on the same data kernelized / dual form. How large T needs to be to get good performance?

**PROBLEM 6 - OPTIONAL (no credit) : Implement Kernel PCA on MNIST**

A) First, add Gaussian noise to MNIST images.
B) Then rerun PCA on noisy images (D=5 and D=20) and inspect visually the images obtained by PCA representation
C) Run Kernel-PCA with the RBF Kernel (D=5 and D=20) on noisy images and observe better images visually.