In [None]:
# Final Exam Spring 2025 - Adrian Halgas

**Problem 1 Clustering Noisy Images**

DATATSET 8358 sampled images with not-uniform label distribution. Each row is an image: first column is the label (digit), then the other 784 columns are pixel values. These images have noise in them; to achieve a better result, you need to work on the features first.

Task: Run a clustering algorithm on the given data set, extract k=10 clusters, and report entropy statistics using the given evaluation function (or write your own). You will have to decide the data preprocessing (if any) and the clustering algorithm. You can use scientific computing libraries (e.g. NumPy / SciPy) for both processing (for example PCA) and clustering (for example KMeans), and you can use any functions you developed in your previous homeworks.

Labels are not to be used during the algorithm/clustering/preprocessing, but only for evaluation: print a confusion matrix of counts, calculate entropy on each row and column, and compute weighted_by_count average entropy for rows (labels) and columns (clusters). Make sure to include all datapoints into the K=10 clusters.

In [None]:
def evaluate(true_labels: np.ndarray, pred_labels: np.ndarray) -> tuple:
  """Entropy-based evaluation of a label assignment.

  Parameters:
    true_labels: the ground-truth class labels on the input data.
    pred_labels: the predicted class labels on the input data.

  Returns:
    a tuple (CM, (cs_e, cr_e, we)) containing the confusion matrix `CM`, the class entropies `cs_e`,
    the cluster entropies `cr_e`, and the averaged weighted entropies `we`.
  """
  from scipy.stats import entropy

  assert len(true_labels) == len(pred_labels), "Label predictions don't match"

  ## Map the labels to index set {0, 1, ..., k - 1 }
  t_classes, t_labels = np.unique(true_labels, return_inverse=True)
  p_classes, p_labels = np.unique(pred_labels, return_inverse=True)
  assert np.all(np.isin(p_classes, t_classes)), "Predicted class outside of labels given"

  ## Accumulate the counts
  n_classes = len(t_classes)
  CM = np.zeros(shape=(n_classes, n_classes), dtype=np.uint32)
  ind = np.ravel_multi_index([t_labels, p_labels], CM.shape)
  np.add.at(CM.ravel(), ind, 1)

  ## Compute the entropy of the empirical row/column distributions
  empirical_dist = lambda x: x / np.sum(x)
  cluster_entropy = np.apply_along_axis(lambda x: entropy(empirical_dist(x), base=2), 0, CM)
  class_entropy = np.apply_along_axis(lambda x: entropy(empirical_dist(x), base=2), 1, CM)

  ## Average w/ count weights
  w_cluster_entropy = np.sum(cluster_entropy * CM.sum(axis=0)) / len(y)
  w_class_entropy = np.sum(class_entropy * CM.sum(axis=1)) / len(y)
  w_entropies = np.array([w_class_entropy, w_cluster_entropy])

  with np.printoptions(precision=3):
    print(f"Class Entropies: {class_entropy}")
    print(f"Cluster Entropies: {cluster_entropy}")
    print(f"Weighted average entropies: {w_entropies}, (avg: {np.mean(w_entropies):.3f})")
  return CM, (w_class_entropy, w_cluster_entropy, w_entropies)

**Problem 2**

An auction house decides each morning, randomly based on internal rules, what class of products will be auctioned: Cars, Jewelry, Paintings, or Houses.
Each class has it own bidders who are called to place bids on matching days, characterized by a bidding_rate parameter λ, and assumed that bidding intervals overall follows negative exponential distribution . That is, probability for a bid to not occur decreases exponentially with length of time. For each day we record the number of bids, which theory dictates must follow E[#bids] = λ, E[bidding_interval] = 1 / λ

Part A (25 points). Given that exponential distribution assumption on bidding intervals, figure out the proper distribution for the #bids/day, parametrized by λ. You can use online resources to do so.

Part B (75 points) The file contains counts of auction bids for 10000 days, without specifying which class was auctioned per day. Estimate the rate_bidding parameter for each class (4 λ values) and also estimate how many days each class was auctioned (4 counts out of 10000).
Hint: use EM on a mixture of 4 distributions found in part A. You can use libraries for distribution computation (pdf), but the EM steps have to be your own implementation. Here is a possible result
Estimated λ-s: [ 6.13 15.22 1.97 22.34]
Estimated #days : [3087 1272 1953 3687 ]