***

*Course:* [Math 535](https://people.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS)  
*Chapter:* 1-Introduction: a first data science problem   
*Author:* [Sebastien Roch](https://people.math.wisc.edu/~roch/), Department of Mathematics, University of Wisconsin-Madison (with the help of ChatGPT)   
*Updated:* Sep 5, 2025   
*Copyright:* &copy; 2025 Sebastien Roch

***

## Auto-quizzes

This notebook generates automated quizzes as well as the answers. Set the `seed` to any integer to produce unique quizzes.

In [None]:
# Python 3
import numpy as np
from numpy import linalg as LA
from numpy.random import default_rng

In [None]:
# Set the `seed` to any integer
seed=535

In [None]:
rng = default_rng(seed)

**AQ1.1**  

***

*Use the following code to generate the quiz questions. You should be able to answer them by hand -- that is, without the help of numerical computation.*

***

This exercise concerns $k$-means clustering with $k = 2$ clusters. To be consistent with Python indexing, we will index the clusters and vectors starting at 0. Consider the following input vectors:

In [None]:
x0 = np.array([0., 0., 0.])
x1 = np.array([0., 1., 0.])
x2 = np.array([0., 0., 1.])
x3 = np.array([1., 1., 0.])
x4 = np.array([1., 0., 1.])
x5 = np.array([1., 1., 1.])
print(f'x0 =',x0)
print(f'x1 =',x1)
print(f'x2 =',x2)
print(f'x3 =',x3)
print(f'x4 =',x4)
print(f'x5 =',x5)

Consider the following representative vectors:

In [None]:
mu0 = np.array([-1.])
random_part0 = np.random.choice([0, 1], 2)
mu0 = np.concatenate([mu0, random_part0])
mu1 = np.array([2.])
random_part1 = np.random.choice([0, 1], 2)
mu1 = np.concatenate([mu1, random_part1])
permuted_indices = np.random.permutation(len(mu0))
mu0 = mu0[permuted_indices]
mu1 = mu1[permuted_indices]
print('mu0 =',mu0)
print('mu1 =',mu1)

and the following clusters:

In [None]:
original_set = np.array([0, 1, 2, 3, 4, 5])
np.random.shuffle(original_set)
split_point = np.random.randint(1, len(original_set))
C0 = original_set[:split_point]
C1 = original_set[split_point:]
print('C0 =', C0)
print('C1 =', C1)

a) For the fixed clusters `C0` and `C1`, compute the optimal representatives `mu0_new` and `mu1_new`.

b) For the fixed representatives `mu0` and `mu1`, compute the optimal clustering `C0_new` and `C1_new`.

c) For the solution in b), compute the $k$-means objective function.

d) For the solution in b), write down the matrix form of the input and of the solution, that is, the matrices `X`, `Z`, `U` in the notes.

***

*Use the following code to generate the answers.*

***

In [None]:
# a)

k = 2
X = np.stack((x0,x1,x2,x3,x4,x5))

mu0_new = np.sum(X[C0,:],axis=0) / len(C0)
mu1_new = np.sum(X[C1,:],axis=0) / len(C1)

print('mu0_new =', mu0_new)
print('mu1_new =', mu1_new)

In [None]:
# b)

(n,d) = X.shape
U = np.stack((mu0,mu1))
dist = np.zeros(n)
C0_new = []
C1_new = []
for j in range(n):
    dist_to_i = np.array([LA.norm(X[j,:] - U[i,:]) for i in range(k)])
    if np.argmin(dist_to_i) == 0:
        C0_new.append(j)
        dist[j] = dist_to_i[0]
    else:
        C1_new.append(j)
        dist[j] = dist_to_i[1]    

print('C0_new =', C0_new)
print('C1_new =', C1_new)

In [None]:
# c)

G = np.sum(dist ** 2)

print('G =', G)

In [None]:
# d)

Z= np.zeros((len(original_set), 2), dtype=int)

for i in C0_new:
    Z[i, 0] = 1
for i in C1_new:
    Z[i, 1] = 1

print('X =', X)
print('Z =', Z)
print('U =', U)

$\lhd$