In [58]:
# Initialize Otter
import otter
grader = otter.Notebook("ps7.ipynb")

In [59]:
import numpy as np


## Question 1: Down with `for` loops
Each of the problems below contains a function that uses `for` loops to perform a certain operation on arrays. Your job is to rewrite these function using only Numpy array manipulations and library functions (e.g. `np.xxx()`). Do not use any `for` or `while` loops, iterators, generators, or list comprehensions in your solutions.

**1(a)** (3 pts) Return all of the rows of the 2-D integer matrix `A` where where each entry of the row is distinct:

In [60]:
def distinct_rows_py(A):
    'Return all rows of A that have completely distinct entries.'
    return np.array([a for a in A if len(set(a)) == len(a)])

In [61]:
A = np.eye(5)
distinct_rows_py(A)

array([], dtype=float64)

In [62]:
A = np.arange(9).reshape(3, 3)
distinct_rows_py(A)

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [63]:
A = np.array([
    [1, 2, 3],
    [4, 4, 4],
    [5, 6, 6]])
distinct_rows_py(A)

array([[1, 2, 3]])

In [64]:
def distinct_rows_np(A):
    #all distinct means strictly monotype (increasing or decresing)
    srt = np.sort(A, axis=1)
    idx = np.prod(np.diff(srt, axis=1), axis=1) != 0
    return A[idx, :]

In [65]:
grader.check("1a")

**1(b)** (3 pts) Given a vector $v$ of length $n$, and an integer $0<k<n$, return a 2D array as:
```
[[v[0], ..., v[k-1]],
 [v[1], ..., v[k]  ],
 [v[2], ..., v[k+1]],
 [ .     .       . ],
 [ .     .       . ],
 [ .     .       . ],
 [v[n-k+1, ..., v[n]]
 ```

In [66]:
def sliding_stack_py(v, k):
    "Stack sliding windows of v of length k."
    rows = []
    for i in range(len(v) - k + 1):
        rows.append(v[i : (i + k)])
    return np.array(rows)

In [67]:
sliding_stack_py(np.array([1, 2, 3, 4, 5]), 3)

array([[1, 2, 3],
       [2, 3, 4],
       [3, 4, 5]])

In [68]:
def sliding_stack_np(v, k):
    n = len(v)
    #broadcasting
    idx = np.arange(k)[np.newaxis, :]+np.arange(n-k+1)[:, np.newaxis]
    #or
    #from numpy.lib.stride_tricks import sliding_window_view
    #return sliding_window_view(v.flatten(), window_shape=k)
    return v[idx]

In [69]:
grader.check("1b")

**1(c)** (3 pts) Given a vector of non-negative integers `v`, with `max(v) = m`, return a vector `c` of length `m + 1` such that `c[i]` is the number of times that the integer `i` appears in `v`.

In [70]:
def digit_count_py(v):
    m = max(v)
    ret = np.zeros(m + 1, int)
    for vv in v:
        ret[vv] += 1
    return ret

In [71]:
v = np.array([0, 0, 1, 1, 2, 2, 2, 3, 5])
digit_count_py(v)

array([2, 2, 3, 1, 0, 1])

In [72]:
def digit_count_np(v):
    ret = np.zeros(v.max() + 1, int)
    idx, cnt = np.unique(v, return_counts=True)
    ret[idx] = cnt
    return ret

In [73]:
grader.check("1c")

**1(d)** (3 pts extra credit) Call a square $n\times n$ matrix $A$ *countersymmetric* if $A_{ij} = A_{n-j,n-i}$ for all $i$ and $j$. An example of such a matrix is:

$$
\begin{pmatrix}
4 & 3 & 2 & 1 & 0\\
8 & 7 & 6 & 5 & 1\\
11 & 10 & 9 & 6 & 2\\
13 & 12 & 10 & 7 & 3\\
14 & 13 & 11 & 8 & 4
\end{pmatrix}
$$

Write a function `is_countersym` that checks this property:

In [74]:
def is_countersym_py(A):
    "Returns True if A is countersymmetric"
    n = A.shape[0]
    for i in range(n):
        for j in range(n):
            if A[i, j] != A[n - j - 1, n - i - 1]:
                return False
    return True

In [75]:
cs_matrix = np.array([[ 4,  3,  2,  1,  0], [ 8,  7,  6,  5,  1], [11, 10,  9,  6,  2], [13, 12, 10,  7,  3], [14, 13, 11,  8,  4]])
is_countersym_py(cs_matrix)

True

In [76]:
def is_countersym_np(A):  
    #1st symmetric over y-axis
    aa = A[:, ::-1]
    #2nd check whether aa is (normal) symmetric
    return np.allclose(aa, aa.T)

In [77]:
grader.check("1d")

## Question 2: $k$-means clustering

$k$-means is a fundamental algorithm for clustering multivariate data. The inputs to the algorithm are:
- An $n\times p$ data matrix $X$ consisting of $n$ observations of a $p$-dimensional feature vector, and
- A $k\times p$ matrix $C$ containing initial guesses for each $k$ cluster centers.

The algorithm proceeds by iteratively a) assigning each point to the nearest cluster center, and b) recomputing the cluster centers as the mean of all of the currently assigned points. Here is a partial implementation:

In [78]:
def kmeans(X, C):
    """
    K-means algorithm.

    Args:
        - X: ndarray, shape (n, p), n observations of p-dimensional feature vector
        - C: ndarray, shape (k, p), k initial cluster centers

    Returns:
        Tuple of length two:
        The first entry is integer ndarray, shape (n), cluster assignments for each data point
        The second entry is ndarray, shape (k, p), centers of each cluster
    """
    assert X.shape[1] == C.shape[1]  # p should match
    old_assignments = None
    while True:
        assignments = nearest_cluster(X, C)
        if np.all(assignments == old_assignments):
            # converged
            return assignments, C
        old_assignments = assignments
        C = compute_centroids(X, assignments)

You will finish implementing this algorithm by completing the missing functions `nearest_cluster()` and `compute_centroids()` below. Note: as in the Question 1, avoid using loops, iterators, or comprehensions. You can find solutions with only broadcasting

**2(a)** (3 pts) Implement the function `nearest_cluster`. It should take two array arguments, the data points `X` and the cluster centers `C`, and return an integer array giving the index in `C` which is nearest to each point in `X`.

In [79]:
def nearest_cluster(X, C):
    """
    For each point in X, find the nearest point in C.

    Args:
        X: ndarray, shape (n, p), n points of dimension p.
        C: ndarray, shape (k, p), k points of dimension p.

    Returns:
        Integer array of length n, [j[1], j[2], ..., j[n]], such that |X[i] - C[j[i]]| <= |X[i] - C[ell]| for 1 <= ell <= k.
    """
    #calculate distance
    Xnew = X[:, np.newaxis] #shape (n, 1, p)
    dist = np.linalg.norm(Xnew - C, ord=2, axis=2) #shape is (n, k, p) -> (n, k)
    idx = np.argmin(dist, axis=1) #(n, )
    return idx

In [80]:
grader.check("2a")

**2(b)** (3 pts) Implement the function `compute_centroids`. It should take two array arguments, the data points `X` and the assignment array `a`, and return an $k \times p$ array containing the cluster centroids (averages) for each point assigned to cluster $0, \dots, k-1$. (You may assume that every entry of $a$ is between $0$ and $k-1$, inclusive.)

In [81]:
def compute_centroids(X, a):
    '''
    X:  (n, p) data   
    a:  (n, ) assignment a[i] means no.i data is cluter i
    '''
    #a is group, find index -- how to change X to the order fo center 0 -> k-1
    a_srt = np.argsort(a)

    # get item # in each group
    # id unique value (index, center #)
    # pos The indices of the first occurrences of the unique values -- first occurrences defines the range of different centers
    # cnt count of unique values
    # why we need to sort a? -- reduceat needs it (increasing index required)
    id, ocr_1st, cnt = np.unique(a[a_srt], return_index=True, return_counts=True)

    # For i in range(len(indices)), reduceat computes ufunc.reduce(array[indices[i]:indices[i+1]]), which becomes the i-th generalized “row” parallel to axis in the final result
    sum_gp = np.add.reduceat(X[a_srt], indices=ocr_1st, axis=0)
    return sum_gp / cnt[:, np.newaxis]

In [82]:
grader.check("2b")

**2(c)** (5 pts.) The performance of the $k$-means algorithm is known to depend heavily on the starting point (the initial clusters `C` passed in as the second argument.) In some cases, using a "good" starting point can dramatically improve the performance of the algorithm.

The $k$-means++ algorithm is designed to find such a good starting point. [According to Wikipedia](https://en.wikipedia.org/wiki/K-means%2B%2B), the steps of $k$-means++ are:

1. Choose one center uniformly at random among the data points.
2. For each data point $x$ not chosen yet, compute $D(x)$, the distance between $x$ and the nearest center that has already been chosen.
3. Choose one new data point at random as a new center, using a weighted probability distribution where a point $x$ is chosen with probability proportional to $D(x)^2$.
4. Repeat Steps 2 and 3 until $k$ centers have been chosen.

Implement this algorithm using the skeleton provided below. As before, your implementation should only use Numpy functions--no additional loops or comprehensions. 

**Note**: To ensure reproducibility, the parts of the algorithm that rely on ranndomness are provided for you. Your job is to fill in the missing lines necessary to complete the algoritm.

In [83]:
def kmeanspp(X, k, rng):
    """
    k-means++ algorithm.

    Args:
        - X: ndarray, shape (n, p), as above.
        - k, the number of clusters.
        - rng: instance of np.random.Generator().

    Returns:
        ndarray, shape (k, p), cluster centers.
    """
    n, p = X.shape
    C = np.zeros((k, p))
    # step 1
    #random choice from n, uniform dist
    j = rng.choice(n)
    C[0] = X[j]
    for i in range(1, k):
        # 0 -> has been chosen
        # get the nearest center in C[0:i]
        ctr_mch = C[nearest_cluster(X, C[0:i, :])]
        # calculate distance, if is already cluster center, dist will be 0, no influence
        w = np.sum((X - ctr_mch)**2, axis=1)
        #normalize to 1
        w = w / np.sum(w)
        # step 3
        j = rng.choice(n, p=w)
        C[i] = X[j]
    return C

In [84]:
grader.check("2c")

**2(d)** (2 pts) In order to measure how good a clustering is, we can define the *within-class variance* 

$$ V(\mathbf{X}, \mathbf{a}, \mathbf{C}) = \sum_{i=1}^n \| \mathbf{x}_i - \mathbf{c}_{a_i} \|^2,$$

where the $i$-th element of $\mathbf{a}=\{a_1,\dots,a_n\}$ is the cluster assignment of observation $i$, and $\mathbf{C}=(\mathbf{c}_1,\dots,\mathbf{c}_k)$ are the centers of each cluster. Thus, $V(\mathbf{X}, \mathbf{a}, \mathbf{C})$ is the sum of the squared distance from each data point to the center of its assigned cluster.

Implement this function. (Again, no loops, just use Numpy functions.)

In [85]:
def V(X, a, C):
    #match obs with center
    ctr_mch = C[a]
    return np.sum((X - ctr_mch)**2)
    

In [86]:
grader.check("2d")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Upload this .zip file to Gradescope for grading.

In [87]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)