# CS6140 Assignments

**Instructions**
1. In each assignment cell, look for the block:
 ```
  #BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
 ```
1. Replace this block with your solution.
1. Test your solution by running the cells following your block (indicated by ##TEST##)
1. Click the "Validate" button above to validate the work.

**Notes**
* You may add other cells and functions as needed
* Keep all code in the same notebook
* In order to receive credit, code must "Validate" on the JupyterHub server

---

# Assignment 2: Clustering

In [None]:
require './assignment_lib'

## Question 1.1: K-Means (25 points)

The following dataset has 5 clusters. Generate a dataset with 5 means in two dimensions, with means $\mu$ and variance $\sigma$ defined as follows:
* $\mu_1 = (-4,4)$, $\sigma_1 = 1$
* $\mu_2 = (-4,-4)$, $\sigma_2 = 1$
* $\mu_3 = (4,4)$, $\sigma_3 = 1$
* $\mu_4 = (4,-4)$, $\sigma_4 = 1$
* $\mu_5 = (0,0)$, $\sigma_5 = 1$

Name your features ```x1``` and ```x2```.

Generate 100 points in each cluster and plot each cluster with a different color.  Use the same format we usually use with the cluster id as the ```label``` field and also set the cluster in a field called ```cluster```. 

To generate a value from a normal distribution, use the following:

```ruby
r = Distribution::Normal.rng(mean,stdev)
x = r.call()
```

In [None]:
def random_point(cluster, sigma, label)
  x1, x2 = cluster
  point = Hash.new
  point["features"] = Hash.new
  #radius = Distribution::Normal.rng(0, sigma).call.abs
  #theta = 2 * Math::PI * rand()
  #point["features"]["x1"] = x1 + radius * Math.cos(theta)
  #point["features"]["x2"] = x2 + radius * Math.sin(theta)
  point["features"]["x1"] = x1 + Distribution::Normal.rng(0, sigma).call
  point["features"]["x2"] = x2 + Distribution::Normal.rng(0, sigma).call
  point["cluster"] = cluster
  point["label"] = label
  point
end

def create_cluster_dataset()
  n = 100
  clusters = [[-4,4], [-4,-4], [4,4], [4,-4], [0,0]]
  dataset = []
  #BEGIN YOUR CODE
  clusters.each_with_index do |cluster, index|
    (1..n).each do
      dataset << random_point(cluster, 1, index)
    end
  end
  #END YOUR CODE
  return dataset
end

In [None]:
def test_11_1()
  dataset = create_cluster_dataset()
  assert_false(dataset.empty?)
  assert_equal 500, dataset.size
  plot_clusters(dataset)
end
test_11_1()

In [None]:
def test_11_2()
  dataset = create_cluster_dataset()
  counts = dataset.group_by {|x| x["cluster"]}
  counts.each_key {|k| counts[k] = counts[k].size}
  assert_equal 5, counts.size
  assert counts.values.all? {|v| v == 100}
  
  counts = dataset.group_by {|x| x["label"]}
  counts.each_key {|k| counts[k] = counts[k].size}
  assert_equal 5, counts.size
  assert counts.values.all? {|v| v == 100}

end
test_11_2()

In [None]:
def test_11_3()
  dataset = create_cluster_dataset()
  clusters = dataset.group_by {|x| x["label"]}
  means = [[-4,4], [-4,-4], [4,4], [4,-4], [0,0]]
  5.times do |i|
    assert_not_nil clusters[i]
    assert_equal 100, clusters[i].size
    m1 = clusters[i].inject(0.0) {|u,r| u += r["features"]["x1"]} / 100.0
    m2 = clusters[i].inject(0.0) {|u,r| u += r["features"]["x2"]} / 100.0
    assert_in_delta m1, means[i][0], 0.5
    assert_in_delta m2, means[i][1], 0.5    
  end
end
test_11_3()

## Question 1.2 (25 points)

Generate $k = 5$ random means, each within the min and max of each feature value.

In [None]:
def init_cluster data, k
  means = Hash.new {|h,k| h[k] = Hash.new {|h,k| h[k] = 0.0}}
  x1min, x1max = data.map { |x| x["features"]["x1"] }.minmax
  x2min, x2max = data.map { |x| x["features"]["x2"] }.minmax
  k.times do |i|
    means[i]["x1"] = x1min + (x1max - x1min) * rand()
    means[i]["x2"] = x2min + (x2max - x2min) * rand()
  end
  means
end

In [None]:
def test_12_1()
  dataset = create_cluster_dataset()  
  means = init_cluster dataset, 5
  assert_equal 5, means.size
  assert_equal 2, means.first.size
  5.times {|i| assert means.has_key? i}
  assert_true(means.values.all? {|m| m["x1"].abs < 10}, "Expected mean of x1 in [-10, 10]")
  assert_true(means.values.all? {|m| m["x2"].abs < 10}, "Expected mean of x2 in [-10, 10]")
end

test_12_1()

## Question 2.1: (10 points)
We will now implement the k-means algorithm to recover the clusters above. Return the means and plot the obtained clustering. Compare your discovered means to the known means.

Starting from a randomly initialized set of clusters defined by a set of means, we will iteratively refine our estimate of the clustering. At each step, we update the means, assign clusters, and update means from the assigned clustering. We denote candidate cluster membership by the binary vector $z_{i,j} = 1$ if example $i$ belongs to cluster $j$. Use an ```Array``` for a row of z.

Start by setting the $z$ vector by assigning finding the mean closest to each point in the datset. Set the candidate cluster in the ```cluster``` field in each row. 

In [None]:
def assign_cluster(data, means)
  indices = Array.new(means.size) {|i| i}
  z = Array.new
  data.each do |point|
    cluster_index = indices.min_by {|i| ((means[i]["x1"] - point["features"]["x1"]) ** 2 + (means[i]["x2"] - point["features"]["x2"]) ** 2) }
    point["cluster"] = cluster_index
    zi = Array.new(means.size, 0.0)
    zi[cluster_index] = 1.0
    z << zi
  end
  return z
end

In [None]:
def test_21_1()
  dataset = create_cluster_dataset()  
  means = init_cluster dataset, 5
  z = assign_cluster(dataset, means)
  assert_equal 500, z.size
  assert_true(z.all? {|zi| zi.size == 5}, "Must set a value for each cluster")
  assert_true(z.all? {|zi| g = zi.group_by {|zij| zij}; g[0.0].size == 4 and g[1.0].size == 1}, 
    "Must set only one cluster to 1.0 and all others to 0.0 (not an integer)")
  
  plot_clusters(dataset)
end
test_21_1()

## Question 2.2: (5 points)
Given the $z$ vector and the dataset, calculate the means for each cluster determined by $z$. 


In [None]:
def calculate_means z, data
  means = Hash.new {|h,k| h[k] = Hash.new {|h,k| h[k] = 0.0}}
  cnt = Array.new(z[0].size, 0)
  data.size.times do |i|
    z[i].size.times do |j|
      if z[i][j] == 1.0
        cnt[j] += 1
        means[j]["x1"] += data[i]["features"]["x1"]
        means[j]["x2"] += data[i]["features"]["x2"]
      end
    end
  end
  cnt.size.times do |j|
    if cnt[j] > 0.0
      means[j]["x1"] /= cnt[j]
      means[j]["x2"] /= cnt[j]
    else
      means[j]["x1"] = 0.0
      means[j]["x2"] = 0.0
    end
  end
  means
end

In [None]:
def test_22_1()
  dataset = create_cluster_dataset()  
  means1 = init_cluster dataset, 5
  z = assign_cluster(dataset, means1)

  assert_equal 500, dataset.size
  assert_equal 500, z.size

  means2 = calculate_means(z, dataset)
  assert_equal 5, means2.size
  5.times {|i| assert means2.has_key? i}
  
  5.times do |i|
    assert_true((means2[i]["x1"] - means1[i]["x1"]).abs > 1e-5, "Expected means to change")
    assert_true((means2[i]["x2"] - means1[i]["x2"]).abs > 1e-5, "Expected means to change")
  end  

  assert_true(means2.values.all? {|fm| fm["x1"].abs < 10}, "Expected mean of x1 in [-10, 10]")
  assert_true(means2.values.all? {|fm| fm["x2"].abs < 10}, "Expected mean of x2 in [-10, 10]")
end

test_22_1()

## Question 2.3: (5 points)
As we iteratively refine our means for the clusters, we need to terminate after some criterion. Stop when the cluster distances are less than $\tau = 0.001$ and plot the convergence. First, let's calculate the distance between the set of $k$ means, using squared distance. Use the following formula for the termination criteria:

#  $\frac{1}{k} \sum_{k} \left\lVert \mu^{(k)}_t - \mu^{(k)}_{t - 1} \right\rVert^2 \le \tau$


In [None]:
def cluster_dist(m0, m1)
  dist = 0.0
  m0.keys.each do |key|
    dist += m0[key].keys.inject(0.0) {|dist1, k| dist1 += (m0[key][k] - m1[key][k]) ** 2 }
  end
  dist / m0.size
end

In [None]:
def test_23_1()
  m0 = {0 => {"x" => 1.0, "y" => 2.0}, 1 => {"x" => 2.0, "y" => 3.0}}
  m1 = {0 => {"x" => 2.0, "y" => 2.0}, 1 => {"x" => 3.0, "y" => 4.0}}

  assert_equal 0.0, cluster_dist(m0, m0)
  assert_equal 0.0, cluster_dist(m1, m1)
  assert_equal 1.5, cluster_dist(m0, m1)
end

test_23_1()

## Question 2.4: (10 Points)

We are now ready to complete the $k$-means clustering algorithm. Given a dataset, the number of clusters desired, $k$, and the termination condition, $\tau$, initialize the means, update the clusters, and refine the estimates until converged. To ensure that the process terminates, stop after 100 iterations. 

When complete, return an array of distances from the current means to the previous means as well as the last vector of means.

In [None]:
def k_means(data, k, tau = 0.001)
  dists = []
  curr_means = init_cluster(data, k)
  100.times.each do
    prev_means = curr_means
    z = assign_cluster(data, prev_means)
    curr_means = calculate_means(z, data)
    dist = cluster_dist(prev_means, curr_means)
    dists << dist
  break if dist < tau
  end
  z = assign_cluster(data, curr_means)
  return [dists, curr_means, z]
end

=begin
dataset = create_cluster_dataset()
dists, means, z = k_means dataset, 3, 0.001
puts dists
plot_clusters(dataset)
=end

In [None]:
def test_24_1()
  dataset = create_cluster_dataset()    
  dists, means, z = k_means dataset, 3, 0.001
  
  assert_true(dists.size > 3)
  assert_true(dists[1] < dists[0])
  assert_true(dists[-1] < dists[1])
  assert_true(dists.last <= 0.001)
  iters = Array.new(dists.size) {|i| i }
  df = Daru::DataFrame.new({iters: iters, dists: dists})
  df.plot(type: :line, x: :iters, y: :dists) do |plot, diagram|
    plot.x_label "X"
    plot.y_label "Mean Dist"
    diagram.title "Cluster Convergence"
    plot.legend false
  end
end

test_24_1()

## Question 2.5 (15 points)
Plot the clusters that result for $k = 3, 5, 6, 10, 20$ clusters.

In [None]:
def test_25_1(k)
  dataset = create_cluster_dataset()    
  dists, means, z = k_means dataset, k, 0.001
  
  assert_equal(k, means.size)
  assert_true(dists.size > 3)
  assert_true(dists[1] < dists[0])
  assert_true(dists[-1] < dists[1])
  assert_true(dists.last <= 0.001)
  iters = Array.new(dists.size) {|i| i }
  plot_clusters(dataset)
end

test_25_1(3)

In [None]:
test_25_1(5)

In [None]:
test_25_1(6)

In [None]:
test_25_1(10)

In [None]:
test_25_1(20)

## Question 2.6 (5 points)

Answer some questions about clustering. 

1. Which of cluster values fit the data best (visually)
 * (A) $k = 3$
 * (B) $k = 5$
 * (C) $k = 6$
 * (D) $k = 10$
 * (D) $k = 20$ 
1. When you set $k$ to a higher value than the ideal number of clusters what happens?
 * (A) Algorithm terminates because ideal number of clusters is a required argument
 * (B) All clusters except the ideal number are set to zero
 * (C) Clusters are split into smaller pieces
 * (D) Clusters are regularized out
1. Go back and try running the clusterings again. You may see that the obtained clusters are different in a few runs. Why? Is $k$-means is sensitive to the initialization?
 * (A) No, it is because the data is being randomly re-generated each time
 * (B) Yes, but only when $k$ is more than the ideal number of clusters
 * (C) No, $k$-means is a convex optimization problem and therefore insensitive to starting point
 * (D) Yes, it is sensitive to the initially guessed means
1. As discussed in the lecture, which of the following captures the biggest difference between $k$-means and the EM-algorithm for gaussian mixture models (with unit variance)?
 * (A) In the EM-algorithm, the $z$ vector is a probability and not just 0 or 1
 * (B) In the EM-algorithm, the means are selected from a normal distribution rather than a uniform distribution
 * (C) The K-means algorithm has no E step or M step, but EM has both 
 * (D) No difference, the algorithms are the same
1. Which datasets are well-suited to $k$-means clustering?
 * (A) All datasets, $k$-means is a low-bias algorithm
 * (B) High-dimensional, sparse vectors with multi-modal normal distributions
 * (C) Low-dimensional, dense vectors with multi-modal normal distributions
 * (D) Only toy datasets like those in this assignment
 
 
If your answers were all A, A, A, A, A, then write the following:

```ruby
def answer_51()
  answers = ["A", "A", "A", "A", "A"]
  return answers
end
```

In [None]:
def answer_26()
  ["B", "C", "D", "A", "C"]
end

In [None]:
t_answers = answer_26()

assert_not_nil t_answers, "1"
assert_true(t_answers.is_a?(Array))
assert_equal(5, t_answers.size)
assert_true(t_answers.all? {|a| a.size == 1 and a =~ /[A-Z]/})
