# CS6140 Assignments

**Instructions**
1. In each assignment cell, look for the block:
 ```
  #BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
 ```
1. Replace this block with your solution.
1. Test your solution by running the cells following your block (indicated by ##TEST##)
1. Click the "Validate" button above to validate the work.

**Notes**
* You may add other cells and functions as needed
* Keep all code in the same notebook
* In order to receive credit, code must "Validate" on the JupyterHub server

---

# Assignment 8: Clustering

In [118]:
require './assignment_lib'

false

## Question 1.1: K-Means (25 points)

The following dataset has 5 clusters. Generate a dataset with 5 means in two dimensions, with means $\mu$ and variance $\sigma$ defined as follows:
* $\mu_1 = (-4,4)$, $\sigma_1 = 1$
* $\mu_2 = (-4,-4)$, $\sigma_2 = 1$
* $\mu_3 = (4,4)$, $\sigma_3 = 1$
* $\mu_4 = (4,-4)$, $\sigma_4 = 1$
* $\mu_5 = (0,0)$, $\sigma_5 = 1$

Name your features ```x1``` and ```x2```.

Generate 100 points in each cluster and plot each cluster with a different color.  Use the same format we usually use with the cluster id as the ```label``` field and also set the cluster in a field called ```cluster```. 

To generate a value from a normal distribution, use the following:

```ruby
r = Distribution::Normal.rng(mean,stdev)
x = r.call()
```

In [165]:
def create_cluster_dataset()
  n = 100
  clusters = [[-4,4], [-4,-4], [4,4], [4,-4], [0,0]]
  dataset = []
  # BEGIN YOUR CODE
  cluster_label = 0
  point = Hash.new{|h, k| h[k] = Hash.new(&h.default_proc)}
  clusters.each do |x1, x2|
    n.times do |i|  
      dataset << Hash["cluster", Hash["x1", x1, "x2", x2],
        "features", Hash["x1", Distribution::Normal.rng(x1,1).call(), "x2", Distribution::Normal.rng(x2,1).call()],
        "label", cluster_label]
    end
    cluster_label += 1
  end
  #END YOUR CODE
  
  return dataset
end

:create_cluster_dataset

In [166]:
def test_11_1()
  dataset = create_cluster_dataset()
  assert_false(dataset.empty?)
  assert_equal 500, dataset.size
  plot_clusters(dataset)
end
test_11_1()

In [167]:
def test_11_2()
  dataset = create_cluster_dataset()
  counts = dataset.group_by {|x| x["cluster"]}
  counts.each_key {|k| counts[k] = counts[k].size}
  assert_equal 5, counts.size
  assert counts.values.all? {|v| v == 100}
  
  counts = dataset.group_by {|x| x["label"]}
  counts.each_key {|k| counts[k] = counts[k].size}
  assert_equal 5, counts.size
  assert counts.values.all? {|v| v == 100}

end
test_11_2()

In [168]:
def test_11_3()
  dataset = create_cluster_dataset()
  clusters = dataset.group_by {|x| x["label"]}
  means = [[-4,4], [-4,-4], [4,4], [4,-4], [0,0]]
  5.times do |i|
    assert_not_nil clusters[i]
    assert_equal 100, clusters[i].size
    m1 = clusters[i].inject(0.0) {|u,r| u += r["features"]["x1"]} / 100.0
    m2 = clusters[i].inject(0.0) {|u,r| u += r["features"]["x2"]} / 100.0
    assert_in_delta m1, means[i][0], 0.5
    assert_in_delta m2, means[i][1], 0.5    
  end
end
test_11_3()

5

## Question 1.2 (25 points)

Generate $k = 5$ random means, each within the min and max of each feature value.

In [193]:
def init_cluster data, k
  means = Hash.new {|h,k| h[k] = Hash.new {|h,k| h[k] = 0.0}}
  # BEGIN YOUR CODE
  k.times do |i|
    mean = Hash.new
    r = rand(data.size)
    mean["x1"] = data[r]["features"]["x1"]
    mean["x2"] = data[r]["features"]["x2"]
    means[i] = mean
  end
  #END YOUR CODE
  means
end

:init_cluster

In [194]:
def test_12_1()
  dataset = create_cluster_dataset()  
  means = init_cluster dataset, 5
  assert_equal 5, means.size
  assert_equal 2, means.first.size
  5.times {|i| assert means.has_key? i}
  assert_true(means.values.all? {|m| m["x1"].abs < 10}, "Expected mean of x1 in [-10, 10]")
  assert_true(means.values.all? {|m| m["x2"].abs < 10}, "Expected mean of x2 in [-10, 10]")
end

test_12_1()

## Question 2.1: (10 points)
We will now implement the k-means algorithm to recover the clusters above. Return the means and plot the obtained clustering. Compare your discovered means to the known means.

Starting from a randomly initialized set of clusters defined by a set of means, we will iteratively refine our estimate of the clustering. At each step, we update the means, assign clusters, and update means from the assigned clustering. We denote candidate cluster membership by the binary vector $z_{i,j} = 1$ if example $i$ belongs to cluster $j$. Use an ```Array``` for a row of z.

Start by setting the $z$ vector by assigning finding the mean closest to each point in the datset. Set the candidate cluster in the ```cluster``` field in each row. 

In [195]:
# BEGIN YOUR CODE
def dot x, w
  # BEGIN YOUR CODE
  sum = 0.0
    
    if !(x.empty? or w.empty?)
      x.each do |k, v|
          if w.has_key?(k)
              sum += v * w[k]
          end
      end
    end
    
    return sum
  #END YOUR CODE  
end

def norm w
  # BEGIN YOUR CODE
  return Math.sqrt(dot(w, w))
  #END YOUR CODE
end

def distance x1, x2
  diff = {}
  x1.each_key do |key|
    diff[key] = x1[key] - x2[key]
  end
  return norm(diff)
end
#END YOUR CODE

:distance

In [202]:
def assign_cluster(data, means)
  indices = Array.new(means.size) {|i| i}
  z = Array.new
  # BEGIN YOUR CODE
   data.each do |row|
    ztemp = Array.new(means.size, 0.0)
    min_dist = distance(means[0], row["features"]) + 1.0
    min_index = -1
    indices.each do |i|    
      dist = distance(means[i], row["features"])
      if dist < min_dist
        min_index = i
        min_dist = dist
      end
    end
    ztemp[min_index] = 1.0
    row["cluster"] = min_index
    z << ztemp
  end
  #END YOUR CODE
  
  return z
end


:assign_cluster

In [218]:
dataset = create_cluster_dataset()

[{"cluster"=>{"x1"=>-4, "x2"=>4}, "features"=>{"x1"=>-4.412073016850647, "x2"=>5.211339492227374}, "label"=>0}, {"cluster"=>{"x1"=>-4, "x2"=>4}, "features"=>{"x1"=>-3.1576167838711253, "x2"=>6.035831492456445}, "label"=>0}, {"cluster"=>{"x1"=>-4, "x2"=>4}, "features"=>{"x1"=>-4.509599658386966, "x2"=>4.717013857277409}, "label"=>0}, {"cluster"=>{"x1"=>-4, "x2"=>4}, "features"=>{"x1"=>-4.601735897566019, "x2"=>4.403719517778803}, "label"=>0}, {"cluster"=>{"x1"=>-4, "x2"=>4}, "features"=>{"x1"=>-4.899118755751846, "x2"=>4.4750530259392916}, "label"=>0}, {"cluster"=>{"x1"=>-4, "x2"=>4}, "features"=>{"x1"=>-6.057276067198501, "x2"=>4.245337375536641}, "label"=>0}, {"cluster"=>{"x1"=>-4, "x2"=>4}, "features"=>{"x1"=>-4.025793000029852, "x2"=>3.297912158963042}, "label"=>0}, {"cluster"=>{"x1"=>-4, "x2"=>4}, "features"=>{"x1"=>-4.706428558125531, "x2"=>3.152402694020036}, "label"=>0}, {"cluster"=>{"x1"=>-4, "x2"=>4}, "features"=>{"x1"=>-5.1558522693563065, "x2"=>4.155220272913846}, "label"=>0

In [203]:
def test_21_1()
  dataset = create_cluster_dataset()  
  means = init_cluster dataset, 5
  z = assign_cluster(dataset, means)
  assert_equal 500, z.size
  assert_true(z.all? {|zi| zi.size == 5}, "Must set a value for each cluster")
  assert_true(z.all? {|zi| g = zi.group_by {|zij| zij}; g[0.0].size == 4 and g[1.0].size == 1}, 
    "Must set only one cluster to 1.0 and all others to 0.0 (not an integer)")
  
  plot_clusters(dataset)
end
test_21_1()

## Question 2.2: (5 points)
Given the $z$ vector and the dataset, calculate the means for each cluster determined by $z$. 


In [204]:
def calculate_means z, data
  means = Hash.new {|h,k| h[k] = Hash.new {|h,k| h[k] = 0.0}}
  # BEGIN YOUR CODE
  sizes = Hash.new {|h,k| h[k] = 0}
  z[0].size.times do |i|
    sizes[i] = 0
    means[i]
  end
  
  n = z.size
  n.times do |row|
    z[row].size.times do |k1|
      if z[row][k1] == 1.0
        sizes[k1] += 1
        data[row]["features"].each_key do |k2|
          means[k1][k2] += data[row]["features"][k2]
        end
        break
      end
    end
  end

  means.each_key do |k1|
    means[k1].each_key do |k2|
      means[k1][k2] = sizes[k1] == 0.0 ? 0.0 : means[k1][k2] / sizes[k1]
    end
  end
  #END YOUR CODE
  means
end

:calculate_means

In [205]:
def test_22_1()
  dataset = create_cluster_dataset()  
  means1 = init_cluster dataset, 5
  z = assign_cluster(dataset, means1)

  assert_equal 500, dataset.size
  assert_equal 500, z.size

  means2 = calculate_means(z, dataset)
  assert_equal 5, means2.size
  5.times {|i| assert means2.has_key? i}
  
  5.times do |i|
    assert_true((means2[i]["x1"] - means1[i]["x1"]).abs > 1e-5, "Expected means to change")
    assert_true((means2[i]["x2"] - means1[i]["x2"]).abs > 1e-5, "Expected means to change")
  end  

  assert_true(means2.values.all? {|fm| fm["x1"].abs < 10}, "Expected mean of x1 in [-10, 10]")
  assert_true(means2.values.all? {|fm| fm["x2"].abs < 10}, "Expected mean of x2 in [-10, 10]")
end

test_22_1()

## Question 2.3: (5 points)
As we iteratively refine our means for the clusters, we need to terminate after some criterion. Stop when the cluster distances are less than $\tau = 0.001$ and plot the convergence. First, let's calculate the distance between the set of $k$ means, using squared distance. Use the following formula for the termination criteria:

#  $\frac{1}{k} \sum_{k} \left\lVert \mu^{(k)}_t - \mu^{(k)}_{t - 1} \right\rVert^2 \le \tau$

Note: You may copy **your** ```dot``` and / or ```norm``` from previous assignments.

In [207]:
def cluster_dist(m0, m1)
  dist = 0.0
  # BEGIN YOUR CODE
  m0.each_key do |k1|
    diff = {}
    m0[k1].each_key do |k2|
      diff[k2] = m0[k1][k2] - m1[k1][k2]
    end
    dist += dot(diff, diff)
  end
  dist /= m0.size
  #END YOUR CODE
  dist
end

:cluster_dist

In [208]:
def test_23_1()
  m0 = {0 => {"x" => 1.0, "y" => 2.0}, 1 => {"x" => 2.0, "y" => 3.0}}
  m1 = {0 => {"x" => 2.0, "y" => 2.0}, 1 => {"x" => 3.0, "y" => 4.0}}

  assert_equal 0.0, cluster_dist(m0, m0)
  assert_equal 0.0, cluster_dist(m1, m1)
  assert_equal 1.5, cluster_dist(m0, m1)
end

test_23_1()

## Question 2.4: (10 Points)

We are now ready to complete the $k$-means clustering algorithm. Given a dataset, the number of clusters desired, $k$, and the termination condition, $\tau$, initialize the means, update the clusters, and refine the estimates until converged. To ensure that the process terminates, stop after 100 iterations. 

When complete, return an array of distances from the current means to the previous means as well as the last vector of means.

In [209]:
def k_means(data, k, tau = 0.001)
  dists = []
  last_means = []
  
  # BEGIN YOUR CODE
  new_means = init_cluster data, k
  dist = 1.0
  iter = 0
  while dist > tau && iter < 100
    prev_means = new_means
    z = assign_cluster(data, prev_means)
    new_means = calculate_means(z, data)
    puts iter
    puts prev_means
    dist = cluster_dist prev_means, new_means
    iter += 1
    dists << dist
  end
  last_means = new_means
  #END YOUR CODE
  return [dists, last_means]
end

:k_means

In [210]:
def test_24_1()
  dataset = create_cluster_dataset()    
  dists, means, z = k_means dataset, 3, 0.001
  
  assert_true(dists.size > 3)
  assert_true(dists[1] < dists[0])
  assert_true(dists[-1] < dists[1])
  assert_true(dists.last <= 0.001)
  iters = Array.new(dists.size) {|i| i }
  df = Daru::DataFrame.new({iters: iters, dists: dists})
  df.plot(type: :line, x: :iters, y: :dists) do |plot, diagram|
    plot.x_label "X"
    plot.y_label "Mean Dist"
    diagram.title "Cluster Convergence"
    plot.legend false
  end
end

test_24_1()

0
{0=>{"x1"=>4.124900556758339, "x2"=>4.6764725743051345}, 1=>{"x1"=>-3.0425165334481936, "x2"=>3.2851779947949034}, 2=>{"x1"=>-4.06216890702722, "x2"=>4.510711322052257}}
1
{0=>{"x1"=>3.929523845115996, "x2"=>0.1816987769122627}, 1=>{"x1"=>-2.253145987435811, "x2"=>-1.4133710852543935}, 2=>{"x1"=>-4.249660501147336, "x2"=>4.488678212396841}}
2
{0=>{"x1"=>3.699091298402275, "x2"=>0.10460146833999008}, 1=>{"x1"=>-2.405908209284093, "x2"=>-2.494303931297648}, 2=>{"x1"=>-3.846805811085277, "x2"=>4.040145748178567}}
3
{0=>{"x1"=>3.430375784922024, "x2"=>0.15507497762648018}, 1=>{"x1"=>-2.7345324398564044, "x2"=>-2.9038811457802836}, 2=>{"x1"=>-3.846805811085277, "x2"=>4.040145748178567}}
4
{0=>{"x1"=>3.1613061843246086, "x2"=>0.1351728541021377}, 1=>{"x1"=>-3.203454067352366, "x2"=>-3.4262388094737743}, 2=>{"x1"=>-3.7997472024150736, "x2"=>3.98312091815771}}
5
{0=>{"x1"=>3.0524776116323364, "x2"=>0.07693507499713409}, 1=>{"x1"=>-3.5048156620625592, "x2"=>-3.628801651946653}, 2=>{"x1"=>-3.7

## Question 2.5 (15 points)
Plot the clusters that result for $k = 3, 5, 6, 10, 20$ clusters.

In [211]:
def test_25_1(k)
  dataset = create_cluster_dataset()    
  dists, means, z = k_means dataset, k, 0.001
  
  assert_equal(k, means.size)
  assert_true(dists.size > 3)
  assert_true(dists[1] < dists[0])
  assert_true(dists[-1] < dists[1])
  assert_true(dists.last <= 0.001)
  iters = Array.new(dists.size) {|i| i }
  plot_clusters(dataset)
end

test_25_1(3)

0
{0=>{"x1"=>-3.5766292691009998, "x2"=>4.5763667591006945}, 1=>{"x1"=>-5.052550715750846, "x2"=>3.2737663950270317}, 2=>{"x1"=>-2.225461638799869, "x2"=>3.0946877043664047}}
1
{0=>{"x1"=>-3.685049418730976, "x2"=>4.758460352736729}, 1=>{"x1"=>-4.886276008524548, "x2"=>-0.9342883650559446}, 2=>{"x1"=>1.5437408711517238, "x2"=>-0.3290571908926119}}
2
{0=>{"x1"=>-3.8007752217977933, "x2"=>4.082892044822337}, 1=>{"x1"=>-4.039241896635792, "x2"=>-3.8110582983399177}, 2=>{"x1"=>2.7050430648219312, "x2"=>0.11654486747336001}}
3
{0=>{"x1"=>-3.6880197029859514, "x2"=>3.7808866578316294}, 1=>{"x1"=>-3.908502583015562, "x2"=>-3.9671930978720202}, 2=>{"x1"=>2.865168556887235, "x2"=>0.16421903146372852}}
4
{0=>{"x1"=>-3.611484055017853, "x2"=>3.7174958106952265}, 1=>{"x1"=>-3.883571574469321, "x2"=>-3.9377852446086616}, 2=>{"x1"=>2.9194220446881936, "x2"=>0.15404730594746524}}
5
{0=>{"x1"=>-3.537704239176637, "x2"=>3.6512139346217904}, 1=>{"x1"=>-3.883571574469321, "x2"=>-3.9377852446086616}, 2=>{

In [212]:
test_25_1(5)

0
{0=>{"x1"=>-0.20917984055284652, "x2"=>0.3441526952759967}, 1=>{"x1"=>2.386631336965817, "x2"=>-2.2866441487189864}, 2=>{"x1"=>-4.144620920501949, "x2"=>-5.649293116823236}, 3=>{"x1"=>-5.370723924226901, "x2"=>-3.9564695736534405}, 4=>{"x1"=>5.546291386277738, "x2"=>-5.13438340075556}}
1
{0=>{"x1"=>-0.19796140704089374, "x2"=>2.6998148407605793}, 1=>{"x1"=>3.5493971907356747, "x2"=>-2.220527700390472}, 2=>{"x1"=>-3.456471337379625, "x2"=>-4.819923301465558}, 3=>{"x1"=>-4.415080901144033, "x2"=>-3.4802335230668664}, 4=>{"x1"=>4.657985489042814, "x2"=>-4.518203833525729}}
2
{0=>{"x1"=>-0.1648625509870402, "x2"=>2.959917009377616}, 1=>{"x1"=>2.710005590024175, "x2"=>-1.4736456665673785}, 2=>{"x1"=>-3.498535457395109, "x2"=>-4.8082289753530985}, 3=>{"x1"=>-4.238560591242112, "x2"=>-3.356997516332064}, 4=>{"x1"=>4.2679042389742365, "x2"=>-4.460957266959933}}
3
{0=>{"x1"=>-0.42071849153554963, "x2"=>3.317629310413128}, 1=>{"x1"=>1.9644747083971674, "x2"=>-0.16926936103816498}, 2=>{"x1"=>-3

In [213]:
test_25_1(6)

0
{0=>{"x1"=>-4.369001311778636, "x2"=>3.707203785055157}, 1=>{"x1"=>3.730817580521359, "x2"=>-3.4328531407733958}, 2=>{"x1"=>-2.3752234089145765, "x2"=>2.892439281658608}, 3=>{"x1"=>-1.8489922709633668, "x2"=>-1.6754049144195904}, 4=>{"x1"=>4.625889901266327, "x2"=>4.497127712022118}, 5=>{"x1"=>-4.89009367827307, "x2"=>-4.8362512288572}}
1
{0=>{"x1"=>-4.330467885171058, "x2"=>4.094636306540258}, 1=>{"x1"=>3.9162975968553195, "x2"=>-3.9926770200411936}, 2=>{"x1"=>-1.1792229978160083, "x2"=>2.180539758099589}, 3=>{"x1"=>-0.646938991262859, "x2"=>-0.8165597525765628}, 4=>{"x1"=>3.898424767101891, "x2"=>4.247545221930598}, 5=>{"x1"=>-4.069363200836577, "x2"=>-4.293270850471718}}
2
{0=>{"x1"=>-4.205865290232925, "x2"=>4.086338260440303}, 1=>{"x1"=>3.980271700525074, "x2"=>-4.071764344781989}, 2=>{"x1"=>-0.7849828339577833, "x2"=>1.6364646289723601}, 3=>{"x1"=>-0.04473870137171363, "x2"=>-0.3441480966771488}, 4=>{"x1"=>3.898424767101891, "x2"=>4.247545221930598}, 5=>{"x1"=>-3.85588298071703

In [214]:
test_25_1(10)

0
{0=>{"x1"=>3.4469132668408005, "x2"=>-2.479815090672111}, 1=>{"x1"=>-4.191798151387328, "x2"=>-4.629382321926476}, 2=>{"x1"=>-3.7113258655464727, "x2"=>4.505209687934864}, 3=>{"x1"=>2.5676663010481944, "x2"=>-3.749823062480103}, 4=>{"x1"=>-0.02275426953616432, "x2"=>-0.5678362000247236}, 5=>{"x1"=>-1.3678076011451794, "x2"=>0.603041831501569}, 6=>{"x1"=>-0.2190744743345376, "x2"=>2.1862424511815144}, 7=>{"x1"=>-4.383122895622853, "x2"=>4.585818351432744}, 8=>{"x1"=>-3.4871499167853255, "x2"=>-5.405267888215393}, 9=>{"x1"=>-3.3360989070703564, "x2"=>-5.111738933989188}}
1
{0=>{"x1"=>4.590371717834694, "x2"=>-2.868310817612231}, 1=>{"x1"=>-4.275555922238289, "x2"=>-3.5059090663522605}, 2=>{"x1"=>-3.3680200787640477, "x2"=>3.931026876701406}, 3=>{"x1"=>3.488282738291284, "x2"=>-4.385636710874018}, 4=>{"x1"=>0.3437848388700962, "x2"=>-0.4037727566985961}, 5=>{"x1"=>-1.4763416605788362, "x2"=>0.43502842641803735}, 6=>{"x1"=>3.4948786942861885, "x2"=>3.6985477391477266}, 7=>{"x1"=>-4.88717

In [215]:
test_25_1(20)

0
{0=>{"x1"=>-4.16537369604053, "x2"=>-3.2620322934381316}, 1=>{"x1"=>5.286765062793952, "x2"=>3.4857863319987152}, 2=>{"x1"=>-4.325805850793248, "x2"=>-4.387012543786328}, 3=>{"x1"=>-4.679024563193555, "x2"=>3.878291310900046}, 4=>{"x1"=>-2.974944478756817, "x2"=>5.19007054515002}, 5=>{"x1"=>5.28052427875174, "x2"=>-5.556677905264561}, 6=>{"x1"=>3.2896504071678265, "x2"=>1.3359340885302657}, 7=>{"x1"=>-3.008994133928869, "x2"=>-3.2282137280516716}, 8=>{"x1"=>-3.0251324067691447, "x2"=>3.380846715964555}, 9=>{"x1"=>3.924144397365165, "x2"=>2.4938169847895777}, 10=>{"x1"=>-3.74948784157067, "x2"=>-4.915286979624367}, 11=>{"x1"=>4.422917844485707, "x2"=>-4.008069140974779}, 12=>{"x1"=>-4.939251276299518, "x2"=>4.217085733773752}, 13=>{"x1"=>3.7944189821493537, "x2"=>-3.653460786520079}, 14=>{"x1"=>-4.97692289130424, "x2"=>-6.136766721606604}, 15=>{"x1"=>-1.9899098492636977, "x2"=>-0.9552925662740601}, 16=>{"x1"=>-4.091549903935637, "x2"=>3.6330930075143963}, 17=>{"x1"=>4.802820479768797,

8
{0=>{"x1"=>-4.665308710396735, "x2"=>-2.886121151401098}, 1=>{"x1"=>4.898444485805906, "x2"=>3.3939229166910847}, 2=>{"x1"=>-4.185000335851326, "x2"=>-4.0981528775783405}, 3=>{"x1"=>-4.888237482617204, "x2"=>3.733363186400388}, 4=>{"x1"=>-3.654035174703752, "x2"=>4.930227720688538}, 5=>{"x1"=>3.4042612844753837, "x2"=>-5.16602711840765}, 6=>{"x1"=>1.0230033553810325, "x2"=>0.040373204210337965}, 7=>{"x1"=>-2.646321007565658, "x2"=>-4.056787411179097}, 8=>{"x1"=>-2.7928048174011284, "x2"=>4.182023463373315}, 9=>{"x1"=>2.9642423926265566, "x2"=>3.659476676492356}, 10=>{"x1"=>-3.1183833536923986, "x2"=>-5.35443772558929}, 11=>{"x1"=>4.702648215201827, "x2"=>-3.3338737501185576}, 12=>{"x1"=>-5.518478673002411, "x2"=>4.837284448094886}, 13=>{"x1"=>3.074641169514652, "x2"=>-3.3524543222316043}, 14=>{"x1"=>-4.950772788801951, "x2"=>-5.371170039042984}, 15=>{"x1"=>-0.6160583491559707, "x2"=>-1.138187970506299}, 16=>{"x1"=>-3.9623575998158, "x2"=>3.0684918337524447}, 17=>{"x1"=>4.025622504490

## Question 2.6 (5 points)

Answer some questions about clustering. 

1. Which of cluster values fit the data best (visually)
 * (A) $k = 3$
 * (B) $k = 5$
 * (C) $k = 6$
 * (D) $k = 10$
 * (D) $k = 20$ 
1. When you set $k$ to a higher value than the ideal number of clusters what happens?
 * (A) Algorithm terminates because ideal number of clusters is a required argument
 * (B) All clusters except the ideal number are set to zero
 * (C) Clusters are split into smaller pieces
 * (D) Clusters are regularized out
1. Go back and try running the clusterings again. You may see that the obtained clusters are different in a few runs. Why? Is $k$-means is sensitive to the initialization?
 * (A) No, it is because the data is being randomly re-generated each time
 * (B) Yes, but only when $k$ is more than the ideal number of clusters
 * (C) No, $k$-means is a convex optimization problem and therefore insensitive to starting point
 * (D) Yes, it is sensitive to the initially guessed means
1. As discussed in the lecture, which of the following captures the biggest difference between $k$-means and the EM-algorithm for gaussian mixture models (with unit variance)?
 * (A) No difference, the algorithms are the same
 * (B) In the EM-algorithm, the $z$ vector is a probability and not just 0 or 1
 * (C) In the EM-algorithm, the means are selected from a normal distribution rather than a uniform distribution
 * (D) The K-means algorithm has no E step or M step, but EM has both 
1. Which datasets are well-suited to $k$-means clustering?
 * (A) Low-dimensional, dense vectors with multi-modal normal distributions
 * (B) All datasets, $k$-means is a low-bias algorithm
 * (C) High-dimensional, sparse vectors with multi-modal normal distributions
 * (D) Only toy datasets like those in this assignment
 
 
If your answers were all A, A, A, A, A, then write the following:

```ruby
def answer_51()
  answers = ["A", "A", "A", "A", "A"]
  return answers
end
```

In [216]:
def answer_26()
  # BEGIN YOUR CODE
  answers = ["B", "C", "D", "B", "C"]
  #END YOUR CODE
  return answers
end

:answer_26

In [217]:
t_answers = answer_26()

assert_not_nil t_answers, "1"
assert_true(t_answers.is_a?(Array))
assert_equal(5, t_answers.size)
assert_true(t_answers.all? {|a| a.size == 1 and a =~ /[A-Z]/})
