# CS6140 Assignments

**Instructions**
1. In each assignment cell, look for the block:
 ```
  #BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
 ```
1. Replace this block with your solution.
1. Test your solution by running the cells following your block (indicated by ##TEST##)
1. Click the "Validate" button above to validate the work.

**Notes**
* You may add other cells and functions as needed
* Keep all code in the same notebook
* In order to receive credit, code must "Validate" on the JupyterHub server

---

# Assignment 5: Linear Models (2)


In this exercise, we will transform our gradient descent algorithm into stochastic gradent descent. We will then implement linear regression and logistic regression. Finally, we will run these models on a real-world dataset.

In [124]:
require './assignment_lib'

false

## Question 1.1 (10 points)

Transform your batch gradient descent into a stochastic gradient descent algorithm. Using a class here allows the SGD algorithm to maintain state, such as learning rate and weights. Test your algorithm on the coin dataset and Binomial model you coded in Assignment 3. Plot the likelihood measured for each batch of 100 examples on one pass of the dataset.

We will implement mini-batch SGD. Therefore you will not have to alter your Binomial Model. 

Stochasic gradient descent requires a learning rate that decreases with each iteration. Use the following learning rate:

## $\eta = \frac{\eta_{0}}{\sqrt{t}}$

where $\eta_{0}$ is the initial learning rate and $t$ is the number of mini-batch iterations.

In [125]:
class StochasticGradientDescent
  attr_reader :weights
  attr_reader :objective
  def initialize obj, w_0, lr = 0.01
    @objective = obj
    @weights = w_0
    @n = 1.0
    @lr = lr
  end
  def update x
    # BEGIN YOUR CODE
    g = @objective.grad(x, @weights)
    learning_rate = @lr / Math.sqrt(@n)
    @weights.each do |k, v|
      @weights[k] -= g[k] * learning_rate
    end
    @n += 1.0
    #END YOUR CODE
  end
end

:update

In [126]:
### TEST ###
# Testing on a known objective
class ParabolaObjective
  def func x, w
    0.5 * ((w["0"] - 1) ** 2.0 + (w["1"] - 2) ** 2.0)
  end
  def grad x, w
    dw = {"0" => (w["0"] - 1), "1" => (w["1"] - 2)}
  end
  def adjust w
  end
end

t1_w = {"0" => 0.0, "1" => 0.0}
t1_obj = ParabolaObjective.new
t1_sgd = StochasticGradientDescent.new t1_obj, t1_w, 0.25
t1_lik = 1.0
1000.times do 
  t1_sgd.update([])
  t1_lik = t1_obj.func([], t1_sgd.weights)
end

assert_true(t1_lik < 0.1, "SGD converges with simple objective")
assert_in_delta 1.0, t1_sgd.weights["0"], 0.1, "Weight 0 expected to be 1.0"
assert_in_delta 2.0, t1_sgd.weights["1"], 0.1, "Weight 1 expected to be 2.0"

t1_w

{"0"=>0.9999998479990008, "1"=>1.9999996959980015}

## Begin Question 2.1 (10 points)

Implement linear regression as an objective function for use with stochastic gradient descent. First, we will implement the predict function. For a weight vector, $w$ and a single ```Row``` with features, $x$, implement:

### $f(w,x) = w^T x$

Note that you already did this in [Assignment 4](../assignment-4/assignment-4.ipynb).

In [127]:
class LinearRegressionModel  
  def predict row, w
    # BEGIN YOUR CODE
    sum = 0.0
    
    if !(row["features"].empty? or w.empty?)
      row["features"].each do |k, v|
          if w.has_key?(k)
              sum += v * w[k]
          end
      end
    end  
    
    return sum
    #END YOUR CODE
  end
end

:predict

In [128]:
### TEST ###
t2_lr = LinearRegressionModel.new

assert_in_delta 6.0, t2_lr.predict({"features" => {"a" => 2.0}}, {"a" => 3.0}), 1e-6
assert_in_delta 6.0, t2_lr.predict({"features" => {"a" => 2.0}}, {"a" => 3.0, "b" => 4.0}), 1e-6
assert_equal 0.0, t2_lr.predict({"features" => {}}, {})
assert_equal 0.0, t2_lr.predict({"features" => {"a" => 1.0}}, {"a" => 0.0, "b" => 1.0})

## Begin Question 2.2 (10 points)

Continuing the implementation, implement the $L_2$ loss, which applies to a mini-batch of $n$ points. Use the ```predict``` function you implemented earlier.

### $L(w,X) = \frac{1}{n} \sum_{i} \frac{1}{2} \left(f(w,x_i) - y_i\right) ^ 2$


In [129]:
class LinearRegressionModel
  def func data, w
    # BEGIN YOUR CODE
    l = 0.0
    for x in data
      l += 0.5 * (predict(x, w) - x["label"]) ** 2
    end
    
    return l / data.length
    #END YOUR CODE
  end
  ## Adjusts the parameter to be within the allowable range
  def adjust w
  end
end

:adjust

In [130]:
t22_data = coin_dataset(10)
t22_data

{"classes"=>{}, "features"=>["x"], "data"=>[{"features"=>{"bias"=>1.0}, "label"=>1.0}, {"features"=>{"bias"=>1.0}, "label"=>0.0}, {"features"=>{"bias"=>1.0}, "label"=>0.0}, {"features"=>{"bias"=>1.0}, "label"=>1.0}, {"features"=>{"bias"=>1.0}, "label"=>1.0}, {"features"=>{"bias"=>1.0}, "label"=>1.0}, {"features"=>{"bias"=>1.0}, "label"=>1.0}, {"features"=>{"bias"=>1.0}, "label"=>1.0}, {"features"=>{"bias"=>1.0}, "label"=>1.0}, {"features"=>{"bias"=>1.0}, "label"=>1.0}]}

In [131]:
### TEST ###
t22_data = coin_dataset(1000)

t22_model = LinearRegressionModel.new
t22_w = Hash.new
t22_w["bias"] = 0.1

t22_f = t22_model.func t22_data["data"], t22_w
assert_in_delta 0.300, t22_f, 0.050, "Expected loss within [250, 350]"

t22_w["bias"] = 0.77
t22_f = t22_model.func t22_data["data"], t22_w
assert_in_delta 0.090, t22_f, 0.050, "Expected loss for a closer guess to be within [40, 140]"

## Begin Question 2.3 (10 points)

Continuing the implementation, now implement the gradient function. This returns a gradient value for the mini-batch of $n$ points.


In [132]:
class LinearRegressionModel
  def grad data, w
    # BEGIN YOUR CODE
    dw = Hash.new{|h, k| h[k] = 0.0}
    
    data[0]["features"].each do |k, v|
      sum = 0.0
      data.each do |x|
        if x["features"][k] != nil
          sum += (predict(x, w) - d=x["label"]) * x["features"][k]
        end
      end
      dw[k] = sum / data.size
    end
    return dw
    #END YOUR CODE
  end
end

:grad

In [133]:
### TEST ###

t23_data = coin_dataset(1000)
t23_model = LinearRegressionModel.new
t23_w = Hash.new
t23_w["bias"] = 0.1

t23_g = t23_model.grad t23_data["data"], t23_w
assert_in_delta -0.69, t23_g["bias"], 0.2, "Expected loss within [-0.49, -0.89]"

t23_w["bias"] = 0.77
t23_g = t23_model.grad t23_data["data"], t23_w
assert_in_delta 0.0, t23_g["bias"], 0.1, "Expected loss for a better guess to be within [-0.1, 0.1]"

## Question 2.4 (10 points)

Putting the previous steps together, use your ```StochasticGradientDecent``` to run linear regression for 10 passes (epochs) over the Coin Dataset, each pass with a mini-batch of size 20. Tune the learning rate, ```lr```, so that the model converges well. Assume that ```obj``` is an instance of ```LinearRegressionModel```, and ```w``` is an initial weight vector.

Track the number of batches in the ```iters``` array and the loss in the ```losses``` array.

In [134]:
def train_coin_sgd(obj, w, dataset)
  i = 0
  iters = []
  losses = []
  
  #Define sgd = StochasticGradientDescent.new obj, w, lr
  # You set the learning rate, lr
  # BEGIN YOUR CODE
  sgd = StochasticGradientDescent.new obj, w, 0.4
  
  (0..9).each do |i|
    sgd.update dataset["data"].sample(20)
    iters << i
    losses << obj.func(dataset["data"], sgd.weights)
  end
  #END YOUR CODE
  return [sgd, iters, losses]
end

:train_coin_sgd

In [135]:
### TEST ###
t24_data = coin_dataset(1000)
t24_model = LinearRegressionModel.new
t24_w = Hash.new
t24_w["bias"] = 0.1

t24_trainer, t24_iter, t24_losses = train_coin_sgd t24_model, t24_w, t24_data

assert_true t24_w.has_key?("bias")
assert_in_delta 0.77, t24_w["bias"], 0.1, "Expected weight for 'bias'  [0.67, 0.87]"
t24_cum_loss = 0.0
t24_losses.each_index {|i| t24_cum_loss += t24_losses[i]; t24_losses[i] = t24_cum_loss / (t24_iter[i] + 1)}
Daru::DataFrame.new({x: t24_iter, y: t24_losses}).plot(type: :line, x: :x, y: :y) do |plot, diagram|
  plot.x_label "Batches"
  plot.y_label "Cumulative Loss"
end

## Question 3.1 (10 points)

Implement Logistic Regression, following much the same process as with linear regression. The prediction function returns a value:

### $f(x,w) = \frac {1}{1 + \exp \left( -w^T x \right) } $

In [65]:
class LogisticRegressionModel
  def predict row, w
    # BEGIN YOUR CODE
    sum = 0.0
    
    if !(row["features"].empty? or w.empty?)
      row["features"].each do |k, v|
          if w.has_key?(k)
              sum += v * w[k]
          end
      end
    end 
    
    return 1 / (1 + Math.exp(-sum))
    #END YOUR CODE
  end
  
  def adjust w
    w
  end
end

:adjust

In [66]:
### TEST ###
t31_model = LogisticRegressionModel.new
def t31_f(a: 0.0, b: 0.0)
  row = {"features" => {"a" => a, "b" => b}}
end
def t31_w(a: 0.0, b: 0.0)
  w = {"a" => a, "b" => b}
end
assert_in_delta 0.5, t31_model.predict(t31_f(), t31_w()), 1e-6
assert_in_delta 0.2689, t31_model.predict(t31_f(a:1), t31_w(a:-1)), 1e-3
assert_in_delta 1.0, t31_model.predict(t31_f(a:1, b:1000), t31_w(a:-1, b: 0.1)), 1e-3

## Question 3.2 (10 points)

Implement log loss assuming that the y label is defined as: $y \in \left\{-1, 1\right\}$. Remember that the mini-batch loss is an expectation of the $n$ examples in the mini batch.

In [121]:
class LogisticRegressionModel
  def func data, w
    # BEGIN YOUR CODE
    sum = 0.0
    for x in data
      sum += (x["label"] == 0? -1 : 1) * Math.log(predict(x, w)) + (1 - (x["label"] == 0? -1 : 1)) * Math.log(1 - predict(x, w))
    end
    
    return - sum / data.length
    #END YOUR CODE
  end
  
end

:func

In [122]:
### TEST ###
t32_data = coin_dataset(1000)
t32_model = LogisticRegressionModel.new
t32_w = Hash.new {|h,k| h[k] = 0.1}
assert_in_delta 0.66, t32_model.func(t32_data["data"], t32_w), 0.2, "Expected LR.func in [460, 860]"

## Question 3.3 (10 points)

Calculate the gradient of the mini-batch log loss for each parameter $w$. This time, assume that $y \in \left\{0, 1\right\}$. Hint: This assumption should simplify the calculation.

In [82]:
class LogisticRegressionModel
  def grad data, w
    # BEGIN YOUR CODE
    dw = Hash.new {|h, k| h[k] = 0.0}
    data[0]["features"].each do |k, v|
      w[k]
      sum = 0.0
      data.each do |x|
        if x["features"][k] != nil
          sum += (predict(x, w) - x["label"]) * x["features"][k]
        end
      end
      dw[k] = sum / data.length
    end
    return dw    
    #END YOUR CODE
  end
end

:grad

In [83]:
### TEST ###
t32_data = coin_dataset(1000)
t32_model = LogisticRegressionModel.new
t32_w = Hash.new {|h,k| h[k] = 0.1}
assert_in_delta 0.66, t32_model.func(t32_data["data"], t32_w), 0.2, "Expected LR.func in [460, 860]"
t32_g = t32_model.grad t32_data["data"], t32_w
assert_in_delta -0.26, t32_g["bias"], 0.1, "Expected LR.grad in [-0.36, -0.16]"

t32_w = Hash.new {|h,k| h[k] = 0.778}
t32_g = t32_model.grad t32_data["data"], t32_w
assert_in_delta -0.1, t32_g["bias"], 0.1, "Expected LR.grad for a closer value to be in [-0.2, -0.2]"



## Question 4.1 (6 points)

Let's train our new models on a familiar dataset, spambase. Let's run gradient descent for a few steps on this dataset. Observe that the learned weights after just gradient 2 steps are very large.

In [73]:
### Example ###
#Preview 2 lines from the Spambase dataset
spambase = read_sparse_data_from_csv "spambase"
spambase["data"].each {|r| r["features"]["bias"] = 1.0}
puts spambase["data"][0,2]

q41_model = LinearRegressionModel.new
q41_w = Hash.new {|h,k| h[k] = 0.0}
q41_w["bias"] = 1
q41_sgd = StochasticGradientDescent.new q41_model, q41_w, 0.1
2.times do
  q41_batch = spambase["data"].sample(10)
  q41_sgd.update q41_batch
end
puts q41_w

[{"features"=>{"word_freq_our"=>0.27, "word_freq_mail"=>0.83, "word_freq_you"=>0.27, "word_freq_your"=>0.27, "word_freq_font"=>8.58, "char_freq_["=>0.092, "char_freq_$"=>0.185, "char_freq_#"=>0.232, "capital_run_length_average"=>7.313, "capital_run_length_longest"=>99.0, "capital_run_length_total"=>607.0, "bias"=>1.0}, "label"=>1.0}, {"features"=>{"word_freq_your"=>0.9, "word_freq_george"=>0.9, "word_freq_data"=>0.9, "char_freq_["=>0.14, "capital_run_length_average"=>3.472, "capital_run_length_longest"=>28.0, "capital_run_length_total"=>125.0, "bias"=>1.0}, "label"=>0.0}]
{"bias"=>0.9149999999999999}


## Question 4.1 (Continued) 
We can correct this by _normalizing_ the data. A popular normalization is the z-score. For each feature, except bias, and considering only the non-zero values create a new zspambase dataset, ```zspambase```. The dataset ```zspambase``` is identical to spambase except that its features have been normalized as follows:

### $x_z = \frac{x - \mu}{\sigma}$

where $\mu$ is the mean of the $x$ value and $\sigma$ is the standard deviation. Note that you have already seen an implementation of ```mean``` and ```stdev```, so find it and add it here.

In [74]:
## Add mean and stdev here

# BEGIN YOUR CODE
def mean x
  u = Hash.new {|h, k| h[k] = 0.0}
  scale = Hash.new {|h, k| h[k] = 0.0}
  
  x.each do |i|
    i["features"].each do |k, v|
      if i["features"].has_key? k
        u[k] += i["features"][k]
        scale[k] += 1
      end
    end
  end
  u.each do |k, v|
    u[k] /= scale[k]
  end
  return u
end

def stdev x
  u = mean x
  d = Hash.new {|h, k| h[k] = 0.0}
  scale = Hash.new {|h, k| h[k] = 0.0}
  x.each do |i|
    i["features"].each do |k, v|
      if i["features"].has_key? k
        d[k] += (i["features"][k] - u[k]) ** 2.0
        scale[k] += 1
      end
    end
  end
  d.each do |k, v|
    d[k] = Math.sqrt(d[k] / (scale[k] - 1))
  end
  return d
end
#END YOUR CODE

:stdev

In [75]:
def create_zspambase spambase
  zspambase = spambase.clone
  zspambase["data"] = spambase["data"].collect do |r|
    u = r.clone
    u["features"] = r["features"].clone
    u
  end

  
  # BEGIN YOUR CODE
  u = mean zspambase["data"]
  d = stdev zspambase["data"]
  zspambase["data"].each do |x|
    x["features"].each do |k, v|
      x["features"][k] = d[k] == 0.0 ? x["features"][k] : (x["features"][k] - u[k]) / d[k]
    end
  end
  
  #END YOUR CODE
  return zspambase
end

zspambase = create_zspambase spambase
zspambase["data"].first

{"features"=>{"word_freq_our"=>-0.628106690674003, "word_freq_mail"=>-0.0163998685249916, "word_freq_you"=>-1.2509960473524198, "word_freq_your"=>-0.9962817981732773, "word_freq_font"=>0.8660048920660688, "char_freq_["=>-0.19095268670254528, "char_freq_$"=>-0.162631899108401, "char_freq_#"=>-0.038069985077358884, "capital_run_length_average"=>0.06686163123580875, "capital_run_length_longest"=>0.24027347134946417, "capital_run_length_total"=>0.533869650358239, "bias"=>1.0}, "label"=>1.0}

In [76]:
### TEST ###
t41_zs = create_zspambase spambase

assert_in_delta 0.27, spambase["data"].first["features"]["word_freq_our"], 1e-5
assert_in_delta -0.628106690674003, zspambase["data"].first["features"]["word_freq_our"], 1e-5

assert_in_delta 607.0, spambase["data"].first["features"]["capital_run_length_total"], 1e-5
assert_in_delta 0.53386, zspambase["data"].first["features"]["capital_run_length_total"], 1e-5

## Question 4.2 (7 points)

Train Linear Regression for the ```zspambase``` dataset. Tune the learning rate as needed to train in one epoch. Hint: Learning rate may need to be very small. 

In [115]:
def train_zspambase_sgd(obj, w, dataset)
  i = 0
  iters = []
  losses = []
  
  #Define sgd = StochasticGradientDescent.new obj, w, lr
  # You set the learning rate, lr
  # BEGIN YOUR CODE
  sgd = StochasticGradientDescent.new obj, w, 0.5
  
  (0..9).each do |k|
    sgd.update dataset["data"].sample(20)
    iters << k
    losses << obj.func(dataset["data"], sgd.weights)
  end
  #END YOUR CODE
  return [sgd, iters, losses]
end

:train_zspambase_sgd

In [116]:
### TEST ###
t25_model = LinearRegressionModel.new
t25_w = Hash.new {|h,k| h[k] = 0.0}
t25_w["bias"] = 1

t25_trainer, t25_iter, t25_losses = train_zspambase_sgd t25_model, t25_w, zspambase
puts t25_w

t25_cum_loss = 0.0
t25_losses.each_index {|i| t25_cum_loss += t25_losses[i]; t25_losses[i] = t25_cum_loss / (t25_iter[i] + 1)}
assert_true (t25_losses.last < 0.15), "Expected last loss value less than target"
Daru::DataFrame.new({x: t25_iter, y: t25_losses}).plot(type: :line, x: :x, y: :y) do |plot, diagram|
  plot.x_label "Batches"
  plot.y_label "Cumulative Loss"
end

{"bias"=>0.3743909345473643}


## Question 4.3 (7 points)

Run logistic regression on the ```zspambase``` dataset, tuning the learning rate.

In [119]:
def train_zspambase_logistic_sgd(obj, w, dataset)
  i = 0
  iters = []
  losses = []
  
  #Define sgd = StochasticGradientDescent.new obj, w, lr
  # You set the learning rate, lr
  # BEGIN YOUR CODE
  sgd = StochasticGradientDescent.new obj, w, 0.5
  
  (0..99).each do |i|
    sgd.update dataset["data"].sample(50)
    iters << i
    losses << obj.func(dataset["data"], sgd.weights)
  end
  #END YOUR CODE
  return [sgd, iters, losses]
end

:train_zspambase_logistic_sgd

In [123]:
### TEST ###
t43_model = LogisticRegressionModel.new
t43_w = Hash.new {|h,k| h[k] = 0.0}
t43_w["bias"] = 1

t43_trainer, t43_iter, t43_losses = train_zspambase_logistic_sgd t43_model, t43_w, zspambase
t43_cum_loss = 0.0
t43_losses.each_index {|i| t43_cum_loss += t43_losses[i]; t43_losses[i] = t43_cum_loss / (t43_iter[i] + 1)}
puts t43_w

assert_true(t43_losses.last < 0.6, "Expected last loss value < 0.6")
Daru::DataFrame.new({x: t43_iter, y: t43_losses}).plot(type: :line, x: :x, y: :y) do |plot, diagram|
  plot.x_label "Batches"
  plot.y_label "Cumulative Loss"
end

{"bias"=>-0.27718350520703305, "word_freq_all"=>-0.00825717977690197, "word_freq_our"=>0.0038195363457372263, "word_freq_mail"=>-0.006752715852515531, "word_freq_free"=>0.029566266530159407, "word_freq_you"=>0.1969722523351173, "word_freq_your"=>0.15768691118613098, "word_freq_re"=>-0.07643760301093347, "char_freq_("=>-0.21292087482596736, "char_freq_!"=>0.121391909708252, "capital_run_length_average"=>0.23073763473073491, "capital_run_length_longest"=>0.5293089355522629, "capital_run_length_total"=>0.5039286631700391, "word_freq_make"=>0.004616897228902001, "word_freq_address"=>-0.014152160845365315, "word_freq_order"=>-0.00800524710507445, "word_freq_will"=>-0.14371136094800735, "word_freq_people"=>0.011647267885007653, "word_freq_credit"=>0.0031859462529840224, "word_freq_money"=>0.005639142473809912, "word_freq_lab"=>-0.003700330380636649, "word_freq_meeting"=>0.0013884275251976326, "char_freq_;"=>-0.020236139660108093, "char_freq_$"=>0.05919116379412423, "char_freq_#"=>0.007588846