# CS6140 Assignments

**Instructions**
1. In each assignment cell, look for the block:
 ```
  #BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
 ```
1. Replace this block with your solution.
1. Test your solution by running the cells following your block (indicated by ##TEST##)
1. Click the "Validate" button above to validate the work.

**Notes**
* You may add other cells and functions as needed
* Keep all code in the same notebook
* In order to receive credit, code must "Validate" on the JupyterHub server

---

# Final Project: Part 3 - Classifiers

In [Part 2](../part-2.ipynb) we implemented basic features and calculated their Information Gain. Now we will use these features to train some models. Your assignment will be graded based on performance on a **test** database with the same schema, which is not provided to you. The Validate step applies to the **training** and **dev** databases. 

As you discover new features, go back to [Part 2](../part-2.ipynb) and add them to demonstrate good information gain. Note the following requirements:

1. Focus on data preparation and creating, normalizing, understanding new features.
1. Use your implementation of models.
1. Avoid creating new models that we did not cover in the class. There is no credit for fancy models.
1. Do not use the target label, or anything that is derived based on the training label as features. 
1. You may talk to other students about your solution, but do not share code. 

Also, here are some hints:

* Use sampling when training on the **training** databases. There is no requirement to use everything.

In [10]:
require './assignment_lib'
dir = "/home/dataset"

$train_db = SQLite3::Database.new "#{dir}/credit_risk_data_train.db", results_as_hash: true, readonly: true
$dev_db = SQLite3::Database.new "#{dir}/credit_risk_data_dev.db", results_as_hash: true, readonly: true

"if(window['d3'] === undefined ||\n   window['Nyaplot'] === undefined){\n    var path = {\"d3\":\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\",\"downloadable\":\"http://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\"};\n\n\n\n    var shim = {\"d3\":{\"exports\":\"d3\"},\"downloadable\":{\"exports\":\"downloadable\"}};\n\n    require.config({paths: path, shim:shim});\n\n\nrequire(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\n\n\tvar script = d3.select(\"head\")\n\t    .append(\"script\")\n\t    .attr(\"src\", \"http://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\")\n\t    .attr(\"async\", true);\n\n\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\n\n\n\t    var event = document.createEvent(\"HTMLEvents\");\n\t    event.initEvent(\"load_nyaplot\",false,false);\n\t    win

#<SQLite3::Database:0x00000000036a0ec0 @tracefunc=nil, @authorizer=nil, @encoding=nil, @busy_handler=nil, @collations={}, @functions={}, @results_as_hash=true, @type_translation=nil, @readonly=true>

## Question 1.1 (10 Points)

Implement a random classifier, one in which the score is a random number. Verify the AUC is 0.5 on training, dev, and test sets. 

Each training method should create features as needed from the provided database. The result must comply with the following format:

```ruby
predictions = Hash.new
predictions[12345] = score    
```

Note that predictions is an integer, so the output looks like this:

```ruby
predictions = {
    12345 => 0.9
}
```


In [11]:
def mean data
  res = Hash.new
  count = Hash.new
  
  data.each do |record|
    record["features"].each do |key, value|
      if !res.key?(key)
        res[key] = 0.0
      end
      if !count.key?(key)
        count[key] = 0.0
      end
      res[key] += value.to_f
      count[key] += 1
    end
  end
  res.each do |key, value|
    res[key] = value / count[key]
  end
end

:mean

In [12]:
def stdev data, mean
  res = Hash.new
  count = Hash.new
  
  data.each do |record|
    record["features"].each do |key, value|
      if !res.key?(key)
        res[key] = 0.0
      end
      if !count.key?(key)
        count[key] = 0.0
      end
      res[key] += (value - mean[key]) ** 2
      count[key] += 1
    end
  end
  res.each do |key, value|
    if count[key] > 1
      res[key] = Math.sqrt(value / (count[key] - 1))
    else
      res[key] = Math.sqrt(value)
    end
  end
    res
end

:stdev

In [13]:
def normalize dataset
  dataset_clone = dataset.collect do |r|
    u = r.clone
    u["id"] = r["id"].clone
    u["label"] = r["label"].clone
    u["features"] = r["features"].clone
    u
  end
  # BEGIN YOUR CODE
  zsp_data = dataset_clone
  zsp_mean = mean zsp_data
  zsp_stdev = stdev zsp_data, zsp_mean
  zsp_data.each do |record|
    record["features"].each do |key, value|
      if zsp_stdev[key] <= 0.0
        record["features"][key] = 0.0
      else
        record["features"][key] = (value - zsp_mean[key]) / zsp_stdev[key]
      end
    end
  end
  
  
  #END YOUR CODE
  return dataset_clone
end

:normalize

In [14]:
def fill_mean dataset
  mean_hash = mean dataset
  dataset.each do |row|
    if row.has_key? ("features")
      row["features"].each_key do |key|
        if row["features"][key] == nil or row["features"][key] == ""
          row["features"][key] = mean_hash[key]
        end
      end
    end
  end
end

:fill_mean

In [15]:
def fill_zero dataset
  dataset.each do |row|
    if row.has_key? ("features")
      row["features"].each_key do |key|
        if row["features"][key] == nil or row["features"][key] == ""
          row["features"][key] = 0
        end
      end
    end
  end
end

:fill_zero

In [16]:
def cal_ig_conti dataset, fname
  dist = class_distribution dataset
  h0 = entropy dist
  res = find_split_point_numeric dataset, h0, fname
  ig = res[1]
  ig
end

:cal_ig_conti

In [17]:
def fill_zero_cate dataset
  dataset.each do |row|
    if row.has_key? ("features")
      row["features"].each_key do |key|
        if row["features"][key] == nil or row["features"][key] == ""
          row["features"][key] = ""
        end
      end
    end
  end
end

:fill_zero_cate

In [18]:
def create_dataset db, sql
  dataset = []
  db.execute(sql) do |row|
    record = Hash.new
    
    if row.key?("SK_ID_CURR")
      record["id"] = row["SK_ID_CURR"]
    end
    if row.key?("TARGET")
      record["label"] = row["TARGET"]
    end
    
    record["features"] = Hash.new
    
    row.keys.each do |key|
      if key != "SK_ID_CURR" && key != "TARGET" && !key.is_number?
        record["features"][key.downcase] = row[key]
      end
    end
    dataset << record
  end
 
  dataset
end

class Object
  def is_number?
    to_f.to_s == to_s || to_i.to_s == to_s
  end
end


:is_number?

In [19]:
def class_distribution dataset
  # BEGIN YOUR CODE
  class_group = dataset.group_by{|row| row["label"]}
  class_dist = Hash.new
  ## size of rows
  total_size = dataset.size
  class_num = class_group.size
  
  class_group.each_key do |key|
    class_dist[key] = (class_group[key].size.to_f / total_size.to_f).to_f
  end
  class_dist
  #END YOUR CODE
end

:class_distribution

In [20]:
def entropy dist
  # BEGIN YOUR CODE
  sum = dist.values.reduce(0.0, :+)
  entropy = 0.0
  dist.values.each do |d|
    if d != 0
      entropy -= (d / sum) * Math.log(d / sum)
    end
  end
  entropy
  #END YOUR CODE
end

:entropy

In [21]:
def information_gain h0, splits
  # BEGIN YOUR CODE
  information_gain = 0.0
  
  total_size = 0.0
  splits.each do |key, value|
    total_size += value.size
  end

  entropy_sum = 0.0
  splits.each do |key, value|
    class_dist = class_distribution value
    class_size = value.size
    class_entropy = entropy class_dist
    entropy_sum -= (class_size.to_f / total_size.to_f) * class_entropy.to_f
  end
  
  information_gain = h0 + entropy_sum
  information_gain
  
  #END YOUR CODE
end

:information_gain

In [22]:
def find_split_point_numeric x, h0, fname
  # BEGIN YOUR CODE
  t_max = 0.0
  ig_max = 0.0
  
  split_l = Hash.new(0)
  split_r = Hash.new(0)
  
  sorted_x = x.sort_by{|row| row["features"].has_key?(fname) ? row["features"][fname] : 0}
  
  sorted_x.each do |row|
    split_r[row["label"]] += 1
  end
  size = sorted_x.size
  ig_max = 0
  sorted_x.each_with_index do |row, index|
    split_l[row["label"]] += 1
    split_r[row["label"]] -= 1
    if(index + 1 < size and row["features"][fname] == sorted_x[index + 1]["features"][fname])
      next
    end
   
    p1 = (index + 1.0)/ size
    p2 = (size - index - 1.0)/ size
    ig = h0 - p1 * entropy(split_l) - p2 * entropy(split_r)
    if ig > ig_max
      ig_max = ig
      t_max = sorted_x[index + 1]["features"][fname]
    end
  end
  return [t_max, ig_max]
  #END YOUR CODE
end

:find_split_point_numeric

In [49]:
######################################Extract 15 features here########################################
def extract_features db
  dataset = []
  
##########################################main join bureau#########################################
  sql_1 = "select A.SK_ID_CURR, target,
       A.EXT_SOURCE_1,
       A.EXT_SOURCE_2,
       A.EXT_SOURCE_3,
       A.AMT_GOODS_PRICE as good_pri,
       A.DAYS_EMPLOYED,
       A.DAYS_BIRTH,
       A.DAYS_ID_PUBLISH as skr, 
       A.DAYS_REGISTRATION as skr_2,
       A.AMT_CREDIT,
       A.AMT_ANNUITY,
       SUM(B.AMT_CREDIT_SUM) as cre_sum,
       AVG(B.AMT_CREDIT_SUM) as cre_avg,  
       AVG(B.AMT_CREDIT_SUM_DEBT) as debt_avg,
       SUM(B.AMT_CREDIT_SUM_DEBT) as debt_sum,
       AVG(B.AMT_ANNUITY) as ann,
       AVG(B.DAYS_CREDIT) as days_cre_avg
       from application_train A 
       left join bureau B on A.SK_ID_CURR = B.SK_ID_CURR 
       group by A.SK_ID_CURR"
  
  dataset_1 = create_dataset db, sql_1
  fill_mean dataset_1
  dataset_1 = normalize dataset_1
  dataset_1.each do |row|
    record = Hash.new
    record["id"] = row["id"].clone
    record["label"] = row["label"].clone
    record["features"] = Hash.new
    row["features"]["remix_1"] = (row["features"]["cre_sum"] / row["features"]["debt_sum"]) * row["features"]["ext_source_2"]
    row["features"]["remix_2"] = (row["features"]["cre_avg"] / row["features"]["ann"]) + 5 * row["features"]["ext_source_2"] ** 2
    
#     row["features"]["remix_3.5"] = row["features"]["days_cre_avg"]
    #0.00476
#     row["features"]["remix_3"] = row["features"]["days_cre_avg"] / (row["features"]["days_birth"] ** 2 *
#     row["features"]["days_employed"] ** 2)
    row["features"]["days_remix"] = row["features"]["days_birth"] ** 2 *
    row["features"]["days_employed"] ** 2 
    row["features"]["remix_3"] = row["features"]["days_cre_avg"] * row["features"]["ext_source_2"] ** 4  / row["features"]["days_remix"] ** 2
    row["features"]["remix_4"] = row["features"]["amt_credit"] * row["features"]["ext_source_3"] / row["features"]["good_pri"] 
    record["features"]["remix_1"] = row["features"]["remix_1"]
    record["features"]["remix_2"] = row["features"]["remix_2"]
    record["features"]["remix_3"] = row["features"]["remix_3"]
    record["features"]["remix_4"] = row["features"]["remix_4"]
    dataset << record   
  end

  
  #######################################extract from main table################################################
  sql_2 = "select target, sk_id_curr, 
  AMT_INCOME_TOTAL,
  AMT_CREDIT,
  AMT_ANNUITY,
  AMT_GOODS_PRICE,
  DAYS_BIRTH,
  DAYS_EMPLOYED,
  ext_source_3, 
  ext_source_2,
  ext_source_1 
  from application_train"
  
  dataset_2 = create_dataset db, sql_2
  fill_mean dataset_2
  dataset_2 = normalize dataset_2
  
  dataset_2.zip(dataset).each do |row, record|
    row["features"]["remix_5"] = 3 * row["features"]["ext_source_1"] + 4 * row["features"]["ext_source_2"] +  3 * row["features"]["ext_source_3"]
    row["features"]["remix_6"] = row["features"]["ext_source_2"] ** 8  / (row["features"]["ext_source_1"] * row["features"]["ext_source_3"])
    row["features"]["remix_7"] = row["features"]["amt_income_total"] * row["features"]["amt_credit"] - 3 * row["features"]["ext_source_3"] 
    row["features"]["remix_8"] = 2 * row["features"]["amt_goods_price"] - row["features"]["amt_credit"] + row["features"]["ext_source_3"]
    row["features"]["remix_9"] = row["features"]["days_birth"] * row["features"]["ext_source_2"] ** 2 / (-365) 
    record["features"]["remix_5"] = row["features"]["remix_5"]
    record["features"]["remix_6"] = row["features"]["remix_6"]
    record["features"]["remix_7"] = row["features"]["remix_7"]
    record["features"]["remix_8"] = row["features"]["remix_8"]
    record["features"]["remix_9"] = row["features"]["remix_9"]
  end
  
  #######################################Categorical Features###########################################
  
  sql_3 = "select target, sk_id_curr,
    name_education_type,
    code_gender,
    flag_own_car,
    flag_own_realty,
    name_contract_type,
    name_type_suite,
    name_income_type,
    name_family_status,
    name_housing_type,
    occupation_type,
    organization_type,
    WEEKDAY_APPR_PROCESS_START
    from application_train"
  
  map_1 = Hash.new
  map_2 = Hash.new
  map_3 = Hash.new
  map_4 = Hash.new
  map_5 = Hash.new
  map_6 = Hash.new
  
  count_1 = 0
  count_2 = 0
  count_3 = 0
  count_4 = 0
  count_5 = 0
  count_6 = 0
  
  dataset_3 = create_dataset db, sql_3
  fill_zero dataset_3
  dataset_3.zip(dataset).each do |row,record|
    row["features"]["remix_10"] = row["features"]["organization_type"].to_s + row["features"]["name_contract_type"].to_s
    if !map_1.key?(row["features"]["remix_10"])
      map_1[row["features"]["remix_10"]] = count_1
      count_1 += 1
    end
    row["features"]["remix_10"] = map_1[row["features"]["remix_10"]]
    
    row["features"]["remix_11"] = row["features"]["name_education_type"].to_s + row["features"]["organization_type"].to_s
    if !map_2.key?(row["features"]["remix_11"])
      map_2[row["features"]["remix_11"]] = count_2
      count_2 += 1
    end
    row["features"]["remix_11"] = map_2[row["features"]["remix_11"]]
    
    row["features"]["remix_12"] = row["features"]["name_family_status"].to_s + row["features"]["occupation_type"].to_s
    
    if !map_3.key?(row["features"]["remix_12"])
      map_3[row["features"]["remix_12"]] = count_3
      count_3 += 1
    end
    row["features"]["remix_12"] = map_3[row["features"]["remix_12"]]
    
    row["features"]["remix_13"] = row["features"]["name_housing_type"].to_s + row["features"]["occupation_type"].to_s
    if !map_4.key?(row["features"]["remix_13"])
      map_4[row["features"]["remix_13"]] = count_4
      count_4 += 1
    end
    row["features"]["remix_13"] = map_4[row["features"]["remix_13"]]  
    
    row["features"]["remix_14"] = row["features"]["flag_own_realty"].to_s + row["features"]["organization_type"].to_s
    if !map_5.key?(row["features"]["remix_14"])
      map_5[row["features"]["remix_14"]] = count_5
      count_5 += 1
    end
    row["features"]["remix_14"] = map_5[row["features"]["remix_14"]]  

    row["features"]["remix_15"] = row["features"]["code_gender"].to_s + row["features"]["organization_type"].to_s
    
    if !map_6.key?(row["features"]["remix_15"])
      map_6[row["features"]["remix_15"]] = count_6
      count_6 += 1
    end
    row["features"]["remix_15"] = map_6[row["features"]["remix_15"]]  
    record["features"]["remix_10"] = row["features"]["remix_10"]
    record["features"]["remix_11"] = row["features"]["remix_11"]
    record["features"]["remix_12"] = row["features"]["remix_12"]
    record["features"]["remix_13"] = row["features"]["remix_13"]
    record["features"]["remix_14"] = row["features"]["remix_14"]
    record["features"]["remix_15"] = row["features"]["remix_15"]
    
  end
  dataset = normalize dataset
  
  
#   dist = class_distribution dataset_3
#   h0 = entropy dist
#   split_1 = dataset_3.group_by {|row| row["features"]["remix_10"]}
#   split_2 = dataset_3.group_by {|row| row["features"]["remix_11"]}
#   split_3 = dataset_3.group_by {|row| row["features"]["remix_12"]}
#   split_4 = dataset_3.group_by {|row| row["features"]["remix_13"]}
#   split_5 = dataset_3.group_by {|row| row["features"]["remix_14"]}
#   split_6 = dataset_3.group_by {|row| row["features"]["remix_15"]}
  
#   ig_10 = information_gain h0, split_1
#   ig_11 = information_gain h0, split_2
#   ig_12 = information_gain h0, split_3
#   ig_13 = information_gain h0, split_4
#   ig_14 = information_gain h0, split_5
#   ig_15 = information_gain h0, split_6
  
#   ig_1 = cal_ig_conti dataset_1, "remix_1"
#   puts "remix_1 : sum of credit sum over sum of debt_sum remix ext_2 "
#   puts ig_1
  
#   ig_2 = cal_ig_conti dataset_1, "remix_2"
#   puts "remix_2 : avg of credit sum over avg of ann remix ext_2"
#   puts ig_2
  
#   ig_3 = cal_ig_conti dataset_1, "remix_3"
#   puts "remix_3 : days credit avg / power of days remix remix ext_2"
#   puts ig_3
  
#   ig_4 = cal_ig_conti dataset_1, "remix_4"
#   puts "remix_4 : amt credit over amt goods price remix ext_3"
#   puts ig_4
  
#   ig_5 = cal_ig_conti dataset_2, "remix_5"
#   puts "remix_5 : linear sum of all ext_source with weights"
#   puts ig_5
  
#   ig_6 = cal_ig_conti dataset_2, "remix_6"
#   puts "remix_6 : nonlinear remix of all ext_source with weights"
#   puts ig_6
#   ig_7 = cal_ig_conti dataset_2, "remix_7"
#   puts "remix_7 : mash up with income total credit and ext_source"
#   puts ig_7
  
#   ig_8 = cal_ig_conti dataset_2, "remix_8"
#   puts "remix_8 : creditdownpayment: AMTGOODPRICE - AMTCREDIT remix ext_source"
#   puts ig_8
  
#   ig_9 = cal_ig_conti dataset_2, "remix_9"
#   puts "remix_9: age int remix ext_source"
#   puts ig_9
  
#   puts "remix_10 : organization_type remix with name contract type"
#   puts ig_10
#   puts "remix_11 : name_education_type remix with organization_type"
#   puts ig_11
#   puts "remix_12 : name_family_status remix with occupation_type"
#   puts ig_12
#   puts "remix_13 : name_housing_type remix with occupation_type"
#   puts ig_13
#   puts "remix_14: flag_own_realty remix with organization_type"
#   puts ig_14
#   puts "remix_15 : code_gender remix with organization_type"
#   puts ig_15
  return dataset
end

:extract_features

In [24]:
def train_random_classifier train_db
  model = nil
  # BEGIN YOUR CODE
  #END YOUR CODE
  return model
end

:train_random_classifier

In [25]:
def eval_random_classifier_on db, model
  predictions = Hash.new
  # BEGIN YOUR CODE
  sql = "select SK_ID_CURR from application_train"
  dataset = create_dataset db, sql
  dataset.each do |row|
    predictions[row["id"]] = rand()
  end
  return predictions
end

:eval_random_classifier_on

In [26]:
def test_11_1
  model = train_random_classifier $train_db
  predictions = eval_random_classifier_on $dev_db, model
  assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
  assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.45, "AUC #{auc} > 0.45")
  assert_true(auc < 0.55, "AUC #{auc} < 0.55")
  plot_roc_curve(fp, tp, auc).show()
end
test_11_1()

In [27]:
"Evaluation on test set after submission"

"Evaluation on test set after submission"

## Question 1.2 (10 Points)

Implement a "perfect" classifier, one in which the score is the class label. You should not use the class label as a feature, but if you do, then your performance will be too good to be true.

In [28]:
def train_perfect_classifier train_db
  model = nil
  
  return model
end

:train_perfect_classifier

In [29]:
def eval_perfect_classifier_on db, model
  predictions = Hash.new
  # BEGIN YOUR CODE
  sql = "select SK_ID_CURR,TARGET from application_train"
  dataset = create_dataset db, sql
  dataset.each do |row|
    predictions[row["id"]] = row["label"]
  end
  return predictions
end

:eval_perfect_classifier_on

In [30]:
def test_12_1
  model = train_perfect_classifier $train_db
  predictions = eval_perfect_classifier_on $dev_db, model
  assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
  assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.95, "AUC #{auc} > 0.95")
  assert_true(auc < 1.05, "AUC #{auc} < 1.05")
  plot_roc_curve(fp, tp, auc).show()
end
test_12_1()

In [31]:
"Evaluation on test set after submission"

"Evaluation on test set after submission"

## Question 2.1 (15 Points)

Implement a classifier that achieves an auc, $a$, in range $a\in (0.5, 0.6)$. This demonstrates that you have discovered some moderately useful features and can control how your model performs. You will not receive any extra points for a model which performs better than 0.6 in this question. 

Note: **do not use the target label in your training or evaluation**

In [32]:
def dot x, w
  
  res = 0.0

  x.each do |k, v|
    if w[k] != nil
      res += v * w[k]
    end
  end
  return res  
end

def norm w
  return Math.sqrt(dot(w,w))
end

:norm

In [33]:
class StochasticGradientDescent
  attr_reader :weights
  attr_reader :objective
  def initialize obj, w_0, lr = 0.01
    @objective = obj
    @weights = w_0
    @n = 1.0
    @lr = lr
  end
  def update x
    # BEGIN YOUR CODE
    curr_lr = @lr / Math.sqrt(@n)
    @objective.func(x, @weights)
    grad = @objective.grad(x, @weights)
    @weights = update_weights(@weights, grad, curr_lr)
    @n += 1
    #END YOUR CODE
  end
  
  def update_weights(w, dw, lr)
    w_copy = w.clone()
    dw_copy = dw.clone()
  
    dw_copy.each do |k, v|
      dw_copy[k] *= lr
    end
  
    w_copy.each do |k, v|
      if dw_copy.key?(k)
        w_copy[k] -= dw_copy[k]
      end
    end
    w_copy
  end
end

:update_weights

In [34]:
class LinearRegressionModel  
  def predict row, w
    res = 0.0
    row["features"].each do |key, value|
      if(w.key? (key))
        res += value * w[key]
      end
    end
    res
  end
end

:predict

In [35]:
class LinearRegressionModel
  def func data, w
    # BEGIN YOUR CODE
    res = 0.0
    data.each do |record|
      update_value = ((predict record, w) - record["label"]) ** 2 / 2
      res += update_value
    end
    res = res / data.length
    res
    #END YOUR CODE
  end
  
  ## Adjusts the parameter to be within the allowable range
  def adjust w
  end
end

:adjust

In [36]:
class LinearRegressionModel
  def grad data, w
    # BEGIN YOUR CODE
    grad_res = Hash.new
    
    data.each do |record|
      record["features"].each do |key, value|
        if(!grad_res.key?(key))
          grad_res[key] = 0.0
        end
        grad_res[key] += (value * (predict record, w) - record["label"]) / data.length
      end
    end
    grad_res
    #END YOUR CODE
  end
end

:grad

In [37]:
class LogisticRegressionModel
  def predict row, w
    # BEGIN YOUR CODE
    res = 0.0
    row["features"].each do |key, value|
      if w[key] != nil
        res += w[key] * value
      end
    end
    
    grad = 1 / (1 + Math.exp(-res))
    #END YOUR CODE
    grad
  end
  
  def adjust w
    w
  end
  
   def func data, w
    # BEGIN YOUR CODE
    res = 0.0
    data_size = data.length
    data.each do |record|
      y = record["label"]
      p = predict record, w
      res -= (y * Math.log(p) + (1 - y) * Math.log(1 - p)) / data_size
    end
    res
    #END YOUR CODE
  end
  
  def grad data, w
    res = Hash.new
    count = Hash.new
    
    data.each do |record|
      record["features"].each do |key,value|
        if !res.key?(key)
          res[key] = 0.0
        end
        if !count.key?(key)
          count[key] = 0
        end
        p = predict record, w
        y = record["label"]
        res[key] += value * (p - y) 
        count[key] += 1
      end
    end
    res.each do |key, value|
      res[key] = value / count[key]
    end
    res
  end
end

:grad

In [1]:
class LogisticRegressionModelL2
  def initialize reg_param
    @reg_param = reg_param
  end

  def predict row, w
    x = row["features"]    
    1.0 / (1 + Math.exp(-dot(w, x)))
  end
  
  def adjust w
    w.each_key {|k| w[k] = 0.0 if w[k].nan? or w[k].infinite?}
    w.each_key {|k| w[k] = 0.0 if w[k].abs > 1e5 }
  end
  
  def func data, w
    # BEGIN YOUR CODE
#     raise NotImplementedError.new()
    update_value = 0.0
    data.each do |record|
      if record["label"] == -1
        record["label"] = 0
      end
      predict_value = predict(record, w)
      update_value -= record["label"] * Math.log(predict_value) + (1 - record["label"]) * Math.log(1 - predict_value) 
    end
    res = (@reg_param / 2) * norm(w) ** 2 + update_value / data.length
    #END YOUR CODE
  end
  def grad data, w
    # BEGIN YOUR CODE
    
    g = Hash.new
    count = Hash.new
    
    data.each do |record|
      record["features"].each do |key, value|
        if !g.key?(key)
          g[key] = 0.0
        end
        if !count.key?(key)
          count[key] = 0
        end
        if record["label"] == -1
          record["label"] = 0
        end
        predict_value = predict(record, w)
        update_value = value * (predict_value - record["label"])
        g[key] += update_value
        count[key] += 1
      end
    end
    
    g.each do |key, value|
      g[key] = @reg_param * w[key] + value / count[key]
    end
    #END YOUR CODE  
    return g
  end
end

:grad

In [2]:
class DecisionTree
  attr_reader :tree, :h0
  
  def initialize splitters, min_size, max_depth
    @splitters = splitters
    @min_size = min_size
    @max_depth = max_depth
  end
  
  def init_dataset dataset
    @dataset = dataset
    @header = @dataset["features"]
    @c_dist = class_distribution @dataset["data"]
    @h0 = entropy @c_dist
    @tree = {n: @dataset["data"].size, entropy: @h0, dist: @c_dist, split: nil, children: {}}    
  end
  
  def find_best_split dataset, initial_entropy
    # BEGIN YOUR CODE
    ig_best = 0.0
    split_obj_best = nil
    
    fnames = Set.new
    dataset.each do |row|
      fnames = fnames | row["features"].keys.to_set
    end
    
    @splitters.each do |splitter|
      fnames.each do |fname|
        if splitter.matches? dataset, fname 
          num_split_obj, ig = splitter.new_split dataset, initial_entropy, fname
          if ig > ig_best
            split_obj_best = num_split_obj
            ig_best = ig
          end
        end
      end
    end
    return [split_obj_best, ig_best]
    #END YOUR CODE
  end
end

:find_best_split

In [3]:
class DecisionTree
  attr_reader :tree, :h0
  
  def initialize splitters, min_size, max_depth
    @splitters = splitters
    @min_size = min_size
    @max_depth = max_depth
  end
  
  def init_dataset dataset
    @dataset = dataset
    @header = @dataset["features"]
    @c_dist = class_distribution @dataset["data"]
    @h0 = entropy @c_dist
    @tree = {n: @dataset["data"].size, entropy: @h0, dist: @c_dist, split: nil, children: {}}    
  end
  
  def find_best_split dataset, initial_entropy
    # BEGIN YOUR CODE
    ig_best = 0.0
    split_obj_best = nil
    
    fnames = Set.new
    dataset.each do |row|
      fnames = fnames | row["features"].keys.to_set
    end
    
    @splitters.each do |splitter|
      fnames.each do |fname|
        if splitter.matches? dataset, fname 
          num_split_obj, ig = splitter.new_split dataset, initial_entropy, fname
          if ig > ig_best
            split_obj_best = num_split_obj
            ig_best = ig
          end
        end
      end
    end
    return [split_obj_best, ig_best]
    #END YOUR CODE
  end
end

:find_best_split

In [4]:
class DecisionTree
  def train dataset
    init_dataset dataset
    build_tree @dataset["data"], @tree, @max_depth
  end

  def build_tree x, root, max_depth
    # BEGIN YOUR CODE
    if x == nil or max_depth == 1 or root[:n] < @min_size
      return
    end
    
    split_best, ig_best = find_best_split x, root[:entropy]
    if split_best == nil
      return 
    end
    
    split_x = split_best.split x 
    root[:split] = split_best
    
    keys = split_x.keys
    l_key = keys[0]
    l_data = split_x[l_key]
    l_size = l_data.size
    l_dist = class_distribution l_data
    l_entropy = entropy l_dist
    left_tree = {n: l_size, entropy: l_entropy, dist: l_dist, split: nil, children: {}}
    
    r_key = keys[1]
    r_data = split_x[r_key]
    r_size = r_data.size
    r_dist = class_distribution r_data
    r_entropy = entropy r_dist
    right_tree = {n: r_size, entropy: r_entropy, dist: r_dist, split: nil, children: {}}
    
    
    root[:children] = {l_key =>left_tree, r_key => right_tree}
    
    build_tree l_data, left_tree, max_depth - 1
    build_tree r_data, right_tree, max_depth - 1
    
    #END YOUR CODE
  end
end

:build_tree

In [5]:
class DecisionTree
  def predict x
    return eval_tree x, @tree
  end
  
  def eval_tree x, root
    # BEGIN YOUR CODE
    if root[:children].empty?
      return root[:dist].key(root[:dist].values.max)
    end
    path = root[:split].test x
    child = root[:children][path]
    return eval_tree x, child
    #END YOUR CODE
  end
end



:eval_tree

In [6]:
def confusion_matrix dataset, predictions
  # BEGIN YOUR CODE
  classes = dataset["classes"]
  class_size = classes.size
  conf_matrix =Array.new(class_size) { Array.new(class_size,0) }
  
  data = dataset["data"]
  data.each_with_index do |row, index|
    predictions.each_with_index do |predict, p_index| 
      if index == p_index
        conf_matrix[predict][row["label"]] += 1
      end
    end
  end
  conf_matrix
  
  # END YOUR CODE
end


:confusion_matrix

In [7]:
def cross_validate data, folds, &block
  dataset = data["data"]
  fold_size = dataset.size / folds
  subsets = []
  dataset.shuffle.each_slice(fold_size) do |subset|
    subsets << subset
  end
  i_folds = Array.new(folds) {|i| i}
  
  i_folds.collect do |fold|
    test = subsets[fold]
    train = (i_folds - [fold]).flat_map {|t_fold| subsets[t_fold]}
    train_data = data.clone
    train_data["data"] = train
    
    test_data = data.clone
    test_data["data"] = test
    
    yield train_data, test_data, fold
  end
end

def mean x
  sum = x.inject(0.0) {|u,v| u += v}
  sum / x.size
end

def stdev x
  m = mean x
  sum = x.inject(0.0) {|u,v| u += (v - m) ** 2.0}
  Math.sqrt(sum / (x.size - 1))
end

:stdev

In [8]:
def cross_validation_accuracy iris, folds = 10, min_size = 10, max_depth = 50
  acc_arr = Array.new
  cross_validate iris, folds do |train, test|
    # BEGIN YOUR CODE
    dec_tree = DecisionTree.new [NumericSplitter.new], min_size, max_depth
    dec_tree.train train
    pred_arr = Array.new
    data = test["data"]
    data.each do |row|
      pred_arr << dec_tree.predict(row)
    end
    mat = confusion_matrix test, pred_arr
    acc_arr << accuracy(mat)
    #END YOUR CODE
  end
  acc_arr
end


:cross_validation_accuracy

In [39]:
def train_one_classifier train_db
  model = nil
  sql = "select SK_ID_CURR,TARGET,
    ext_source_1
    from application_train"
  dataset = create_dataset train_db, sql
  fill_zero dataset
  obj = LogisticRegressionModelL2.new 0.1
  weights = Hash.new {|h,k| h[k] = 0.0}
  model = StochasticGradientDescent.new obj, weights, 0.01
  
  batch_size = 100
  total_iter = dataset.length / batch_size
  
  total_iter.times do
    batch_data = dataset.sample(batch_size)
    model.update(batch_data)
  end
  
  return model
end

:train_one_classifier

In [40]:
def eval_one_classifier_on db, model
  predictions = Hash.new
  # BEGIN YOUR CODE
  sql = "select SK_ID_CURR,TARGET,ext_source_1 from application_train"
  dataset = create_dataset db, sql
  fill_zero dataset
  dataset.each do |row|
    obj = model.objective
    predictions[row["id"]] = obj.predict(row, model.weights)
  end
  
  #END YOUR CODE
  return predictions
end

:eval_one_classifier_on

In [41]:
def test_21_1
  model = train_one_classifier $train_db
  predictions = eval_one_classifier_on $dev_db, model
  assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
  assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.5, "AUC #{auc} > 0.5")
  assert_true(auc < 0.6, "AUC #{auc} < 0.6")
  plot_roc_curve(fp, tp, auc).show()
end
test_21_1()

In [42]:
"Evaluation on test set after submission"


"Evaluation on test set after submission"

## Question 3.1 (25 Points)

Implement a classifier that achieves an auc, $a$, in range $a\in (0.6, 0.7)$. This demonstrates that you have discovered some moderately useful features and can control how your model performs. You will not receive any extra points for a model which performs better than 0.7 in this question. 

Note: **do not use the target label in your training or evaluation**

In [43]:
def train_two_classifier train_db
  model = nil
  sql = "select SK_ID_CURR,TARGET,
    ext_source_1,
    ext_source_2,
    ext_source_3,
    (ext_source_1 + ext_source_2 + ext_source_3) as sum
    from application_train"
  dataset = create_dataset train_db, sql
  fill_zero dataset
  dataset = normalize dataset
  obj = LogisticRegressionModelL2.new 0.1
  weights = Hash.new {|h,k| h[k] = 0.0}
  model = StochasticGradientDescent.new obj, weights, 0.01
  
  batch_size = 100
  total_iter = dataset.length / batch_size
  
  total_iter.times do
    batch_data = dataset.sample(batch_size)
    model.update(batch_data)
  end
  
  return model
end

:train_two_classifier

In [44]:
def eval_two_classifier_on db, model
  predictions = Hash.new
  # BEGIN YOUR CODE
   sql = "select SK_ID_CURR,TARGET,
    ext_source_1,
    ext_source_2,
    ext_source_3,
    (ext_source_1 + ext_source_2 + ext_source_3) as sum
    from application_train"
  dataset = create_dataset db, sql
  fill_zero dataset
  dataset = normalize dataset
  dataset.each do |row|
    obj = model.objective
    predictions[row["id"]] = obj.predict(row, model.weights)
  end
  
  #END YOUR CODE
  return predictions
end

:eval_two_classifier_on

In [45]:
def test_31_1
  model = train_two_classifier $train_db
  predictions = eval_two_classifier_on $dev_db, model
  assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
  assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.6, "AUC #{auc} > 0.6")
  assert_true(auc < 0.7, "AUC #{auc} < 0.7")
  plot_roc_curve(fp, tp, auc).show()
end
test_31_1()

In [43]:
"Evaluation on test set after submission"

"Evaluation on test set after submission"

## Question 4.1 (20 Points)

Implement a classifier that achieves an auc, $a$, where $a > 0.7$. This demonstrates that you have discovered some interesting features and tuned some algorithms well.  

Note: **do not use the target label in your training or evaluation**

In [53]:
def train_three_classifier train_db
  model = nil
#   dataset = create_dataset train_db, sql
  dataset = extract_features train_db
  
  fill_zero dataset
  dataset = normalize dataset
  
  obj = LogisticRegressionModelL2.new 0.1
  weights = Hash.new {|h,k| h[k] = (rand * 0.1) - 0.05}
  model = StochasticGradientDescent.new obj, weights, 0.1
  
  batch_size = 50
  total_iter = dataset.length / batch_size
  
  total_iter.times do
    batch_data = dataset.sample(batch_size)
    model.update(batch_data)
  end
  
  return model
end

:train_three_classifier

In [54]:
def eval_three_classifier_on db, model
  predictions = Hash.new
  # BEGIN YOUR CODE
#   dataset = create_dataset db, sql
  dataset = extract_features db
  fill_zero dataset
  dataset = normalize dataset
  dataset.each do |row|
    obj = model.objective
    predictions[row["id"]] = obj.predict(row, model.weights)
  end
  
  #END YOUR CODE
  return predictions
end

:eval_three_classifier_on

In [55]:
def test_41_1
  model = train_three_classifier $train_db
  predictions = eval_three_classifier_on $dev_db, model
  assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
  assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.7, "AUC #{auc} > 0.7")
  assert_true(auc < 1.0, "AUC #{auc} < 1.0")
  plot_roc_curve(fp, tp, auc).show()
end
test_41_1()

In [36]:
"Evaluation on test set after submission"

"Evaluation on test set after submission"

## Question 5.1 (15 Points)

Implement a classifier that achieves an auc, $a$, where $a > 0.75$. Note that this is higher than the average Kaggle competition result, so get creative.

Note: **do not use the target label in your training or evaluation**

In [56]:
module AdaBoost

  class AdaBoost

    attr_reader :weak_classifiers, :y_index

    def initialize(number_of_classifiers, y_index)
      @weak_classifiers = []
      @weak_learner = WeakLearner.new(y_index)
      @number_of_classifiers = number_of_classifiers
      @weights = []
      @y_index = y_index
    end

    def train(samples)
      if Config::OVER_SAMPLING_TRAINING_SET 
        resampler = Resampler.new(@y_index)
        resampler.over_sample(samples)
      end
      initialize_weights(samples)
      0.upto(@number_of_classifiers - 1) do |i|
        weak_classifier = @weak_learner.generate_weak_classifier(samples, @weights)
        weak_classifier.compute_alpha
        update_weights(weak_classifier, samples)
        @weak_classifiers << weak_classifier
        yield i, weak_classifier if block_given? 
      end
    end

    def classify(sample)
      score = 0.0
      @weak_classifiers.each do |weak_classifier| 
        score += weak_classifier.classify_with_alpha(sample)
      end
      score
    end

    def self.build_from_model(model, y_index = 0)
      classifiers = model.weak_classifiers
      adaboost = AdaBoost.new(classifiers.size, y_index)
      classifiers.each do |classifier|
        adaboost.weak_classifiers << WeakClassifier.new(classifier.feature_number, classifier.split, classifier.alpha)
      end
      adaboost
    end

    private

    def initialize_weights(samples)
      samples_size = samples.size.to_f
      negative_weight = 1 / samples_size
      positive_weight = negative_weight
      if Config::INCORPORATE_COST_SENSITIVE_LEARNING
        analyzer = FeaturesAnalyzer.new(@y_index)
        distribution = analyzer.analyze(samples).distribution
        positive_rate = distribution.positive / samples_size
        negative_rate = distribution.negative / samples_size
        normalizing_constant = distribution.negative * positive_rate + distribution.positive * negative_rate
        positive_weight = positive_rate / normalizing_constant.to_f
        negative_weight = negative_rate / normalizing_constant.to_f
      end
      samples.each_with_index do |sample, i|
        y = sample[@y_index]
        @weights[i] = (y == -1) ? positive_weight : negative_weight
      end
    end

    def update_weights(weak_classifier, samples)
      sum = 0.0
      samples.each_with_index do |sample, i|
        y = sample[@y_index]
        @weights[i] *= Math.exp(-(weak_classifier.alpha) * weak_classifier.classify(sample) * y)
        sum += @weights[i]
      end
      @weights.each_with_index do |_, i|
        @weights[i] /= sum
      end
    end
  end
end

:update_weights

In [57]:
module AdaBoost

  module Config
    NUMBER_OF_RANDOM_CLASSIFIERS = 100
    INCORPORATE_COST_SENSITIVE_LEARNING = true
    OVER_SAMPLING_TRAINING_SET = false
    USE_RANDOM_WEAK_CLASSIFIERS = false
    USE_THRESHOLD_CLASSIFICATION = true
  end
end

true

In [58]:
module AdaBoost

  class ContingencyTable

    def initialize
      @table = [[0, 0], [0, 0]]
    end

    def true_positive
      @table[1][1]
    end

    def false_positive
      @table[0][1]
    end

    def true_negative
      @table[0][0]
    end

    def false_negative
      @table[1][0]
    end

    def add_prediction(y, h)
      @table[class_to_index(y)][class_to_index(h)] += 1
    end

    def outcome_positive
      true_positive + false_positive
    end

    def outcome_negative
      true_negative + false_negative
    end

    def total_population
      @table[0][0] + @table[0][1] + @table[1][0] + @table[1][1]
    end

    def predicted_condition_positive
      true_positive + false_positive
    end

    def predicted_condition_negative
      false_negative + true_negative
    end

    def condition_positive
      true_positive + false_negative
    end

    def condition_negative
      false_positive + true_negative
    end

    def prevalence
      condition_positive / total_population.to_f
    end

    def true_positive_rate
      true_positive / condition_positive.to_f
    end

    def recall
      true_positive_rate
    end

    def sensitivity
      true_positive_rate
    end

    def false_positive_rate
      false_positive / condition_negative.to_f
    end

    def fall_out
      false_positive_rate
    end

    def false_negative_rate
      false_negative / condition_positive.to_f
    end

    def true_negative_rate
      true_negative / condition_negative.to_f
    end

    def specificity
      true_negative_rate
    end

    def accuracy
      (true_positive + true_negative) / total_population.to_f
    end

    def positive_predictive_value
      true_positive / outcome_positive.to_f
    end

    def precision
      positive_predictive_value
    end

    def false_discovery_rate
      false_positive / outcome_positive.to_f
    end

    def false_omission_rate
      false_negative / outcome_negative.to_f
    end

    def negative_predictive_value
      true_negative / outcome_negative.to_f
    end

    def positive_likelihood_ratio
      true_positive_rate / false_positive_rate.to_f
    end

    def negative_likelihood_ratio
      false_negative_rate / true_negative_rate.to_f
    end

    def diagnostic_odds_ratio
      positive_likelihood_ratio / negative_likelihood_ratio.to_f
    end

    def to_s
      "\nTotal population: %d\t \
      \nCondition positive: %d\t \
      \nCondition negative: %d\t \
      \nPredicted Condition positive: %d\t \
      \nPredicted Condition negative: %d\t \
      \nTrue positive: %d\t \
      \nTrue negative: %d\t \
      \nFalse Negative: %d\t \
      \nFalse Positive: %d\t \
      \nPrevalence = Σ Condition positive / Σ Total population: %f\t \
      \nTrue positive rate (TPR) = Σ True positive / Σ Condition positive: %f\t \
      \nFalse positive rate (FPR) = Σ False positive / Σ Condition negative: %f\t \
      \nFalse negative rate (FNR) = Σ False negative / Σ Condition positive: %f\t \
      \nTrue negative rate (TNR) = Σ True negative / Σ Condition negative: %f\t \
      \nAccuracy (ACC) = Σ True positive \ Σ True negative / Σ Total population: %f\t \
      \nPositive predictive value (PPV) = Σ True positive / Σ Test outcome positive: %f\t \
      \nFalse discovery rate (FDR) = Σ False positive / Σ Test outcome positive: %f\t \
      \nFalse omission rate (FOR) = Σ False negative / Σ Test outcome negative: %f\t \
      \nNegative predictive value (NPV) = Σ True negative / Σ Test outcome negative: %f\t \
      \nPositive likelihood ratio (LR\) = TPR / FPR: %f\t \
      \nNegative likelihood ratio (LR−) = FNR / TNR: %f\t \
      \nDiagnostic odds ratio (DOR) = LR+ / LR−: %f\t" %
      [
        total_population,
        condition_positive,
        condition_negative,
        predicted_condition_positive,
        predicted_condition_negative,
        true_positive,
        true_negative,
        false_negative,
        false_positive,
        prevalence,
        true_positive_rate,
        false_positive_rate,
        false_negative_rate,
        true_negative_rate,
        accuracy,
        positive_predictive_value,
        false_discovery_rate,
        false_omission_rate,
        negative_predictive_value,
        positive_likelihood_ratio,
        negative_likelihood_ratio,
        diagnostic_odds_ratio
      ]
    end

    def class_to_index(k)
      (k > 0) ? 1 : 0
    end
  end
end

:class_to_index

In [59]:
module AdaBoost

  class Evaluator

    def initialize(classifier)
      @classifier = classifier
      @threshold = Float::MAX
    end

    def evaluate(test_set)
      contingency_table = ContingencyTable.new
      test_set.each do |sample|
        y = sample[@classifier.y_index]
        h = if Config::USE_THRESHOLD_CLASSIFICATION
          classify_using_threshold(sample)
        else
          classify_normally(sample)
        end
        contingency_table.add_prediction(y, h)
      end
      contingency_table
    end

    def used_feature_numbers(unique = false)
      used_feature_numbers = []
      @classifier.weak_classifiers.each do |weak_classifier|
        used_feature_numbers << weak_classifier.feature_number
      end
      unique ? used_feature_numbers.uniq : used_feature_numbers
    end

    def feature_occurrences
      used_numbers = used_feature_numbers
      occurrences = {}
      used_numbers.each do |number|
        occurrences[number] = 0 if occurrences[number].nil?
        occurrences[number] += 1
      end
      occurrences
    end

    private

    def threshold
      if @threshold == Float::MAX
        @threshold = 0
        @classifier.weak_classifiers.each do |weak_classifier|
          @threshold += weak_classifier.alpha / 2.0
        end
      end
      @threshold
    end

    def classify_normally(sample)
      @classifier.classify(sample > 0) ? 1 : -1
    end

    def classify_using_threshold(sample)
      score = 0.0
      @classifier.weak_classifiers.each do |weak_classifier|
        if sample[weak_classifier.feature_number] > weak_classifier.split
          score += weak_classifier.alpha
        end
      end
      score > threshold ? 1 : -1
    end
  end
end

:classify_using_threshold

In [60]:
module AdaBoost

    Analyze = Struct.new(:statistics, :distribution)
    Distribution = Struct.new(:negative, :positive)
    FeatureStatistic = Struct.new(:min, :max, :sum, :avg, :vrn, :std, :rng)
    VariableRelations = Struct.new(:x, :y, :cov, :cor)

  class FeaturesAnalyzer

    def initialize(y_index)
      @y_index = y_index
    end

    def analyze(samples)
      
      statistics = []
      distribution = Distribution.new(0, 0)
      number_of_samples = samples.size
      
      if number_of_samples < 1
        raise ArgumentError.new('At least one sample is needed to analyze.')
      end
      number_of_features = @y_index
      sample_size = samples[0].size
      if number_of_features < 1 or sample_size < 2 or sample_size <= @y_index
        raise ArgumentError.new('At least 1 feature is needed to analyze.')
      end
      0.upto(number_of_features - 1) do
        statistics << FeatureStatistic.new(Float::MAX, -Float::MAX, 0, 0, 0, 0)
      end
      samples.each do |sample|
        y = sample[@y_index]
        if y == -1
            distribution.negative += 1
        else
            distribution.positive += 1
        end
        0.upto(number_of_features - 1) do |i|
          statistic = statistics[i]
          feature_value = sample[i]
          if feature_value < statistic.min
            statistic.min = feature_value
          end
          if feature_value > statistic.max
            statistic.max = feature_value
          end
          statistic.sum += feature_value
        end
      end
      statistics.each do |statistic|
        statistic.avg = statistic.sum / number_of_samples.to_f
        statistic.rng = (statistic.max - statistic.min).abs
      end
      samples.each do |sample|
        statistics.each_with_index do |statistic, i|
          feature_value = sample[i]
          statistic.vrn += (statistic.avg - feature_value) ** 2
        end
      end
      statistics.each do |statistic|
        statistic.vrn /= (number_of_samples - 1).to_f
        statistic.std = Math.sqrt statistic.vrn
      end
      analyze = Analyze.new
      analyze.statistics = statistics
      analyze.distribution = distribution
      analyze
    end

    def relations(x, y, samples, statistics)
      sum = 0.0
      samples.each do |sample|
        x_value = sample[x].to_f
        y_value = sample[y].to_f
        sum += (x_value - statistics[x].avg) * (y_value - statistics[y].avg)
      end
      cov = sum / (samples.size - 1).to_f
      cor = cov / (statistics[x].std * statistics[y].std).to_f
      VariableRelations.new(x, y, cov, cor)
    end
  end
end

:relations

In [61]:
module AdaBoost

  class Resampler

    def initialize(y_index)
      @y_index = y_index
    end
    
    def over_sample(samples)
      distribution = distribution(samples)
      y0 = distribution.negative
      y1 = distribution.positive
      majority = y0 < y1 ? 1.0 : -1.0
      difference = (y0 - y1).abs
      samples.each do |sample|
        if difference <= 0
          break
        end
        if sample[@y_index] != majority
          samples << sample
          difference -= 1
        end
      end
    end

    private

    def distribution(instances)
      analyzer = FeaturesAnalyzer.new(@y_index)
      analyzer.analyze(instances).distribution
    end
  end
end

:distribution

In [62]:
module AdaBoost

  class WeakClassifier

    attr_accessor :error
    attr_reader :feature_number, :split, :alpha

    def initialize(feature_number, split, alpha = 0.0, error = 0.0)
      @feature_number = feature_number
      @split = split
      @error = error
      @alpha = alpha
    end

    def compute_alpha
      @alpha = 0.5 * Math.log((1.0 - @error) / @error)
    end

    def classify(sample)
      sample[@feature_number] > @split ? 1 : -1
    end

    def classify_with_alpha(sample)
      return classify(sample) * @alpha
    end

    def increase_error(amount)
      @error += amount
    end
  end
end

:increase_error

In [63]:
module AdaBoost

  class WeakLearner

    def initialize(y_index)
      @y_index = y_index
      @analyzer = FeaturesAnalyzer.new(y_index)
      @classifiers_cache = []
    end

    def features_satistics(samples)
       @analyzer.analyze(samples).statistics
    end

    def generate_weak_classifier(samples, weights)
      number_of_samples = samples.size
      if number_of_samples < 1
        raise ArgumentError.new('At least one sample is needed to generate.')
      end
      number_of_features = @y_index
      sample_size = samples[0].size
      if number_of_features < 1 or sample_size < 2 or sample_size <= @y_index
        raise ArgumentError.new('At least 1 feature is needed to generate.')
      end
      classifiers = []
      if Config::USE_RANDOM_WEAK_CLASSIFIERS
        classifiers = generate_random_classifiers(samples, number_of_features)
      else
        classifiers = generate_all_possible_classifiers(samples, number_of_features)
      end
      best_index = -1
      best_error = Float::MAX
      classifiers.each_with_index do |classifier, i|
        classifier.error = 0.0
        samples.each_with_index do |sample, j|
          y = sample[@y_index]
          if classifier.classify(sample).to_f != y
            classifier.increase_error(weights[j])
          end
        end
        if classifier.error < best_error
          best_error = classifier.error
          best_index = i
        end
      end
      best = classifiers[best_index]
      if !Config::USE_RANDOM_WEAK_CLASSIFIERS
        classifiers.delete_at(best_index)
      end
      best
    end

    private

    def generate_random_classifiers(samples, number_of_features)
      classifiers = []
      statistics = features_satistics(samples)
      0.upto(Config::NUMBER_OF_RANDOM_CLASSIFIERS - 1) do
        feature_number = rand(number_of_features)
        info = statistics[feature_number]
        split = rand * info.rng + info.min
        classifiers << WeakClassifier.new(feature_number, split)
      end
      classifiers
    end

    def generate_all_possible_classifiers(samples, number_of_features)
      if @classifiers_cache.size == 0
        matrix = []
        0.upto(number_of_features - 1) do
          matrix << []
        end
        samples.each do |sample|
          0.upto(number_of_features - 1) do |i|
            sample_value = sample[i]
            matrix[i] << sample_value
          end
        end
        matrix.each_with_index do |entry, i|
          entry = entry.uniq
          entry.each do |uniq_value|
            @classifiers_cache << WeakClassifier.new(i, uniq_value)
          end
        end
      end
      @classifiers_cache
    end
  end
end

:generate_all_possible_classifiers

In [64]:
###########################################################################

In [None]:
def train_four_classifier train_db
  model = nil
#   dataset = create_dataset train_db, sql
  dataset = extract_features train_db
  
  fill_zero dataset
  dataset = normalize dataset
  
  obj = LogisticRegressionModelL2.new 0.1
  weights = Hash.new {|h,k| h[k] = (rand * 0.1) - 0.05}
  model = StochasticGradientDescent.new obj, weights, 0.1
  
  batch_size = 50
  total_iter = dataset.length / batch_size
  
  total_iter.times do
    batch_data = dataset.sample(batch_size)
    model.update(batch_data)
  end
  return model
end

In [None]:
def eval_four_classifier_on db, model
  predictions = Hash.new
  # BEGIN YOUR CODE
  dataset = extract_features db
  fill_zero dataset
  dataset = normalize dataset
  dataset.each do |row|
    obj = model.objective
    predictions[row["id"]] = obj.predict(row, model.weights)
  end
  #END YOUR CODE
  return predictions
end

In [None]:
def test_51_1
  model = train_four_classifier $train_db
  predictions = eval_four_classifier_on $dev_db, model
  assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
  assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.75, "AUC #{auc} > 0.7")
  assert_true(auc < 1.0, "AUC #{auc} < 1.0")
  plot_roc_curve(fp, tp, auc).show()
end
test_51_1()

In [None]:
"Evaluation on test set after submission"

## Question 6.1 (10 Points)

Implement a classifier that achieves an auc, $a$, where $a > 0.8$. Note that this as high as the winning Kaggle competitor.

Note: **do not use the target label in your training or evaluation**

In [None]:
def train_five_classifier train_db
  model = nil
  # BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
  return model
end

In [None]:
def eval_five_classifier_on db, model
  predictions = Hash.new
  # BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
  return predictions
end

In [None]:
def test_61_1
  model = train_five_classifier $train_db
  predictions = eval_five_classifier_on $dev_db, model
  assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
  assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.8, "AUC #{auc} > 0.8")
  assert_true(auc < 1.0, "AUC #{auc} < 1.0")
  plot_roc_curve(fp, tp, auc).show()
end
test_61_1()

In [None]:
"Evaluation on test set after submission"