# CS6140 Assignments

**Instructions**
1. In each assignment cell, look for the block:
 ```
  #BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
 ```
1. Replace this block with your solution.
1. Test your solution by running the cells following your block (indicated by ##TEST##)
1. Click the "Validate" button above to validate the work.

**Notes**
* You may add other cells and functions as needed
* Keep all code in the same notebook
* In order to receive credit, code must "Validate" on the JupyterHub server

---

# Final Project: Part 3 - Classifiers

In [Part 2](../part-2.ipynb) we implemented basic features and calculated their Information Gain. Now we will use these features to train some models. Your assignment will be graded based on performance on a **test** database with the same schema, which is not provided to you. The Validate step applies to the **training** and **dev** databases. 

As you discover new features, go back to [Part 2](../part-2.ipynb) and add them to demonstrate good information gain. Note the following requirements:

1. Focus on data preparation and creating, normalizing, understanding new features.
1. Use your implementation of models.
1. Avoid creating new models that we did not cover in the class. There is no credit for fancy models.
1. Do not use the target label, or anything that is derived based on the training label as features. 
1. You may talk to other students about your solution, but do not share code. 

Also, here are some hints:

* Use sampling when training on the **training** databases. There is no requirement to use everything.

In [1]:
require './assignment_lib'
dir = "/home/dataset"

$train_db = SQLite3::Database.new "#{dir}/credit_risk_data_train.db", results_as_hash: true, readonly: true
$dev_db = SQLite3::Database.new "#{dir}/credit_risk_data_dev.db", results_as_hash: true, readonly: true

"if(window['d3'] === undefined ||\n   window['Nyaplot'] === undefined){\n    var path = {\"d3\":\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\",\"downloadable\":\"http://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\"};\n\n\n\n    var shim = {\"d3\":{\"exports\":\"d3\"},\"downloadable\":{\"exports\":\"downloadable\"}};\n\n    require.config({paths: path, shim:shim});\n\n\nrequire(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\n\n\tvar script = d3.select(\"head\")\n\t    .append(\"script\")\n\t    .attr(\"src\", \"http://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\")\n\t    .attr(\"async\", true);\n\n\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\n\n\n\t    var event = document.createEvent(\"HTMLEvents\");\n\t    event.initEvent(\"load_nyaplot\",false,false);\n\t    win

#<SQLite3::Database:0x0000000002035078 @tracefunc=nil, @authorizer=nil, @encoding=nil, @busy_handler=nil, @collations={}, @functions={}, @results_as_hash=true, @type_translation=nil, @readonly=true>

## Question 1.1 (10 Points)

Implement a random classifier, one in which the score is a random number. Verify the AUC is 0.5 on training, dev, and test sets. 

Each training method should create features as needed from the provided database. The result must comply with the following format:

```ruby
predictions = Hash.new
predictions[12345] = score    
```

Note that predictions is an integer, so the output looks like this:

```ruby
predictions = {
    12345 => 0.9
}
```


In [251]:
def create_dataset db, sql
  start_time = Time.new
  puts "create_dataset Start Time : " + start_time.inspect
  dataset = []
  db.execute sql do |row|
    # BEGIN YOUR CODE
#     puts row
    data_point = Hash.new
    data_point["features"] = Hash.new
    row.keys.each do |key|
      if key.is_a? String
        if key == "SK_ID_CURR"
          data_point["id"] = row[key]
        elsif key == "TARGET"
          data_point["label"] = row[key]
        else
          data_point["features"][key.downcase] = row[key]
        end
      end
    end
    dataset << data_point
    #END YOUR CODE
  end
  end_time = Time.new
  puts "create_dataset End Time : " + end_time.inspect
#   puts dataset[0]
  return dataset
end

:create_dataset

In [328]:
def class_distribution dataset
  # BEGIN YOUR CODE
  res = Hash.new {|h,k| h[k] = 0}
#   puts "start grouping"
  group = dataset.group_by{|p| p["label"]}
#   puts "end grouping"
  group.keys.each do |k|
    res[k] = group[k].length * 1.0 / dataset.length
  end
  return res
  #END YOUR CODE
end

:class_distribution

In [253]:
def entropy dist
  # BEGIN YOUR CODE
  entropy = 0
  sum = 0
  dist.keys.each do |k|
    sum += dist[k]
  end
  dist.keys.each do |k|
    pro = dist[k] * 1.0 / sum
    if pro != 0
      entropy -= pro * Math.log(pro)
    end
  end
  return entropy
  #END YOUR CODE
end

:entropy

In [254]:
def find_split_point_numeric x, h0, fname
  # BEGIN YOUR CODE
  x.each do |p|
    if !p["features"].key?(fname) || p["features"][fname] == ""
      p["features"][fname] = 0
    end
  end
  sorted_x = x.sort_by{|p| p["features"][fname]}
  farray = sorted_x.map{|p| p["features"][fname]}.uniq
  farray << farray[farray.length - 1] + 1
  tindex = 0
  best_threshhood = farray[tindex]
  best_ig = 0
  index = 0
  while index < sorted_x.length do
    while (index < sorted_x.length &&sorted_x[index]["features"][fname] < farray[tindex]) do
      index += 1
    end
    tindex += 1
    threshhood = farray[tindex - 1]
    split = {"l" => sorted_x[0, index], "r" => sorted_x[index, sorted_x.length]}
    ig = information_gain(h0, split)
    if ig > best_ig
      best_ig = ig
      best_threshhood = threshhood
    end
  end
  return [best_threshhood, best_ig]
  #END YOUR CODE
end

:find_split_point_numeric

In [255]:
def information_gain h0, splits
  # BEGIN YOUR CODE
  ig = h0
  count = 0
  splits.keys.each do |k|
    count += splits[k].length
  end
  splits.keys.each do |k|
    subset_entropy = entropy(class_distribution(splits[k]))
    ig -= splits[k].length * subset_entropy / count
  end
  return ig
  #END YOUR CODE
end

:information_gain

In [336]:
def extract_features db
  dataset = []
  # BEGIN YOUR CODE
  sql0 =  "select * from application_train limit 1000"
  sql1 = "select TARGET, SK_ID_CURR, OWN_CAR_AGE, EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3, APARTMENTS_AVG, BASEMENTAREA_AVG, YEARS_BEGINEXPLUATATION_AVG, YEARS_BUILD_AVG, COMMONAREA_AVG, ELEVATORS_AVG, ENTRANCES_AVG, FLOORSMAX_AVG, FLOORSMIN_AVG, LANDAREA_AVG, LIVINGAPARTMENTS_AVG, LIVINGAREA_AVG, NONLIVINGAPARTMENTS_AVG, NONLIVINGAREA_AVG, APARTMENTS_MODE, BASEMENTAREA_MODE, YEARS_BEGINEXPLUATATION_MODE, YEARS_BUILD_MODE, COMMONAREA_MODE, ELEVATORS_MODE, ENTRANCES_MODE, FLOORSMAX_MODE, FLOORSMIN_MODE, LANDAREA_MODE, LIVINGAPARTMENTS_MODE, LIVINGAREA_MODE, NONLIVINGAPARTMENTS_MODE, NONLIVINGAREA_MODE, APARTMENTS_MEDI, BASEMENTAREA_MEDI, YEARS_BEGINEXPLUATATION_MEDI, YEARS_BUILD_MEDI, COMMONAREA_MEDI, ELEVATORS_MEDI, ENTRANCES_MEDI, FLOORSMAX_MEDI, FLOORSMIN_MEDI, LANDAREA_MEDI, LIVINGAPARTMENTS_MEDI, LIVINGAREA_MEDI, NONLIVINGAPARTMENTS_MEDI, NONLIVINGAREA_MEDI, FONDKAPREMONT_MODE, HOUSETYPE_MODE, TOTALAREA_MODE, WALLSMATERIAL_MODE, EMERGENCYSTATE_MODE, AMT_REQ_CREDIT_BUREAU_HOUR, AMT_REQ_CREDIT_BUREAU_DAY, AMT_REQ_CREDIT_BUREAU_WEEK, AMT_REQ_CREDIT_BUREAU_MON, AMT_REQ_CREDIT_BUREAU_QRT, AMT_REQ_CREDIT_BUREAU_YEAR
            from application_train"
  sql2 = "select TARGET, SK_ID_CURR, OWN_CAR_AGE, EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3,NONLIVINGAPARTMENTS_MEDI
            from application_train"
  dataset = create_dataset(db, sql0)
  #END YOUR CODE
  return dataset
end

:extract_features

In [257]:
def train_random_classifier train_db
  model = nil
  # BEGIN YOUR CODE

  #END YOUR CODE
  return model
end

:train_random_classifier

In [258]:
def eval_random_classifier_on db, model
  predictions = Hash.new
  # BEGIN YOUR CODE
  dataset = extract_features db
  dataset.each do |p|
    predictions[p["id"]] = rand > 0.5 ? 1 : 0
  end
  #END YOUR CODE
  return predictions
end

:eval_random_classifier_on

In [259]:
def test_11_1
  model = train_random_classifier $train_db
  predictions = eval_random_classifier_on $dev_db, model
  assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
  assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.45, "AUC #{auc} > 0.45")
  assert_true(auc < 0.55, "AUC #{auc} < 0.55")
  plot_roc_curve(fp, tp, auc).show()
end
test_11_1()

create_dataset Start Time : 2019-04-18 03:39:14 +0000
create_dataset End Time : 2019-04-18 03:39:15 +0000


Test::Unit::AssertionFailedError: <15334> expected but was
<1000>.

In [260]:
"Evaluation on test set after submission"

"Evaluation on test set after submission"

## Question 1.2 (10 Points)

Implement a "perfect" classifier, one in which the score is the class label. You should not use the class label as a feature, but if you do, then your performance will be too good to be true.

In [261]:
def train_perfect_classifier train_db
  model = nil
  # BEGIN YOUR CODE

  #END YOUR CODE
  return model
end

:train_perfect_classifier

In [262]:
def eval_perfect_classifier_on db, model
  predictions = Hash.new
  # BEGIN YOUR CODE
  dataset = extract_features db
  dataset.each do |p|
    predictions[p["id"]] = p["label"]
  end
  #END YOUR CODE
  return predictions
end

:eval_perfect_classifier_on

In [263]:
def test_12_1
  model = train_perfect_classifier $train_db
  predictions = eval_perfect_classifier_on $dev_db, model
  assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
  assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.95, "AUC #{auc} > 0.95")
  assert_true(auc < 1.05, "AUC #{auc} < 1.05")
  plot_roc_curve(fp, tp, auc).show()
end
test_12_1()

create_dataset Start Time : 2019-04-18 03:39:18 +0000
create_dataset End Time : 2019-04-18 03:39:18 +0000


Test::Unit::AssertionFailedError: <15334> expected but was
<1000>.

In [264]:
"Evaluation on test set after submission"

"Evaluation on test set after submission"

## Question 2.1 (15 Points)

Implement a classifier that achieves an auc, $a$, in range $a\in (0.5, 0.6)$. This demonstrates that you have discovered some moderately useful features and can control how your model performs. You will not receive any extra points for a model which performs better than 0.6 in this question. 

Note: **do not use the target label in your training or evaluation**

In [265]:
class NumericSplit
  attr_reader :fname, :value
  def initialize fname, value
    @fname = fname
    @value = value
    @lpath = "#{@fname} < #{@value}"
    @rpath = "#{@fname} >= #{@value}"
  end
  
  def to_s
    "Numeric[#{@fname} <=> #{@value}]"
  end
  
  def split dataset
    # BEGIN YOUR CODE
    return dataset.select{|row| row["features"].key?(@fname)}.group_by {|row| row["features"][@fname] < @value ? @lpath : @rpath}
    #END YOUR CODE
  end
  
  def test x
    # BEGIN YOUR CODE
    test_value = x["features"][@fname]
    test_value < @value ? @lpath : @rpath
    #END YOUR CODE
  end
end

:test

In [266]:
class CategoricalSplit
  attr_reader :fname
  def initialize fname
    @fname = fname
  end
  
  def split dataset
    # BEGIN YOUR CODE
    return dataset.select{|row| row["features"].key?(@fname)}.group_by {|row| row["features"][@fname]}
    #END YOUR CODE
  end
  
#   def test x
#     # BEGIN YOUR CODE
#     test_value = x["features"][@fname]
#     test_value < @value ? @lpath : @rpath
#     #END YOUR CODE
#   end
end

:split

In [338]:
class NumericSplitter
  def new_split dataset, initial_entropy, fname
    # BEGIN YOUR CODE
    # check the feature is numeric or categorical
#     puts fname
    if matches?(dataset, fname)
#       puts "#{fname} is Numberic"
      dataset.each do |p|
        if !p["features"].key?(fname) || p["features"][fname] == ""
          p["features"][fname] = 0
        end
      end
#       puts dataset.map{|p| p["features"][fname]}
      sorted_dataset = dataset.sort_by{|p| p["features"][fname]}
      farray = sorted_dataset.map{|p| p["features"][fname]}.uniq
      farray << farray[farray.length - 1] + 1
     if fname = "ext_source_1" || fname = "ext_source_2"
       puts farray.length
     end
      tindex = 0
      best_threshhood = farray[tindex]
      best_ig = 0
      index = 0
      while index < sorted_dataset.length do
        while (index < sorted_dataset.length &&sorted_dataset[index]["features"][fname] < farray[tindex]) do
          index += 1
        end
        tindex += 1
        threshhood = farray[tindex - 1]
        split = {"l" => sorted_dataset[0, index], "r" => sorted_dataset[index, sorted_dataset.length]}
        ig = information_gain(initial_entropy, split)
        if ig > best_ig
          best_ig = ig
          best_threshhood = threshhood
        end
      end
      t_max = best_threshhood
      ig_max = best_ig
      return [NumericSplit.new(fname, t_max), ig_max]
    else
#       puts "#{fname} is String"
      dataset.each do |p|
        if p["features"][fname] == ""
          p["features"][fname] = "miss"
        end
      end
      split = dataset.group_by {|r| r["features"].fetch(fname)}
      ig_max = information_gain(initial_entropy,split)
      return [CategoricalSplit.new(fname), ig_max]
    end
    #END YOUR CODE
  end
  
  def matches? x, fname
    x.each do |p|
#       puts p
      if p["features"].key?(fname) && p["features"][fname] != ""
        return p["features"][fname].is_a?(Numeric)
      end
    end
#     x.all? {|r| r["features"].fetch(fname, 0.0).is_a?(Numeric)}
  end
end

:matches?

In [326]:
class DecisionTree
  attr_reader :tree, :h0
  
  def initialize splitters, min_size, max_depth
    @splitters = splitters
    @min_size = min_size
    @max_depth = max_depth
  end
  
  def init_dataset dataset
    @dataset = dataset
    @header = @dataset["features"]
    @c_dist = class_distribution @dataset["data"]
    @h0 = entropy @c_dist
    @tree = {n: @dataset["data"].size, entropy: @h0, dist: @c_dist, split: nil, children: {}}    
  end
  
  def find_best_split dataset, initial_entropy
    # BEGIN YOUR CODE
    splitter = NumericSplitter.new 
    best_split = [nil, 0]
    puts @header
    @header.each do |feature|
      puts feature
      new_split = splitter.new_split dataset, initial_entropy, feature
#       puts new_split[1], best_split[1]
      if new_split[1] > best_split[1]
        best_split = new_split
      end
    end
    return best_split
    #END YOUR CODE
  end
  
  def train dataset
    init_dataset dataset
    build_tree @dataset["data"], @tree, @max_depth
  end

  def build_tree x, root, max_depth
    # BEGIN YOUR CODE
    if root[:dist].keys.length <= 1
      return root
    end
    if x.length <= @min_size
      return root
    end
    if max_depth <= 1
      return root
    end
    best_split, best_ig = find_best_split x, root[:entropy]
    if best_split == nil
      return root
    end
    if best_split.is_a? NumericSplit
      puts "#{best_split.fname} is numberic"
      splited = best_split.split(x)
      root[:split] = best_split
      paths = []
      splited.keys.each do |path|
        paths << path
      end
#       puts paths
      subset0 = splited[paths[0]]
      subset1 = splited[paths[1]]
      node0 = {n: subset0.size, entropy: entropy(class_distribution(subset0)), dist: class_distribution(subset0), split: nil, children: {}} 
      node1 = {n: subset1.size, entropy: entropy(class_distribution(subset1)), dist: class_distribution(subset1), split: nil, children: {}}
      root[:children][paths[0]] = node0
      root[:children][paths[1]] = node1
      build_tree subset0, node0, max_depth - 1
      build_tree subset1, node1, max_depth - 1
    else
      puts "#{best_split.fname} is categorical"
      splited = best_split.split(x)
      root[:split] = best_split
#       paths = []
#       splited.keys.each do |path|
#         paths << path
#       end
#       puts paths
      splited.keys.each do |path|
        subset = splited[path]
        node = {n: subset.size, entropy: entropy(class_distribution(subset)), dist: class_distribution(subset), split: nil, children: {}} 
        root[:children][path] = node
        build_tree subset, node, max_depth - 1
      end
    end
    #END YOUR CODE
  end
  
  def predict x
    return eval_tree x, @tree
  end
  
  def eval_tree x, root
    # BEGIN YOUR CODE
    if root[:split] == nil
     return root[:dist][1]
    end
    fname = root[:split].fname
#     puts fname, x["features"][fname], x["features"][fname] == ""
    if !x["features"].key?(fname) || x["features"][fname] == ""
        x["features"][fname] = (root[:split].is_a? NumericSplit) ? 0 : "miss"
    end
    if root[:split].is_a? NumericSplit
      l_path = fname + " < " + root[:split].value.to_s
      r_path = fname + " >= " + root[:split].value.to_s
      if x["features"].key?(fname) && x["features"][fname] >= root[:split].value
        return eval_tree(x, root[:children][r_path])
      else
        return eval_tree(x, root[:children][l_path])
      end
    else
      path = x["features"][fname]
      if !root[:children].keys.include?(path)
         return root[:dist][0] >= root[:dist][1] ? 0 : 1
      end
      return eval_tree(x, root[:children][path])
    end
    #END YOUR CODE
  end
end

:eval_tree

In [309]:
# def numerize dataset
#   numerized_dataset = Hash.new
#   numerized_dataset["features"] = Array.new
#   numerized_dataset["data"] = Array.new
#   dataset.each do |p|
#     numerized_point = Hash.new
#     numerized_point["id"] = p["id"]
#     numerized_point["label"] = p["label"]
#     p["features"].keys.each do |k|
#       if 
#     end
#   end
#   return numerized_dataset
# end

In [310]:
  def matches? x, fname
    x.each do |p|
      if p["features"].key?(fname) && p["features"][fname] != ""
        return p["features"][fname].is_a?(Numeric)
      end
    end
#     x.all? {|r| r["features"].fetch(fname, 0.0).is_a?(Numeric)}
  end

:matches?

In [311]:
dataset_data = extract_features $train_db
dataset = Hash.new
dataset_data.first
dataset["features"] = dataset_data.first["features"]
matches?(dataset_data, "wallsmaterial_mode")

create_dataset Start Time : 2019-04-18 04:39:22 +0000
create_dataset End Time : 2019-04-18 04:39:22 +0000


false

In [332]:
def train_one_classifier train_db
  model = nil
  # BEGIN YOUR CODE
  model = DecisionTree.new [NumericSplitter.new], 100, 20
  dataset_data = extract_features train_db
  puts dataset_data.length
  dataset = Hash.new
  dataset["features"] = dataset_data.first["features"].keys
  dataset["data"] = dataset_data
  start_time = Time.new
  puts "Train Start Time : " + start_time.inspect
  model.train dataset
  end_time = Time.new
  puts "Train End Time : " + end_time.inspect
  #END YOUR CODE
  return model
end

:train_one_classifier

In [333]:
def eval_one_classifier_on db, model
  predictions = Hash.new
  # BEGIN YOUR CODE
  tree = model.tree
  dataset_data = extract_features db
  dataset_data.each do |p|
    id = p["id"]
    predictions[id] = model.predict(p)
  end
  #END YOUR CODE
  return predictions
end

:eval_one_classifier_on

In [339]:
def test_21_1
  model = train_one_classifier $train_db
#   puts "Multi-level Tree", JSON.pretty_generate(model.tree)
  predictions = eval_one_classifier_on $dev_db, model
#   assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  puts scores[0,10]
#   assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
#   assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.5, "AUC #{auc} > 0.5")
  assert_true(auc < 0.6, "AUC #{auc} < 0.6")
  plot_roc_curve(fp, tp, auc).show()
end
test_21_1()

create_dataset Start Time : 2019-04-18 05:40:06 +0000
create_dataset End Time : 2019-04-18 05:40:06 +0000
1000
Train Start Time : 2019-04-18 05:40:06 +0000
["name_contract_type", "code_gender", "flag_own_car", "flag_own_realty", "cnt_children", "amt_income_total", "amt_credit", "amt_annuity", "amt_goods_price", "name_type_suite", "name_income_type", "name_education_type", "name_family_status", "name_housing_type", "region_population_relative", "days_birth", "days_employed", "days_registration", "days_id_publish", "own_car_age", "flag_mobil", "flag_emp_phone", "flag_work_phone", "flag_cont_mobile", "flag_phone", "flag_email", "occupation_type", "cnt_fam_members", "region_rating_client", "region_rating_client_w_city", "weekday_appr_process_start", "hour_appr_process_start", "reg_region_not_live_region", "reg_region_not_work_region", "live_region_not_work_region", "reg_city_not_live_city", "reg_city_not_work_city", "live_city_not_work_city", "organization_type", "ext_source_1", "ext_sourc

ArgumentError: comparison of String with 1 failed

In [246]:
def test_21_1
  model = train_one_classifier $train_db
  predictions = eval_one_classifier_on $dev_db, model
  assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
  assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.5, "AUC #{auc} > 0.5")
  assert_true(auc < 0.6, "AUC #{auc} < 0.6")
  plot_roc_curve(fp, tp, auc).show()
end
test_21_1()

create_dataset Start Time : 2019-04-18 03:31:04 +0000
create_dataset End Time : 2019-04-18 03:31:04 +0000
1000
Train Start Time : 2019-04-18 03:31:04 +0000
organization_type is categorical
ext_source_3 is numberic
["ext_source_3 < 0.4668640059537032", "ext_source_3 >= 0.4668640059537032"]
occupation_type is categorical
obs_30_cnt_social_circle is categorical
occupation_type is categorical
obs_30_cnt_social_circle is categorical
Train End Time : 2019-04-18 03:31:08 +0000
create_dataset Start Time : 2019-04-18 03:31:08 +0000
create_dataset End Time : 2019-04-18 03:31:08 +0000
1000
new eval
false
#<CategoricalSplit:0x0000000009c49b50>
continue eval
new eval
false
Numeric[ext_source_3 <=> 0.4668640059537032]
continue eval
new eval
false
#<CategoricalSplit:0x000000000a3c11a8>
continue eval
new eval
false

root[:split] == nil
{0=>0.7333333333333333, 1=>0.26666666666666666}
prediction finished
new eval
false
#<CategoricalSplit:0x0000000009c49b50>
continue eval
new eval
true


NoMethodError: undefined method `[]' for nil:NilClass

In [None]:
"Evaluation on test set after submission"


## Question 3.1 (25 Points)

Implement a classifier that achieves an auc, $a$, in range $a\in (0.6, 0.7)$. This demonstrates that you have discovered some moderately useful features and can control how your model performs. You will not receive any extra points for a model which performs better than 0.7 in this question. 

Note: **do not use the target label in your training or evaluation**

In [None]:
def train_two_classifier train_db
  model = nil
  # BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
  return model
end

In [None]:
def eval_two_classifier_on db, model
  predictions = Hash.new
  # BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
  return predictions
end

In [None]:
def test_31_1
  model = train_two_classifier $train_db
  predictions = eval_two_classifier_on $dev_db, model
  assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
  assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.6, "AUC #{auc} > 0.6")
  assert_true(auc < 0.7, "AUC #{auc} < 0.7")
  plot_roc_curve(fp, tp, auc).show()
end
test_31_1()

In [None]:
"Evaluation on test set after submission"

## Question 4.1 (20 Points)

Implement a classifier that achieves an auc, $a$, where $a > 0.7$. This demonstrates that you have discovered some interesting features and tuned some algorithms well.  

Note: **do not use the target label in your training or evaluation**

In [None]:
def train_three_classifier train_db
  model = nil
  # BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
  return model
end

In [None]:
def eval_three_classifier_on db, model
  predictions = Hash.new
  # BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
  return predictions
end

In [None]:
def test_41_1
  model = train_three_classifier $train_db
  predictions = eval_three_classifier_on $dev_db, model
  assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
  assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.7, "AUC #{auc} > 0.7")
  assert_true(auc < 1.0, "AUC #{auc} < 1.0")
  plot_roc_curve(fp, tp, auc).show()
end
test_41_1()

In [None]:
"Evaluation on test set after submission"

## Question 5.1 (15 Points)

Implement a classifier that achieves an auc, $a$, where $a > 0.75$. Note that this is higher than the average Kaggle competition result, so get creative.

Note: **do not use the target label in your training or evaluation**

In [None]:
def train_four_classifier train_db
  model = nil
  # BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
  return model
end

In [None]:
def eval_four_classifier_on db, model
  predictions = Hash.new
  # BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
  return predictions
end

In [None]:
def test_51_1
  model = train_four_classifier $train_db
  predictions = eval_four_classifier_on $dev_db, model
  assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
  assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.75, "AUC #{auc} > 0.7")
  assert_true(auc < 1.0, "AUC #{auc} < 1.0")
  plot_roc_curve(fp, tp, auc).show()
end
test_51_1()

In [None]:
"Evaluation on test set after submission"

## Question 6.1 (10 Points)

Implement a classifier that achieves an auc, $a$, where $a > 0.8$. Note that this as high as the winning Kaggle competitor.

Note: **do not use the target label in your training or evaluation**

In [None]:
def train_five_classifier train_db
  model = nil
  # BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
  return model
end

In [None]:
def eval_five_classifier_on db, model
  predictions = Hash.new
  # BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
  return predictions
end

In [None]:
def test_61_1
  model = train_five_classifier $train_db
  predictions = eval_five_classifier_on $dev_db, model
  assert_equal 15334, predictions.size
  scores = get_labels_for $dev_db, predictions
  assert_equal 15334, scores.size
  
  fp, tp, auc = roc_curve scores
  
  assert_equal(15334 + 1, fp.size, "Get all the points")
  assert_true(auc > 0.8, "AUC #{auc} > 0.8")
  assert_true(auc < 1.0, "AUC #{auc} < 1.0")
  plot_roc_curve(fp, tp, auc).show()
end
test_61_1()

In [None]:
"Evaluation on test set after submission"