# CS6140 Assignments

**Instructions**
1. In each assignment cell, look for the block:
 ```
  #BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
 ```
1. Replace this block with your solution.
1. Test your solution by running the cells following your block (indicated by ##TEST##)
1. Click the "Validate" button above to validate the work.

**Notes**
* You may add other cells and functions as needed
* Keep all code in the same notebook
* In order to receive credit, code must "Validate" on the JupyterHub server

---

# Final Project: Part 2 - Feature Extraction


In any practical machine learning problem, the data preparation and feature extraction stages are the most important and time-consuming. The final project exposes you to a real-world dataset. In this part of the final project, you are responsible to creating features that will be meaningful for prediction. Features are evaluated based on Information Gain, which you implemented in [Assignment 2](../assignment-2/assignment-2.ipynb).

Here is what will work well in this project:

* Extract some sample data, load it in [R](https://www.r-project.org), and do some intial analysis. Feel free to build models there to get a feel for the best features.
* Join the different tables--they are there for a reason. 
* Get creative.
* Read some of the Kaggle competition forums and kernels. 

Here is what will NOT work:

* Do not use only the features as provided in application_train.
* Do not try implementing new learning algorithm in order to generate features. If you find something that works, investigate what features were helpful and add the features. 
* Do not build lookup tables "embeddings" or other things you might have read about but were not covered in class. 
* Do not try to build a kernel matrix on all pairs. Re-evaluate the kernel instead.

In [9]:
require './assignment_lib'

"if(window['d3'] === undefined ||\n   window['Nyaplot'] === undefined){\n    var path = {\"d3\":\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\",\"downloadable\":\"http://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\"};\n\n\n\n    var shim = {\"d3\":{\"exports\":\"d3\"},\"downloadable\":{\"exports\":\"downloadable\"}};\n\n    require.config({paths: path, shim:shim});\n\n\nrequire(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\n\n\tvar script = d3.select(\"head\")\n\t    .append(\"script\")\n\t    .attr(\"src\", \"http://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\")\n\t    .attr(\"async\", true);\n\n\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\n\n\n\t    var event = document.createEvent(\"HTMLEvents\");\n\t    event.initEvent(\"load_nyaplot\",false,false);\n\t    win

true

In [10]:
dir = "/home/dataset"
$dev_db = SQLite3::Database.new "#{dir}/credit_risk_data_dev.db", results_as_hash: true, readonly: true

#<SQLite3::Database:0x0000000002c23650 @tracefunc=nil, @authorizer=nil, @encoding=nil, @busy_handler=nil, @collations={}, @functions={}, @results_as_hash=true, @type_translation=nil, @readonly=true>

## Question 1.1 (10 Points)

Implement ```create_dataset``` which runs an SQL query on a database and constructs a dataset like those we have used in this course. Add an ```id``` field for the ```SK_ID_CURR``` and store the ```TARGET``` in ```label```. 

If the query is:
```sql
select sk_id_curr, target, ext_source_1 from application_train  where ext_source_1 <> '' order by sk_id_curr;
```

then the result is:

```json
[
    {"label":1,"id":100002,"features":{"ext_source_1":0.08303696739132256}},
    {"label":0,"id":100015,"features":{"ext_source_1":0.7220444501416448}}
]
...
```
Note the features should not include the ID or Label. Feature keys should be lowercase and only contain keys fo which ```key.is_a? String``` returns true.



In [11]:
def create_dataset db, sql
  dataset = []
  db.execute(sql) do |row|
    record = Hash.new
    
    if row.key?("SK_ID_CURR")
      record["id"] = row["SK_ID_CURR"]
    end
    if row.key?("TARGET")
      record["label"] = row["TARGET"]
    end
    
    record["features"] = Hash.new
    
    row.keys.each do |key|
      if key != "SK_ID_CURR" && key != "TARGET" && !key.is_number?
        record["features"][key.downcase] = row[key]
      end
    end
    dataset << record
  end
 
  dataset
end

class Object
  def is_number?
    to_f.to_s == to_s || to_i.to_s == to_s
  end
end


:is_number?

In [12]:
def test_11()
  dataset = create_dataset $dev_db, "select sk_id_curr, target, ext_source_1 from application_train where ext_source_1 <> '' 
order by sk_id_curr limit 37"
  assert_equal 37, dataset.size
  assert_true(dataset[0]["features"].has_key? "ext_source_1")
  assert_equal(1, dataset[0]["features"].size)
  assert_equal(100002, dataset[0]["id"])  
  assert_in_delta(0.08303696, dataset[0]["features"]["ext_source_1"], 1e-4)
  assert_equal(1, dataset[0]["label"])    
end

test_11()

## Question 1.1 (20 points)

Copy and revise **your** information gain calculation for numeric and categorical features, from [Assignment 2](../assignment-2/assignment-2.ipynb). Copy the following implementations

* Class Distribution
* Entropy
* Information Gain after splitting
* Information gain for numerical features (fast version)


In [13]:
def class_distribution dataset
  # BEGIN YOUR CODE
  class_group = dataset.group_by{|row| row["label"]}
  class_dist = Hash.new
  ## size of rows
  total_size = dataset.size
  class_num = class_group.size
  
  class_group.each_key do |key|
    class_dist[key] = (class_group[key].size.to_f / total_size.to_f).to_f
  end
  class_dist
  #END YOUR CODE
end

:class_distribution

In [14]:
def entropy dist
  # BEGIN YOUR CODE
  sum = dist.values.reduce(0.0, :+)
  entropy = 0.0
  dist.values.each do |d|
    if d != 0
      entropy -= (d / sum) * Math.log(d / sum)
    end
  end
  entropy
  #END YOUR CODE
end

:entropy

In [15]:
def test_12_1()
  # Check that there are three classes
  dataset = create_dataset $dev_db, "select target, sk_id_curr, ext_source_1, flag_own_car from application_train where ext_source_1 <> ''"
  dist = class_distribution dataset
  h0 = entropy dist
  assert_in_delta(0.2686201883261589, h0, 1e-3)
end

test_12_1()

In [16]:
def information_gain h0, splits
  # BEGIN YOUR CODE
  information_gain = 0.0
  
  total_size = 0.0
  splits.each do |key, value|
    total_size += value.size
  end

  entropy_sum = 0.0
  splits.each do |key, value|
    class_dist = class_distribution value
    class_size = value.size
    class_entropy = entropy class_dist
    entropy_sum -= (class_size.to_f / total_size.to_f) * class_entropy.to_f
  end
  
  information_gain = h0 + entropy_sum
  information_gain
  
  #END YOUR CODE
end

:information_gain

In [17]:
def test_12_2()
  # Check that there are three classes
  dataset = create_dataset $dev_db, "select target, sk_id_curr, ext_source_1, flag_own_car from application_train where ext_source_1 <> ''"
  dist = class_distribution dataset
  h0 = entropy dist
  
  splits = dataset.group_by {|row| row["features"]["flag_own_car"]}
  ig = information_gain h0, splits
  assert_in_delta(0.0002206258541794237, ig, 1e-4)
end

test_12_2()

In [18]:
def find_split_point_numeric x, h0, fname
  # BEGIN YOUR CODE
  t_max = 0.0
  ig_max = 0.0
  
  split_l = Hash.new(0)
  split_r = Hash.new(0)
  
  sorted_x = x.sort_by{|row| row["features"].has_key?(fname) ? row["features"][fname] : 0}
  
  sorted_x.each do |row|
    split_r[row["label"]] += 1
  end
  size = sorted_x.size
  ig_max = 0
  sorted_x.each_with_index do |row, index|
    split_l[row["label"]] += 1
    split_r[row["label"]] -= 1
    if(index + 1 < size and row["features"][fname] == sorted_x[index + 1]["features"][fname])
      next
    end
   
    p1 = (index + 1.0)/ size
    p2 = (size - index - 1.0)/ size
    ig = h0 - p1 * entropy(split_l) - p2 * entropy(split_r)
    if ig > ig_max
      ig_max = ig
      t_max = sorted_x[index + 1]["features"][fname]
    end
  end
  return [t_max, ig_max]
  #END YOUR CODE
end

:find_split_point_numeric

In [19]:
def test_12_3()
  # Check that there are three classes
  dataset = create_dataset $dev_db, "select target, sk_id_curr, ext_source_1, flag_own_car from application_train where ext_source_1 <> ''"
  dist = class_distribution dataset
  h0 = entropy dist
  
  t, ig = find_split_point_numeric dataset, h0, "ext_source_1"
  assert_in_delta(0.009751743140812785, ig, 1e-4)
end

test_12_3()

## Question 2.1 (70 Points)

Using whatever external software you want (hosted on your own devices), provide 15+ different features that have information >= 0.005. You may to implement several cells below, so please insert them above the test. 

Features must only be derived from the database but you are free to write whatever SQL queries you want. You may create temporary tables, but the database is read-only.

Pay close attention to the following aspects of feature design:

* Normalization: Z-score, L2, Min-Max, etc.
* Sparsity / missing values
* Frequency: Information is easily fooled by features with many values.
* Joins: Some of the best features in this dataset combine two columns from different tables.
* Transformations: One-hot, Binning, Discretization, Non-linear transformation

In [33]:
def mean data
  res = Hash.new
  count = Hash.new
  
  data.each do |record|
    record["features"].each do |key, value|
      if !res.key?(key)
        res[key] = 0.0
      end
      if !count.key?(key)
        count[key] = 0.0
      end
      res[key] += value.to_f
      count[key] += 1
    end
  end
  res.each do |key, value|
    res[key] = value / count[key]
  end
end

:mean

In [34]:
def stdev data, mean
  res = Hash.new
  count = Hash.new
  
  data.each do |record|
    record["features"].each do |key, value|
      if !res.key?(key)
        res[key] = 0.0
      end
      if !count.key?(key)
        count[key] = 0.0
      end
      res[key] += (value - mean[key]) ** 2
      count[key] += 1
    end
  end
  res.each do |key, value|
    if count[key] > 1
      res[key] = Math.sqrt(value / (count[key] - 1))
    else
      res[key] = Math.sqrt(value)
    end
  end
    res
end

:stdev

In [35]:
def print_ds dataset
  dataset.each do |row|
    puts row
  end
end


:print_ds

In [36]:
def normalize dataset
  dataset_clone = dataset.collect do |r|
    u = r.clone
    u["id"] = r["id"].clone
    u["label"] = r["label"].clone
    u["features"] = r["features"].clone
    u
  end
  # BEGIN YOUR CODE
  zsp_data = dataset_clone
  zsp_mean = mean zsp_data
  zsp_stdev = stdev zsp_data, zsp_mean
  zsp_data.each do |record|
    record["features"].each do |key, value|
      if zsp_stdev[key] <= 0.0
        record["features"][key] = 0.0
      else
        record["features"][key] = (value - zsp_mean[key]) / zsp_stdev[key]
      end
    end
  end
  
  
  #END YOUR CODE
  return dataset_clone
end

:normalize

In [37]:
def fill_mean dataset
  mean_hash = mean dataset
  dataset.each do |row|
    if row.has_key? ("features")
      row["features"].each_key do |key|
        if row["features"][key] == nil or row["features"][key] == ""
          row["features"][key] = mean_hash[key]
        end
      end
    end
  end
end

:fill_mean

In [38]:
def fill_zero dataset
  dataset.each do |row|
    if row.has_key? ("features")
      row["features"].each_key do |key|
        if row["features"][key] == nil or row["features"][key] == ""
          row["features"][key] = 0
        end
      end
    end
  end
end

:fill_zero

In [39]:
def cal_ig_conti dataset, fname
  dist = class_distribution dataset
  h0 = entropy dist
  res = find_split_point_numeric dataset, h0, fname
  ig = res[1]
  ig
end

:cal_ig_conti

In [40]:
################################component features with two categorical in application_train###################################

def information_gain_2_cate f_1, f_2
  remix = "remix_str"

  sql = "select target, sk_id_curr, "+ f_1 + ", " + f_2 + " from application_train"

  dataset = create_dataset $dev_db, sql
  fill_zero dataset

  dataset.each do |row|
    row["features"][remix] = row["features"][f_1].to_s + row["features"][f_2].to_s
  end

  dist = class_distribution dataset
  h0 = entropy dist
  split = dataset.group_by {|row| row["features"][remix]}

  ig = information_gain h0, split
  ig
end

fea_arr = ["name_education_type", "code_gender", "flag_own_car", 
  "flag_own_realty", "name_contract_type", "name_type_suite", "name_income_type", "name_family_status",
  "name_housing_type", "occupation_type", "organization_type", "WEEKDAY_APPR_PROCESS_START"] 

ig = information_gain_2_cate "occupation_type", "organization_type"
puts ig
ig2 = information_gain_2_cate "name_education_type","organization_type"
puts ig2
ig3 = information_gain_2_cate "name_family_status","occupation_type"
puts ig3
# fea_arr.each_with_index do |feature, index|
#   (index + 1).upto(fea_arr.size - 1) do |value|
#     ig = information_gain_2_cate feature, fea_arr[value]
#   end
# end

# name_education_type occupation_type
# 0.005157590588092231
# name_education_type organization_type
# 0.008093865339290574
# code_gender organization_type
# 0.006276768835009772
# flag_own_car organization_type
# 0.0063802284520556585
# flag_own_realty organization_type
# 0.006339866788409865
# name_contract_type organization_type
# 0.0064926022053908294
# name_type_suite occupation_type
# 0.005998938711793467
# name_type_suite organization_type
# 0.01030555386398474
# name_income_type organization_type
# 0.007258178505345825
# name_family_status occupation_type
# 0.005765562896686671
# name_type_suite occupation_type
# 0.005998938711793467
# name_family_status organization_type
# 0.009801178453388792
# name_housing_type occupation_type
# 0.006723792431264697
# name_housing_type organization_type
# 0.01033458778398838
# occupation_type organization_type
# 0.019953193002315517


0.019953193002315517
0.008093865339290574
0.005765562896686671


In [41]:
#single feature with weights

sql = "select target, sk_id_curr, (2 * ext_source_3 + 3 * ext_source_2 + ext_source_1) as remix from application_train"

dataset = create_dataset $dev_db, sql
fill_zero dataset
puts dataset.size
dist = class_distribution dataset
h0 = entropy dist
res = find_split_point_numeric dataset, h0, "remix"
ig = res[1]
ig

15334


0.011790544823178845

In [96]:
#####################component numeric feature from join main and previous#########################

# fea_p_arr = ["P.DAYS_FIRST_DRAWING", "P.DAYS_FIRST_DUE", "P.DAYS_LAST_DUE_1ST_VERSION",
#   "P.DAYS_LAST_DUE", "P.DAYS_TERMINATION", "P.NFLAG_INSURED_ON_APPROVAL", "P.AMT_ANNUITY", 
#   "P.AMT_APPLICATION", "P.AMT_CREDIT", "P.AMT_DOWN_PAYMENT"]
# fea_a_arr = ["A.REGION_POPULATION_RELATIVE", "A.REGION_POPULATION_RELATIVE", "A.DAYS_REGISTRATION",
#   "A.DAYS_ID_PUBLISH", "A.DAYS_EMPLOYED", "A.DAYS_BIRTH"]
# fea_p_arr.each do |fea_2|
#   fea_a_arr.each do |fea_1|
#     sql = "select A.SK_ID_CURR, target, (" + fea_1 + " * MAX(" + fea_2 +")) as skr from application_train A 
#        left join previous_application P on A.SK_ID_CURR = P.SK_ID_CURR 
#        group by A.SK_ID_CURR"
#     dataset = create_dataset $dev_db, sql
#     fill_zero dataset
#     dataset = normalize dataset

#     ig = cal_ig_conti dataset, "skr"
#     puts ig
#   end
# end
###########################################################################################################3
fea_1 = "A.DAYS_EMPLOYED"
fea_2 = "P.DAYS_DECISION"
fea_p_arr = ["P.DAYS_FIRST_DRAWING", "P.DAYS_FIRST_DUE", "P.DAYS_LAST_DUE_1ST_VERSION",
  "P.DAYS_LAST_DUE", "P.DAYS_TERMINATION", "P.NFLAG_INSURED_ON_APPROVAL", "P.AMT_ANNUITY", 
  "P.AMT_APPLICATION", "P.AMT_CREDIT", "P.AMT_DOWN_PAYMENT"]
fea_a_arr = ["A.REGION_POPULATION_RELATIVE", "A.REGION_POPULATION_RELATIVE", "A.DAYS_REGISTRATION",
  "A.DAYS_ID_PUBLISH", "A.DAYS_EMPLOYED", "A.DAYS_BIRTH"]

sql = "select A.SK_ID_CURR, target, A.DAYS_BIRTH, A.DAYS_REGISTRATION, 
       AVG(P.AMT_ANNUITY) as skr_1,  
       AVG(P.AMT_APPLICATION) as skr_2
       from application_train A 
       left join previous_application P on A.SK_ID_CURR = P.SK_ID_CURR 
       group by A.SK_ID_CURR"

dataset = create_dataset $dev_db, sql
fill_zero dataset
dataset = normalize dataset
dataset.each do |row|
  row["features"]["remix_2"] =  12 * row["features"]["days_birth"] + row["features"]["days_registration"] +
  row["features"]["skr_1"] * row["features"]["skr_2"]
end
dataset = normalize dataset
# print_ds dataset

# # puts dataset

ig = cal_ig_conti dataset, "remix_2"
ig

0.0023160802036418238

In [43]:
######################################MAIN TABLE JOIN BUREAU####################################

## sum AMT_CREDIT_SUM / sum AMT_CREDIT_SUM_DEBT  bureau
## amt_credit_sum / amt_annuity bureau
## !! row["features"]["remix_2"] = (row["features"]["cre_sum"] / row["features"]["debt"]) * row["features"]["ext_source_2"]

fea_1 = "A.DAYS_EMPLOYED"
fea_2 = "P.DAYS_DECISION"
fea_p_arr = ["P.DAYS_FIRST_DRAWING", "P.DAYS_FIRST_DUE", "P.DAYS_LAST_DUE_1ST_VERSION",
  "P.DAYS_LAST_DUE", "P.DAYS_TERMINATION", "P.NFLAG_INSURED_ON_APPROVAL", "P.AMT_ANNUITY", 
  "P.AMT_APPLICATION", "P.AMT_CREDIT", "P.AMT_DOWN_PAYMENT"]
fea_a_arr = ["A.REGION_POPULATION_RELATIVE", "A.REGION_POPULATION_RELATIVE", "A.DAYS_REGISTRATION",
  "A.DAYS_ID_PUBLISH", "A.DAYS_EMPLOYED", "A.DAYS_BIRTH","A.AMT_CREDIT"]
fea_b_arr = ["AMT_CREDIT_SUM", "AMT_CREDIT_SUM_DEBT","AMT_ANNUITY", "DAYS_CREDIT"]


sql = "select A.SK_ID_CURR, target,
       A.EXT_SOURCE_1,
       A.EXT_SOURCE_2,
       A.EXT_SOURCE_3,
       A.DAYS_EMPLOYED,
       A.DAYS_BIRTH,
       A.AMT_CREDIT as skr_1,
       A.AMT_ANNUITY,
       SUM(B.AMT_CREDIT_SUM) as cre_sum,
       AVG(B.AMT_CREDIT_SUM) as cre_avg,  
       SUM(B.AMT_CREDIT_SUM_DEBT) as debt,
       AVG(B.AMT_ANNUITY) as ann
       from application_train A 
       left join bureau B on A.SK_ID_CURR = B.SK_ID_CURR 
       group by A.SK_ID_CURR"

dataset = create_dataset $dev_db, sql
fill_zero dataset
dataset = normalize dataset
dataset.each do |row|
  row["features"]["remix_2"] = (row["features"]["cre_sum"] / row["features"]["debt"]) * row["features"]["ext_source_2"]
  
  row["features"]["remix_3"] = (row["features"]["cre_avg"] / row["features"]["ann"]) + 5 * row["features"]["ext_source_2"] ** 2
end
dataset = normalize dataset
# print_ds dataset

ig = cal_ig_conti dataset, "remix_3"
puts ig

0.005416923770485485


In [104]:
################################component features with two categorical in application_train###################################

def information_gain_2_cate f_1, f_2
  remix = "remix_str"

  sql = "select target, sk_id_curr, "+ f_1 + ", " + f_2 + " from application_train"

  dataset = create_dataset $dev_db, sql
  fill_zero dataset

  dataset.each do |row|
    row["features"][remix] = row["features"][f_1].to_s + row["features"][f_2].to_s
  end

  dist = class_distribution dataset
  h0 = entropy dist
  split = dataset.group_by {|row| row["features"][remix]}

  ig = information_gain h0, split
  ig
end

fea_arr = ["name_education_type", "code_gender", "flag_own_car", 
  "flag_own_realty", "name_contract_type", "name_type_suite", "name_income_type", "name_family_status",
  "name_housing_type", "occupation_type", "organization_type", "WEEKDAY_APPR_PROCESS_START"] 

ig = information_gain_2_cate "occupation_type", "organization_type"
puts ig
ig2 = information_gain_2_cate "name_education_type","organization_type"
puts ig2
ig3 = information_gain_2_cate "name_family_status","occupation_type"
puts ig3
# fea_arr.each_with_index do |feature, index|
#   (index + 1).upto(fea_arr.size - 1) do |value|
#     ig = information_gain_2_cate feature, fea_arr[value]
#   end
# end

# name_education_type occupation_type
# 0.005157590588092231
# name_education_type organization_type
# 0.008093865339290574
# code_gender organization_type
# 0.006276768835009772
# flag_own_car organization_type
# 0.0063802284520556585
# flag_own_realty organization_type
# 0.006339866788409865
# name_contract_type organization_type
# 0.0064926022053908294
# name_type_suite occupation_type
# 0.005998938711793467
# name_type_suite organization_type
# 0.01030555386398474
# name_income_type organization_type
# 0.007258178505345825
# name_family_status occupation_type
# 0.005765562896686671
# name_type_suite occupation_type
# 0.005998938711793467
# name_family_status organization_type
# 0.009801178453388792
# name_housing_type occupation_type
# 0.006723792431264697
# name_housing_type organization_type
# 0.01033458778398838
# occupation_type organization_type
# 0.019953193002315517


0.019953193002315517
0.008093865339290574
0.005765562896686671


In [123]:
######################################Extract 15 features here########################################
def extract_features db
  dataset = []
  
##########################################main join bureau#########################################
  sql_1 = "select A.SK_ID_CURR, target,
       A.EXT_SOURCE_1,
       A.EXT_SOURCE_2,
       A.EXT_SOURCE_3,
       A.AMT_GOODS_PRICE as good_pri,
       A.DAYS_EMPLOYED,
       A.DAYS_BIRTH,
       A.DAYS_ID_PUBLISH as skr, 
       A.DAYS_REGISTRATION as skr_2,
       A.AMT_CREDIT,
       A.AMT_ANNUITY,
       SUM(B.AMT_CREDIT_SUM) as cre_sum,
       AVG(B.AMT_CREDIT_SUM) as cre_avg,  
       AVG(B.AMT_CREDIT_SUM_DEBT) as debt_avg,
       SUM(B.AMT_CREDIT_SUM_DEBT) as debt_sum,
       AVG(B.AMT_ANNUITY) as ann,
       AVG(B.DAYS_CREDIT) as days_cre_avg
       from application_train A 
       left join bureau B on A.SK_ID_CURR = B.SK_ID_CURR 
       group by A.SK_ID_CURR"
  
  dataset_1 = create_dataset db, sql_1
  fill_zero dataset_1
  dataset_1 = normalize dataset_1
  dataset_1.each do |row|
    record = Hash.new
    record["id"] = row["id"].clone
    record["label"] = row["label"].clone
    record["features"] = Hash.new
    row["features"]["remix_1"] = (row["features"]["cre_sum"] / row["features"]["debt_sum"]) * row["features"]["ext_source_2"]
    row["features"]["remix_2"] = (row["features"]["cre_avg"] / row["features"]["ann"]) + 5 * row["features"]["ext_source_2"] ** 2
    
#     row["features"]["remix_3.5"] = row["features"]["days_cre_avg"]
    #0.00476
#     row["features"]["remix_3"] = row["features"]["days_cre_avg"] / (row["features"]["days_birth"] ** 2 *
#     row["features"]["days_employed"] ** 2)
    row["features"]["days_remix"] = row["features"]["days_birth"] ** 2 *
    row["features"]["days_employed"] ** 2 
    row["features"]["remix_3"] = row["features"]["days_cre_avg"] * row["features"]["ext_source_2"] ** 4  / row["features"]["days_remix"] ** 2
    row["features"]["remix_4"] = row["features"]["amt_credit"] * row["features"]["ext_source_3"] / row["features"]["good_pri"] 
    record["features"]["remix_1"] = row["features"]["remix_1"]
    record["features"]["remix_2"] = row["features"]["remix_2"]
    record["features"]["remix_3"] = row["features"]["remix_3"]
    record["features"]["remix_4"] = row["features"]["remix_4"]
    dataset << record   
  end

  
  #######################################extract from main table################################################
  sql_2 = "select target, sk_id_curr, 
  AMT_INCOME_TOTAL,
  AMT_CREDIT,
  AMT_ANNUITY,
  AMT_GOODS_PRICE,
  DAYS_BIRTH,
  DAYS_EMPLOYED,
  ext_source_3, 
  ext_source_2,
  ext_source_1 
  from application_train"
  
  dataset_2 = create_dataset db, sql_2
  fill_mean dataset_2
  dataset_2 = normalize dataset_2
  
  dataset_2.zip(dataset).each do |row, record|
    row["features"]["remix_5"] = 3 * row["features"]["ext_source_1"] + 4 * row["features"]["ext_source_2"] +  3 * row["features"]["ext_source_3"]
    row["features"]["remix_6"] = row["features"]["ext_source_2"] ** 8  / (row["features"]["ext_source_1"] * row["features"]["ext_source_3"])
    row["features"]["remix_7"] = row["features"]["amt_income_total"] * row["features"]["amt_credit"] - 3 * row["features"]["ext_source_3"] 
    row["features"]["remix_8"] = 2 * row["features"]["amt_goods_price"] - row["features"]["amt_credit"] + row["features"]["ext_source_3"]
    row["features"]["remix_9"] = row["features"]["days_birth"] * row["features"]["ext_source_2"] ** 2 / (-365) 
    record["features"]["remix_5"] = row["features"]["remix_5"]
    record["features"]["remix_6"] = row["features"]["remix_6"]
    record["features"]["remix_7"] = row["features"]["remix_7"]
    record["features"]["remix_8"] = row["features"]["remix_8"]
    record["features"]["remix_9"] = row["features"]["remix_9"]
  end
  
  #######################################Categorical Features###########################################
  fea_arr = ["name_education_type", "code_gender", "flag_own_car", 
  "flag_own_realty", "name_contract_type", "name_type_suite", "name_income_type", "name_family_status",
  "name_housing_type", "occupation_type", "organization_type", "WEEKDAY_APPR_PROCESS_START"] 
  
  sql_3 = "select target, sk_id_curr,
    name_education_type,
    code_gender,
    flag_own_car,
    flag_own_realty,
    name_contract_type,
    name_type_suite,
    name_income_type,
    name_family_status,
    name_housing_type,
    occupation_type,
    organization_type,
    WEEKDAY_APPR_PROCESS_START
    from application_train"
  
  dataset_3 = create_dataset db, sql_3
  fill_zero dataset_3
  dataset_3.zip(dataset).each do |row,record|
    row["features"]["remix_10"] = row["features"]["organization_type"].to_s + row["features"]["name_contract_type"].to_s
    row["features"]["remix_11"] = row["features"]["name_education_type"].to_s + row["features"]["organization_type"].to_s
    row["features"]["remix_12"] = row["features"]["name_family_status"].to_s + row["features"]["occupation_type"].to_s
    row["features"]["remix_13"] = row["features"]["name_housing_type"].to_s + row["features"]["occupation_type"].to_s
    row["features"]["remix_14"] = row["features"]["flag_own_realty"].to_s + row["features"]["organization_type"].to_s
    row["features"]["remix_15"] = row["features"]["code_gender"].to_s + row["features"]["organization_type"].to_s
    record["features"]["remix_10"] = row["features"]["remix_10"]
    record["features"]["remix_11"] = row["features"]["remix_11"]
    record["features"]["remix_12"] = row["features"]["remix_12"]
    record["features"]["remix_13"] = row["features"]["remix_13"]
    record["features"]["remix_14"] = row["features"]["remix_14"]
    record["features"]["remix_15"] = row["features"]["remix_15"]
  end
#   ig_1 = cal_ig_conti dataset_1, "remix_1"
#   puts "remix_1 : sum of credit sum over sum of debt_sum remix ext_2 "
#   puts ig_1
  
#   ig_2 = cal_ig_conti dataset_1, "remix_2"
#   puts "remix_2 : avg of credit sum over avg of ann remix ext_2"
#   puts ig_2
  
#   ig_3 = cal_ig_conti dataset_1, "remix_3"
#   puts "remix_3 : days credit avg / power of days remix remix ext_2"
#   puts ig_3
  
#   ig_4 = cal_ig_conti dataset_1, "remix_4"
#   puts "remix_4 : amt credit over amt goods price remix ext_3"
#   puts ig_4
  
#   ig_5 = cal_ig_conti dataset_2, "remix_5"
#   puts "remix_5 : linear sum of all ext_source with weights"
#   puts ig_5
  
#   ig_6 = cal_ig_conti dataset_2, "remix_6"
#   puts "remix_6 : nonlinear remix of all ext_source with weights"
#   puts ig_6
#   ig_7 = cal_ig_conti dataset_2, "remix_7"
#   puts "remix_7 : mash up with income total credit and ext_source"
#   puts ig_7
  
#   ig_8 = cal_ig_conti dataset_2, "remix_8"
#   puts "remix_8 : creditdownpayment: AMTGOODPRICE - AMTCREDIT remix ext_source"
#   puts ig_8
  
#   ig_9 = cal_ig_conti dataset_2, "remix_9"
#   puts "remix_9: age int remix ext_source"
#   puts ig_9
  
#   dist = class_distribution dataset_3
#   h0 = entropy dist
#   split_1 = dataset_3.group_by {|row| row["features"]["remix_10"]}
#   split_2 = dataset_3.group_by {|row| row["features"]["remix_11"]}
#   split_3 = dataset_3.group_by {|row| row["features"]["remix_12"]}
#   split_4 = dataset_3.group_by {|row| row["features"]["remix_13"]}
#   split_5 = dataset_3.group_by {|row| row["features"]["remix_14"]}
#   split_6 = dataset_3.group_by {|row| row["features"]["remix_15"]}
  
#   ig_10 = information_gain h0, split_1
#   ig_11 = information_gain h0, split_2
#   ig_12 = information_gain h0, split_3
#   ig_13 = information_gain h0, split_4
#   ig_14 = information_gain h0, split_5
#   ig_15 = information_gain h0, split_6
  
#   puts "remix_10 : organization_type remix with name contract type"
#   puts ig_10
#   puts "remix_11 : name_education_type remix with organization_type"
#   puts ig_11
#   puts "remix_12 : name_family_status remix with occupation_type"
#   puts ig_12
#   puts "remix_13 : name_housing_type remix with occupation_type"
#   puts ig_13
#   puts "remix_14: flag_own_realty remix with organization_type"
#   puts ig_14
#   puts "remix_15 : code_gender remix with organization_type"
#   puts ig_15
  
  
#   puts dataset.size
#   dataset.take(50).each do |row|
#     puts row
#   end
  return dataset
end

15334
{"id"=>100002, "label"=>1, "features"=>{"remix_1"=>-0.5519410263577981, "remix_2"=>10.916204187844244, "remix_3"=>0.9274268511644237, "remix_4"=>-0.9486483779178423, "remix_5"=>-15.036848615763171, "remix_6"=>3.39673298368854, "remix_7"=>5.76536997590556, "remix_8"=>-2.5099798200110457, "remix_9"=>-0.007012007308497651, "remix_10"=>"Business Entity Type 3Cash loans", "remix_11"=>"Secondary / secondary specialBusiness Entity Type 3", "remix_12"=>"Single / not marriedLaborers", "remix_13"=>"House / apartmentLaborers", "remix_14"=>"YBusiness Entity Type 3", "remix_15"=>"MBusiness Entity Type 3"}}
{"id"=>100007, "label"=>0, "features"=>{"remix_1"=>-0.43325421098858174, "remix_2"=>6.916573176496392, "remix_3"=>-10.177136126341658, "remix_4"=>-4.860855223036282, "remix_5"=>-7.231019322305619, "remix_6"=>3.424535022157259, "remix_7"=>1.4845902810553708, "remix_8"=>-0.3874624373654716, "remix_9"=>0.0023866336658833983, "remix_10"=>"ReligionCash loans", "remix_11"=>"Secondary / secondary 

{"id"=>100260, "label"=>0, "features"=>{"remix_1"=>0.3528043166340067, "remix_2"=>7.043836111413981, "remix_3"=>-0.7260710816393411, "remix_4"=>0.27606864377664, "remix_5"=>1.9357283780161079, "remix_6"=>22.326202137561676, "remix_7"=>0.10200254940924668, "remix_8"=>-0.9684175902297005, "remix_9"=>-0.0037609740311441847, "remix_10"=>"GovernmentCash loans", "remix_11"=>"Higher educationGovernment", "remix_12"=>"Single / not marriedCore staff", "remix_13"=>"Co-op apartmentCore staff", "remix_14"=>"YGovernment", "remix_15"=>"MGovernment"}}
{"id"=>100268, "label"=>0, "features"=>{"remix_1"=>0.4244424132994396, "remix_2"=>4.4342692607948955, "remix_3"=>1.7006965408972987, "remix_4"=>-0.7519155013300369, "remix_5"=>-5.297544447803585, "remix_6"=>0.03052616916132642, "remix_7"=>5.251095095942939, "remix_8"=>-2.257254930989579, "remix_9"=>-0.0016441690837617922, "remix_10"=>"Business Entity Type 3Cash loans", "remix_11"=>"Secondary / secondary specialBusiness Entity Type 3", "remix_12"=>"Marri

{"id"=>100557, "label"=>0, "features"=>{"remix_1"=>-0.2322967779352874, "remix_2"=>5.697872940781939, "remix_3"=>-4.613415722124828, "remix_4"=>0.8535277435036392, "remix_5"=>3.072223108257627, "remix_6"=>0.27658846861036235, "remix_7"=>-1.3475203122583272, "remix_8"=>2.473348449898217, "remix_9"=>-0.0024035520015104578, "remix_10"=>"Business Entity Type 3Cash loans", "remix_11"=>"Incomplete higherBusiness Entity Type 3", "remix_12"=>"MarriedSales staff", "remix_13"=>"House / apartmentSales staff", "remix_14"=>"YBusiness Entity Type 3", "remix_15"=>"FBusiness Entity Type 3"}}
{"id"=>100576, "label"=>0, "features"=>{"remix_1"=>0.6296491809317776, "remix_2"=>12.378755673131728, "remix_3"=>303.3486971633805, "remix_4"=>-1.6732080609838047, "remix_5"=>1.9955794672140739, "remix_6"=>30.22884900937017, "remix_7"=>1.276301759344465, "remix_8"=>-1.2246509419449678, "remix_9"=>-0.003458345854403709, "remix_10"=>"OtherCash loans", "remix_11"=>"Secondary / secondary specialOther", "remix_12"=>"Ma

{"id"=>101001, "label"=>0, "features"=>{"remix_1"=>0.03466284376035897, "remix_2"=>7.229312494427405, "remix_3"=>4.634227189210138, "remix_4"=>1.1383115213667092, "remix_5"=>5.967183292410409, "remix_6"=>-3.3706828866522875, "remix_7"=>-3.4874714931663062, "remix_8"=>2.0160563279043577, "remix_9"=>-0.0028866680919522247, "remix_10"=>"Self-employedCash loans", "remix_11"=>"Secondary / secondary specialSelf-employed", "remix_12"=>"MarriedSales staff", "remix_13"=>"Municipal apartmentSales staff", "remix_14"=>"NSelf-employed", "remix_15"=>"FSelf-employed"}}
{"id"=>101011, "label"=>0, "features"=>{"remix_1"=>0.4783665809268612, "remix_2"=>6.771643057728856, "remix_3"=>13.902444005247654, "remix_4"=>-0.32305165431015753, "remix_5"=>-1.4707047920639118, "remix_6"=>0.7555122623402338, "remix_7"=>2.876496503733354, "remix_8"=>0.10116112143243439, "remix_9"=>0.0017348452864946096, "remix_10"=>"Self-employedCash loans", "remix_11"=>"Secondary / secondary specialSelf-employed", "remix_12"=>"Marri

[{"id"=>100002, "label"=>1, "features"=>{"remix_1"=>-0.5519410263577981, "remix_2"=>10.916204187844244, "remix_3"=>0.9274268511644237, "remix_4"=>-0.9486483779178423, "remix_5"=>-15.036848615763171, "remix_6"=>3.39673298368854, "remix_7"=>5.76536997590556, "remix_8"=>-2.5099798200110457, "remix_9"=>-0.007012007308497651, "remix_10"=>"Business Entity Type 3Cash loans", "remix_11"=>"Secondary / secondary specialBusiness Entity Type 3", "remix_12"=>"Single / not marriedLaborers", "remix_13"=>"House / apartmentLaborers", "remix_14"=>"YBusiness Entity Type 3", "remix_15"=>"MBusiness Entity Type 3"}}, {"id"=>100007, "label"=>0, "features"=>{"remix_1"=>-0.43325421098858174, "remix_2"=>6.916573176496392, "remix_3"=>-10.177136126341658, "remix_4"=>-4.860855223036282, "remix_5"=>-7.231019322305619, "remix_6"=>3.424535022157259, "remix_7"=>1.4845902810553708, "remix_8"=>-0.3874624373654716, "remix_9"=>0.0023866336658833983, "remix_10"=>"ReligionCash loans", "remix_11"=>"Secondary / secondary spec

In [124]:
extracted_dataset = extract_features($dev_db)
extracted_dataset[0]

15334
{"id"=>100002, "label"=>1, "features"=>{"remix_1"=>-0.5519410263577981, "remix_2"=>10.916204187844244, "remix_3"=>0.9274268511644237, "remix_4"=>-0.9486483779178423, "remix_5"=>-15.036848615763171, "remix_6"=>3.39673298368854, "remix_7"=>5.76536997590556, "remix_8"=>-2.5099798200110457, "remix_9"=>-0.007012007308497651, "remix_10"=>"Business Entity Type 3Cash loans", "remix_11"=>"Secondary / secondary specialBusiness Entity Type 3", "remix_12"=>"Single / not marriedLaborers", "remix_13"=>"House / apartmentLaborers", "remix_14"=>"YBusiness Entity Type 3", "remix_15"=>"MBusiness Entity Type 3"}}
{"id"=>100007, "label"=>0, "features"=>{"remix_1"=>-0.43325421098858174, "remix_2"=>6.916573176496392, "remix_3"=>-10.177136126341658, "remix_4"=>-4.860855223036282, "remix_5"=>-7.231019322305619, "remix_6"=>3.424535022157259, "remix_7"=>1.4845902810553708, "remix_8"=>-0.3874624373654716, "remix_9"=>0.0023866336658833983, "remix_10"=>"ReligionCash loans", "remix_11"=>"Secondary / secondary 

{"id"=>100260, "label"=>0, "features"=>{"remix_1"=>0.3528043166340067, "remix_2"=>7.043836111413981, "remix_3"=>-0.7260710816393411, "remix_4"=>0.27606864377664, "remix_5"=>1.9357283780161079, "remix_6"=>22.326202137561676, "remix_7"=>0.10200254940924668, "remix_8"=>-0.9684175902297005, "remix_9"=>-0.0037609740311441847, "remix_10"=>"GovernmentCash loans", "remix_11"=>"Higher educationGovernment", "remix_12"=>"Single / not marriedCore staff", "remix_13"=>"Co-op apartmentCore staff", "remix_14"=>"YGovernment", "remix_15"=>"MGovernment"}}
{"id"=>100268, "label"=>0, "features"=>{"remix_1"=>0.4244424132994396, "remix_2"=>4.4342692607948955, "remix_3"=>1.7006965408972987, "remix_4"=>-0.7519155013300369, "remix_5"=>-5.297544447803585, "remix_6"=>0.03052616916132642, "remix_7"=>5.251095095942939, "remix_8"=>-2.257254930989579, "remix_9"=>-0.0016441690837617922, "remix_10"=>"Business Entity Type 3Cash loans", "remix_11"=>"Secondary / secondary specialBusiness Entity Type 3", "remix_12"=>"Marri

{"id"=>100557, "label"=>0, "features"=>{"remix_1"=>-0.2322967779352874, "remix_2"=>5.697872940781939, "remix_3"=>-4.613415722124828, "remix_4"=>0.8535277435036392, "remix_5"=>3.072223108257627, "remix_6"=>0.27658846861036235, "remix_7"=>-1.3475203122583272, "remix_8"=>2.473348449898217, "remix_9"=>-0.0024035520015104578, "remix_10"=>"Business Entity Type 3Cash loans", "remix_11"=>"Incomplete higherBusiness Entity Type 3", "remix_12"=>"MarriedSales staff", "remix_13"=>"House / apartmentSales staff", "remix_14"=>"YBusiness Entity Type 3", "remix_15"=>"FBusiness Entity Type 3"}}
{"id"=>100576, "label"=>0, "features"=>{"remix_1"=>0.6296491809317776, "remix_2"=>12.378755673131728, "remix_3"=>303.3486971633805, "remix_4"=>-1.6732080609838047, "remix_5"=>1.9955794672140739, "remix_6"=>30.22884900937017, "remix_7"=>1.276301759344465, "remix_8"=>-1.2246509419449678, "remix_9"=>-0.003458345854403709, "remix_10"=>"OtherCash loans", "remix_11"=>"Secondary / secondary specialOther", "remix_12"=>"Ma

{"id"=>101001, "label"=>0, "features"=>{"remix_1"=>0.03466284376035897, "remix_2"=>7.229312494427405, "remix_3"=>4.634227189210138, "remix_4"=>1.1383115213667092, "remix_5"=>5.967183292410409, "remix_6"=>-3.3706828866522875, "remix_7"=>-3.4874714931663062, "remix_8"=>2.0160563279043577, "remix_9"=>-0.0028866680919522247, "remix_10"=>"Self-employedCash loans", "remix_11"=>"Secondary / secondary specialSelf-employed", "remix_12"=>"MarriedSales staff", "remix_13"=>"Municipal apartmentSales staff", "remix_14"=>"NSelf-employed", "remix_15"=>"FSelf-employed"}}
{"id"=>101011, "label"=>0, "features"=>{"remix_1"=>0.4783665809268612, "remix_2"=>6.771643057728856, "remix_3"=>13.902444005247654, "remix_4"=>-0.32305165431015753, "remix_5"=>-1.4707047920639118, "remix_6"=>0.7555122623402338, "remix_7"=>2.876496503733354, "remix_8"=>0.10116112143243439, "remix_9"=>0.0017348452864946096, "remix_10"=>"Self-employedCash loans", "remix_11"=>"Secondary / secondary specialSelf-employed", "remix_12"=>"Marri

{"id"=>100002, "label"=>1, "features"=>{"remix_1"=>-0.5519410263577981, "remix_2"=>10.916204187844244, "remix_3"=>0.9274268511644237, "remix_4"=>-0.9486483779178423, "remix_5"=>-15.036848615763171, "remix_6"=>3.39673298368854, "remix_7"=>5.76536997590556, "remix_8"=>-2.5099798200110457, "remix_9"=>-0.007012007308497651, "remix_10"=>"Business Entity Type 3Cash loans", "remix_11"=>"Secondary / secondary specialBusiness Entity Type 3", "remix_12"=>"Single / not marriedLaborers", "remix_13"=>"House / apartmentLaborers", "remix_14"=>"YBusiness Entity Type 3", "remix_15"=>"MBusiness Entity Type 3"}}

In [125]:
assert_not_nil extracted_dataset
assert_equal 15334, extracted_dataset.size
assert_true(extracted_dataset.all? {|row| row["features"].size >= 8}, "At least 6 non-zero features per row")
assert_true(extracted_dataset.flat_map {|row| row["features"].keys}.uniq.size >= 15,  "At least 15 features")

In [126]:
assert_equal 15334, extracted_dataset.collect {|row| row["id"]}.uniq.size
assert_equal 2, extracted_dataset.collect {|row| row["label"]}.uniq.size

h0 = entropy(class_distribution(extracted_dataset))
assert_in_delta(0.2797684909805576, h0, 1e-3)

In [127]:
features = extracted_dataset.flat_map {|row| row["features"].keys}.uniq
numeric_features = features.select {|k| extracted_dataset.reject {|row| row["features"][k] == ""}.all? {|row| row["features"].fetch(k, 0.0).is_a? Numeric}}

assert_true(numeric_features.size >= 4, "At least 4 numeric features")
def test_ig_numeric extracted_dataset, h0, test_feature1
  t, ig = find_split_point_numeric extracted_dataset, h0, test_feature1
  assert_true(ig >= 0.005, "Expected information gain for '#{test_feature1}' > 0.005")
  return test_feature1
end

test_ig_numeric extracted_dataset, h0, numeric_features[0]

"remix_1"

In [128]:
test_ig_numeric extracted_dataset, h0, numeric_features[1]

"remix_2"

In [129]:
test_ig_numeric extracted_dataset, h0, numeric_features[2]

"remix_3"

In [130]:
3.upto(numeric_features.size - 1) do |i|
  test_ig_numeric extracted_dataset, h0, numeric_features[i]
end

3

In [131]:
categorical_features = features.select {|k| extracted_dataset.all? {|row| row["features"].fetch(k, "").is_a? String}}

assert_true(categorical_features.size >= 4, "At least 4 categorical features")

def test_ig_categorical extracted_dataset, h0, test_feature1
  splits = extracted_dataset.group_by {|row| row["features"][test_feature1]}
  ig = information_gain h0, splits
  puts ig
  assert_true(ig >= 0.005, "Expected information gain for '#{test_feature1}' > 0.005")
  return test_feature1
end

test_ig_categorical extracted_dataset, h0, categorical_features[0]

0.0064926022053908294


"remix_10"

In [132]:
test_ig_categorical extracted_dataset, h0, categorical_features[1]

0.008093865339290574


"remix_11"

In [133]:
test_ig_categorical extracted_dataset, h0, categorical_features[2]

0.005765562896686671


"remix_12"

In [134]:
3.upto(categorical_features.size - 1) do |i|
  test_ig_categorical extracted_dataset, h0, categorical_features[i]
end

0.006723792431264697
0.006339866788409865
0.006276768835009772


3