A discretization algorithm based on information loss. From the beginning, all continuous data are placed in individual bins. Then, the algorithm combines bins with the least difference between two consecutive bins, stopping when the least difference is above the average information loss of all steps of discretizing the data into a single bin. In other words, it will first go through the process to discretize all the data into a single bin in order to calculate the average info loss taking place each step. Then, the algorithm will go through the actual process of discretizing, stopping at the step that exceeds this average.
Output Example:
Data: [0.5593630989, 0.9247589927, 0.9286284722, 0.7990522484, 0.8204542483, 0.6647561547, 0.9765970904, 0.9745135246, 0.8049680872, 0.7426005405, 0.6156845434, 0.9758901663, 0.645977965, 0.9194581]
Initial Bins State: [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927], [0.9286284722], [0.9745135246], [0.9758901663], [0.9765970904]]
Probability distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142]
___ Getting Average Information Loss to Determine Stopping Point ___
We first determine the average information loss by combing all bins into a single bin and tracking the information loss at each iteration
This will determine our stoping point for discretization if the info loss at a certain point exceeds the average
Information Loss:
Sum of Bins 13 and 14 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 13 and 14:
H([0.5, 0.5]) =
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities * H_2 = 0.14285714285714285
Information Loss:
Sum of Bins 12 and 13 probabilites = 0.07142857142857142 + 0.14285714285714285 = 0.21428571428571427
H_2 for bins 12 and 13:
H([0.3333333333333333, 0.6666666666666666]) =
-1 * (
0.3333333333333333 * log_2(0.3333333333333333) +
0.6666666666666666 * log_2(0.6666666666666666)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities * H_2 = 0.19677767872596202
Information Loss:
Sum of Bins 10 and 11 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 10 and 11:
H([0.5, 0.5]) =
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities * H_2 = 0.14285714285714285
Information Loss:
Sum of Bins 6 and 7 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 6 and 7:
H([0.5, 0.5]) =
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities * H_2 = 0.14285714285714285
Information Loss:
Sum of Bins 8 and 9 probabilites = 0.07142857142857142 + 0.14285714285714285 = 0.21428571428571427
H_2 for bins 8 and 9:
H([0.3333333333333333, 0.6666666666666666]) =
-1 * (
0.3333333333333333 * log_2(0.3333333333333333) +
0.6666666666666666 * log_2(0.6666666666666666)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities * H_2 = 0.19677767872596202
Information Loss:
Sum of Bins 6 and 7 probabilites = 0.14285714285714285 + 0.07142857142857142 = 0.21428571428571427
H_2 for bins 6 and 7:
H([0.6666666666666666, 0.3333333333333333]) =
-1 * (
0.6666666666666666 * log_2(0.6666666666666666) +
0.3333333333333333 * log_2(0.3333333333333333)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities * H_2 = 0.19677767872596202
Information Loss:
Sum of Bins 3 and 4 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 3 and 4:
H([0.5, 0.5]) =
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities * H_2 = 0.14285714285714285
Information Loss:
Sum of Bins 2 and 3 probabilites = 0.07142857142857142 + 0.14285714285714285 = 0.21428571428571427
H_2 for bins 2 and 3:
H([0.3333333333333333, 0.6666666666666666]) =
-1 * (
0.3333333333333333 * log_2(0.3333333333333333) +
0.6666666666666666 * log_2(0.6666666666666666)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities * H_2 = 0.19677767872596202
Information Loss:
Sum of Bins 5 and 6 probabilites = 0.21428571428571427 + 0.21428571428571427 = 0.42857142857142855
H_2 for bins 5 and 6:
H([0.5, 0.5]) =
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities * H_2 = 0.42857142857142855
Information Loss:
Sum of Bins 3 and 4 probabilites = 0.07142857142857142 + 0.21428571428571427 = 0.2857142857142857
H_2 for bins 3 and 4:
H([0.25, 0.75]) =
-1 * (
0.25 * log_2(0.25) +
0.75 * log_2(0.75)
)
= 0.8112781244591328
Information loss = Sum of bins probabilities * H_2 = 0.23179374984546652
Information Loss:
Sum of Bins 1 and 2 probabilites = 0.07142857142857142 + 0.21428571428571427 = 0.2857142857142857
H_2 for bins 1 and 2:
H([0.25, 0.75]) =
-1 * (
0.25 * log_2(0.25) +
0.75 * log_2(0.75)
)
= 0.8112781244591328
Information loss = Sum of bins probabilities * H_2 = 0.23179374984546652
Information Loss:
Sum of Bins 2 and 3 probabilites = 0.2857142857142857 + 0.42857142857142855 = 0.7142857142857142
H_2 for bins 2 and 3:
H([0.4, 0.6000000000000001]) =
-1 * (
0.4 * log_2(0.4) +
0.6000000000000001 * log_2(0.6000000000000001)
)
= 0.9709505944546685
Information loss = Sum of bins probabilities * H_2 = 0.6935361388961917
Information Loss:
Sum of Bins 1 and 2 probabilites = 0.2857142857142857 + 0.7142857142857143 = 1.0
H_2 for bins 1 and 2:
H([0.2857142857142857, 0.7142857142857143]) =
-1 * (
0.2857142857142857 * log_2(0.2857142857142857) +
0.7142857142857143 * log_2(0.7142857142857143)
)
= 0.863120568566631
Information loss = Sum of bins probabilities * H_2 = 0.863120568566631
Average info loss is: 0.29287345554289257
_______________________________
___ Discretization Process ___
Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142]) =
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142)
)
= 3.8073549220576055
Information Loss:
Sum of Bins 13 and 14 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 13 and 14:
H([0.5, 0.5]) =
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities * H_2 = 0.14285714285714285
----------------------------------
After 1 iterations
Bins 13 and 14 will be combined since they have smallest difference in means
Previous Bins State: [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927], [0.9286284722], [0.9745135246], [0.9758901663], [0.9765970904]]
Next Bins State: [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927], [0.9286284722], [0.9745135246], [0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142]
Next Probability distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285]
----------------------------------
Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285]) =
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.14285714285714285 * log_2(0.14285714285714285)
)
= 3.6644977792004623
Information Loss:
Sum of Bins 12 and 13 probabilites = 0.07142857142857142 + 0.14285714285714285 = 0.21428571428571427
H_2 for bins 12 and 13:
H([0.3333333333333333, 0.6666666666666666]) =
-1 * (
0.3333333333333333 * log_2(0.3333333333333333) +
0.6666666666666666 * log_2(0.6666666666666666)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities * H_2 = 0.19677767872596202
----------------------------------
After 2 iterations
Bins 12 and 13 will be combined since they have smallest difference in means
Previous Bins State: [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927], [0.9286284722], [0.9745135246], [0.9758901663, 0.9765970904]]
Next Bins State: [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927], [0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285]
Next Probability distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.21428571428571427]
----------------------------------
Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.21428571428571427]) =
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.21428571428571427 * log_2(0.21428571428571427)
)
= 3.4677201004744997
Information Loss:
Sum of Bins 10 and 11 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 10 and 11:
H([0.5, 0.5]) =
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities * H_2 = 0.14285714285714285
----------------------------------
After 3 iterations
Bins 10 and 11 will be combined since they have smallest difference in means
Previous Bins State: [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927], [0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Next Bins State: [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.21428571428571427]
Next Probability distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.21428571428571427]
----------------------------------
Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.21428571428571427]) =
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.14285714285714285 * log_2(0.14285714285714285) +
0.21428571428571427 * log_2(0.21428571428571427)
)
= 3.3248629576173565
Information Loss:
Sum of Bins 6 and 7 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 6 and 7:
H([0.5, 0.5]) =
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities * H_2 = 0.14285714285714285
----------------------------------
After 4 iterations
Bins 6 and 7 will be combined since they have smallest difference in means
Previous Bins State: [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Next Bins State: [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872], [0.8204542483], [0.9194581], [0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.21428571428571427]
Next Probability distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.21428571428571427]
----------------------------------
Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.21428571428571427]) =
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.14285714285714285 * log_2(0.14285714285714285) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.14285714285714285 * log_2(0.14285714285714285) +
0.21428571428571427 * log_2(0.21428571428571427)
)
= 3.1820058147602133
Information Loss:
Sum of Bins 8 and 9 probabilites = 0.07142857142857142 + 0.14285714285714285 = 0.21428571428571427
H_2 for bins 8 and 9:
H([0.3333333333333333, 0.6666666666666666]) =
-1 * (
0.3333333333333333 * log_2(0.3333333333333333) +
0.6666666666666666 * log_2(0.6666666666666666)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities * H_2 = 0.19677767872596202
----------------------------------
After 5 iterations
Bins 8 and 9 will be combined since they have smallest difference in means
Previous Bins State: [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872], [0.8204542483], [0.9194581], [0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Next Bins State: [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872], [0.8204542483], [0.9194581, 0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.21428571428571427]
Next Probability distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427]
----------------------------------
Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427]) =
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.14285714285714285 * log_2(0.14285714285714285) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.21428571428571427 * log_2(0.21428571428571427)
)
= 2.9852281360342516
Information Loss:
Sum of Bins 6 and 7 probabilites = 0.14285714285714285 + 0.07142857142857142 = 0.21428571428571427
H_2 for bins 6 and 7:
H([0.6666666666666666, 0.3333333333333333]) =
-1 * (
0.6666666666666666 * log_2(0.6666666666666666) +
0.3333333333333333 * log_2(0.3333333333333333)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities * H_2 = 0.19677767872596202
----------------------------------
After 6 iterations
Bins 6 and 7 will be combined since they have smallest difference in means
Previous Bins State: [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872], [0.8204542483], [0.9194581, 0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Next Bins State: [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872, 0.8204542483], [0.9194581, 0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427]
Next Probability distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]
----------------------------------
Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]) =
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.21428571428571427 * log_2(0.21428571428571427)
)
= 2.788450457308289
Information Loss:
Sum of Bins 3 and 4 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 3 and 4:
H([0.5, 0.5]) =
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities * H_2 = 0.14285714285714285
----------------------------------
After 7 iterations
Bins 3 and 4 will be combined since they have smallest difference in means
Previous Bins State: [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872, 0.8204542483], [0.9194581, 0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Next Bins State: [[0.5593630989], [0.6156845434], [0.645977965, 0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872, 0.8204542483], [0.9194581, 0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]
Next Probability distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]
----------------------------------
Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]) =
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.14285714285714285 * log_2(0.14285714285714285) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.21428571428571427 * log_2(0.21428571428571427)
)
= 2.6455933144511468
Information Loss:
Sum of Bins 2 and 3 probabilites = 0.07142857142857142 + 0.14285714285714285 = 0.21428571428571427
H_2 for bins 2 and 3:
H([0.3333333333333333, 0.6666666666666666]) =
-1 * (
0.3333333333333333 * log_2(0.3333333333333333) +
0.6666666666666666 * log_2(0.6666666666666666)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities * H_2 = 0.19677767872596202
----------------------------------
After 8 iterations
Bins 2 and 3 will be combined since they have smallest difference in means
Previous Bins State: [[0.5593630989], [0.6156845434], [0.645977965, 0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872, 0.8204542483], [0.9194581, 0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Next Bins State: [[0.5593630989], [0.6156845434, 0.645977965, 0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872, 0.8204542483], [0.9194581, 0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins: [0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]
Next Probability distribution of bins: [0.07142857142857142, 0.21428571428571427, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]
----------------------------------
Entropy with current bins:
H([0.07142857142857142, 0.21428571428571427, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]) =
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.21428571428571427 * log_2(0.21428571428571427)
)
= 2.4488156357251847
Information Loss:
Sum of Bins 5 and 6 probabilites = 0.21428571428571427 + 0.21428571428571427 = 0.42857142857142855
H_2 for bins 5 and 6:
H([0.5, 0.5]) =
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities * H_2 = 0.42857142857142855
Info loss exceeds average info loss, so we still stop the iteration process
________________________________
After decretization:
Final Bins State: [[0.5593630989], [0.6156845434, 0.645977965, 0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872, 0.8204542483], [0.9194581, 0.9247589927, 0.9286284722, 0.9745135246, 0.9758901663, 0.9765970904]]
Final Bin Probability Distribution: [0.07142857142857142, 0.21428571428571427, 0.07142857142857142, 0.21428571428571427, 0.42857142857142855]