Skip to content

PR0Grammar/discretization-info-loss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

discretization-info-loss

A discretization algorithm based on information loss. From the beginning, all continuous data are placed in individual bins. Then, the algorithm combines bins with the least difference between two consecutive bins, stopping when the least difference is above the average information loss of all steps of discretizing the data into a single bin. In other words, it will first go through the process to discretize all the data into a single bin in order to calculate the average info loss taking place each step. Then, the algorithm will go through the actual process of discretizing, stopping at the step that exceeds this average.

Output Example:

Data:  [0.5593630989, 0.9247589927, 0.9286284722, 0.7990522484, 0.8204542483, 0.6647561547, 0.9765970904, 0.9745135246, 0.8049680872, 0.7426005405, 0.6156845434, 0.9758901663, 0.645977965, 0.9194581]
Initial Bins State:  [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927], [0.9286284722], [0.9745135246], [0.9758901663], [0.9765970904]]
Probability distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142]


___ Getting Average Information Loss to Determine Stopping Point ___
We first determine the average information loss by combing all bins into a single bin and tracking the information loss at each iteration
This will determine our stoping point for discretization if the info loss at a certain point exceeds the average

Information Loss:
Sum of Bins 13 and 14 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 13 and 14: 
H([0.5, 0.5]) = 
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities *  H_2 = 0.14285714285714285


Information Loss:
Sum of Bins 12 and 13 probabilites = 0.07142857142857142 + 0.14285714285714285 = 0.21428571428571427
H_2 for bins 12 and 13: 
H([0.3333333333333333, 0.6666666666666666]) = 
-1 * (
0.3333333333333333 * log_2(0.3333333333333333) +
0.6666666666666666 * log_2(0.6666666666666666)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities *  H_2 = 0.19677767872596202


Information Loss:
Sum of Bins 10 and 11 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 10 and 11: 
H([0.5, 0.5]) = 
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities *  H_2 = 0.14285714285714285


Information Loss:
Sum of Bins 6 and 7 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 6 and 7: 
H([0.5, 0.5]) = 
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities *  H_2 = 0.14285714285714285


Information Loss:
Sum of Bins 8 and 9 probabilites = 0.07142857142857142 + 0.14285714285714285 = 0.21428571428571427
H_2 for bins 8 and 9: 
H([0.3333333333333333, 0.6666666666666666]) = 
-1 * (
0.3333333333333333 * log_2(0.3333333333333333) +
0.6666666666666666 * log_2(0.6666666666666666)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities *  H_2 = 0.19677767872596202


Information Loss:
Sum of Bins 6 and 7 probabilites = 0.14285714285714285 + 0.07142857142857142 = 0.21428571428571427
H_2 for bins 6 and 7: 
H([0.6666666666666666, 0.3333333333333333]) = 
-1 * (
0.6666666666666666 * log_2(0.6666666666666666) +
0.3333333333333333 * log_2(0.3333333333333333)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities *  H_2 = 0.19677767872596202


Information Loss:
Sum of Bins 3 and 4 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 3 and 4: 
H([0.5, 0.5]) = 
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities *  H_2 = 0.14285714285714285


Information Loss:
Sum of Bins 2 and 3 probabilites = 0.07142857142857142 + 0.14285714285714285 = 0.21428571428571427
H_2 for bins 2 and 3: 
H([0.3333333333333333, 0.6666666666666666]) = 
-1 * (
0.3333333333333333 * log_2(0.3333333333333333) +
0.6666666666666666 * log_2(0.6666666666666666)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities *  H_2 = 0.19677767872596202


Information Loss:
Sum of Bins 5 and 6 probabilites = 0.21428571428571427 + 0.21428571428571427 = 0.42857142857142855
H_2 for bins 5 and 6: 
H([0.5, 0.5]) = 
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities *  H_2 = 0.42857142857142855


Information Loss:
Sum of Bins 3 and 4 probabilites = 0.07142857142857142 + 0.21428571428571427 = 0.2857142857142857
H_2 for bins 3 and 4: 
H([0.25, 0.75]) = 
-1 * (
0.25 * log_2(0.25) +
0.75 * log_2(0.75)
)
= 0.8112781244591328
Information loss = Sum of bins probabilities *  H_2 = 0.23179374984546652


Information Loss:
Sum of Bins 1 and 2 probabilites = 0.07142857142857142 + 0.21428571428571427 = 0.2857142857142857
H_2 for bins 1 and 2: 
H([0.25, 0.75]) = 
-1 * (
0.25 * log_2(0.25) +
0.75 * log_2(0.75)
)
= 0.8112781244591328
Information loss = Sum of bins probabilities *  H_2 = 0.23179374984546652


Information Loss:
Sum of Bins 2 and 3 probabilites = 0.2857142857142857 + 0.42857142857142855 = 0.7142857142857142
H_2 for bins 2 and 3: 
H([0.4, 0.6000000000000001]) = 
-1 * (
0.4 * log_2(0.4) +
0.6000000000000001 * log_2(0.6000000000000001)
)
= 0.9709505944546685
Information loss = Sum of bins probabilities *  H_2 = 0.6935361388961917


Information Loss:
Sum of Bins 1 and 2 probabilites = 0.2857142857142857 + 0.7142857142857143 = 1.0
H_2 for bins 1 and 2: 
H([0.2857142857142857, 0.7142857142857143]) = 
-1 * (
0.2857142857142857 * log_2(0.2857142857142857) +
0.7142857142857143 * log_2(0.7142857142857143)
)
= 0.863120568566631
Information loss = Sum of bins probabilities *  H_2 = 0.863120568566631


Average info loss is: 0.29287345554289257
_______________________________

___ Discretization Process ___
Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142]) = 
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142)
)
= 3.8073549220576055


Information Loss:
Sum of Bins 13 and 14 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 13 and 14: 
H([0.5, 0.5]) = 
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities *  H_2 = 0.14285714285714285


----------------------------------
After 1 iterations
Bins 13 and 14 will be combined since they have smallest difference in means
Previous Bins State:  [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927], [0.9286284722], [0.9745135246], [0.9758901663], [0.9765970904]]
Next Bins State:  [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927], [0.9286284722], [0.9745135246], [0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142]
Next Probability distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285]
----------------------------------

Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285]) = 
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.14285714285714285 * log_2(0.14285714285714285)
)
= 3.6644977792004623


Information Loss:
Sum of Bins 12 and 13 probabilites = 0.07142857142857142 + 0.14285714285714285 = 0.21428571428571427
H_2 for bins 12 and 13: 
H([0.3333333333333333, 0.6666666666666666]) = 
-1 * (
0.3333333333333333 * log_2(0.3333333333333333) +
0.6666666666666666 * log_2(0.6666666666666666)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities *  H_2 = 0.19677767872596202


----------------------------------
After 2 iterations
Bins 12 and 13 will be combined since they have smallest difference in means
Previous Bins State:  [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927], [0.9286284722], [0.9745135246], [0.9758901663, 0.9765970904]]
Next Bins State:  [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927], [0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285]
Next Probability distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.21428571428571427]
----------------------------------

Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.21428571428571427]) = 
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.21428571428571427 * log_2(0.21428571428571427)
)
= 3.4677201004744997


Information Loss:
Sum of Bins 10 and 11 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 10 and 11: 
H([0.5, 0.5]) = 
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities *  H_2 = 0.14285714285714285


----------------------------------
After 3 iterations
Bins 10 and 11 will be combined since they have smallest difference in means
Previous Bins State:  [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927], [0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Next Bins State:  [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.21428571428571427]
Next Probability distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.21428571428571427]
----------------------------------

Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.21428571428571427]) = 
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.14285714285714285 * log_2(0.14285714285714285) +
0.21428571428571427 * log_2(0.21428571428571427)
)
= 3.3248629576173565


Information Loss:
Sum of Bins 6 and 7 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 6 and 7: 
H([0.5, 0.5]) = 
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities *  H_2 = 0.14285714285714285


----------------------------------
After 4 iterations
Bins 6 and 7 will be combined since they have smallest difference in means
Previous Bins State:  [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484], [0.8049680872], [0.8204542483], [0.9194581], [0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Next Bins State:  [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872], [0.8204542483], [0.9194581], [0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.21428571428571427]
Next Probability distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.21428571428571427]
----------------------------------

Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.21428571428571427]) = 
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.14285714285714285 * log_2(0.14285714285714285) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.14285714285714285 * log_2(0.14285714285714285) +
0.21428571428571427 * log_2(0.21428571428571427)
)
= 3.1820058147602133


Information Loss:
Sum of Bins 8 and 9 probabilites = 0.07142857142857142 + 0.14285714285714285 = 0.21428571428571427
H_2 for bins 8 and 9: 
H([0.3333333333333333, 0.6666666666666666]) = 
-1 * (
0.3333333333333333 * log_2(0.3333333333333333) +
0.6666666666666666 * log_2(0.6666666666666666)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities *  H_2 = 0.19677767872596202


----------------------------------
After 5 iterations
Bins 8 and 9 will be combined since they have smallest difference in means
Previous Bins State:  [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872], [0.8204542483], [0.9194581], [0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Next Bins State:  [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872], [0.8204542483], [0.9194581, 0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.21428571428571427]
Next Probability distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427]
----------------------------------

Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427]) = 
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.14285714285714285 * log_2(0.14285714285714285) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.21428571428571427 * log_2(0.21428571428571427)
)
= 2.9852281360342516


Information Loss:
Sum of Bins 6 and 7 probabilites = 0.14285714285714285 + 0.07142857142857142 = 0.21428571428571427
H_2 for bins 6 and 7: 
H([0.6666666666666666, 0.3333333333333333]) = 
-1 * (
0.6666666666666666 * log_2(0.6666666666666666) +
0.3333333333333333 * log_2(0.3333333333333333)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities *  H_2 = 0.19677767872596202


----------------------------------
After 6 iterations
Bins 6 and 7 will be combined since they have smallest difference in means
Previous Bins State:  [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872], [0.8204542483], [0.9194581, 0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Next Bins State:  [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872, 0.8204542483], [0.9194581, 0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427]
Next Probability distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]
----------------------------------

Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]) = 
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.21428571428571427 * log_2(0.21428571428571427)
)
= 2.788450457308289


Information Loss:
Sum of Bins 3 and 4 probabilites = 0.07142857142857142 + 0.07142857142857142 = 0.14285714285714285
H_2 for bins 3 and 4: 
H([0.5, 0.5]) = 
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities *  H_2 = 0.14285714285714285


----------------------------------
After 7 iterations
Bins 3 and 4 will be combined since they have smallest difference in means
Previous Bins State:  [[0.5593630989], [0.6156845434], [0.645977965], [0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872, 0.8204542483], [0.9194581, 0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Next Bins State:  [[0.5593630989], [0.6156845434], [0.645977965, 0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872, 0.8204542483], [0.9194581, 0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]
Next Probability distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]
----------------------------------

Entropy with current bins:
H([0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]) = 
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.14285714285714285 * log_2(0.14285714285714285) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.21428571428571427 * log_2(0.21428571428571427)
)
= 2.6455933144511468


Information Loss:
Sum of Bins 2 and 3 probabilites = 0.07142857142857142 + 0.14285714285714285 = 0.21428571428571427
H_2 for bins 2 and 3: 
H([0.3333333333333333, 0.6666666666666666]) = 
-1 * (
0.3333333333333333 * log_2(0.3333333333333333) +
0.6666666666666666 * log_2(0.6666666666666666)
)
= 0.9182958340544896
Information loss = Sum of bins probabilities *  H_2 = 0.19677767872596202


----------------------------------
After 8 iterations
Bins 2 and 3 will be combined since they have smallest difference in means
Previous Bins State:  [[0.5593630989], [0.6156845434], [0.645977965, 0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872, 0.8204542483], [0.9194581, 0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Next Bins State:  [[0.5593630989], [0.6156845434, 0.645977965, 0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872, 0.8204542483], [0.9194581, 0.9247589927, 0.9286284722], [0.9745135246, 0.9758901663, 0.9765970904]]
Previous Probability Distribution of bins:  [0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]
Next Probability distribution of bins:  [0.07142857142857142, 0.21428571428571427, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]
----------------------------------

Entropy with current bins:
H([0.07142857142857142, 0.21428571428571427, 0.07142857142857142, 0.21428571428571427, 0.21428571428571427, 0.21428571428571427]) = 
-1 * (
0.07142857142857142 * log_2(0.07142857142857142) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.07142857142857142 * log_2(0.07142857142857142) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.21428571428571427 * log_2(0.21428571428571427) +
0.21428571428571427 * log_2(0.21428571428571427)
)
= 2.4488156357251847


Information Loss:
Sum of Bins 5 and 6 probabilites = 0.21428571428571427 + 0.21428571428571427 = 0.42857142857142855
H_2 for bins 5 and 6: 
H([0.5, 0.5]) = 
-1 * (
0.5 * log_2(0.5) +
0.5 * log_2(0.5)
)
= 1.0
Information loss = Sum of bins probabilities *  H_2 = 0.42857142857142855


Info loss exceeds average info loss, so we still stop the iteration process

________________________________
After decretization: 
Final Bins State:  [[0.5593630989], [0.6156845434, 0.645977965, 0.6647561547], [0.7426005405], [0.7990522484, 0.8049680872, 0.8204542483], [0.9194581, 0.9247589927, 0.9286284722, 0.9745135246, 0.9758901663, 0.9765970904]]
Final Bin Probability Distribution:  [0.07142857142857142, 0.21428571428571427, 0.07142857142857142, 0.21428571428571427, 0.42857142857142855]


About

A discretization algorithm based on information loss

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages