-
Notifications
You must be signed in to change notification settings - Fork 0
/
_NOTES.txt
92 lines (76 loc) · 2.75 KB
/
_NOTES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
#################################
##-- Data Imputation Project --##
#################################
~ Comparison of GAIN, MisGAN and KNN ~
Step by step notes
Using Python 3.8.5
Numpy => np.__version__ = 1.19.2
Pandas. => pd.__version__ = 1.1.3
Tensorflow => tf.__version__ = 2.4.1
Sklearn => sklearn.__version__ = 0.23.2
* pipeline1:
----------
On mydata1 (univariate Gaussian), MCAR missing_rate=20%
Trying GAIN and MisGAN, from 1000 to 20000 epochs.
Trying KNN uniform and distance from 2 to 300 neighbours.
* pipeline2:
----------
Same thing with mydata2 (Mixture of 3 Gaussians)
I use this in the paper for justification
Conclusion -> GAIN: 20000 epochs, MisGAN: 5000 epochs, KNNs: 50 neighbours
* pipeline3:
----------
Extra analysis for GAIN on mydata2 (mixture of 3 Gaussians).
Train from 10000 to 100000 epochs (to see if overfitting happens).
Conclusion -> 20000 epochs for GAIN seems good!
* pipeline4:
----------
On mydata2 (Mixture of 3 Gaussians)
Try missing_rate from 10% to 80% (in MCAR)
Conclusion -> poor MisGAN...
* pipeline5:
----------
MCAR missing_rate=20% on the 7 true datasets
Conclusion -> Dataset "news" is really bad (because of outliers)
-> GAIN much better than MisGAN
-> KNN still better than GAIN
* pipeline6:
----------
Using mydata1 (mutivar. Gaussian) and mydata2 (Mixture of 3 Gaussians)
MCAR with changing missing_rates per variables (10%, 10%, 40%, 60%, 80%)
Conclusion -> MisGAN poor, GAIN good
-> KNN still slightly better than GAIN
* pipeline7:
----------
Using mydata1 (multivar. Gaussian) and mydata2 (mixture of 3 Gaussians)
MAR based on the first column!
Conclusion -> MisGAN bad... GAIN is better than KNN??
-> Yes, I did it two times and this is consistent!
* pipeline8:
----------
MAR with every dataset (one variable is selected for MAR probs, using quantiles)
Average missing_rate is 20%, evenly scaled between 0 and 40%
Conclusion -> Dataset "news" is bad due to outliers
-> MisGAN extremely poor
-> KNN seems to perform better
* pipeline9:
----------
MAR using quantiles on the same variable as above
Average missing_rate is 45% (evenly scaled between 0 and 90%)
Conclusion -> Dataset "news" definitely shity
-> MisGAN still poor
-> KNN still better
* pipeline10:
-----------
MNAR using quantiles for each variable
Average missing_rate of 20% (evenly scaled between 0 and 40%)
Conclusion -> News is bad dataset
-> MisGAN poor
-> KNN and GAIN are comparable
* pipeline11:
-----------
MNAR using quantiles for each variable
Average missing_rate of 45% (evenly scaled between 0 and 90%)
Conclusion -> News is bad, but so is spam dataset now!! :o
-> MisGAN poor
-> Now GAIN does better