Presence of NaN in unrelated columns breaks DABEST #44

DizietAsahi · 2019-06-17T21:37:00Z

When trying to work on a large dataframe, containing several columns, some of which could be analyzed using dabest, I realized that other columns that are unrelated to the comparison I'm trying to do (i.e. columns that are not included in the x/y parameters) are interfering with the results.

Demonstration:

dabest.__version__
'0.2.4'

create example dataframe

df = pd.DataFrame(
    {'groups': np.random.choice(['Group 1', 'Group 2', 'Group 3'], size=(100,)),
     'value': np.random.random(size=(100,))})
df['unrelated'] = np.nan
df.head()

	groups	value
Group 1	0.592223	NaN
Group 1	0.432398	NaN
Group 3	0.714241	NaN
Group 1	0.889762	NaN
Group 1	0.388109	NaN

compare Group 1 vs Group 2:

test = dabest.load(data=df, x='groups', y='value', idx=['Group 1', 'Group 2'])
test.mean_diff

This generates a bunch of warnings:

.../numpy/core/fromnumeric.py:3118: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
.../numpy/core/_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
.../dabest/_stats_tools/confint_2group_diff.py:157: RuntimeWarning: invalid value encountered in less
prop_less_than_es = sum(B < effsize) / len(B)
.../dabest/_classes.py:545: UserWarning: The lower limit of the BCa interval cannot be computed. It is set to the effect size itself. All bootstrap values were likely all the same.
stacklevel=0)
.../dabest/_classes.py:550: UserWarning: The upper limit of the BCa interval cannot be computed. It is set to the effect size itself. All bootstrap values were likely all the same.
stacklevel=0)
.../scipy/stats/stats.py:5001: RuntimeWarning: divide by zero encountered in double_scalars
z = (bigu - meanrank) / sd
.../numpy/core/fromnumeric.py:3367: RuntimeWarning: Degrees of freedom <= 0 for slice
**kwargs)
.../numpy/core/_methods.py:110: RuntimeWarning: invalid value encountered in true_divide
arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
.../numpy/core/_methods.py:132: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)

and then the result is incorrect:

(...)
The unpaired mean difference between Group 1 and Group 2 is nan [95%CI nan, nan].
The two-sided p-value of the Mann-Whitney test is 0.0.
(...)

running the same analysis but keeping only the columns that are relevant generates the correct result

test = dabest.load(data=df[['groups','value']], x='groups', y='value', idx=['Group 1', 'Group 2'])
test.mean_diff

(...)
The unpaired mean difference between Group 1 and Group 2 is -0.0708 [95%CI -0.202, 0.0631].
The two-sided p-value of the Mann-Whitney test is 0.268.
(...)

Alternatively, if the unrelated column(s) do not contain NaNs, everything works as expected:

df.unrelated = 0

test = dabest.load(data=df, x='groups', y='value', idx=['Group 1', 'Group 2'])
test.mean_diff

(...)
The unpaired mean difference between Group 1 and Group 2 is -0.0708 [95%CI -0.202, 0.0631].
The two-sided p-value of the Mann-Whitney test is 0.268.
(...)

The text was updated successfully, but these errors were encountered:

josesho · 2019-06-18T01:45:12Z

Thanks for the excellent diagnosis of the problem, @DizietAsahi ! This was very recently brought to my attention by a colleague as well. Expect a bugfix shortly. Thanks!

josesho self-assigned this Jun 18, 2019

josesho added the bug label Jun 18, 2019

josesho added this to the v0.2.5 milestone Jun 18, 2019

This was referenced Sep 3, 2019

v0.2.5 #63

Closed

v0.2.5 #64

Merged

josesho closed this as completed Sep 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Presence of NaN in unrelated columns breaks DABEST #44

Presence of NaN in unrelated columns breaks DABEST #44

DizietAsahi commented Jun 17, 2019

josesho commented Jun 18, 2019

Presence of NaN in unrelated columns breaks DABEST #44

Presence of NaN in unrelated columns breaks DABEST #44

Comments

DizietAsahi commented Jun 17, 2019

create example dataframe

compare Group 1 vs Group 2:

and then the result is incorrect:

running the same analysis but keeping only the columns that are relevant generates the correct result

Alternatively, if the unrelated column(s) do not contain NaNs, everything works as expected:

josesho commented Jun 18, 2019