Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in encoding data with pyd4 when using sparse #54

Closed
mrvollger opened this issue Jun 30, 2022 · 2 comments
Closed

Bug in encoding data with pyd4 when using sparse #54

mrvollger opened this issue Jun 30, 2022 · 2 comments

Comments

@mrvollger
Copy link
Contributor

mrvollger commented Jun 30, 2022

Hello,

I have attached an example where pyd4 doesn't encode the underlying data correctly when using the sparse builder (I think):

import pyd4
import pandas as pd
import numpy as np

def chrom_bg(sts, ens, chrom_len):
    chrom = np.zeros(chrom_len, dtype=np.int32)
    to_add = np.int32(1)
    for st, en in zip(sts, ens):
        chrom[st:en] += to_add
    print(f"total_coverage = {chrom.sum()}")
    return chrom


df = pd.read_csv("1.bed.gz", sep="\t", header=None, comment="#")
writer = (
    pyd4.D4Builder("1.d4")
    .add_chroms([("chr11",10_000_000)] )
    .for_sparse_data()
    .get_writer()
)
data = chrom_bg(df[1].to_numpy(), df[2].to_numpy(), 10_000_000)
writer.write_np_array("chr11", 0, data)

writer.close()

d4_sum = pyd4.D4File("1.d4")["chr11"].sum()
df_sum = (df[2] - df[1]).sum()

assert d4_sum == df_sum, "{} != {}".format(d4_sum, df_sum)

Error:

total_coverage = 3821
Traceback (most recent call last):
  File "/Users/mrvollger/Desktop/repos/fibertools/fibertools/test_pyd4.py", line 28, in <module>
    assert d4_sum == df_sum, "{} != {}".format(d4_sum, df_sum)
AssertionError: 0 != 3821

Data file:
1.bed.gz

However, if I comment out the .for_sparse_data() I get the correct results.

Am I somehow doing the sparse encoding wrong?

Thanks,
Mitchell

mrvollger added a commit to mrvollger/fibertools that referenced this issue Jun 30, 2022
@38
Copy link
Owner

38 commented Jun 30, 2022

Hi Mitchell,
Thanks for reporting this bug. I confirmed this bug and this is due to pyd4 doesn't flush the last data chuck and your input is small enough that all data is lost.

I've published a fixed version of pyd4 on pythonpi, would you mind confirm my fix works on your side?

Thanks,
Hao

38 added a commit that referenced this issue Jun 30, 2022
@mrvollger
Copy link
Contributor Author

Hi Hao,

Thanks for a quick fix! It looks like everything on my end is now good.

Thanks again,
Mitchell

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants