Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging and Continued Stability of RAREsim v2 #6

Open
JessMurphy opened this issue Nov 1, 2023 · 20 comments
Open

Debugging and Continued Stability of RAREsim v2 #6

JessMurphy opened this issue Nov 1, 2023 · 20 comments

Comments

@JessMurphy
Copy link

Group

The Hendricks Research Group

Contact info

Audrey Hendricks, PI, audrey.hendricks@cuanschutz.edu
Jessica Murphy, Lead Contact, jessica.murphy@cuanschutz.edu
Ryan Bernard, Lead Developer, rbarnard1107@gmail.com
Megan Null, Senior Author, megan.null@ucdenver.edu

Links to code

https://github.com/RMBarnard/raresim/tree/main

Workflow

RAREsim is a python interface for performing scalable rare variant simulations. It is essentially a few python scripts backed by some C code to implement data structures. Git and Github have been used to track the different versions of the package and it is currently being utilized on a computing cluster in a linux environment.

Work description

We need assistance debugging a few specific errors:

  1. Pruning error when the observed number of variants is less than the expected (see attached file for screenshot of error)
  2. Warning that the legend and haplotype files are different lengths (this is likely due to the sparse matrix data structure)
  3. Error with the -z flag

We would also like continued stability and maintenance of RAREsim.

All data we are using is publicly available so there is no PHI.

pruneError

Timeline

We are currently running simulations and writing 2-3 papers based on the use of this software. So, we hope the debugging can be done as soon as possible and an initial plan for long term maintenance can be developed within the next couple of months.

Funding

No response

@vincerubinetti
Copy link
Contributor

Hi @JessMurphy ,

Sorry for the delay in response. We are fairly full on workload right now, so we may not be able to help in an ongoing manner at this time. However we may be able to provide a few consultations. At the very least, we can sit down with you in an initial meeting to see exactly what your needs are, in detail.

Would you mind finding a time in our booking tool here where we can chat:

https://outlook.office365.com/owa/calendar/SoftwareEngineeringTeam@olucdenver.onmicrosoft.com/bookings/

@JessMurphy
Copy link
Author

Attached is data and example code to generate the three errors discussed above.

example_code.zip

@vincerubinetti
Copy link
Contributor

vincerubinetti commented Dec 7, 2023

Hi Jess, we have some insights on these problems.

A side note, I'd recommend formatting all the Python with Black and the C with something equivalent.

@falquaddoomi and @d33bs can feel free to comment with more insights.


Problem 1

Skipped for now since it seems you at least have a work-around in place. I'll continue to investigate, but this problem might require more knowledge about the science and thus require a sit-down. I'm not a biologist or data scientist really, so I don't know what this comment in the bash script means:

# produces the following error if the number of observed functional variants (0) is less than the number of expected functional variants (0.36) for the [201,400] MAC bin from the first pruning step above (may need to rerun the first pruning step to reproduce the error)

Problem 2

I don't think this is actually a problem, I think the warning is correct.

# produces WARNING: Lengths of legend 19029 and hap 19027 files do not match

Looking in the legend file, it is indeed 19029 long:

Screenshot 2023-12-07 at 5 19 41 PM

(minus the first header row)

Then let's try to trace back the matrix row count of 19027:

  • In sim.py ... verify_legend(legend, legend_header, M, func_split, args.prob) gets called .
  • In header.py ... M.num_rows() != len(legend) compares length of legend list (simple) to num_rows on the matrix. Note the adjacent comment # TODO: This check has a bug in it somewhere. Likely in the C code.
  • In lists.c ... uint32_t_sparse_martix_num_rows(struct uint32_t_sparse_matrix *m) simply returns m->rows.
  • Back insim.py ... M.load(args.sparse_matrix) gets called.
  • In lists.c ... there are a few matrix load/read functions (read_matrix, read_uncompressed_matrix, etc.), but the one being called in this case is uint32_t_sparse_matrix_read. In it, the file gets read in chunks of 4 bytes (uint32_t) at a time. You'll notice m->rows and m->cols get assigned right here, after individual calls to fread.

TL;DR the number of rows/cols in the matrix is read directly from the first 8 bytes of the .sm file. And if we open that file in a hex viewer, we see that the first 4 bytes are indeed 19027.

Screenshot 2023-12-07 at 5 35 06 PM

So, no error in the C code. I think it is a problem with whatever generated the matrix file. Or perhaps the matrix really was 19027 rows and this is just a human mistake?

Side note

Here:

    if M.num_rows() != len(legend):
        # TODO: This check has a bug in it somewhere. Likely in the C code
        # raise DifferingLengths(f"Lengths of legend {len(legend)} and hap {M.num_rows()} files do not match")
        print(f"WARNING: Lengths of legend {len(legend)} and hap {M.num_rows()} files do not match")

You can uncomment that raise line and delete the print line, because in sim.py where this is called, it already tries to catch an exception and prints a warning there.


Problem 3

This boils down to a list being sorted in heapq.merge as a part of the line all_kept_rows = list(merge(all_kept_rows, sorted(R))) in header.py, but a number is trying to be compared to a string (for sorting). (Supposedly merge "assumes that each of the input streams is already sorted", suggesting it doesn't run another sort, but in testing it actually seems to.)

sorted(R) looks like ["fun", "syn"] and R looks like {"fun": [1,2,3], "syn": [4,5,6]}. So merge is trying to "zip" (sort of?) together two lists that look like [1,2,3,4,5] and ["fun", "syn"] into a single list, then sort it, which fails.

TL;DR I think the intent is to get all the lists (of matrix indices, presumably) in R, concat them together, sort them, then merge it with all_kept_rows, like this:

    if z:
        # from itertools import chain
        all_kept_rows = list(merge(all_kept_rows, sorted(chain(*R.values()))))
        # OR
        all_kept_rows = list(merge(all_kept_rows, sorted([item for sublist in R.values() for item in sublist])))

I believe the if keep_protected: block of code below it will also need to be updated in the same way.

Faisal suggested that maybe the R variable was originally just a list, and was later modified to be a dict of lists, but this line was forgotten to be changed.

Dave brings up that older versions of Python treated dicts differently, and maybe this code would've run as intended on an older version. We wanted to verify that the intended version is 3.7, which should probably be specified explicitly in an env file or somewhere.

@JessMurphy
Copy link
Author

Thanks, Vince! Yes, Problem 1 is not that urgent, but I would be more than happy to meet to try to further explain it. For Problem 2, both the legend file and initial haplotype file (.gz) have 19029 rows but when the initial haplotype file is converted into a sparse matrix (.sm) using convert.py it somehow only has 19027 rows (when it should have 19029). So, we think it is an issue with the sparse function called in the convert.py script. And I will look further into Problem 3.

@vincerubinetti
Copy link
Contributor

Ah okay, I didn't notice that that .sm file was also being generated by this package.

So, at least in the specific case of the 19027 vs 19029 matrix, I was able to find a fix:

In uint32_t_sparse_matrix_add, comment out these lines that conditionally increment the row count:

    // if (m->rows < row + 1)
    //    m->rows = row + 1; 

and in add_buffer_to_matrix, just increment the row when we encounter a newline character:

        if ( (int)buffer[i] == NEWLINE ) {
            // ...
            M->rows += 1;
            continue;
        }

Please test this out with various different files to make sure it works reliably.

I'm a little perplexed at why the row count was conditionally incremented the way it was, in that separate function. It seems like the parsing could be done in a simpler way with less code repetition.

@JessMurphy
Copy link
Author

Thanks! So would you be able to update the code with the potential fixes for Problems 2 and 3 and we can test it out? No one on our team currently knows C and I have minimal experience with python.

@JessMurphy
Copy link
Author

JessMurphy commented Jan 10, 2024

So I tested out the recent changes and it seems the convert.py script still produces a sparse matrix of length 19027 instead of 19029 (for the chr19.block37.NFE.sim3.controls.haps.gz example) and I'm receiving a similar error as before for the z flag (see below).

image

@vincerubinetti
Copy link
Contributor

vincerubinetti commented Jan 11, 2024

For the z flag, in the PR I forgot to do R.values() (as in my above comment). Instead I did just R.

For the sparse matrix length difference, are you still running that pruning_code.sh script? Because the first command that does is try to load the controls.haps.sm file, which simply has the number of rows and cols hard-coded in the first few bytes from a previous run of convert.py, as we discussed above.

If I first generate the .sm file like this, with the fix from the PR...

python3 ${WD}/raresim/convert.py \
    -i ${WD}/example_code/chr19.block37.NFE.sim3.controls.haps.gz \
    -o ${WD}/example_code/chr19.block37.NFE.sim3.controls.haps.sm

...it generates a .sm file with 19029 rows, the correct number. And then a subsequent call to sim.py doesn't show the mismatch warning.

@JessMurphy
Copy link
Author

JessMurphy commented Jan 11, 2024

Yes, I'm re-generating the sparse matrix with the code you cited above and still getting 19027 rows. Could it have something to do with my environment? I'm using a singularity container on a computing server that has python version 3.10.6. Also, when I run the setup.py script, I receive the attached output, which I'm not sure is an issue or not.

setup py output 1 setup py output 2 setup py output 3

@vincerubinetti
Copy link
Contributor

I'll probably need more details to be able to help, I didn't know about a singularity or a server running this code. Is that the ultimate intended environment for this to run in? The readme doesn't mention it. Can you verify the server has the latest version of the C code, with the added M->rows += 1; line?

I'm running it locally with Python v3.11.2, and getting the following output:

Log
(base) Vincents-MacBook-Pro:raresim vincerubinetti$ python3 setup.py install
/Users/vincerubinetti/Desktop/raresim/raresim/setup.py:7: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` directly.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
        ********************************************************************************

!!
  self.initialize_options()
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` and ``easy_install``.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://github.com/pypa/setuptools/issues/917 for details.
        ********************************************************************************

!!
  self.initialize_options()
Compiling rareSim.pyx because it changed.
[1/1] Cythonizing rareSim.pyx
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/Cython/Compiler/Main.py:381: FutureWarning: Cython directive 'language_level' not set, using '3str' for now (Py3). This has changed from earlier releases! File: /Users/vincerubinetti/Desktop/raresim/raresim/rareSim.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)
warning: rsdec.pxd:11:28: Non-trivial type declarators in shared declaration (e.g. mix of pointers and values). Each pointer declaration should be on its own line.
warning: rareSim.pyx:9:30: Unknown type declaration 'void' in annotation, ignoring
warning: rareSim.pyx:49:31: Unknown type declaration 'void' in annotation, ignoring
warning: rareSim.pyx:62:32: Unknown type declaration 'void' in annotation, ignoring
lib/raresim/src/lists.c:223:18: warning: comparison of integers of different signs: 'int' and 'uint32_t' (aka 'unsigned int') [-Wsign-compare]
    for (i =0; i < rows; ++i) {
               ~ ^ ~~~~
lib/raresim/src/lists.c:238:19: warning: comparison of integers of different signs: 'int' and 'uint32_t' (aka 'unsigned int') [-Wsign-compare]
    for (i = 0; i < (*m)->size; ++i) {
                ~ ^ ~~~~~~~~~~
lib/raresim/src/lists.c:276:14: warning: unused variable 'ret' [-Wunused-variable]
    uint32_t ret = uint32_t_array_add(m->data[row], val);
             ^
lib/raresim/src/lists.c:264:30: warning: comparison of integers of different signs: 'int' and 'uint32_t' (aka 'unsigned int') [-Wsign-compare]
        for (i = old_size; i < m->size; ++i) {
                           ~ ^ ~~~~~~~
lib/raresim/src/lists.c:327:19: warning: comparison of integers of different signs: 'int' and 'uint32_t' (aka 'unsigned int') [-Wsign-compare]
    for (i = 0; i < m->rows; ++i) {
                ~ ^ ~~~~~~~
lib/raresim/src/lists.c:338:19: warning: comparison of integers of different signs: 'int' and 'uint32_t' (aka 'unsigned int') [-Wsign-compare]
    for (i = 0; i < m->rows; ++i) {
                ~ ^ ~~~~~~~
lib/raresim/src/lists.c:384:19: warning: comparison of integers of different signs: 'int' and 'uint32_t' (aka 'unsigned int') [-Wsign-compare]
    for (i = 0; i < m->rows; ++i) {
                ~ ^ ~~~~~~~
lib/raresim/src/lists.c:513:11: warning: unused variable 'r' [-Wunused-variable]
    char *r = strcpy(last_3, file_name + strlen(file_name) - 3);
          ^
lib/raresim/src/lists.c:641:17: warning: unused variable 'r' [-Wunused-variable]
            int r = uint32_t_sparse_matrix_add(M, *row, *col);
                ^
lib/raresim/src/lists.c:659:14: warning: unused variable 'ret' [-Wunused-variable]
    uint32_t ret = uint32_t_sparse_matrix_write(m, fp);
             ^
10 warnings generated.
lib/raresim/src/lists.c:223:18: warning: comparison of integers of different signs: 'int' and 'uint32_t' (aka 'unsigned int') [-Wsign-compare]
    for (i =0; i < rows; ++i) {
               ~ ^ ~~~~
lib/raresim/src/lists.c:238:19: warning: comparison of integers of different signs: 'int' and 'uint32_t' (aka 'unsigned int') [-Wsign-compare]
    for (i = 0; i < (*m)->size; ++i) {
                ~ ^ ~~~~~~~~~~
lib/raresim/src/lists.c:276:14: warning: unused variable 'ret' [-Wunused-variable]
    uint32_t ret = uint32_t_array_add(m->data[row], val);
             ^
lib/raresim/src/lists.c:264:30: warning: comparison of integers of different signs: 'int' and 'uint32_t' (aka 'unsigned int') [-Wsign-compare]
        for (i = old_size; i < m->size; ++i) {
                           ~ ^ ~~~~~~~
lib/raresim/src/lists.c:327:19: warning: comparison of integers of different signs: 'int' and 'uint32_t' (aka 'unsigned int') [-Wsign-compare]
    for (i = 0; i < m->rows; ++i) {
                ~ ^ ~~~~~~~
lib/raresim/src/lists.c:338:19: warning: comparison of integers of different signs: 'int' and 'uint32_t' (aka 'unsigned int') [-Wsign-compare]
    for (i = 0; i < m->rows; ++i) {
                ~ ^ ~~~~~~~
lib/raresim/src/lists.c:384:19: warning: comparison of integers of different signs: 'int' and 'uint32_t' (aka 'unsigned int') [-Wsign-compare]
    for (i = 0; i < m->rows; ++i) {
                ~ ^ ~~~~~~~
lib/raresim/src/lists.c:513:11: warning: unused variable 'r' [-Wunused-variable]
    char *r = strcpy(last_3, file_name + strlen(file_name) - 3);
          ^
lib/raresim/src/lists.c:641:17: warning: unused variable 'r' [-Wunused-variable]
            int r = uint32_t_sparse_matrix_add(M, *row, *col);
                ^
lib/raresim/src/lists.c:659:14: warning: unused variable 'ret' [-Wunused-variable]
    uint32_t ret = uint32_t_sparse_matrix_write(m, fp);
             ^
10 warnings generated.
lib/zlib-1.2.11/adler32.c:63:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT adler32_z(adler, buf, len)
              ^
lib/zlib-1.2.11/adler32.c:134:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT adler32(adler, buf, len)
              ^
lib/zlib-1.2.11/adler32.c:143:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local uLong adler32_combine_(adler1, adler2, len2)
            ^
lib/zlib-1.2.11/adler32.c:172:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT adler32_combine(adler1, adler2, len2)
              ^
lib/zlib-1.2.11/adler32.c:180:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT adler32_combine64(adler1, adler2, len2)
              ^
5 warnings generated.
lib/zlib-1.2.11/adler32.c:63:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT adler32_z(adler, buf, len)
              ^
lib/zlib-1.2.11/adler32.c:134:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT adler32(adler, buf, len)
              ^
lib/zlib-1.2.11/adler32.c:143:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local uLong adler32_combine_(adler1, adler2, len2)
            ^
lib/zlib-1.2.11/adler32.c:172:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT adler32_combine(adler1, adler2, len2)
              ^
lib/zlib-1.2.11/adler32.c:180:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT adler32_combine64(adler1, adler2, len2)
              ^
5 warnings generated.
lib/zlib-1.2.11/compress.c:22:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT compress2 (dest, destLen, source, sourceLen, level)
            ^
lib/zlib-1.2.11/compress.c:68:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT compress (dest, destLen, source, sourceLen)
            ^
lib/zlib-1.2.11/compress.c:81:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT compressBound (sourceLen)
              ^
3 warnings generated.
lib/zlib-1.2.11/compress.c:22:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT compress2 (dest, destLen, source, sourceLen, level)
            ^
lib/zlib-1.2.11/compress.c:68:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT compress (dest, destLen, source, sourceLen)
            ^
lib/zlib-1.2.11/compress.c:81:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT compressBound (sourceLen)
              ^
3 warnings generated.
lib/zlib-1.2.11/crc32.c:202:23: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
unsigned long ZEXPORT crc32_z(crc, buf, len)
                      ^
lib/zlib-1.2.11/crc32.c:237:23: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
unsigned long ZEXPORT crc32(crc, buf, len)
                      ^
lib/zlib-1.2.11/crc32.c:266:21: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local unsigned long crc32_little(crc, buf, len)
                    ^
lib/zlib-1.2.11/crc32.c:306:21: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local unsigned long crc32_big(crc, buf, len)
                    ^
lib/zlib-1.2.11/crc32.c:344:21: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local unsigned long gf2_matrix_times(mat, vec)
                    ^
lib/zlib-1.2.11/crc32.c:361:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local void gf2_matrix_square(square, mat)
           ^
lib/zlib-1.2.11/crc32.c:372:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local uLong crc32_combine_(crc1, crc2, len2)
            ^
lib/zlib-1.2.11/crc32.c:428:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT crc32_combine(crc1, crc2, len2)
              ^
lib/zlib-1.2.11/crc32.c:436:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT crc32_combine64(crc1, crc2, len2)
              ^
9 warnings generated.
lib/zlib-1.2.11/crc32.c:202:23: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
unsigned long ZEXPORT crc32_z(crc, buf, len)
                      ^
lib/zlib-1.2.11/crc32.c:237:23: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
unsigned long ZEXPORT crc32(crc, buf, len)
                      ^
lib/zlib-1.2.11/crc32.c:266:21: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local unsigned long crc32_little(crc, buf, len)
                    ^
lib/zlib-1.2.11/crc32.c:306:21: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local unsigned long crc32_big(crc, buf, len)
                    ^
lib/zlib-1.2.11/crc32.c:344:21: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local unsigned long gf2_matrix_times(mat, vec)
                    ^
lib/zlib-1.2.11/crc32.c:361:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local void gf2_matrix_square(square, mat)
           ^
lib/zlib-1.2.11/crc32.c:372:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local uLong crc32_combine_(crc1, crc2, len2)
            ^
lib/zlib-1.2.11/crc32.c:428:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT crc32_combine(crc1, crc2, len2)
              ^
lib/zlib-1.2.11/crc32.c:436:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT crc32_combine64(crc1, crc2, len2)
              ^
9 warnings generated.
lib/zlib-1.2.11/deflate.c:201:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local void slide_hash(s)
           ^
lib/zlib-1.2.11/deflate.c:228:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateInit_(strm, level, version, stream_size)
            ^
lib/zlib-1.2.11/deflate.c:240:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateInit2_(strm, level, method, windowBits, memLevel, strategy,
            ^
lib/zlib-1.2.11/deflate.c:353:11: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local int deflateStateCheck (strm)
          ^
lib/zlib-1.2.11/deflate.c:376:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateSetDictionary (strm, dictionary, dictLength)
            ^
lib/zlib-1.2.11/deflate.c:445:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateGetDictionary (strm, dictionary, dictLength)
            ^
lib/zlib-1.2.11/deflate.c:467:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateResetKeep (strm)
            ^
lib/zlib-1.2.11/deflate.c:505:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateReset (strm)
            ^
lib/zlib-1.2.11/deflate.c:517:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateSetHeader (strm, head)
            ^
lib/zlib-1.2.11/deflate.c:528:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflatePending (strm, pending, bits)
            ^
lib/zlib-1.2.11/deflate.c:542:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflatePrime (strm, bits, value)
            ^
lib/zlib-1.2.11/deflate.c:568:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateParams(strm, level, strategy)
            ^
lib/zlib-1.2.11/deflate.c:617:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateTune(strm, good_length, max_lazy, nice_length, max_chain)
            ^
lib/zlib-1.2.11/deflate.c:652:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT deflateBound(strm, sourceLen)
              ^
lib/zlib-1.2.11/deflate.c:716:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local void putShortMSB (s, b)
           ^
lib/zlib-1.2.11/deflate.c:730:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local void flush_pending(strm)
           ^
lib/zlib-1.2.11/deflate.c:763:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflate (strm, flush)
            ^
lib/zlib-1.2.11/deflate.c:1076:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateEnd (strm)
            ^
lib/zlib-1.2.11/deflate.c:1102:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateCopy (dest, source)
            ^
lib/zlib-1.2.11/deflate.c:1164:16: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local unsigned read_buf(strm, buf, size)
               ^
lib/zlib-1.2.11/deflate.c:1194:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local void lm_init (s)
           ^
lib/zlib-1.2.11/deflate.c:1236:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local uInt longest_match(s, cur_match)
           ^
lib/zlib-1.2.11/deflate.c:1482:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local void fill_window(s)
           ^
lib/zlib-1.2.11/deflate.c:1643:19: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local block_state deflate_stored(s, flush)
                  ^
lib/zlib-1.2.11/deflate.c:1824:19: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local block_state deflate_fast(s, flush)
                  ^
lib/zlib-1.2.11/deflate.c:1926:19: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local block_state deflate_slow(s, flush)
                  ^
lib/zlib-1.2.11/deflate.c:2057:19: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local block_state deflate_rle(s, flush)
                  ^
lib/zlib-1.2.11/deflate.c:2130:19: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local block_state deflate_huff(s, flush)
                  ^
28 warnings generated.
lib/zlib-1.2.11/deflate.c:201:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local void slide_hash(s)
           ^
lib/zlib-1.2.11/deflate.c:228:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateInit_(strm, level, version, stream_size)
            ^
lib/zlib-1.2.11/deflate.c:240:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateInit2_(strm, level, method, windowBits, memLevel, strategy,
            ^
lib/zlib-1.2.11/deflate.c:353:11: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local int deflateStateCheck (strm)
          ^
lib/zlib-1.2.11/deflate.c:376:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateSetDictionary (strm, dictionary, dictLength)
            ^
lib/zlib-1.2.11/deflate.c:445:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateGetDictionary (strm, dictionary, dictLength)
            ^
lib/zlib-1.2.11/deflate.c:467:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateResetKeep (strm)
            ^
lib/zlib-1.2.11/deflate.c:505:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateReset (strm)
            ^
lib/zlib-1.2.11/deflate.c:517:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateSetHeader (strm, head)
            ^
lib/zlib-1.2.11/deflate.c:528:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflatePending (strm, pending, bits)
            ^
lib/zlib-1.2.11/deflate.c:542:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflatePrime (strm, bits, value)
            ^
lib/zlib-1.2.11/deflate.c:568:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateParams(strm, level, strategy)
            ^
lib/zlib-1.2.11/deflate.c:617:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateTune(strm, good_length, max_lazy, nice_length, max_chain)
            ^
lib/zlib-1.2.11/deflate.c:652:15: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
uLong ZEXPORT deflateBound(strm, sourceLen)
              ^
lib/zlib-1.2.11/deflate.c:716:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local void putShortMSB (s, b)
           ^
lib/zlib-1.2.11/deflate.c:730:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local void flush_pending(strm)
           ^
lib/zlib-1.2.11/deflate.c:763:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflate (strm, flush)
            ^
lib/zlib-1.2.11/deflate.c:1076:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateEnd (strm)
            ^
lib/zlib-1.2.11/deflate.c:1102:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT deflateCopy (dest, source)
            ^
lib/zlib-1.2.11/deflate.c:1164:16: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local unsigned read_buf(strm, buf, size)
               ^
lib/zlib-1.2.11/deflate.c:1194:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local void lm_init (s)
           ^
lib/zlib-1.2.11/deflate.c:1236:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local uInt longest_match(s, cur_match)
           ^
lib/zlib-1.2.11/deflate.c:1482:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local void fill_window(s)
           ^
lib/zlib-1.2.11/deflate.c:1643:19: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local block_state deflate_stored(s, flush)
                  ^
lib/zlib-1.2.11/deflate.c:1824:19: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local block_state deflate_fast(s, flush)
                  ^
lib/zlib-1.2.11/deflate.c:1926:19: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local block_state deflate_slow(s, flush)
                  ^
lib/zlib-1.2.11/deflate.c:2057:19: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local block_state deflate_rle(s, flush)
                  ^
lib/zlib-1.2.11/deflate.c:2130:19: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local block_state deflate_huff(s, flush)
                  ^
28 warnings generated.
lib/zlib-1.2.11/gzclose.c:11:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT gzclose(file)
            ^
1 warning generated.
lib/zlib-1.2.11/gzclose.c:11:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT gzclose(file)
            ^
1 warning generated.
lib/zlib-1.2.11/gzlib.c:75:12: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local void gz_reset(state)
           ^
lib/zlib-1.2.11/gzlib.c:91:14: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
local gzFile gz_open(path, fd, mode)
             ^
lib/zlib-1.2.11/gzlib.c:252:9: error: call to undeclared function 'lseek'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
        LSEEK(state->fd, 0, SEEK_END);  /* so gzoffset() is correct */
        ^
lib/zlib-1.2.11/gzlib.c:14:17: note: expanded from macro 'LSEEK'
#  define LSEEK lseek
                ^
lib/zlib-1.2.11/gzlib.c:252:9: note: did you mean 'fseek'?
lib/zlib-1.2.11/gzlib.c:14:17: note: expanded from macro 'LSEEK'
#  define LSEEK lseek
                ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/stdio.h:154:6: note: 'fseek' declared here
int      fseek(FILE *, long, int);
         ^
lib/zlib-1.2.11/gzlib.c:258:24: error: call to undeclared function 'lseek'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
        state->start = LSEEK(state->fd, 0, SEEK_CUR);
                       ^
lib/zlib-1.2.11/gzlib.c:14:17: note: expanded from macro 'LSEEK'
#  define LSEEK lseek
                ^
lib/zlib-1.2.11/gzlib.c:270:16: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
gzFile ZEXPORT gzopen(path, mode)
               ^
lib/zlib-1.2.11/gzlib.c:278:16: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
gzFile ZEXPORT gzopen64(path, mode)
               ^
lib/zlib-1.2.11/gzlib.c:286:16: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
gzFile ZEXPORT gzdopen(fd, mode)
               ^
lib/zlib-1.2.11/gzlib.c:316:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT gzbuffer(file, size)
            ^
lib/zlib-1.2.11/gzlib.c:343:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT gzrewind(file)
            ^
lib/zlib-1.2.11/gzlib.c:359:9: error: call to undeclared function 'lseek'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    if (LSEEK(state->fd, state->start, SEEK_SET) == -1)
        ^
lib/zlib-1.2.11/gzlib.c:14:17: note: expanded from macro 'LSEEK'
#  define LSEEK lseek
                ^
lib/zlib-1.2.11/gzlib.c:366:19: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
z_off64_t ZEXPORT gzseek64(file, offset, whence)
                  ^
lib/zlib-1.2.11/gzlib.c:400:15: error: call to undeclared function 'lseek'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
        ret = LSEEK(state->fd, offset - state->x.have, SEEK_CUR);
              ^
lib/zlib-1.2.11/gzlib.c:14:17: note: expanded from macro 'LSEEK'
#  define LSEEK lseek
                ^
lib/zlib-1.2.11/gzlib.c:443:17: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
z_off_t ZEXPORT gzseek(file, offset, whence)
                ^
lib/zlib-1.2.11/gzlib.c:455:19: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
z_off64_t ZEXPORT gztell64(file)
                  ^
lib/zlib-1.2.11/gzlib.c:472:17: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
z_off_t ZEXPORT gztell(file)
                ^
lib/zlib-1.2.11/gzlib.c:482:19: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
z_off64_t ZEXPORT gzoffset64(file)
                  ^
lib/zlib-1.2.11/gzlib.c:496:14: error: call to undeclared function 'lseek'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    offset = LSEEK(state->fd, 0, SEEK_CUR);
             ^
lib/zlib-1.2.11/gzlib.c:14:17: note: expanded from macro 'LSEEK'
#  define LSEEK lseek
                ^
lib/zlib-1.2.11/gzlib.c:505:17: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
z_off_t ZEXPORT gzoffset(file)
                ^
lib/zlib-1.2.11/gzlib.c:515:13: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
int ZEXPORT gzeof(file)
            ^
lib/zlib-1.2.11/gzlib.c:532:22: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
const char * ZEXPORT gzerror(file, errnum)
                     ^
lib/zlib-1.2.11/gzlib.c:553:14: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
void ZEXPORT gzclearerr(file)
             ^
lib/zlib-1.2.11/gzlib.c:579:20: warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
void ZLIB_INTERNAL gz_error(state, err, msg)
                   ^
17 warnings and 5 errors generated.
error: command '/usr/bin/clang' failed with exit code 1
(base) Vincents-MacBook-Pro:raresim vincerubinetti$ 

Note that you're seeing the lseek errors and other warnings that I'm seeing, but much more.

I'm guessing the issue is that the C code is not being finished compiling because of those errors. If I add a test fprintf to add_buffer_to_matrix, it does not show up when running convert.py. The weird thing is that those errors were not there for me just a few weeks ago. And I've freshly cloned the code, removing the previous files completely, but maybe the compiled version that was working from a few weeks ago that had the rows += 1 is still stored somewhere else on my computer, and that is what is being run now even when the C compile is failing. I don't know enough about how Cython works. Still working on investigating.

@vincerubinetti
Copy link
Contributor

vincerubinetti commented Jan 11, 2024

Well I don't know why this changed over the last few weeks. But I found a hack online, which is to just turn off that C error so the compile can succeed:

export CFLAGS='-Wno-implicit-function-declaration' && python3 setup.py install

This makes it work for me, locally. Probably won't fix your other errors.

@JessMurphy
Copy link
Author

JessMurphy commented Jan 12, 2024

So yes, raresim will probably be run on a computing server or cluster because genetic data is usually too large to be stored locally. But I'm currently just cloning the updated code into my home drive (and deleting the previous files) and it does have the latest version of the C code. However, I think a previous version of raresim was added to the singularity container so I wonder if that could be messing things up (though I'm specifically referencing the raresim folder in my home drive and just using the singularity to call python3).

I did run your hack, which got rid of warnings 14, 252, 89, and 661 from the setup.py output (everything else looked the exact same), but the convert.py function still produces a .sm file of 19027 rows. I could try to run it locally, but I haven't done that before, so it might take me a bit to figure it out.

@vincerubinetti
Copy link
Contributor

I did run your hack, which got rid of warnings 14, 252, 89, and 661 from the setup.py output (everything else looked the exact same), but the convert.py function still produces a .sm file of 19027 rows.

If there are still other errors being shown, they might be blocking the build from finishing. Try adding a printf("|||||||||||||||||||||||||||||"); test print (on the server) to the top of the M->rows += 1; function, and see if it prints when running convert. That should be a sure fire way to see if the server is actually including the M->rows += 1; fix.

As far as the differences between running local and running on the server, we'd probably need to have another sit down with screen-sharing to troubleshoot this, and I'd probably need help from my colleagues. I think the code is fairly brittle and sensitive to version and environment differences. Python is not my main language but from what I've heard, that applies to a lot of Python code... and then adding compiled C (with included external libraries) into the mix makes it extra complicated.

Please schedule something with the booking page:
https://outlook.office365.com/owa/calendar/SoftwareEngineeringTeam@olucdenver.onmicrosoft.com/bookings/

Just FYI, it's possible that an actual robust fix (to make it reliably run on the server or any environment) could take a significant effort, and we have a limit on what we can do without making an MOU ("official" agreement for work/time commitment between our supervisor and your lab).

@JessMurphy
Copy link
Author

JessMurphy commented Jan 12, 2024

Assuming I added the print statement in the right place (see below), it did not print when running convert.py. And I just booked an appointment for next Wednesday (1/17).

//{{{uint32_t add_buffer_to_matrix(char *buffer,
uint32_t add_buffer_to_matrix(char *buffer,
                              long length,
                              struct uint32_t_sparse_matrix *M,
                              uint32_t *row,
                              uint32_t *col)
{
    printf("|||||||||||||||||||||||||||||");
    uint32_t max_col = 0;
    long i;
    for (i = 0; i < length; i++) {

        if ( (int)buffer[i] == SPACE  )
            continue;

        if ( (int)buffer[i] == NEWLINE ) {
            *col = 0;
            max_col = MAX(*col, max_col);
            *row += 1;
            M->rows += 1;
            continue;
        }

        if ( (int)buffer[i] == ONE ) {
            int r = uint32_t_sparse_matrix_add(M, *row, *col);
        }

        *col += 1;
        max_col = MAX(*col, max_col);
    }
    return max_col;
}

//}}}

@d33bs
Copy link
Member

d33bs commented Feb 9, 2024

In thinking about this, discussing with the software team, and taking a brief look at the code I wanted to mention that using a packaging and environment management tool could possibly assist with reproducibility when it comes to how the work is deployed + tested. There's a helpful guide at pyOpenSci.org on Python Packaging Tools which might influence the decision-making in this space. Feel free to ignore this comment if it's way off-base or unhelpful.

@falquaddoomi
Copy link

falquaddoomi commented Feb 9, 2024

Hey @JessMurphy, sorry for the delay. So, I got access to your compute cluster and was able to get your pruning_code.sh script to run with @vincerubinetti 's modifications to raresim (which, as of today, have all been merged into main).

Changes to pruning_code.sh

I had to make some tweaks to the pruning_code.sh script; here's my new version of it, which you can copy over your existing one. I added comments where I'd changed things with the prefix # FA:

#!/bin/bash

# FA: the `set` command applies bash settings; the following settings make it easier to debug the script:
#  -e: abort on any program returning an error, i.e. a code != 0
#  -o pipefail: also abort on programs that error in pipes, e.g. `broken_program | working_program` will fail when it attempts to run `broken_program`
#  -x: echo lines that are run with a `+` in front of them
set -e -o pipefail -x

pop=NFE
nsim=20000
pcase=100
pconf=90
rep=3

# change the file path to where the example_code folder is stored
# (and where the raresim folder will be stored)
# FA: changed this to taking the WD as the first parameter so i could test it in my home directory on clas-compute
# (it defaults to the value it was hardcoded to before if you don't specify the first parameter)
WD=${1:-/home/math/murphjes}
cd ${WD}

# clone raresim from Github
# FA: i first remove the raresim folder, if it exists, so we know we're starting with a fresh copy
rm -rf raresim || echo "No existing raresim folder found, continuing..."
git clone https://github.com/RMBarnard/raresim.git
cd raresim/
# FA: i added `--user` here to install the package into a folder that's writeable by a regular user
python3 setup.py install --user
cd ..


# prune functional and synonymous variants down to pcase %
python3 ${WD}/raresim/sim.py \
    -m ${WD}/example_code/chr19.block37.${pop}.sim${rep}.controls.haps.sm \
    --functional_bins ${WD}/example_code/MAC_bin_estimates_${nsim}_${pop}_fun_${pcase}.txt \
    --synonymous_bins ${WD}/example_code/MAC_bin_estimates_${nsim}_${pop}_syn_${pcase}.txt \
    -l ${WD}/example_code/chr19.block37.${pop}.sim${rep}.copy.legend \
    -L ${WD}/example_code/chr19.block37.${pop}.sim${rep}.${pcase}fun.${pcase}syn.legend \
    -H ${WD}/example_code/chr19.block37.${pop}.sim${rep}.controls.${pcase}fun.${pcase}syn.haps.gz

# produces WARNING: Lengths of legend 19029 and hap 19027 files do not match


# convert the resulting -H haplotype file to a sparse matrix
python3 ${WD}/raresim/convert.py \
    -i ${WD}/example_code/chr19.block37.${pop}.sim${rep}.controls.${pcase}fun.${pcase}syn.haps.gz \
    -o ${WD}/example_code/chr19.block37.${pop}.sim${rep}.controls.${pcase}fun.${pcase}syn.haps.sm


# prune functional and synonymous variants down again to pconf % (sometimes doesn't work because of the MAC bins)
python3 ${WD}/raresim/sim.py \
    -m ${WD}/example_code/chr19.block37.${pop}.sim${rep}.controls.${pcase}fun.${pcase}syn.haps.sm \
    --functional_bins ${WD}/example_code/MAC_bin_estimates_${nsim}_${pop}_fun_${pconf}.txt \
    --synonymous_bins ${WD}/example_code/MAC_bin_estimates_${nsim}_${pop}_syn_${pconf}.txt \
    -l ${WD}/example_code/chr19.block37.${pop}.sim${rep}.${pcase}fun.${pcase}syn.legend \
    -L ${WD}/example_code/chr19.block37.${pop}.sim${rep}.${pconf}fun.${pconf}syn.legend \
    -H ${WD}/example_code/chr19.block37.${pop}.sim${rep}.controls.${pconf}fun.${pconf}syn.haps.gz || \
    echo "* NOTE: Expected error (code: $?), continuing..."

# FA: since we know the above line is supposed to fail, i added `|| echo "* expected..." so that the command as a whole succeeds and we can continue, despite `set -eo pipefail` being enabled

# produces the following error if the number of observed functional variants (0) is less than the number
# of expected functional variants (0.36) for the [201,400] MAC bin from the first pruning step above
# (may need to rerun the first pruning step to reproduce the error)

#Traceback (most recent call last):
#  File "/home/math/murphjes/raresim/sim.py", line 100, in <module>
#    if __name__ == '__main__': main()
#  File "/home/math/murphjes/raresim/sim.py", line 69, in main
#    print_frequency_distribution(bins, bin_h, func_split, fun_only, syn_only)
#  File "/home/math/murphjes/raresim/header.py", line 304, in print_frequency_distribution
#    print_bin(bin_h['fun'], bins['fun'])
#  File "/home/math/murphjes/raresim/header.py", line 142, in print_bin
#    + str(len(bin_h[bin_id])))
#KeyError: 6

# we circumvented this error by combining the last two MAC bins into a [21,400] bin


# prune functional and synonymous variants down again to pconf % but don't remove the rows of zeros (doesn't work because of -z flag)
python3 ${WD}/raresim/sim.py \
    -m ${WD}/example_code/chr19.block37.${pop}.sim${rep}.controls.${pcase}fun.${pcase}syn.haps.sm \
    --functional_bins ${WD}/example_code/MAC_bin_estimates_${nsim}_${pop}_fun_${pconf}_6bins.txt \
    --synonymous_bins ${WD}/example_code/MAC_bin_estimates_${nsim}_${pop}_syn_${pconf}_6bins.txt \
    -l ${WD}/example_code/chr19.block37.${pop}.sim${rep}.${pcase}fun.${pcase}syn.legend \
    -L ${WD}/example_code/chr19.block37.${pop}.sim${rep}.${pconf}fun.${pconf}syn.legend \
    -H ${WD}/example_code/chr19.block37.${pop}.sim${rep}.controls.${pconf}fun.${pconf}syn.haps.gz \
    -z

# produces the following error

#Traceback (most recent call last):
#  File "/home/math/murphjes/raresim/sim.py", line 100, in <module>
#    if __name__ == '__main__': main()
#  File "/home/math/murphjes/raresim/sim.py", line 90, in main
#    all_kept_rows = get_all_kept_rows(bin_h, R, func_split, fun_only, syn_only, args.z, args.keep_protected, legend)
#  File "/home/math/murphjes/raresim/header.py", line 338, in get_all_kept_rows
#    all_kept_rows = list(merge(all_kept_rows, sorted(R)))
#  File "/usr/lib/python3.10/heapq.py", line 353, in merge
#    _heapify(h)
#TypeError: '<' not supported between instances of 'int' and 'str'

Opening a shell into the container

(The following should be done after SSHing into your cluster, i.e. on clas-compute or alderaan)

I find it a lot easier to debug things if I can get an interactive shell into the container. I'd start in your home directory, with the contents of the zip you sent, example_code.zip, extracted to the path /home/math/murphjes/example_code/.

I don't know if it's necessary, but you can get into your mixtures.sif container in temporary-write mode like so:

singularity shell --writable-tmpfs /storage/singularity/mixtures.sif

The --writable-tmpfs mode allows you to write to any location in the container's filesystem (assuming you have sufficient permissions), but the changes disappear when you exit the container. The exception is changes you make anywhere in or under your home directory, which will persist even after exiting; Singularity mounts your home directory into the container as read-writeable.

Once you run the above command (and after a short wait as the image is launched as a container), you'll be at a prompt that looks like:

Singularity>

but it's otherwise a normal bash shell.

From that shell, you can run ./example_code/pruning_code.sh, at the prompt and it should run to completion -- I didn't see the last error, TypeError: '<' not supported between instances of 'int' and 'str', at least.

Hope that helps, and let me know if you have issues or questions, and of course if it ends up working for you.

@JessMurphy
Copy link
Author

Sorry for my delayed response, but thanks @falquaddoomi! The differing lengths warning is no longer produced and the sparse matrix is of the correct size. And yes, the last error isn't produced but it's not doing what it's supposed to be doing so I need to follow-up with @vincerubinetti and provide examples.

@JessMurphy
Copy link
Author

Hey @vincerubinetti, attached is an updated example_code folder with included output. The current output when using the z flag is chr19.block37.NFE.sim3.all.90fun.90syn.zflag.haps.gz, chr19.block37.NFE.sim3.90fun.90syn.zflag.legend and chr19.block37.NFE.sim3.90fun.90syn.zflag.legend-pruned-variants. The .legend file is fine and the .legend-pruned-variants file is not needed (or should be the same as chr19.block37.NFE.sim3.90fun.90syn.legend-pruned-variants instead of blank). The .haps.gz file should look like chr19.block37.NFE.sim3.all.90fun.90syn.zflag.correct.haps.gz instead of an exact copy of chr19.block37.NFE.sim3.all.100fun.100syn.haps.gz. It should have the same number of rows as chr19.block37.NFE.sim3.all.100fun.100syn.haps.gz but have some rows of all zeros (the same number as rows in the chr19.block37.NFE.sim3.90fun.90syn.legend-pruned-variants file). An easy way to check if the z flag is working is summing up the rows of the .haps.gz files.

Please let me know if you have any questions or if it would be easier to meet to further explain.

example_code.zip

@vincerubinetti
Copy link
Contributor

vincerubinetti commented Mar 25, 2024

Hi Jessica,

I haven’t had a chance to look at this yet, no. I’m out sick today and most likely will be out several days this week. If you’d like to schedule a sit down, that would be helpful. It’d be good to have all three of the SET team members there so we can all help, so if you could schedule through the booking page that would be great.

Generally speaking, it’s probably going to be hard (for me at least) to find what needs fixing just by looking at what the expected output is. At some point someone has to understand what each line of code is supposed to be doing, both from a biological and programming standpoint.

It also seems like this would've been broken long before I fixed the error message for the z flag, unless there was some undocumented way of using the z flag (passing in differently formatted data) that I'm unaware of.

@JessMurphy
Copy link
Author

No worries and I hope you feel better! Yes, the z flag definitely wasn't working properly even before you fixed the error message. I was able to get ahold of Ryan, the maintainer of raresim, and it sounds like he might have some time to look into it. But if he doesn't or he runs into issues, I'll reach back out and schedule a meeting. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants