Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running diem on whole genome #5

Open
simonharnqvist opened this issue Apr 26, 2024 · 20 comments
Open

Running diem on whole genome #5

simonharnqvist opened this issue Apr 26, 2024 · 20 comments

Comments

@simonharnqvist
Copy link

Hi @StuartJEBaird - taking you up on your offer on how to run diem on genome scale data. Raising an issue here so it becomes part of the documentation, but feel free to direct me elsewhere.

First things first - the nCores parameter in diemR does not seem to be working on macOS; CPU load is not responsive to the requested number of threads. I'm not an R user: is there's something else I need to do to enable R to multiprocess? (Given files are per chromosome I can work around this easily, but just thought I should let you know)

Other than multiprocessing, what are some tricks to make this run smoothly on a whole genome?

Thanks in advance
Simon

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Apr 26, 2024 via email

@simonharnqvist
Copy link
Author

Thanks @StuartJEBaird! I would definitely prefer using Python, but I couldn't find any documentation for the Python version

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Apr 30, 2024 via email

@simonharnqvist
Copy link
Author

Hmm - the function signatures are very different:

diem.py

def diem(PhiW, CompartmentNames, CompartmentPloidies, datapath, outputPath, ChosenInds, diemMaxInterations, Epsilon, verbose, processes):

vs

diem.R

diem <- function(files, ploidy = list(2), markerPolarity = FALSE, ChosenInds,
                 epsilon = 0.99999, verbose = FALSE, nCores = parallel::detectCores() - 1,
                 maxIterations = 50, ...)

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented May 1, 2024 via email

@simonharnqvist
Copy link
Author

Hi Stuart,

I was being very silly - of course it doesn't use more than one core when I've only provided one input file. As far as I can tell, it works on MacOS - will run a proper test overnight.

With diempy, I'm getting an error - I'll stick the traceback below for your information, but I'm happy enough with diemR now.

chr14_len = int(subprocess.check_output(["wc", "-l", "diem_files/diem_input/brenthis_ino.SP_BI_364.chromosome_14.diem.txt"]).split()[0])
rng = np.random.default_rng(seed=42)
phi_w = "".join(map(str, rng.integers(low=1, high=3, size=chr14_len)))

diem(PhiW=phi_w, CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"], 
     CompartmentPloidies=[2], datapath="diem_files/diem_input", 
     outputPath="diempy_test", ChosenInds=list(range(13)))

I get a NumPy array creation error (np 1.24):

Traceback (ignore the VSCode links - I'm too lazy to crop them out):

RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/s2341012/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/Users/s2341012/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py", line 205, in GetI4ofOneCompartment
    def GetI4ofOneCompartment(args): return GetI4ofOneCompartment2args(*args)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py", line 202, in GetI4ofOneCompartment2args
    return np.array([csStateCount(Ind) for Ind in Inds], dtype=np.uint32)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (14,) + inhomogeneous part.
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[22], [line 1](vscode-notebook-cell:?execution_count=22&line=1)
----> [1](vscode-notebook-cell:?execution_count=22&line=1) diem(PhiW=phi_w, CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"], 
      [2](vscode-notebook-cell:?execution_count=22&line=2)      CompartmentPloidies=[2], datapath="diem_files/diem_input", 
      [3](vscode-notebook-cell:?execution_count=22&line=3)      outputPath="diempy_test", ChosenInds=list(range(13)))

File [~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:459](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:459), in diem(PhiW, CompartmentNames, CompartmentPloidies, datapath, outputPath, ChosenInds, diemMaxIterations, Epsilon, verbose, processes)
    [457](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:457) # ACTION : Measure initial state
    [458](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:458)     vprint("Starting big state counts..")
--> [459](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:459)     I4compartments = AbsoluteTiming(GetI4ofCompartments, PhiW)
    [460](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:460) #    I4compartments = AbsoluteTimingNoArg(GetI4ofCompartments);
    [461](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:461)     vprint("I4 : ", I4compartments[0], " seconds.")

File [~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:311](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:311), in AbsoluteTiming(f, arg)
    [309](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:309) def AbsoluteTiming(f, arg):
    [310](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:310)     start_time = time.time()
--> [311](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:311)     ans = f(arg)
    [312](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:312)     duration = time.time() - start_time
    [313](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:313)     return [duration, ans]

File [~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:388](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:388), in diem.<locals>.GetI4ofCompartments(markerLabelsForCompartments)
    [384](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:384) def GetI4ofCompartments(markerLabelsForCompartments):
    [385](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:385)     #    def GetI4ofCompartments():
    [386](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:386)     compInputs = list(
    [387](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:387)         zip(CompartmentDatapaths, markerLabelsForCompartments))
--> [388](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:388)     return np.array(ParallelMap(GetI4ofOneCompartment, compInputs))

File [~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:369](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:369), in diem.<locals>.ParallelMap(f, arglist)
    [367](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:367) def ParallelMap(f, arglist):
    [368](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:368)     with multiprocessing.Pool(processes=processes) as pool:
--> [369](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:369)         ans = pool.map(f, arglist)
    [370](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:370)         while len(pool._cache) > 0:
    [371](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:371)             sleep(0.1)

File [~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:367](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:367), in Pool.map(self, func, iterable, chunksize)
    [362](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:362) def map(self, func, iterable, chunksize=None):
    [363](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:363)     '''
    [364](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:364)     Apply `func` to each element in `iterable`, collecting the results
    [365](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:365)     in a list that is returned.
    [366](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:366)     '''
--> [367](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:367)     return self._map_async(func, iterable, mapstar, chunksize).get()

File [~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:774](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:774), in ApplyResult.get(self, timeout)
    [772](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:772)     return self._value
    [773](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:773) else:
--> [774](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:774)     raise self._value

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (14,) + inhomogeneous part.

No idea what the problem is, so I'll leave this issue open in case you want to investigate further.

Will let you know how the bigger test goes - thanks again!
Simon

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented May 2, 2024 via email

@simonharnqvist
Copy link
Author

Hi

Can confirm: diemR now works as expected on full genome. Regarding output: is there a better option than simply capturing stdout (which I may have forgotten to do...) - I don't see an output path option unlike in the Python API?

Stuart: unfortunately I still get the same error. If that works for you: which version of NumPy are you using?

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented May 5, 2024 via email

@simonharnqvist
Copy link
Author

Hi,

diemR: Hmm - I don't find any files with optimal polarities. Let's wait until Natalia gets back, there's no rush to get this working.

diempy: This is using diem.py from this repo with one modification of L25 - I've commented out:

np.warnings.filterwarnings('error', category=np.VisibleDeprecationWarning)

Because this throws an error for me - which is why I suspect I might be using an incompatible version of NumPy.

Here's the script:

from diem.diempy.diempy_core.diem import diem
import subprocess
import numpy as np

chr14_len = int(subprocess.check_output(["wc", "-l", "diem_files/diem_input/brenthis_ino.SP_BI_364.chromosome_14.diem.txt"]).split()[0])
rng = np.random.default_rng(seed=42)
phi_w = "".join(map(str, rng.integers(low=1, high=3, size=chr14_len)))

diem(PhiW=phi_w,
CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"],
     CompartmentPloidies=[  [2]*13 ],
datapath="diem_files/diem_input",
     outputPath="diempy_test", ChosenInds=list(range(13)))

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented May 21, 2024 via email

@simonharnqvist
Copy link
Author

Thanks Stuart and Natalia - indeed, nothing in the working directory; I'm running this test locally on my Mac, so there's no scratch that might get wiped. (But it also means I have a second platform to try it on - I won't run the rest of my species locally anyway)

I'm running it again - maybe the process threw an error before finishing that I didn't notice. Will let you know once that finishes, including a full list of output files.

Simon

@simonharnqvist
Copy link
Author

Hi again,

Reporting back from the re-run. As a reminder, this is the call:

library(diemr)

input <- Sys.glob("diem_files/diem_input/*.chromosome_*.diem.txt")

diem(
    files = input,
    ploidy = list(rep(2, 13)),
    markerPolarity = FALSE,
    ChosenInds = 1:13,
    nCores = 6
)

Diem wrote 12 directories to my workdir, listed below. I'm using invalid/pseudo-regex to represent multiple entries.


modelDiagnostics[1-11]
│   OriginalHI.txt
│   RescaledHItobetaOfRescaledHI.pdf   
|   SortedRescaledHybridIndex.pdf
│   WarpSwitch.pdf
|   WarpSwitch.txt

and

likelihood
|   MarkerDiagnostics[0-9]-[0-9].txt
|   MarkersWithChangedPolarity[1-14]

The end of the standard output looks like this:

[99865] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[99877] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
[99889]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[99901]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
[99913] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE
[99925] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
[99937]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
[99949] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[99961] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[99973] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
[99985]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
[99997] FALSE FALSE FALSE
 [ reached getOption("max.print") -- omitted 947174 entries ]

I don't see any optimal polarities anywhere - but maybe it's still hiding somewhere?

Simon

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented May 23, 2024 via email

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented May 23, 2024 via email

@simonharnqvist
Copy link
Author

Thanks both - this can absolutely wait until next week. I've corrected the ploidy specification and am running again overnight. I'll keep an eye out for a 'diagnostics' directory in the workdir.

Simon

@simonharnqvist
Copy link
Author

Can confirm - the verbose=TRUE made the difference. I now have a 'diagnostics' folder - diemR has worked!

@StuartJEBaird - is there a guide/example for how go from the list of indices to the pretty Diem plots?

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented May 24, 2024 via email

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented May 24, 2024 via email

@nmartinkova
Copy link
Collaborator

Hi Simon,

Thank you for your patience. I have now set the outputs in such a way that regardless of the value in the verbose argument in diem, a file MarkerDiagnosticsWithOptimalPolarities.txt will always be created. The new vignette Understanding genome polarisation output files will be available from diemr 1.3.

Best,
Natalia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants