-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running diem on whole genome #5
Comments
Hi Simon,
Regarding support: When it comes to R specifics, Natalia is our guru! (Not
me)
If you are not an R user, then why not try our Python implementation?
It goes faster than R....and (of course) accepts the same input, does the
same thing, gives the same output.
If you have your diem parameter set up for R it should not be too much
trouble to given them as arguments to the Python diem function.
Your data is large, so my tip would be: Divide it up into a multiple of
nCoresAvailable ~equal-sized compartments.
This optimisation can be fiddly if you are including non-conventional
ploidy Chromosomes...as each compartment has only one ploidy spec.
(A compartment ploidy spec is just a list of the ploidy of each individual
in the input - this is a function of gender for W,X,Y,Z)
If your chr sizes are of the same order of magnitude, then it's probably
easier to stick with one compartment per chr - it will just take a bit
longer.
Keep the questions coming,
Best
S
…On Fri, 26 Apr 2024 at 12:57, simonharnqvist ***@***.***> wrote:
Hi @StuartJEBaird <https://github.com/StuartJEBaird> - taking you up on
your offer on how to run diem on genome scale data. Raising an issue here
so it becomes part of the documentation, but feel free to direct me
elsewhere.
First things first - the nCores parameter in diemR does not seem to be
working on macOS; CPU load is not responsive to the requested number of
threads. I'm not an R user: is there's something else I need to do to
enable R to multiprocess? (Given files are per chromosome I can work around
this easily, but just thought I should let you know)
Other than multiprocessing, what are some tricks to make this run smoothly
on a whole genome?
Thanks in advance
Simon
—
Reply to this email directly, view it on GitHub
<#5>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDCV54ETQQE635OCHWU55LY7IXJ5AVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3DKNJTGY4TIMI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks @StuartJEBaird! I would definitely prefer using Python, but I couldn't find any documentation for the Python version |
You were able to download diempy?
The *diem* function is identical across all platforms, and so usage and
the arguments should be the same...
ie if you have the arguments for *diemr* then they should be the arguments
for *diempy*.
This identity is true for diempy/diemmca, but may not hold perfectly
between diemr/diempy.
(Natalia has been developing a wrapper round the core R function).
We can fill in this part of the support right now if you wish...
What are your arguments to the diemr version?
On my side, I will look out the python calls I used for the *diem*
publication.
(I use diemmca day to day).
Best
S
…On Tue, 30 Apr 2024 at 12:21, simonharnqvist ***@***.***> wrote:
Thanks @StuartJEBaird <https://github.com/StuartJEBaird>! I would
definitely prefer using Python, but I couldn't find any documentation for
the Python version
—
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDCV5Z246L3GQJ6X6Z554DY75WELAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBUHEZDMMZUGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hmm - the function signatures are very different: diem.py def diem(PhiW, CompartmentNames, CompartmentPloidies, datapath, outputPath, ChosenInds, diemMaxInterations, Epsilon, verbose, processes): vs diem.R diem <- function(files, ploidy = list(2), markerPolarity = FALSE, ChosenInds,
epsilon = 0.99999, verbose = FALSE, nCores = parallel::detectCores() - 1,
maxIterations = 50, ...) |
Hi Simon!
The final 5 arguments are identical:
ChosenInds, diemMaxInterations (maxIterations), Epsilon, verbose,
processes (nCores)
leaving:
PhiW, CompartmentNames, CompartmentPloidies, datapath, outputPath
*CompartmentNames is a list of strings*... *datapath* can end in "/",
though this is optional as the internal code is:
*CompartmentDatapaths = [ os.path.join(datapath, name) for name in
CompartmentNames ]*
*outputPath* is obvious.
The only non-obvs inputs are then 2: PhiW, CompartmentPloidies
*PhiW is the Null Polarisation you wish to use.*
To generate a Null Polarisation for total N[sites] NS, use a random number
generator to generate a list of NS "1" or "2" characters.
Subdivide this list and concatenate to form N[compartments] strings
"1211212211..."
The string lengths of these strings are the numbers of sites in each
compartment.
*Here is some pseudocode:*
CompartmentLengths = Map[Length,Compartments]
NS = Total[ CompartmentLengths ]
PhiW = RandomChoice[ {"1","2"}, NS ]
PhiW= TakeList[ PhiW, CompartmentLengths ]
PhiW = Map[StringJoin, PhiW]
Each time you generate a Null it is (very) likely to be unique, so it is
good practice to save it to file before passing it to *diem* as an argument.
(You can then guarantee replicability of any results, as *diem* is
deterministic).
*CompartmentPloidies details the ploidy for each individual, for each data
compartment*
It is simplest to set all compartment ploidies the same (normally 2... see
diemr default)...
but you may need to be more exact, and divide your sites into sets of
equal ploidy such as
Autosome, W, X, Y, Z, mt
For diploid autosomes and mt all individuals have the same ploidy:
{2,2,2,2,2...} and {1,1,1,1,1...} respectively.
For data from males=M, females=F, {MFMFMFMFMFMFMF....}
X and Z compartments have ploidy: {1,2,1,2,1...} and {2,1,2,1,2...}
respectively.
Y and W compartments have ploidy: {1,0,1,0,1...} and {0,1,0,1,0...}
respectively.
You have to determine your own {MFMMFFF...} sequence from the metadata of
your genomes (as ordered in your *diem* input sites).
This makes *CompartmentPloidies *is an N[compartments] x N[individuals]
matrix of whole numbers.
Is this sufficient info for you to get *diempy* running?
Checking my *diem* Manual: it is very much a work in progress, so would not
yet be useful for you,
but if the above is sufficient I will use it as a guide to the relevant
Manual section.
Bestest
Stuart
…On Tue, 30 Apr 2024 at 15:46, simonharnqvist ***@***.***> wrote:
Hmm - the function signatures are very different:
*diem.py*
def diem(PhiW, CompartmentNames, CompartmentPloidies, datapath, outputPath, ChosenInds, diemMaxInterations, Epsilon, verbose, processes):
vs
*diem.R*
diem <- function(files, ploidy = list(2), markerPolarity = FALSE, ChosenInds,
epsilon = 0.99999, verbose = FALSE, nCores = parallel::detectCores() - 1,
maxIterations = 50, ...)
—
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDCV56NFQB4R4QOBEZCJE3Y76OETAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBVGM3TQNZZHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi Stuart, I was being very silly - of course it doesn't use more than one core when I've only provided one input file. As far as I can tell, it works on MacOS - will run a proper test overnight. With diempy, I'm getting an error - I'll stick the traceback below for your information, but I'm happy enough with diemR now. chr14_len = int(subprocess.check_output(["wc", "-l", "diem_files/diem_input/brenthis_ino.SP_BI_364.chromosome_14.diem.txt"]).split()[0])
rng = np.random.default_rng(seed=42)
phi_w = "".join(map(str, rng.integers(low=1, high=3, size=chr14_len)))
diem(PhiW=phi_w, CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"],
CompartmentPloidies=[2], datapath="diem_files/diem_input",
outputPath="diempy_test", ChosenInds=list(range(13))) I get a NumPy array creation error (np 1.24): Traceback (ignore the VSCode links - I'm too lazy to crop them out):
No idea what the problem is, so I'll leave this issue open in case you want to investigate further. Will let you know how the bigger test goes - thanks again! |
Hi Simon!
On Thu, 2 May 2024 at 13:52, simonharnqvist ***@***.***> wrote:
Hi Stuart,
I was being very silly - of course it doesn't use more than one core when
I've only provided one input file. As far as I can tell, it works on MacOS
- will run a proper test overnight.
I will tell Natalia - she will be pleased.
With diempy, I'm getting an error - I'll stick the traceback below for
your information, but I'm happy enough with diemR now.
I would be happy if you try the following for *diempy*
It seems you called *diempy* like this:
diem(PhiW=phi_w,
CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"],
CompartmentPloidies=[2], datapath="diem_files/diem_input",
outputPath="diempy_test", ChosenInds=list(range(13)))
But CompartmentPloidies is "an N[compartments] x N[individuals] matrix"
(see passim), so
(as the compatement name list has only one entry) try calling it something
like this:
diem(PhiW=phi_w,
CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"],
CompartmentPloidies=[ [2 for number in range(13)] ],
datapath="diem_files/diem_input",
outputPath="diempy_test", ChosenInds=list(range(13)))
such that compartmentPloidies *also* only has one entry: a list of ploidies
for all 13 individuals (for the one compartment in the list).
At some point I will put a wrapper round the core function to catch stuff
like this...
Bestest
S
… chr14_len = int(subprocess.check_output(["wc", "-l", "diem_files/diem_input/brenthis_ino.SP_BI_364.chromosome_14.diem.txt"]).split()[0])rng = np.random.default_rng(seed=42)phi_w = "".join(map(str, rng.integers(low=1, high=3, size=chr14_len)))
diem(PhiW=phi_w, CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"],
CompartmentPloidies=[2], datapath="diem_files/diem_input",
outputPath="diempy_test", ChosenInds=list(range(13)))
I get a NumPy array creation error (np 1.24):
Traceback (ignore the VSCode links - I'm too lazy to crop them out):
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/s2341012/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/Users/s2341012/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
^^^^^^^^^^^^^^^^
File "/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py", line 205, in GetI4ofOneCompartment
def GetI4ofOneCompartment(args): return GetI4ofOneCompartment2args(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py", line 202, in GetI4ofOneCompartment2args
return np.array([csStateCount(Ind) for Ind in Inds], dtype=np.uint32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (14,) + inhomogeneous part.
"""
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
Cell In[22], [line 1](vscode-notebook-cell:?execution_count=22&line=1)
----> [1](vscode-notebook-cell:?execution_count=22&line=1) diem(PhiW=phi_w, CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"],
[2](vscode-notebook-cell:?execution_count=22&line=2) CompartmentPloidies=[2], datapath="diem_files/diem_input",
[3](vscode-notebook-cell:?execution_count=22&line=3) outputPath="diempy_test", ChosenInds=list(range(13)))
File [~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:459](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:459), in diem(PhiW, CompartmentNames, CompartmentPloidies, datapath, outputPath, ChosenInds, diemMaxIterations, Epsilon, verbose, processes)
[457](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:457) # ACTION : Measure initial state
[458](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:458) vprint("Starting big state counts..")
--> [459](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:459) I4compartments = AbsoluteTiming(GetI4ofCompartments, PhiW)
[460](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:460) # I4compartments = AbsoluteTimingNoArg(GetI4ofCompartments);
[461](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:461) vprint("I4 : ", I4compartments[0], " seconds.")
File [~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:311](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:311), in AbsoluteTiming(f, arg)
[309](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:309) def AbsoluteTiming(f, arg):
[310](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:310) start_time = time.time()
--> [311](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:311) ans = f(arg)
[312](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:312) duration = time.time() - start_time
[313](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:313) return [duration, ans]
File [~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:388](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:388), in diem.<locals>.GetI4ofCompartments(markerLabelsForCompartments)
[384](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:384) def GetI4ofCompartments(markerLabelsForCompartments):
[385](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:385) # def GetI4ofCompartments():
[386](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:386) compInputs = list(
[387](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:387) zip(CompartmentDatapaths, markerLabelsForCompartments))
--> [388](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:388) return np.array(ParallelMap(GetI4ofOneCompartment, compInputs))
File [~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:369](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:369), in diem.<locals>.ParallelMap(f, arglist)
[367](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:367) def ParallelMap(f, arglist):
[368](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:368) with multiprocessing.Pool(processes=processes) as pool:
--> [369](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:369) ans = pool.map(f, arglist)
[370](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:370) while len(pool._cache) > 0:
[371](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:371) sleep(0.1)
File [~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:367](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:367), in Pool.map(self, func, iterable, chunksize)
[362](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:362) def map(self, func, iterable, chunksize=None):
[363](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:363) '''
[364](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:364) Apply `func` to each element in `iterable`, collecting the results
[365](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:365) in a list that is returned.
[366](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:366) '''
--> [367](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:367) return self._map_async(func, iterable, mapstar, chunksize).get()
File [~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:774](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:774), in ApplyResult.get(self, timeout)
[772](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:772) return self._value
[773](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:773) else:
--> [774](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:774) raise self._value
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (14,) + inhomogeneous part.
No idea what the problem is, so I'll leave this issue open in case you
want to investigate further.
Will let you know how the bigger test goes - thanks again!
Simon
—
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDCV57KNAOEXBN6FO4KDELZAISGFAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJQGMYDSNJXGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi Can confirm: diemR now works as expected on full genome. Regarding output: is there a better option than simply capturing stdout (which I may have forgotten to do...) - I don't see an output path option unlike in the Python API? Stuart: unfortunately I still get the same error. If that works for you: which version of NumPy are you using? |
Hi simon,
As Natalia is on holiday I will try to answer for both R and Python:
On Fri, 3 May 2024 at 15:33, simonharnqvist ***@***.***> wrote:
Hi
Can confirm: diemR now works as expected on full genome. Regarding output:
is there a better option than simply capturing stdout (which I may have
forgotten to do...) - I don't see an output path option unlike in the
Python API?
I believe diemr may have tried to create a new output directory in the
'working directory'.
Try searching for filenames like "MarkerDiagnosticsWithOptimalPolarities.txt
"
Stuart: unfortunately I still get the same error. If that works for you:
which version of NumPy are you using?
Precisely the same error... interesting.
Perhaps just send me your python script for calling diempy - I believe I
have the data for chr14 already?
Probably easier to work it out from this end.
Bestest
Stuart
… —
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDCV5YSE6FVEG2ZWDKLQW3ZAOG3XAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTGAZTGMBYGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi, diemR: Hmm - I don't find any files with optimal polarities. Let's wait until Natalia gets back, there's no rush to get this working. diempy: This is using diem.py from this repo with one modification of L25 - I've commented out: np.warnings.filterwarnings('error', category=np.VisibleDeprecationWarning) Because this throws an error for me - which is why I suspect I might be using an incompatible version of NumPy. Here's the script: from diem.diempy.diempy_core.diem import diem
import subprocess
import numpy as np
chr14_len = int(subprocess.check_output(["wc", "-l", "diem_files/diem_input/brenthis_ino.SP_BI_364.chromosome_14.diem.txt"]).split()[0])
rng = np.random.default_rng(seed=42)
phi_w = "".join(map(str, rng.integers(low=1, high=3, size=chr14_len)))
diem(PhiW=phi_w,
CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"],
CompartmentPloidies=[ [2]*13 ],
datapath="diem_files/diem_input",
outputPath="diempy_test", ChosenInds=list(range(13))) |
Hi Simon!
Natalia is back - and straight into grant assessment. She did have a
suggestion however:
On Mon, 6 May 2024 at 11:15, simonharnqvist ***@***.***> wrote:
Hi,
diemR: Hmm - I don't find any files with optimal polarities. Let's wait
until Natalia gets back, there's no rush to get this working.
Natalia: "into the working directory. getwd()"
Personally, I think you might have thought of that... is it possible there
is an interaction term with running diemr on a cluster/cloud?
(eg auto-trash of working area once job is complete?).
diempy: This is using diem.py from this repo with one modification of L25
- I've commented out:
np.warnings.filterwarnings('error', category=np.VisibleDeprecationWarning)
Because this throws an error for me - which is why I suspect I might be
using an incompatible version of NumPy.
At that line I cast the version warning into an error
This gives me some look-ahead regarding version updates - instead of users
getting a warning they can ignore, they get an apparent error that they
report to me.
I can then tell them to comment out L25... and off they go!
Unfortunately this means the version issue is very likely ignorable in your
case, and the real error is elsewhere.
Will look into it, and thank you for the help tracing the issue!
S
… Here's the script:
from diem.diempy.diempy_core.diem import diemimport subprocessimport numpy as np
chr14_len = int(subprocess.check_output(["wc", "-l", "diem_files/diem_input/brenthis_ino.SP_BI_364.chromosome_14.diem.txt"]).split()[0])rng = np.random.default_rng(seed=42)phi_w = "".join(map(str, rng.integers(low=1, high=3, size=chr14_len)))
diem(PhiW=phi_w,CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"],
CompartmentPloidies=[ [2]*13 ],datapath="diem_files/diem_input",
outputPath="diempy_test", ChosenInds=list(range(13)))
—
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDCV55K5CQAW4F47N4345DZA5C3TAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJVGUZDQMRUGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks Stuart and Natalia - indeed, nothing in the working directory; I'm running this test locally on my Mac, so there's no scratch that might get wiped. (But it also means I have a second platform to try it on - I won't run the rest of my species locally anyway) I'm running it again - maybe the process threw an error before finishing that I didn't notice. Will let you know once that finishes, including a full list of output files. Simon |
Hi again, Reporting back from the re-run. As a reminder, this is the call: library(diemr)
input <- Sys.glob("diem_files/diem_input/*.chromosome_*.diem.txt")
diem(
files = input,
ploidy = list(rep(2, 13)),
markerPolarity = FALSE,
ChosenInds = 1:13,
nCores = 6
) Diem wrote 12 directories to my workdir, listed below. I'm using invalid/pseudo-regex to represent multiple entries.
and
The end of the standard output looks like this:
I don't see any optimal polarities anywhere - but maybe it's still hiding somewhere? Simon |
Hi Simon,
Thanks for this very clear summary.
Have a look in the MarkerDiagnostics files:
If there are three columns, then they are the
{OptimalPolarity,DiagnosticIndex,Support} tuples for each of your sites.
I have no idea why R is trying to print a million booleans to stdout...
Natalia?
More comments/queries below: BOTH: Please correct anything I have wrong...
Best
Stuart
On Thu, 23 May 2024 at 11:40, simonharnqvist ***@***.***> wrote:
Hi again,
Reporting back from the re-run. As a reminder, this is the call:
library(diemr)
input <- Sys.glob("diem_files/diem_input/*.chromosome_*.diem.txt")
There are multiple compartments of diem-format data: One for each
chromosome.
(The number of compartments is unclear - but it's a butterfly so around 32)
diem(
files = input,
ploidy = list(rep(2, 13)),
markerPolarity = FALSE,
ChosenInds = 1:13,
nCores = 6
)
You have chosen to let diemr construct a null polarisation (which it will
save for you, Natalia?)
You have specified the ploidy for only one compartment of data: As with
diempy,diemmca, the natural ploidy spec is a list of lists, one for each
compartment.
(Perhaps diemr catches this type mis-match and constructs a list of lists
by assuming all compartments have the same ploidy spec... Natalia?)
Diem wrote 12 directories to my workdir, listed below. I'm using
invalid/pseudo-regex to represent multiple entries.
12 surprises me. I would expect 1 summary directory (diemr modelDiagnostics)
And one directory to gather outputs for each of the chromosomes (diemr
likelihood).
Check the time stamps of the summary directories... diemr may be avoiding
overwriting older directories,
which would suggest this is the 11th time you have tried to get this
working....
…
modelDiagnostics[1-11]
│ OriginalHI.txt
│ RescaledHItobetaOfRescaledHI.pdf
| SortedRescaledHybridIndex.pdf
│ WarpSwitch.pdf
| WarpSwitch.txt
and
likelihood
| MarkerDiagnostics[0-9]-[0-9].txt
| MarkersWithChangedPolarity[1-14]
The end of the standard output looks like this:
[99865] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[99877] FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
[99889] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[99901] TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
[99913] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
[99925] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
[99937] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[99949] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[99961] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[99973] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[99985] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[99997] FALSE FALSE FALSE
[ reached getOption("max.print") -- omitted 947174 entries ]
I don't see any optimal polarities anywhere - but maybe it's still hiding
somewhere?
Simon
—
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDCV57QCVQ5ARLAZPXSZQTZDW2SBAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRWGY3TEMRXGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Dear Simon,
Can you wait until Tuesday? I’m horribly overwhelmed with project evaluations after my holidays. So now, just a quick reply:
To make sure that I provide ploidies for all compartments, I use this code:
subory = dir("Data", full.names = TRUE) # input files in diem format
inds = c(1:13) # indices of individuals to use
ploidies = rep(list(rep(2, length(inds))), length(subory)) # diploid for all compartments. Must be modified if a compartment is a sex chromosome and individuals have known sex.
CheckDiemFormat(subory, inds, ploidies)
res = diem(files = subory,
ChosenInds = inds,
ploidy = ploidies,
verbose = TRUE)
The polarities are in res$newPolarity or saved to diagnostics/MarkersDiagnosticsWithOptimalPolarities in the working directory with the polarities in the first column as Stuart says.
# this code calculates how many markers there are per compartment.
compartment.sizes = readLines("diagnostics/NullMarkerPolarities.txt")[-1]
compartment.sizes = strsplit(compartment.sizes, split = " ")
compartment.sizes = lapply(compartment.sizes, as.logical)
compartment.sizes = unlist(lapply(compartment.sizes, length))
# import genotypes
# WILL FAIL DUE TO MEMORY FOR LARGE DATASETS!
gens = Reduce(cbind, lapply(1:length(subory), FUN = function(x) importPolarized(file = subory[x],
changePolarity = res$markerPolarity[(sum(c(1, compartment.sizes)[1:x])):(sum(compartment.sizes[1:x]))],
ChosenInds = inds)
))
The multiple directories are Stuart’s old requirement that we report the results of each iteration (also in res$changedPolarities).
I hope this helps until Tuesday.
Best,
Natalia
… On 23 May 2024, at 12:32, StuartJEBaird ***@***.***> wrote:
Hi Simon,
Thanks for this very clear summary.
Have a look in the MarkerDiagnostics files:
If there are three columns, then they are the {OptimalPolarity,DiagnosticIndex,Support} tuples for each of your sites.
I have no idea why R is trying to print a million booleans to stdout... Natalia?
More comments/queries below: BOTH: Please correct anything I have wrong...
Best
Stuart
On Thu, 23 May 2024 at 11:40, simonharnqvist ***@***.*** ***@***.***>> wrote:
Hi again,
Reporting back from the re-run. As a reminder, this is the call:
library(diemr)
input <- Sys.glob("diem_files/diem_input/*.chromosome_*.diem.txt")
There are multiple compartments of diem-format data: One for each chromosome.
(The number of compartments is unclear - but it's a butterfly so around 32)
diem(
files = input,
ploidy = list(rep(2, 13)),
markerPolarity = FALSE,
ChosenInds = 1:13,
nCores = 6
)
You have chosen to let diemr construct a null polarisation (which it will save for you, Natalia?)
You have specified the ploidy for only one compartment of data: As with diempy,diemmca, the natural ploidy spec is a list of lists, one for each compartment.
(Perhaps diemr catches this type mis-match and constructs a list of lists by assuming all compartments have the same ploidy spec... Natalia?)
Diem wrote 12 directories to my workdir, listed below. I'm using invalid/pseudo-regex to represent multiple entries.
12 surprises me. I would expect 1 summary directory (diemr modelDiagnostics)
And one directory to gather outputs for each of the chromosomes (diemr likelihood).
Check the time stamps of the summary directories... diemr may be avoiding overwriting older directories,
which would suggest this is the 11th time you have tried to get this working....
modelDiagnostics[1-11]
│ OriginalHI.txt
│ RescaledHItobetaOfRescaledHI.pdf
| SortedRescaledHybridIndex.pdf
│ WarpSwitch.pdf
| WarpSwitch.txt
and
likelihood
| MarkerDiagnostics[0-9]-[0-9].txt
| MarkersWithChangedPolarity[1-14]
The end of the standard output looks like this:
[99865] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[99877] FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
[99889] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[99901] TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
[99913] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
[99925] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
[99937] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[99949] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[99961] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[99973] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[99985] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[99997] FALSE FALSE FALSE
[ reached getOption("max.print") -- omitted 947174 entries ]
I don't see any optimal polarities anywhere - but maybe it's still hiding somewhere?
Simon
—
Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDCV57QCVQ5ARLAZPXSZQTZDW2SBAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRWGY3TEMRXGE>.
You are receiving this because you were mentioned.
|
Thanks both - this can absolutely wait until next week. I've corrected the ploidy specification and am running again overnight. I'll keep an eye out for a 'diagnostics' directory in the workdir. Simon |
Can confirm - the @StuartJEBaird - is there a guide/example for how go from the list of indices to the pretty Diem plots? |
Hi Simon,
(Konrad - if you want to play with this stuff in Mathematica just say and I
will send my input files for the cdf below)
Simon: I don't think it was the verbose=TRUE .... more likely the
incorrect ploidy spec (feel free to prove me wrong).
But in any case - diemR has worked.
You now have the polarisation stats for very many brenthis sites, *the
majority of which will be uninteresting*.
The interesting ones have high diagnostic index (DI).
In which environment will you be working now?
The first thing to do is examine the distribution of DIs over all sites
(for all chromosomes).
Much as one would only be interested in ~1% of the genome, if looking for
differences between humans and chimps,
one is only interested in a (small) percentage of the genome when looking
for differences between *brenthis*es.
But that percentage is a priori unknown - hence inspect the DI distribution.
Here is the distribution for brenthis Chr1:
[image: image.png]
When I worked up your chr1 data I arbitrarily decided to work with all
sites DI>-30...thats about 46% of sites (which is *way* more than most
systems).
(Not completely arbitrary - RH peak is sites with fixed differences, DI>-30
includes a smaller peak... ~sites with introgression).
Natalia has given you her R import code (see passim), which imports and
polarises all your original site data.
If you work in R, I suggest you add to that code a filter so that you only
import DI>SimonsChoice,
and then use the diemr visualiser. Be aware: You have way more highDI sites
than anyone else, so you may discover 'issues'.
I attach my Mathematica import and visualisation code in a cdf file that
can be opened using Wolfram's free MathPlayer.
I have added *dummyBrenthisNames* and a *dummyBrenthisBED* to show
where/how sampleIDs and a bed file for *diem*-encoded sites are used.
Not sure the BEDfile stuff is yet available in R...but it is important:
Check out 'chr'5 in the cdf visualisation - there is a proximal region with
a desert of diagnostic sites.
One explanation is a selective sweep across the brenthis barrier, such that
there is a large region now identical across the barrier, and therefore
unpolarisable.
I suggest that because of the perturbation in the remaining polarisable
sites around the desert. Looks to me like a burst DMI - something I have
been discussing with Pierre Nahoud in his data... but I digress.
If you want to code up this post-*diem* stuff in Python, that would be
great! We could add your code to the repository (along with Sam's vcf2diem)
Along those lines, we are just about to add the circular *diem* plot code
in Python.
I should say, regarding potential *brenthis*-specific issues: You have WAY
more highDI sites than anyone else...we may need to (lossless)
runlength compress/encode your data for decent visualisations...this is one
of two tricks we already use to plot circular whole genome diem results.
Best
Stuart
…On Fri, 24 May 2024 at 12:24, simonharnqvist ***@***.***> wrote:
Can confirm - the verbose=TRUE made the difference. I now have a
'diagnostics' folder - diemR has worked!
@StuartJEBaird <https://github.com/StuartJEBaird> - is there a
guide/example for how go from the list of indices to the pretty Diem plots?
—
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDCV56574MPXBUBJ2D4GS3ZD4INVAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRZGE4DMOBXHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks for all your help Stuart!
The high density of highDI sites will be true for all butterfly species pairs so run-length compression will be helpful in general. Awesome that the circular plotting code will be available in python soon.
@SimonH: totally up to you whether you want to work in R, python or mathematica. The advantage of going for python is that it may the best environment for coding up some of the post-diem filtering stuff in a generally useful way. One thing that is not clear to me is how necessary/helpful kernel smoothing is. I assume that this is only possible in mathematica at the moment?
@ Stuart: I will try my best to explain your tract-length distribution analysis to folk here at the labmeeting next wednesday (12.30 your time). If you have time to join we could chat about strategies for Simon's post-diem analyses.
Best wishes,
Konrad
…________________________________
From: StuartJEBaird ***@***.***>
Sent: 24 May 2024 13:55
To: StuartJEBaird/diem ***@***.***>
Cc: Konrad Lohse ***@***.***>
Subject: Re: [StuartJEBaird/diem] Running diem on whole genome (Issue #5)
This email was sent to you by someone outside the University.
You should only click on links or attachments if you are certain that the email is genuine and the content is safe.
Hi Simon,
(Konrad - if you want to play with this stuff in Mathematica just say and I will send my input files for the cdf below)
Simon: I don't think it was the verbose=TRUE .... more likely the incorrect ploidy spec (feel free to prove me wrong).
But in any case - diemR has worked.
You now have the polarisation stats for very many brenthis sites, the majority of which will be uninteresting.
The interesting ones have high diagnostic index (DI).
In which environment will you be working now?
The first thing to do is examine the distribution of DIs over all sites (for all chromosomes).
Much as one would only be interested in ~1% of the genome, if looking for differences between humans and chimps,
one is only interested in a (small) percentage of the genome when looking for differences between brenthises.
But that percentage is a priori unknown - hence inspect the DI distribution.
Here is the distribution for brenthis Chr1:
[image.png]
When I worked up your chr1 data I arbitrarily decided to work with all sites DI>-30...thats about 46% of sites (which is way more than most systems).
(Not completely arbitrary - RH peak is sites with fixed differences, DI>-30 includes a smaller peak... ~sites with introgression).
Natalia has given you her R import code (see passim), which imports and polarises all your original site data.
If you work in R, I suggest you add to that code a filter so that you only import DI>SimonsChoice,
and then use the diemr visualiser. Be aware: You have way more highDI sites than anyone else, so you may discover 'issues'.
I attach my Mathematica import and visualisation code in a cdf file that can be opened using Wolfram's free MathPlayer.
I have added dummyBrenthisNames and a dummyBrenthisBED to show where/how sampleIDs and a bed file for diem-encoded sites are used.
Not sure the BEDfile stuff is yet available in R...but it is important: Check out 'chr'5 in the cdf visualisation - there is a proximal region with a desert of diagnostic sites.
One explanation is a selective sweep across the brenthis barrier, such that there is a large region now identical across the barrier, and therefore unpolarisable.
I suggest that because of the perturbation in the remaining polarisable sites around the desert. Looks to me like a burst DMI - something I have been discussing with Pierre Nahoud in his data... but I digress.
If you want to code up this post-diem stuff in Python, that would be great! We could add your code to the repository (along with Sam's vcf2diem)
Along those lines, we are just about to add the circular diem plot code in Python.
I should say, regarding potential brenthis-specific issues: You have WAY more highDI sites than anyone else...we may need to (lossless) runlength compress/encode your data for decent visualisations...this is one of two tricks we already use to plot circular whole genome diem results.
Best
Stuart
On Fri, 24 May 2024 at 12:24, simonharnqvist ***@***.******@***.***>> wrote:
Can confirm - the verbose=TRUE made the difference. I now have a 'diagnostics' folder - diemR has worked!
@StuartJEBaird<https://github.com/StuartJEBaird> - is there a guide/example for how go from the list of indices to the pretty Diem plots?
—
Reply to this email directly, view it on GitHub<#5 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABDCV56574MPXBUBJ2D4GS3ZD4INVAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRZGE4DMOBXHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
|
Hi Simon, Thank you for your patience. I have now set the outputs in such a way that regardless of the value in the Best, |
Hi @StuartJEBaird - taking you up on your offer on how to run diem on genome scale data. Raising an issue here so it becomes part of the documentation, but feel free to direct me elsewhere.
First things first - the nCores parameter in diemR does not seem to be working on macOS; CPU load is not responsive to the requested number of threads. I'm not an R user: is there's something else I need to do to enable R to multiprocess? (Given files are per chromosome I can work around this easily, but just thought I should let you know)
Other than multiprocessing, what are some tricks to make this run smoothly on a whole genome?
Thanks in advance
Simon
The text was updated successfully, but these errors were encountered: