Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU usage and GPU usage too little #244

Open
xiaoyaojianghuzai opened this issue Apr 24, 2024 · 13 comments
Open

CPU usage and GPU usage too little #244

xiaoyaojianghuzai opened this issue Apr 24, 2024 · 13 comments
Assignees

Comments

@xiaoyaojianghuzai
Copy link

hi processor @mdbarnesUCSD
I change my code and use the given parameters
The cpu usage is still very low.
When it gets the "making matries for INDELs", the cpu usage is too slow.
It costs too much time to finish.
I look through sigpro.py.
Then I find that you use the mutipleprocess package. But it seems doesn't take effect.

I run code in Ubuntu Linux 22.04 and the sigprofilerextractor package is the latest version.

@xiaoyaojianghuzai
Copy link
Author

And the vcf files are placed in a harddisk.

After searching this question in Google. I find python have the GIL lock. Did this prevent the full usage of CPU?
But I look through the past issues of sigprofilerextractor. Someone even have the problem of too high usage of CPU and GPU.
I cannot find find the true answer and solve the problem。

So I use mutipleprocess on my hand for different types of cancer
But for each cancer types, the problem is still on my way.

def extract_signature_for_folder(folder):
    output_dir = f"/harddisk/sxt/VCFinput/{folder}"
    output_path = f"/harddisk/sxt/output/{folder}"
    print(f"Signature extraction for {folder} started.")
    sig.sigProfilerExtractor("vcf", output_path, output_dir, "GRCh38")
    print(f"Signature for {folder} extracted.")


if __name__ == "__main__":
    
    with Pool(128) as p:
        p.map(extract_signature_for_folder, cancer_types)

@xiaoyaojianghuzai
Copy link
Author

Sincerely waiting for your hearing.

@xiaoyaojianghuzai
Copy link
Author

JOB_METADATA.txt

I run a small example for examination

@mdbarnesUCSD
Copy link
Collaborator

Hi @xiaoyaojianghuzai,

Your input matrix has 96 rows and 2 columns, but your extraction is from signatures 1 to 25. This does not work and you need a larger input matrix (the max rank is 2 for a 96x2 input).

Please review the README and run the example using the matrix file as input (code below):

from SigProfilerExtractor import sigpro as sig
def main_function():    
   # to get input from table format (mutation catalog matrix)
   path_to_example_table = sig.importdata("matrix")
   data = path_to_example_table # you can put the path to your tab delimited file containing the mutational catalog matrix/table
   sig.sigProfilerExtractor("matrix", "example_output", data, opportunity_genome="GRCh38", minimum_signatures=1, maximum_signatures=3)
if __name__=="__main__":
   main_function()

@xiaoyaojianghuzai
Copy link
Author

hi professor @mdbarnesUCSD
I use vcf_files as input. Should I change the max signature too?

@xiaoyaojianghuzai
Copy link
Author

Plus, can I only extract signatures for SBS and DIUNC except INDELs? But when I change the context_type parameter, nothing has changed. It still generating matrices for INDELs as usual. How can I make it?

@xiaoyaojianghuzai
Copy link
Author

Plus, can I only extract signatures for SBS and DIUNC except INDELs? But when I change the context_type parameter, nothing has changed. It still generating matries for INDELs as usual.

@xiaoyaojianghuzai
Copy link
Author

sig.sigProfilerExtractor("vcf", "/harddisk/sxt/output/gum", "/harddisk/sxt/VCFinput/gum", "GRCh38", minimum_signatures=1,maximum_signatures=3)

image
image

@xiaoyaojianghuzai
Copy link
Author

how to choose the max_signatures parameter when using VCF files as input?

@mdbarnesUCSD
Copy link
Collaborator

Hi @xiaoyaojianghuzai,

The maximum_signatures needs to be a value less than the number of samples that you have.

I would suggest start with matrix inputs rather than VCFs. You can run SigProfilerMatrixGenerator to generate the matrices and this may help you identify if there are any issues with your VCFs. You can then use the INDEL matrix you created from SigProfilerMatrixGenerator as the input for SigProfilerExtractor.

@xiaoyaojianghuzai
Copy link
Author

Hello @mdbarnesUCSD
I run the code

sig.sigProfilerExtractor("vcf","/home/sxt/HDD/output/rectum", "/home/sxt/HDD/VCFinput/rectum", "GRCh38",minimum_signatures=1,maximum_signatures=3)

The terminal prints

(base) sxt@C233-Primary-Server:~$ /home/sxt/miniconda3/bin/conda run -p /home/sxt/miniconda3 --no-capture-output python /tmp/pycharm_project_825/rectum.py

************** Reported Current Memory Use: 0.5 GB *****************

Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 337.43 seconds.
Starting matrix generation for INDELs...    

I find the SNVs and DINUCs costs 5minutes, but the INDELS has costed 10 days. It still didn't finish.
Does the matrix generation step use methods that can speed up this process, such as mutipleprocess?

@mdbarnesUCSD mdbarnesUCSD self-assigned this May 10, 2024
@mdbarnesUCSD
Copy link
Collaborator

Please generate your matrices separately and provide those as inputs to SigProfilerExtractor. The matrix generation step should not take anywhere near 10 days. How many mutations are you working with? Are you running out of memory?

@xiaoyaojianghuzai
Copy link
Author

There's a lot of memory left. All vcf files are about 5 GB. I am going to try to generate matrices separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants