Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in performing process_atac #27

Closed
arnavgupta06754 opened this issue Nov 18, 2019 · 11 comments
Closed

Error in performing process_atac #27

arnavgupta06754 opened this issue Nov 18, 2019 · 11 comments

Comments

@arnavgupta06754
Copy link

Hi Margaret,
I am trying to process ATAC-seq data in the form of broadPeak files (generated from macs2) which are in bedGraph format for motif displacement. I have downloaded HOCOMOCO v11 hg38 motifs from your repository. The error I return is:

Traceback (most recent call last):
File "/projects/agupta06@xsede.org/applications/conda/env/bio-env/bin/process_atac", line 10, in
sys.exit(process_atac())
File "/projects/agupta06@xsede.org/applications/conda/env/bio-env/lib/python2.7/site-packages/DAStk/init.py", line 7, in process_atac
p.main()
File "/projects/agupta06@xsede.org/applications/conda/env/bio-env/lib/python2.7/site-packages/DAStk/process_atac.py", line 253, in main
[md_score, small_window, large_window, motif_site_count, heat] = get_md_score(filename, int(args.mp_threads), args.atac_peaks_filename, CHROMOSOMES, args.radius)
File "/projects/agupta06@xsede.org/applications/conda/env/bio-env/lib/python2.7/site-packages/DAStk/process_atac.py", line 133, in get_md_score
CHROMOSOMES)
File "/projects/agupta06@xsede.org/applications/conda/env/bio-env/lib/python2.7/multiprocessing/pool.py", line 253, in map
return self.map_async(func, iterable, chunksize).get()
File "/projects/agupta06@xsede.org/applications/conda/env/bio-env/lib/python2.7/multiprocessing/pool.py", line 572, in get
raise self._value
pandas.errors.ParserError: Too many columns specified: expected 3 and found 1

My broadPeak files have 9 columns and appear similar to previous broadPeak files I have successfully analyzed with process_ATAC. Can you please advise as to where my error is?
Thanks

@ignaciot
Copy link
Collaborator

Hi there, sorry about the delay. It's hard to tell without looking at the actual file, but it sounds like the peaks file you are using may not be tab-separated (DAStk expects a tab as the delimiter). Could you check if that is the case?

@arnavgupta06754
Copy link
Author

I reformatted my .broadPeak file using the awk command to insert "\t" tab delimiters between columns. The error i know receive is:
File "/projects/agupta06@xsede.org/applications/conda/env/Arnav_env/lib/python2.7/site-packages/pybedtools/helpers.py", line 749, in get_chromsizes_from_ucsc
raise OSError(failures)
OSError: ["Can't find path to fetchChromsizes"]

I am either not making it to the previous error or have solved that problem.

@ignaciot
Copy link
Collaborator

Please take a closer look at the Readme document. If you have an internet connection, it should be reaching out to UCSC to get the chromosome sizes for the reference genome that you are using. Alternatively, you can also provide your own file with the chromosome sizes (we provide instructions on how to create it on the Readme, too).

@arnavgupta06754
Copy link
Author

I have created the genome.chrom.sizes file as mentioned in the readme and included the --chromosomes argument in the process_atac call but I am still returning the same error. I installed and reinstalled dastk using the pip install/uninstall command - is there an alternative installation that may help me?

@ignaciot
Copy link
Collaborator

ignaciot commented Dec 5, 2019

At this point the easiest way to debug this would be if you could send us the exact commands that you are running, and at least a sample of the peaks file. One last thing to check: Have you checked that all files involved (peaks, motif binding sites, etc) are sorted by the same criteria?

@arnavgupta06754
Copy link
Author

I double checked my sorting and they are all consistent. Here is what I am working with:

  1. Slurm file
    Header is the same as all other headers that I have used in the past
    I use a conda environment under which I have installed dastk, python etc. (I used pip install for dastk)

Other than setting directories, the only command I have is:
$process_atac
$ --threads 2
$ -c ${CHROMSIZES}\ #I have tried with including this and omitting this line
$ --genome ${GENOME}
$ --atac-peaks ${MACS2}/Sm25_ATACseq-Ctrl2_R1_peaks_clean_chr.broadPeak
$ --motif ${MOTIF}
$ --output ${DASTK}

$CHROMSIZES directs to a file generated by:
#samtools faidx /projects/agupta06@xsede.org/genome/grch38/genome.fa
#cut -f1,2 /projects/agupta06@xsede.org/genome/grch38/genome.fa.fai > /projects/agupta06@xsede.org/genome/grch38/genomehg38.chrom.sizes

$GENOME is at:
GENOME=/projects/agupta06@xsede.org/genome/grch38/genome

and $MOTIF was downloaded from your repository

My input file (Sm25_ATACseq-Ctrl2_R1_peaks_clean_chr.broadPeak) was generated from macs2 callPeak and is only altered by intersecting out the blacklist and the addition of "chr" to the beginning so the sample output looks like:
chr1 10090 10206 Sm25_ATACseq-Ctrl2_R1_peak_1 26 . 3.73679 5.15502 2.67571
chr1 13538 14233 Sm25_ATACseq-Ctrl2_R1_peak_2 18 . 3.06343 4.18317 1.83849
chr1 29148 29412 Sm25_ATACseq-Ctrl2_R1_peak_3 22 . 3.17598 4.63219 2.21539
chr1 183990 184536 Sm25_ATACseq-Ctrl2_R1_peak_4 19 . 3.03128 4.31061 1.93728
chr1 199461 199903 Sm25_ATACseq-Ctrl2_R1_peak_5 23 . 3.25293 4.80685 2.36102
chr1 203605 203711 Sm25_ATACseq-Ctrl2_R1_peak_6 25 . 3.45628 5.06671 2.58469
chr1 285305 285475 Sm25_ATACseq-Ctrl2_R1_peak_7 11 . 2.95828 3.40436 1.15748
chr1 586106 586221 Sm25_ATACseq-Ctrl2_R1_peak_8 31 . 4.02926 5.68439 3.11507
chr1 629094 631376 Sm25_ATACseq-Ctrl2_R1_peak_9 119 . 2.24206 15.51227 11.91780
chr1 631823 634913 Sm25_ATACseq-Ctrl2_R1_peak_10 77 . 1.98238 10.85735 7.75619

I am using the rmacc-summit computing cluster and I have allocated sufficient memory, nodes and time for the project (was successful in running dastk previously on other samples). However, when I submit each time now, I return the same error posted above:
Traceback (most recent call last):
File "/projects/agupta06@xsede.org/applications/conda/env/Arnav_env/bin/process_atac", line 8, in
sys.exit(process_atac())
File "/projects/agupta06@xsede.org/applications/conda/env/Arnav_env/lib/python2.7/site-packages/DAStk/init.py", line 7, in process_atac
p.main()
File "/projects/agupta06@xsede.org/applications/conda/env/Arnav_env/lib/python2.7/site-packages/DAStk/process_atac.py", line 208, in main
chr_size_file = pybedtools.chromsizes(args.genome)
File "/projects/agupta06@xsede.org/applications/conda/env/Arnav_env/lib/python2.7/site-packages/pybedtools/helpers.py", line 812, in chromsizes
return get_chromsizes_from_ucsc(genome)
File "/projects/agupta06@xsede.org/applications/conda/env/Arnav_env/lib/python2.7/site-packages/pybedtools/helpers.py", line 749, in get_chromsizes_from_ucsc
raise OSError(failures)
OSError: ["Can't find path to fetchChromsizes"]

I have tried the following so far:
include and omit chromosome sizes generated from my genome.fa file; also used -c and --chromsizes
running python version 2 and 3
updating pybedtools
rewriting my input file using the awk command and setting "\t" tab delimiters between columns
truncating my input file into a standard .bedgraph format (6 columns replacing column 6 with "+")

I am worried that the problem may lie in my conda environment and the versions of python and python libraries I am using currently (have experienced this with other programs such as igvtools where it runs with certain versions but not others)
here is what I have currently installed:
python 2.7.15 h5a48372_1009
pybedtools 0.8.0 py27he860b03_1
samtools 1.9 h10a08f8_12
dastk 0.3.2 pypi_0
among others, I can provide information on other packages as you request (didnt want to list them all on here)

Thanks again for your help.

@ignaciot
Copy link
Collaborator

Sorry about the delay, I was presenting at a conference and missed this notification. Something else I didn't point out: the --genome and --chromosome arguments are mutually exclusive. It looks like this is not clear in the help nor we are enforcing it, so thank you for pointing it out, I will push an update making this explicit. I think that because you are specifying the --genome argument, it's still trying to fetch the chromosome sizes (it requires fetchChromsizes to be installed, which I should mention in the README, too).

The bottom line: try the above without specifying the --genome option, since the only use of that argument is to fetch the proper chromosome sizes for that genome, and please let me know if that fixed it.

@arnavgupta06754
Copy link
Author

Ok, I am back to receiving my original error, seems like the chromosome sizes issue has been fixed (posted at the top of this thread - pandas.errors.ParserError: Too many columns specified: expected 3 and found 1)
I have tab delimited my broadPeak file and converted it to standard bed format:

chr1 10090 10206 Sm25_ATACseq-Ctrl2_R1_peak_1 26 +
chr1 13538 14233 Sm25_ATACseq-Ctrl2_R1_peak_2 18 +
chr1 29148 29412 Sm25_ATACseq-Ctrl2_R1_peak_3 22 +
chr1 183990 184536 Sm25_ATACseq-Ctrl2_R1_peak_4 19 +
chr1 199461 199903 Sm25_ATACseq-Ctrl2_R1_peak_5 23 +
chr1 203605 203711 Sm25_ATACseq-Ctrl2_R1_peak_6 25 +
chr1 285305 285475 Sm25_ATACseq-Ctrl2_R1_peak_7 11 +
chr1 586106 586221 Sm25_ATACseq-Ctrl2_R1_peak_8 31 +
chr1 629094 631376 Sm25_ATACseq-Ctrl2_R1_peak_9 119 +
chr1 631823 634913 Sm25_ATACseq-Ctrl2_R1_peak_10 77 +

@arnavgupta06754
Copy link
Author

I used the awk command
awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t+"}' input.broadPeak > output.bed
to tab delimit and trim my broadPeak file to generate the output file as above

@ignaciot
Copy link
Collaborator

That’s odd, the error really means the input is not tab-delimited. Usually the output from MACS is tab-delimited by default, why did you pass it thru awk? Maybe you need to explicitly specify the output field separator if you do need awk, by adding OFS="\t"...

At this point, the easiest way would be to try with the broadPeak file directly or just email me the broadPeak file to ignacio.tripodi (at) colorado.edu so we can test it because we can't reproduce the error.

@ignaciot
Copy link
Collaborator

Closing this one, after discussing this offline it looks like the problem was a zero-sized bed file in the motif sites directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants