New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code for concatenating CpGoe counts from individual samples #694
Comments
|
|
|
|
Is there a way to do this with alternate |
You'll have to do one of the following:
OR
After installing Homebrew, you can install the two "brews" like so:
The only catch is that I don't think these will get added to your Of course, you can add the Homebrew programs to your system |
Actually, I think they will get added to your |
Another alternative, if you wanted to just switch to using
OR
The former is recommended to avoid confusion with quoting and escaping of various special characters that might be required. Additionally, you could inset a EDITED: Fixed code block formatting. |
Installing Homebrew worked! Can now run the script as written using |
@kubu4 I got a script running on mox but it errored out on the last loop. It was able to join the first twenty or so files super fast and then took increasingly longer time to join each subsequent file. It completely failed ~3-4 hours in after joining the 26th file with this error message for each subsequent join attempt: "Broken pipe join --nocheck-order ID_CpG_labelled_all ${file}ID_CpG_labelled " The script is on mox here: The slurm file is here: The incomplete output is here: I'm thinking that there is something funny going on with either the filenames or how many files are in these new 2019-05-22-CAP-Array-Analysis directories (see /gscratch/srlab/strigg/data/Cvirg/FROGER_CAP_CpGoe/CAP_CpGoe/2019-05-22-CAP-Array-Analysis). @kubu4 do you think you could link your original analysis for comparison? |
I'll glance at your Mox script later tonight (or, early tomorrow) to see if something jumps out at me. |
I don't have permission to view your data files (just your SBATCH script). |
I tried to change the permissions (sorry about that). Let me know if you're able to view |
Still can't view any of the contents in that folder. Maybe add execute permissions for all? I looked at other user folders in |
Something like this command should do that:
I'd probably do this directory, too:
|
This might be an issue (I've added the line number from your script at the beginning of each line):
Line number 36 shouldn't be writing to an output file. Here's my section of that code:
|
I tried to fix the permissions again, hopefully everything is executable now. I will try to remove the output file from the header part and see if that helps. Let me know if you see anything else. Thanks! |
Permission changes worked. I'll check it out. |
I can't really see anything that jumps out at me. Re-run with that change posted above and see how it goes. |
Actually, that "fix" I mentioned above probably won't have any impact. I checked it on ShellCheck and it doesn't seem to have any issues with that (however, I don't understand what it's supposed to be doing; haven't tested to see what value is stored in When you have a sec, can you please paste your code on how you prepped the FastA files (I couldn't find it in your notebook)? Also, I did notice that your input FastA files have each entry on a single line. The FastAs that were provided for the analysis I did were not like that; the sequences wrapped after 60 characters. Possibly an issue? If it is an issue, I'd guess that things get jacked up when using |
That code can be found here: https://gannet.fish.washington.edu/spartina/2019-05-21-FROGER/CAP_CpGoe/2019-05-22-CAP-Array-Script.sh |
code fixed and QC'd. See post https://shellytrigg.github.io/100th-post/. @yaaminiv this code can be run on other files (CDS, exons, windows,etc.) and doesn't take as long as it did before. |
Ooooh! I'm curious to see how you got it to run! One thing that jumped out at me:
Not a big deal, but I run all this stuff on Linux (which is what Mox runs on). |
Oh, so then it's probably not because of a discrepancy between OSs because I also used linux environment on Mox. Does it make sense that original join loop would fail because of a difference in analysis directory names? That's really the only difference I saw between the data. The original code led to redundant sample names "HC_VA" because these had two underscores in their directory names (Combined.SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.HC_VA_1_CAP_analysis). Maybe join couldn't handle that? Interestingly, the failure happened when attempting to join the third "HC" directory (Combined.SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.HC_4_CAP_analysis), which I think would have been before even getting to the redundant sample names. It's also still curious why the join loop takes increasingly longer in each iteration after the first 20 samples. Maybe join was doing some sorting after all and that led to the lag and the failure with redundant names? The updated script using paste and awk in a one-liner processes faster and avoids whatever was happening during that join loop |
I haven't had a chance to look it over, but:
Additionally, your changes make the code compliant across Mac/Linux, which is very nice! |
EDITED by kubu4: IGNORE THIS POST. SEE MY POST BELOW THIS ONE. Well, this thread has been enlightening! Although I still haven't figured out why @shellytrigg got the broken pipe error (I did not encounter this when I used my script on the original sample sets), it did highlight the fact that my script also results in the redundant sample names (due to the "problem" files)! This doesn't affect the data output, but would definitely impact downstream analysis that relied on column headers to process data! Doh! However, @shellytrigg's updated script also produces the same duplicate sample names. As an example, we'll use one of the "problem" files. Snippet from my code:
@shellytrigg's updated snippet:
With that being the case, @shellytrigg how does your script deal with this? I see you have the section to remove duplicate columns, but you shouldn't have duplicate columns since all the sample names should be unique. Are you missing data for the I'm going to look into tweaks for these scripts to handle the |
@shellytrigg Nevermind!!! The example "snippets" I posted above are no good. Yours works great with the proper input:
Updating my script and re-running my analysis from whenever I did this originally. |
@kubu4 the script is run from the -Array-Analysis directory which contains all sample diectories (example: Combined.SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.HC_4_GENE_analysis). The script uses the sample directory names to create sample names. produces HC_VA_4_CAP_analysis rev| cut -f3- | rev produces HC_VA_4 Because the script uses the analysis directory names in the Array-Analysis directory (not the fasta files in the previous directory), it does not produce duplicate sample names |
Yep, see my post above yours. |
@kubu4 At FROGER, I'm working on replicating some of your code with CpGoe identification and counting on chromatin-associated protein sample files. I was able to work through the first 2 parts of this Markdown file. I'm now trying to append sample-specific headers to each ID_CpG file and join all ID_CpG files.
I'm working in this spartina directory with this specific script. I ran the script and encountered these errors:
sed
error. There is no -i argument so I'm not sure how to interpret this error...join
error. The only join option used is `--nocheck-order, which is a (valid) general argument.Where are the errors I'm missing?
The text was updated successfully, but these errors were encountered: