Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code for concatenating CpGoe counts from individual samples #694

Closed
yaaminiv opened this issue May 22, 2019 · 31 comments
Closed

Code for concatenating CpGoe counts from individual samples #694

yaaminiv opened this issue May 22, 2019 · 31 comments
Labels

Comments

@yaaminiv
Copy link
Contributor

@kubu4 At FROGER, I'm working on replicating some of your code with CpGoe identification and counting on chromatin-associated protein sample files. I was able to work through the first 2 parts of this Markdown file. I'm now trying to append sample-specific headers to each ID_CpG file and join all ID_CpG files.

I'm working in this spartina directory with this specific script. I ran the script and encountered these errors:

  1. sed error. There is no -i argument so I'm not sure how to interpret this error...
sed: 1: "1iID\tUMFS_6": command i expects \ followed by text
  1. join error. The only join option used is `--nocheck-order, which is a (valid) general argument.
join: illegal option -- -
usage: join [-a fileno | -v fileno ] [-e string] [-1 field] [-2 field]
            [-o list] [-t char] file1 file2

Where are the errors I'm missing?

@kubu4
Copy link
Contributor

kubu4 commented May 22, 2019

  1. I'd guess your problem is related to quoting and the presence of the backslash. I think bash is interpreting this backslash as a line continuation instead of a tab character (that's what you're trying to get, right?).Try this:
  • sed '1iID\t${gene}' ${file}ID_CpG > ${file}ID_CpG_labelled
  1. Are you running this on a Mac? If yes, then that's the issue. That option doesn't exist on Macs.

@yaaminiv
Copy link
Contributor Author

  1. Yes, trying to get a tab character. Even though I updated my script with your suggestion, I'm still getting the same error:

sed: 1: "1iID\t${gene}": command i expects \ followed by text

  1. Yes, I am running this on a Mac! Is there a separate argument I should use instead/do I need to use an additional argument?

@kubu4
Copy link
Contributor

kubu4 commented May 22, 2019

  1. The error code you pasted still shows double quotes...

  2. Don't think so. Here's GitHub issue where Kaitlyn did this in R: Merge CpG O/E datasets #596

@yaaminiv
Copy link
Contributor Author

  1. I know!! BUT my script does have single quotes:

Screen Shot 2019-05-22 at 3 04 09 PM

  1. Kaitlyn's R script would be used instead of the bash script?

@kubu4
Copy link
Contributor

kubu4 commented May 22, 2019

  1. Turns out, that sed command won't run on Macs, either.

  2. Yes.

@yaaminiv
Copy link
Contributor Author

Is there a way to do this with alternate sed or join commands in bash? The rest of the pipeline is in bash so it would be annoying (although not impossible) to transition between languages.

@kubu4
Copy link
Contributor

kubu4 commented May 22, 2019

You'll have to do one of the following:

  • use a Linux computer (e.g. Emu, Roadrunner, Mox)

OR

  • Install (and use) different version of sed and join on your Mac. If this is the route you choose, I'd install using Homebrew.

After installing Homebrew, you can install the two "brews" like so:

brew install coreutils gnu-sed

The only catch is that I don't think these will get added to your $PATH, so you'll have to pay attention to where they're installed and refer to them specifically in order to utilize those programs instead of what's installed on your Mac by default.

Of course, you can add the Homebrew programs to your system $PATH if you'd like and then you can call the programs without the need for the full path.

@kubu4
Copy link
Contributor

kubu4 commented May 22, 2019

Actually, I think they will get added to your $PATH automatically! Those commands will just be prefixed with a "g" (e.g. gsed, gjoin).

@kubu4
Copy link
Contributor

kubu4 commented May 22, 2019

Another alternative, if you wanted to just switch to using R is to use the system() function to make calls to bash. E.g.:

bash_command <- 'cp ${array[0]}ID_CpG_labelled ID_CpG_labelled_all'
system(bash_command)

OR

system(cp ${array[0]}ID_CpG_labelled ID_CpG_labelled_all)

The former is recommended to avoid confusion with quoting and escaping of various special characters that might be required. Additionally, you could inset a cat(bash_command) before the system(bash_command) call to verify that your bash command is properly formatted before executing.

EDITED: Fixed code block formatting.

@yaaminiv
Copy link
Contributor Author

Installing Homebrew worked! Can now run the script as written using gsed and gjoin. Thanks!!

@shellywanamaker
Copy link
Contributor

@kubu4 I got a script running on mox but it errored out on the last loop. It was able to join the first twenty or so files super fast and then took increasingly longer time to join each subsequent file. It completely failed ~3-4 hours in after joining the 26th file with this error message for each subsequent join attempt:

"Broken pipe join --nocheck-order ID_CpG_labelled_all ${file}ID_CpG_labelled "

The script is on mox here:
/gscratch/srlab/strigg/jobs/20190523_CAP_File_Naming.sh

The slurm file is here:
/gscratch/srlab/strigg/data/Cvirg/FROGER_CAP_CpGoe/CAP_CpGoe/slurm-863764.out

The incomplete output is here:
/gscratch/srlab/strigg/data/Cvirg/FROGER_CAP_CpGoe/CAP_CpGoe/2019-05-22-CAP-Array-Analysis/ID_CpG_labelled_all

I'm thinking that there is something funny going on with either the filenames or how many files are in these new 2019-05-22-CAP-Array-Analysis directories (see /gscratch/srlab/strigg/data/Cvirg/FROGER_CAP_CpGoe/CAP_CpGoe/2019-05-22-CAP-Array-Analysis).

@kubu4 do you think you could link your original analysis for comparison?

@kubu4
Copy link
Contributor

kubu4 commented May 29, 2019

https://robertslab.github.io/sams-notebook/2019/02/26/Data-Wrangling-CpG-OE-Calculations-on-C.virginica-Genes.html

I'll glance at your Mox script later tonight (or, early tomorrow) to see if something jumps out at me.

@kubu4
Copy link
Contributor

kubu4 commented May 29, 2019

I don't have permission to view your data files (just your SBATCH script).

@shellywanamaker
Copy link
Contributor

I tried to change the permissions (sorry about that). Let me know if you're able to view

@kubu4
Copy link
Contributor

kubu4 commented May 29, 2019

Still can't view any of the contents in that folder. Maybe add execute permissions for all? I looked at other user folders in /gscratch/srlab and I'm able to navigate in all of them (except yours). The difference is that they all have execute permissions for all, whereas yours do not.

@kubu4
Copy link
Contributor

kubu4 commented May 29, 2019

Something like this command should do that:

chmod a+x --recursive /gscratch/srlab/strigg/data

I'd probably do this directory, too:

chmod a+x --recursive /gscratch/srlab/strigg/analyses

@kubu4
Copy link
Contributor

kubu4 commented May 29, 2019

This might be an issue (I've added the line number from your script at the beginning of each line):

32	## Create column headers for ID_CpG files using sample name from directory name.
33	
34	for file in ${array[@]}
35	do
36	  gene=$(echo ${file} | awk -F\[._] '{print $6"_"$7}') > ${file}ID_header
37	  sed "1iID\t${gene}" ${file}ID_CpG > ${file}ID_CpG_labelled
38	done

Line number 36 shouldn't be writing to an output file.


Here's my section of that code:

# Create column headers for ID_CpG files using sample name from directory name.
for file in ${array[@]}
do
  gene=$(echo ${file} | awk -F\[._] '{print $6"_"$7}')
  sed "1iID\t${gene}" ${file}ID_CpG > ${file}ID_CpG_labelled
done

@shellywanamaker
Copy link
Contributor

I tried to fix the permissions again, hopefully everything is executable now. I will try to remove the output file from the header part and see if that helps. Let me know if you see anything else. Thanks!

@kubu4
Copy link
Contributor

kubu4 commented May 29, 2019

Permission changes worked. I'll check it out.

@kubu4
Copy link
Contributor

kubu4 commented May 29, 2019

I can't really see anything that jumps out at me. Re-run with that change posted above and see how it goes.

@kubu4
Copy link
Contributor

kubu4 commented May 30, 2019

Actually, that "fix" I mentioned above probably won't have any impact. I checked it on ShellCheck and it doesn't seem to have any issues with that (however, I don't understand what it's supposed to be doing; haven't tested to see what value is stored in gene).

When you have a sec, can you please paste your code on how you prepped the FastA files (I couldn't find it in your notebook)?

Also, I did notice that your input FastA files have each entry on a single line. The FastAs that were provided for the analysis I did were not like that; the sequences wrapped after 60 characters. Possibly an issue?

If it is an issue, I'd guess that things get jacked up when using seqkit to convert the FastA to tab-delimited (I'm not sure what it does). Probably a good idea to run seqkit on one of your FastAs and one of Steven's FastAs that I used for the analysis and see how the outputs compare.

@shellywanamaker
Copy link
Contributor

@yaaminiv did that part. @yaaminiv can you comment on this?

@yaaminiv
Copy link
Contributor Author

When you have a sec, can you please paste your code on how you prepped the FastA files (I couldn't find it in your notebook)?

That code can be found here: https://gannet.fish.washington.edu/spartina/2019-05-21-FROGER/CAP_CpGoe/2019-05-22-CAP-Array-Script.sh

@shellywanamaker
Copy link
Contributor

code fixed and QC'd. See post https://shellytrigg.github.io/100th-post/. @yaaminiv this code can be run on other files (CDS, exons, windows,etc.) and doesn't take as long as it did before.

@kubu4
Copy link
Contributor

kubu4 commented Jun 3, 2019

Ooooh! I'm curious to see how you got it to run!

One thing that jumped out at me:

as they do on a PC (how Sam ran it)

Not a big deal, but I run all this stuff on Linux (which is what Mox runs on).

@shellywanamaker
Copy link
Contributor

Oh, so then it's probably not because of a discrepancy between OSs because I also used linux environment on Mox. Does it make sense that original join loop would fail because of a difference in analysis directory names? That's really the only difference I saw between the data.

The original code led to redundant sample names "HC_VA" because these had two underscores in their directory names (Combined.SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.HC_VA_1_CAP_analysis). Maybe join couldn't handle that?

Interestingly, the failure happened when attempting to join the third "HC" directory (Combined.SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.HC_4_CAP_analysis), which I think would have been before even getting to the redundant sample names.

It's also still curious why the join loop takes increasingly longer in each iteration after the first 20 samples. Maybe join was doing some sorting after all and that led to the lag and the failure with redundant names?

The updated script using paste and awk in a one-liner processes faster and avoids whatever was happening during that join loop

@kubu4
Copy link
Contributor

kubu4 commented Jun 4, 2019

I haven't had a chance to look it over, but:

The updated script using paste and awk in a one-liner processes faster and avoids whatever was happening during that join loop

Additionally, your changes make the code compliant across Mac/Linux, which is very nice!

@kubu4
Copy link
Contributor

kubu4 commented Jun 4, 2019

EDITED by kubu4: IGNORE THIS POST. SEE MY POST BELOW THIS ONE.

Well, this thread has been enlightening! Although I still haven't figured out why @shellytrigg got the broken pipe error (I did not encounter this when I used my script on the original sample sets), it did highlight the fact that my script also results in the redundant sample names (due to the "problem" files)! This doesn't affect the data output, but would definitely impact downstream analysis that relied on column headers to process data! Doh!

However, @shellytrigg's updated script also produces the same duplicate sample names.

As an example, we'll use one of the "problem" files.

Snippet from my code:

$ echo Combined.SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.HC_VA_3_GENE.fa \
| awk -F\[._] '{print $6"_"$7}'

HC_VA

@shellytrigg's updated snippet:

$ echo Combined.SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.HC_VA_3_GENE.fa \
| awk -F[.] '{print $6}' \
| rev \
| cut -d "_" -f3- \
| rev

HC_VA

With that being the case, @shellytrigg how does your script deal with this? I see you have the section to remove duplicate columns, but you shouldn't have duplicate columns since all the sample names should be unique. Are you missing data for the HC_VA_# samples after your script runs?

I'm going to look into tweaks for these scripts to handle the HC_VA_# sample names and update my notebook/data. I'll post here with any changes.

@kubu4 kubu4 reopened this Jun 4, 2019
@kubu4
Copy link
Contributor

kubu4 commented Jun 4, 2019

@shellytrigg Nevermind!!! The example "snippets" I posted above are no good. Yours works great with the proper input:

Combined.SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.HC_VA_3_GENE_analysis

Updating my script and re-running my analysis from whenever I did this originally.

@kubu4 kubu4 closed this as completed Jun 4, 2019
@shellywanamaker
Copy link
Contributor

@kubu4 the script is run from the -Array-Analysis directory which contains all sample diectories (example: Combined.SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.HC_4_GENE_analysis). The script uses the sample directory names to create sample names.
Example:
echo Combined.SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.HC_VA_4_GENE_analysis | awk -F[.] '{print $6}' |\

produces HC_VA_4_CAP_analysis

rev| cut -f3- | rev

produces HC_VA_4

Because the script uses the analysis directory names in the Array-Analysis directory (not the fasta files in the previous directory), it does not produce duplicate sample names

@kubu4
Copy link
Contributor

kubu4 commented Jun 4, 2019

Yep, see my post above yours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants