Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduced_genome.sh error on #get only sequences from FASTA file #31

Open
trotos opened this issue Apr 17, 2017 · 5 comments
Open

reduced_genome.sh error on #get only sequences from FASTA file #31

trotos opened this issue Apr 17, 2017 · 5 comments

Comments

@trotos
Copy link

trotos commented Apr 17, 2017

Hi,
there seem to be 2 issues regarding the following step:

#get only sequences from FASTA file
grep -v '^>' ${genome}_${enzyme}_flanking_sequences_${fl}_unique_2.fa | sort | uniq -i -u | grep -xF -f - -B 1 ${genome}_${enzyme}_flanking_sequences_${fl}_unique_2.fa | grep -v '^--' > ${genome}_${enzyme}_flanking_sequences_${fl}_unique.fa

1st:
while the removing of the ">" lines, sorting and getting the unique files works and produces something like the following: (grep -v '^>' ${genome}_${enzyme}flanking_sequences${fl}_unique_2.fa | sort | uniq -i -u)

ggccaaaacaaaggggttacagggcacgtgccctgtaacagaaagcagtc
ggccaaaacaaaggggttataggacccctgcaagtctgaaatccagaagg
ggccaaaacaaagtggctacaggccccatgtgagtccaaaatccagtagg
ggccaaaacaaatgggctattggccccatgtaagtctgaaacccaacagg
ggccaaaacaaattctttacttatgctaattttgccttagacttttattt
GGCCAAAACAAGGAAACACTGAGAAGAGACGCCTGTTTGCCAGGTGCTTG
ggccaaaacaagggggctacaggcccatataagtcagcaatccagcaggg
GGCCAAAACAAGTTCTTTTTCCATTATTGAATACTTTTTGAGCTATAAGT
GGCCAAAACAATTCTTAAATAATTACTGCATTTTTTTTCAAAATGTAAGA
ggccaaaacacctaacccctctgttctctctgcctcagttgactcatgaa
GGCCAAAACAGAAGAGTCTGGCTGTCCCATCTCTGTATCTTTCAGGATCC

the following command ( grep -xF -f - -B 1 ${genome}_${enzyme}flanking_sequences${fl}_unique_2.fa ) produces a zero bytes file.
Could you please see if there is an error?

2nd:
why you remove the duplicates? this is a PCR based technique and it is expected for the file to be full of PCR duplicates.

Thank you for your answer in advance

@rr1859
Copy link
Owner

rr1859 commented Apr 28, 2017

Hi,

1)You are missing an underscore here - grep -xF -f - -B 1 ${genome}${enzyme}flanking_sequences${fl}unique_2.fa. It should be : grep -xF -f - -B 1 ${genome}${enzyme}flanking_sequences${fl}_unique_2.fa. Please try that and let me now.

  1. the duplicates that are being removed here are the fragments that we are mapping to (short fragments adjacent to the RE sites. If there are duplicate sequences in the genome we cannot properly assign our interactions to these regions.

@mindfaraway
Copy link

mindfaraway commented May 3, 2017

I got the same problem when running reduced_genome.sh.
I only changed line 13 to fl=100, 15 to enzyme=dpnii. After the program #get only sequences from FASTA file, it created a 0 byte file mm10_dpnii_flanking_sequences_100_unique.fa. Then, creating a 0 byte file mm10_dpnii_flanking_sites_100_unique.bed. I ran the script with MacOS Sierra 10.12.3 by Terminal command-line and I have no idea why the error occurs.

Any suggestion?

@rr1859
Copy link
Owner

rr1859 commented May 4, 2017

Do you have oligoMatch and bedtools installed?

@mindfaraway
Copy link

mindfaraway commented Aug 8, 2017

Hi,
I am now using another MAC machine to run this program.
It return an Error message: "grep: -: No such file or directory"
and create a 0 byte file "mm10_dpnii_flanking_sequences_50_unique.fa"
The error might be caused by the dash symbol in the command "grep -xF -f - -B 1 ${genome}_${enzyme}flanking_sequences${fl}_unique_2.fa" in line 32.
I guess there may have some incompatible command format between linux and MAC terminal when processing the grep command.
How can I solve this problem?

@mindfaraway
Copy link

mindfaraway commented Aug 9, 2017

Okey...
I replaced line 32 with the code from the issues "reduced_genome.sh error - grep: memory exhausted"
Here's the code I use:
paste - - < ${genome}_${enzyme}flanking_sequences${fl}unique_2.fa | awk '{a[toupper($2)]++; b[$2]=$1}; END {for(n in a) if (a[n] == 1) print b[n]"\n"n}' | sort -k1,1 -k2,2n -k3,3n | awk '{print $1":"$2"-"$3"\n"$4}' > ${genome}${enzyme}flanking_sequences${fl}_unique.fa

Now it generated a final file which is about 143 MB.
Seems to be the right way.....

By the way, I also edit the code at line 40 and 41 due to that the "sed" code in my MAC machine requires a backup file name for creating backup file. I changed it into:

sed -i .bak 's/>//g' ${genome}_${enzyme}flanking_sites${fl}unique.bed
rm ${genome}
${enzyme}flanking_sites${fl}unique.bed.bak
sed -i .bak 's/:|-/\t/g' ${genome}
${enzyme}flanking_sites${fl}unique.bed
rm ${genome}
${enzyme}flanking_sites${fl}_unique.bed.bak

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants