Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alu Retrotransposon Matches Seem Artefactual #380

Closed
DarioS opened this issue Aug 14, 2020 · 5 comments
Closed

Alu Retrotransposon Matches Seem Artefactual #380

DarioS opened this issue Aug 14, 2020 · 5 comments

Comments

@DarioS
Copy link

DarioS commented Aug 14, 2020

I am interested in LINE1, Alu and SVA retrotransposons and I was looking at Alu specifically by INSRMRC=SINE/Alu If I look at the popular software in the retrotransposon research community MELT and look inside of its references folder, I see a FASTA file named ALU.fa which has

> ALU
GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGA
TCACGAGGTCAGGAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAA
AAATACAAAAAATTAGCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGG
CTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGC
CACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTC

as the reference sequence of Alu elements.

However, GRIDSS is labelling repeats of As or Ts as Alu, although they don't appear in the reference sequence.

chr2	179636891	gridss42f_35810b	T	TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.	1033.21	PASS	AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=2;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=348.43;BASRP=13;BASSR=4;BEALN=chr1:26810050|-|35M|0;BEID=asm8-264360,asm8-264365;BEIDH=-1,-1;BEIDL=0,0;BMQ=60.00;BMQN=60.00;BMQX=60.00;BQ=1033.21;BSC=4;BSCQ=75.15;BUM=29;BUMQ=609.62;BVF=31;CAS=0;CASQ=0.00;CQ=1033.21;EVENT=gridss42f_35810;IC=0;INSRMP=0.800;INSRMRC=SINE/Alu;INSRMRO=-;INSRMRT=AluSx4;IQ=0.00;LOCAL_LINKED_BY=.;PURPLE_AF=0.253;PURPLE_CN=3.78;PURPLE_CN_CHANGE=0.957;PURPLE_JCN=0.957;RAS=0;RASQ=0.00;REF=123;REFG=GTTTTTTTCTTATTGCATTGG;REFPAIR=147;REMOTE_LINKED_BY=.;RP=0;RPQ=0.00;SB=0.5;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;TAF=0.170;VF=0	GT:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF	.:0.00:0:0:0:0.00:0:0.00:0.00:0:0:0.00:0:0.00:0:0.00:0:0.00:0:0.00:0.00:0.00:56:63:0:0.00:0:0.00:0	.:0.00:0:0:0:0.00:0:0.00:348.43:13:4:1033.21:4:75.15:29:609.62:31:0.00:0:0.00:0.00:0.00:67:84:0:0.00:0:0.00:0
chr3	76896610	gridss56b_29622b	A	.AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	523.93	PASS	AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=1;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BEALN=chr6:96655299|-|37M|0;BEID=asm11-549910;BEIDH=-1;BEIDL=427;BMQ=59.22;BMQN=39.00;BMQX=60.00;BQ=547.09;BSC=2;BSCQ=40.43;BUM=24;BUMQ=506.66;BVF=24;CAS=0;CASQ=0.00;CQ=547.09;EVENT=gridss56b_29622;IC=0;INSRMP=1.00;INSRMRC=SINE/Alu;INSRMRO=+;INSRMRT=AluY;IQ=0.00;LOCAL_LINKED_BY=dsb71;PURPLE_AF=0.280;PURPLE_CN=2.94;PURPLE_CN_CHANGE=0.939;PURPLE_JCN=0.823;RAS=0;RASQ=0.00;REF=121;REFG=CGGTTCAGTTAAAAAAGTTAG;REFPAIR=106;REMOTE_LINKED_BY=.;RP=0;RPQ=0.00;SB=0.0;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;TAF=0.172;VF=0	GT:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF	.:0.00:0:0:0:0.00:0:0.00:0.00:0:0:23.16:0:0.00:1:23.16:1:0.00:0:0.00:0.00:0.00:64:52:0:0.00:0:0.00:0	.:0.00:0:0:0:0.00:0:0.00:0.00:0:0:523.93:2:40.43:23:483.49:23:0.00:0:0.00:0.00:0.00:57:54:0:0.00:0:0.00:0
chr6	86583822	gridss114b_35984b	A	.AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	1400.08	PASS	AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=2;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=333.07;BASRP=9;BASSR=8;BEALN=chr19:11998315|-|36M|0;BEID=asm22-865165,asm22-865185;BEIDH=-1,-1;BEIDL=36,282;BMQ=59.59;BMQN=45.00;BMQX=60.00;BQ=1400.08;BSC=9;BSCQ=163.09;BUM=43;BUMQ=903.92;BVF=48;CAS=0;CASQ=0.00;CQ=1380.87;EVENT=gridss114b_35984;IC=0;INSRMP=1.00;INSRMRC=SINE/Alu;INSRMRO=+;INSRMRT=AluSp;IQ=0.00;LOCAL_LINKED_BY=dsb149;PURPLE_AF=0.408;PURPLE_CN=4.70;PURPLE_CN_CHANGE=1.83;PURPLE_JCN=1.92;RAS=0;RASQ=0.00;REF=98;REFG=TGATTAGCTTAATAATTAACC;REFPAIR=119;REMOTE_LINKED_BY=.;RP=0;RPQ=0.00;SB=0.5294118;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;TAF=0.293;VF=0	GT:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF	.:0.00:0:0:0:0.00:0:0.00:0.00:0:0:0.00:0:0.00:0:0.00:0:0.00:0:0.00:0.00:0.00:45:56:0:0.00:0:0.00:0	.:0.00:0:0:0:0.00:0:0.00:333.07:9:8:1400.08:9:163.09:43:903.92:48:0.00:0:0.00:0.00:0.00:53:63:0:0.00:0:0.00:0
chr13	96738568	gridss217f_27736b	T	TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.	118.95	PASS	AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=1;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=59.48;BASRP=0;BASSR=3;BEALN=chr7:145453112|+|41M|0;BEID=asm43-255562;BEIDH=-1;BEIDL=0;BMQ=60.00;BMQN=60.00;BMQX=60.00;BQ=118.95;BSC=3;BSCQ=59.48;BUM=0;BUMQ=0.00;BVF=3;CAS=0;CASQ=0.00;CQ=331.31;EVENT=gridss217f_27736;IC=0;INSRMP=1.00;INSRMRC=SINE/Alu;INSRMRO=-;INSRMRT=AluJo;IQ=0.00;LOCAL_LINKED_BY=dsb384;PURPLE_AF=0.028;PURPLE_CN=3.25;PURPLE_CN_CHANGE=0.229;PURPLE_JCN=0.092;RAS=0;RASQ=0.00;REF=148;REFG=AATGCCCTATTATTGTTTAAA;REFPAIR=131;REMOTE_LINKED_BY=.;RP=0;RPQ=0.00;SB=0.6666667;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;TAF=0.018;VF=0	GT:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF	.:0.00:0:0:0:0.00:0:0.00:0.00:0:0:0.00:0:0.00:0:0.00:0:0.00:0:0.00:0.00:0.00:59:57:0:0.00:0:0.00:0	.:0.00:0:0:0:0.00:0:0.00:59.48:0:3:118.95:3:59.48:0:0.00:3:0.00:0:0.00:0.00:0.00:89:74:0:0.00:0:0.00:0
@d-cameron
Copy link
Member

Thoses are the repeatmasker annotations for the nominal BEALN locations.

However, GRIDSS is labelling repeats of As or Ts as Alu, although they don't appear in the reference sequence.

They do: in hg38 chr1:26810050-26810085 is a poly-A sequence which RepeatMasker annotates as part of a SINE element.

https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A26810000%2D26810099&hgsid=880760495_AaWKkAA6yoOeeG3hF6zPnQAvtbco

That said, the above doesn't really help you that much as clearly annotating polyA sequence with SINE isn't a very useful annotation. I'm aware of the problem and am in the processing of adding support for actual RepeatMasker annotations where the inserted sequence is sent off to RepeatMasker. The RepeatMasker output is a pain to parse so I've been putting this bit off as I've had other tasks to do as well.

The overall pipeline is VCF -> (gridss.InsertedSequencesToFasta^) -> RepeatMasker -> (TODO annotate VCF)

The other issue I've had has been the interpretation of multiple hits. What would you expect the correct annotation to be for sequences that have multiple (either overlapping, or no-overlapping) repeatmasker matches?

^ New in the 2.10.0 dev branch.

@DarioS
Copy link
Author

DarioS commented Aug 26, 2020

... the inserted sequence is sent off to RepeatMasker.

Does that mean GRIDSS will require internet connectivity to make web service queries? That will make it harder for people using HPCs, which often don't allow internet connections on the compute nodes. For example:

$ cat download.pbs
#!/bin/bash
#PBS -q normal
#PBS -l ncpus=1
#PBS -l mem=2gb
#PBS -l walltime=00:03:00

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.annotation.gtf.gz

fails to establish a connection to EBI FTP server

$ cat download.pbs.e10468964
--2020-08-26 09:48:34--  ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.annotation.gtf.gz
           => 'gencode.v35.annotation.gtf.gz'
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.197.74|:21... failed: Connection timed out.
Retrying.

--2020-08-26 09:50:46--  ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.annotation.gtf.gz
  (try: 2) => 'gencode.v35.annotation.gtf.gz'
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.197.74|:21...

Is the functionality available yet in development branch? How would I compile GRIDSS if I wanted to test pre-release on a HPC? Or, is the next release imminent and it would be simplest to just wait for it?

@d-cameron
Copy link
Member

Is the functionality available yet in development branch?

Just finished it off today.

How would I compile GRIDSS if I wanted to test pre-release on a HPC

git clone --branch dev http://github.com/PapenfussLab/gridss
cd gridss
 mvn package -DskipTests

Does that mean GRIDSS will require internet connectivity to make web service queries?

No, but it does require a local RepeatMasker installation. I just grabbed the bioconda version.

To test your installation is working:

cd example
./annotate_repeatmasker.sh

d-cameron pushed a commit that referenced this issue Sep 9, 2020
…maskerbed from gridss.sh

Writing SW alignment score and inferred edit distance to INSRM
@d-cameron
Copy link
Member

scripts/gridss_annotate_vcf_repeatmasker.sh will be included in the next release.

@DarioS
Copy link
Author

DarioS commented Sep 16, 2020

That is convenient but the server which I am using doesn't have RepeatMasker nor conda installed. I'll be busy with dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants