Skip to content

LBGC-CFB/SPiP

Repository files navigation

SPiP, Splicing Pipeline Prediction

SPiP logo


SPiP is a randomForest model running a cascade of bioinformatics tools. Briefly, SPiP uses SPiCE tool for the consensus splice sites (donor and acceptor sites), MES for polypyrimidine tract between -13 and -20, BPP for branch point area between -18 and -44, an homemade score to research cryptic/de novo activation and ΔtESRseq for exonic splicing regulatory element until to 120 nt in exon

SPiP is available for Windows OS at https://sourceforge.net/projects/splicing-prediction-pipeline/

Table

Repository contents


  • SPiPv2.1_main.r: the SPiP script
  • testCrypt.txt: an example of input data in text format
  • testVar.vcf: an example of input data in vcf format
  • RefFiles: folder where are the reference files used by SPiP

Install SPiP


To get SPiP from this repository, you can enter in the linux consoles:

git clone https://github.com/raphaelleman/SPiP
cd ./SPiP

SPiP needs also to install 2 libraries, from the R console:

install.packages("foreach")
install.packages("doParallel")
install.packages("randomForest")

Load the transcriptome files

you have to download frome sourcforge the RData files containing the transcripts sequences. hg19 assembly : transcriptome_hg19.RData hg38 assembly : transcriptome_hg38.RData

Put these files in /path/to/SPiP/RefFiles/ or you can define it manually by the option --transcriptome.

NB: commands to regenerate these files are available in getGenomeSequenceFromBSgenome.r

Run SPiP


you can get the different argument of SPiP by Rscript /path/to/SPiPv2.1_main.r --help

An example of SPiP run with test file testCrypt.txt:

cd /path/to/SPiP/
Rscript ./SPiPv2.1_main.r -I ./testCrypt.txt -O ./outputTest.txt

In this example SPiP will generate a text file "outputTest.txt" where the predictions will be save. The scheme of this output is:

Column names Example Description
varID NM_007294:c.213-6T>G The variant id (Transcript:mutation)
Interpretation Alter by SPiCE The overall prediction
InterConfident 92.9 % +/- 2.1 % The risk that the variant impact splicing
Estimated from collection of variant with in vitro RNA studies and frequent variant
chr chr17 Chromosome number
strand - Strand of the junction ('+': forward;
'-':reverse)
varType substitution Type of variant
ntChange T>G Nucleotides variation
ExonInfo Intron 4 (1499) Number and size of Exon/Intron
transcript NM_007294 Transcript (RefSeq)
gene BRCA1 Gene symbol (RefSeq)
gNomen 41256979 Genomic position of variant
seqPhysio ACGG...AGGA (A, C, G, T)-sequence before the mutation
seqMutated ACGG...AGGA (A, C, G, T)-sequence after the mutation
NearestSS acceptor The nearest natural splice site to the variant
distSS -6 Distance between the nearest splice site and the mutation
RegType IntronCons The type of region where located the variant
SPiCEproba 1 The SPiCE probability for variant in consensus splice site
SPiCEinter_2thr high The SPiCE classes (high/medium/low)
deltaMES 0 MES variation for variant in the polypyrimidine tract
mutInPBarea No If the mutation is located in branch point predicted by BPP tool
deltaESRscore NA ESR score variation for exonic variant
posCryptMut 41256978 Genomic position of cryptic splice site after the mutation
sstypeCryptMut Acc Splice type of cryptic splice site after the mutation
probaCryptMut 0.000710404942432828 Score of cryptic splice site after the mutation
classProbaCryptMut No Prediction of cryptic splice site after the mutation (Yes: used, No: Not used)
nearestSStoCrypt Acc Splice type of the nearest natural splice site
nearestPosSStoCrypt 41256973 Genomic position of the nearest natural splice site
nearestDistSStoCrypt -5 Distance between the cryptic site and the natural site
posCryptWT 41256970 Genomic position of cryptic splice site before the mutation
probaCryptWT 4.89918764104143e-07 Score of cryptic splice site before the mutation
classProbaCryptWT No Prediction of cryptic splice site before the mutation (Yes: used, No: Not used)
posSSPhysio 41256973 Genomic position of natural splice site that same splice site type of the mutated cryptic
probaSSPhysio 0.00408919066993282 Score of natural splice site that same splice site type of the mutated cryptic
classProbaSSPhysio Yes Prediction of natural splice site that same splice site type of the mutated cryptic (Yes: used, No: Not used)
probaSSPhysioMut 1.74991364794327e-06 Score of natural splice site that same splice site type of the mutated cryptic after the mutation
classProbaSSPhysioMut No Score of natural splice site that same splice site type of the mutated cryptic after the mutation (Yes: used, No: Not used)

SPiP options

-I, --input /path/to/inputFile

  • list of variants file (.txt or .vcf). SPiP supports VCF version 4.1 or later (see example testVar.vcf). The txt file must be tab-delimated and the column with mutation, in format Transcript:mutation, is indicated by 'varID' column name (see example testCrypt.txt).

-O, --output /path/to/outputFile

  • Name of ouput file (.txt). Directory to the output file (in text format)

-g, --GenomeAssenbly hg19

  • Genome assembly version (hg19 or hg38) [default= hg19]

-t, --threads N

  • Number of threads used for the calculation [default= 1]

-l, --maxLines N

  • Number of lines read in each time [default= 1000]

--verbose

  • Show run process, i.e. displays progression bar tool

--geneList /path/to/geneList.txt

  • You can process analysis exclusively on a gene list, available only if VCF input

--transcriptList /path/to/transcriptList.txt

  • You can process analysis exclusively on a transcript list, available only if VCF input

--transcriptome /path/to/transcriptome_hgXX.RData

  • You can define where you have installed the file transcriptome_hgXX.RData if your file is not in /path/to/SPiP/RefFiles/

--VCF

  • Get the SPiP output in VCF format (v4.0)
# dynamic line modified in script : paste0("##SPiP output v",version)
# dynamic line modified in script : paste0("##SPiPCommand=",CMD)
## SPiP=altUsed|varID|Interpretation|InterConfident|SPiPscore|strand|gNomen|varType|ntChange|ExonInfo|exonSize|transcript|gene|NearestSS|DistSS|RegType|SPiCEproba|SPiCEinter_2thr|deltaMES|BP|mutInPBarea|deltaESRscore|posCryptMut|sstypeCryptMut|probaCryptMut|classProbaCryptMut|nearestSStoCrypt|nearestPosSStoCrypt|nearestDistSStoCrypt|posCryptWT|probaCryptWT|classProbaCryptWT|posSSPhysio|probaSSPhysio|classProbaSSPhysio|probaSSPhysioMut|classProbaSSPhysioMut
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	15765825	NM_007272:g.15765825:G>A	G	A	.	.	SPiP=A|NTR|00.04 % [00.02 % ; 00.08%]|+|substitution|G>A|Intron 1 (1795)|NM_007272|CTRC|donor|825|DeepIntron|0|Outside SPiCE Interpretation|0|No|NA|15765816|Acc|0.00206159394907144|No|Don|15765000|816|15765816|0.00161527498798199|No|15766795|0.0775463330795674|Yes|0.0775463330795674|Yes

Authors

Cite as: Leman, R., Parfait, B., Vidaud, D.,Girodon, E., Pacot, L., Le Gac, G., Ka, C., Ferec, C., Fichou,Y., Quesnelle, C., Aucouturier, C., Muller, E., Vaur, D.,Castera, L., Boulouard, F., Ricou, A., Tubeuf,H., Soukarieh,O., Gaildrat, P., Krieger, S. (2022). SPiP: Splicing Prediction Pipeline, a machine learning tool for massive detection of exonic and intronic variant effects on mRNA splicing. Human Mutation

License

This project is licensed under the MIT License - see the LICENSE file for details