Tools for designing, quantifying and visualizing MPRNA datasets This suite of tools contains methods for analysis of MPRNA and MPRNA-RIP datasets desribed in:
- Lubelsky & Ulitsky Nature 2018
- Zuckerman et al. Mol Cell 2020
- Lubelsky et al. EMBO J 2021
- Ron et al. Nature Communications 2022
- von Kügelgen et al. Nature Neuroscience 2023
The counting is done using Java code, and downstream analysis using R. You will need to make sure Java is installed.
java -Xmx48000m -cp jar/compbioLib.jar:jar/compbio.jar:jar/picard.jar scripts.lincs.patch.AnalyzeConservedPatches design_patches OUT_FILE BED_FILE GENOME_2BIT_FILE DESCRIPTION_FILE TILE_LEN OFFSET_LEN_COVERED OFFSET_LEN_CONSERVED EXCLUDED_SEQS N_CONTROLS CUSTOM_SEQ_FILES CUSTOM_PRIMERS PATCH_FILE
Parameters:
OUT_FILE
Base name of the output fileBED_FILE
BED format file (can be gzipped) that contains transcript coorindates (useNONE
if not needed)GENOME_FILE
Genome sequence file in 2bit formatDESCRIPTION_FILE
Tab-delimited library description file (see below)TILE_LEN
Length of the designed library tilesOFFSET_LEN_COVERED
Length of offset for regular sequencesOFFSET_LEN_CONSERVED
Length of offset for conserved patches (provided in thePATCH_FILE
)EXCLUDED_SEQS
Comma-separated list of sequences that have to be avoided in tiles (e.g., restriction enzyme cut sites) (useNONE
if not needed)N_CONTROLS
Number of dinucleotide-shuffled controls to design for each sequenceCUSTOM_SEQ_FILES
FASTA file of custom sequences for which tiles will be designed (separated by colons, useNONE
if not needed)CUSTOM_PRIMERS
pairs of primers to use for the custom sequences, separated by:
(separated by colons, useNONE
if not needed)PATCH_FILE
BED file of conserved patches for design
The DESCRIPTION_FILE
is the key file as it contains information about the different subsets of the library.
It contains the following 6 columns:
- Subset type - as described below
- Basic information - e.g., file name, described below
- Subset name - offsets used for tiling this subset
- Offset length - offsets used for tiling this subset
- Prefix - prefix (forward primer) appended to this sequence
- Suffix - prefix (typically reverse complement of the reverse primer) appended to this sequence
The subset types categories are available:
BED
- a name of the BED file is expected in the Basic information; the exonic sequences of each BED element are loaded and tiled with the stops as specified in Offset lengthFile
- a name of the FASTA file is expected in the Basic information, which can be followed with:
and a name of a corresponding BED FILEFileCirc
- same asFile
, but the sequences are circular and so back-splicing junction will also be tiled.FileMutate
- a name of the FASTA file is expected in the Basic information, which can be followed with ":" and a name of a corresponding BED file (which will be used for creating the output BED file of the tile positions). In the Offset length column the expected format is OFFSET_LEN:START:END, and positions between START and END will be systematically mutatedBEDcirc
- same asBED
but for circular sequences (adding tiling also over the back-spliced junction)Fragment
- a specific genomic position (CHR:START-END
format) is expected in Basic information, and it is tiledRefSeq
- a specific transcript id is expected in Basic information, and it is extracted from theBED_FILE
and tiledRefSeqFind
- a specific pairTRANCRIPT_ID:SEQ
is expected in Basic information and a pair of numbersUPSTREAM:TOTAL
is expected in Offset length.TRANSCRIPT_ID
is extracted from theBED_FILE
, thenSEQ
sequence is located in it. ThenUPSTREAM
bases are appended to theSEQ
and additional downstream bases are added to meet the final length ofTOTAL
. These tiles are called Context tiles.RefSeqFindMutate
- a specific pairedTRANCRIPT_ID:SEQ
is expected in Basic information and a pair of numbersUPSTREAM:TOTAL:OLD:NEW
is expected in Offset length.TRANSCRIPT_ID
is extracted from theBED_FILE
, thenSEQ
sequence is located in it. ThenUPSTREAM
bases are appended to theSEQ
and additional bases are added to meet the final length ofTOTAL
. Then, all instances ofOLD
sequence are replaced withNEW
sequence.Repeat
- a FASTA file is expected in Basic information and START-END-TOTAL in Offset length; the part of the sequence between START and END will be repeated in the tilesDelete
- a FASTA file is expected in Basic information and START-END-FROM-TO in Offset length, the part of the sequence starting between START and END and in length between MIN_LEN and MAX_LEN will be deleted (and moved to the end of the title) in the tilesRepeatSpecific
- a kmer sequence is expected in Basic information and total length of the tile in Offset length, the kmer will be repeated to total length.MutateAndRepeatSpecific
- as forRepeatSpecific
but systematic mutations of the kmer will also be producedInsertMer:KMER
- a FASTA file is expected in Basic information and START-END-STEP in Offset length. TheKMER
will be inserted in each position between START and END, with the requested STEPs between themReplaceMer:KMER
- asInsertMer
but replacing instead of inserting theKMER
.Mutate
- a FASTA file is expected inBasicInformation
and START-END-STEP in Offset length. Positions between START and END in each sequence, with the requested STEPs, will be systematically mutatedMutatePairs
- as inMutate
, but Offset length has format START-END-EXCLUDE_START-EXCLUDE_END and all pairs of positions between START and END but exclusing EXCLUDE_START-EXCLUDE_END are mutatedMutateMer:KMER_LEN
- as inMutate
, but instead of point mutations, kmers of lengthKMER_LEN
are systematically mutated (A<->T, G<->C)MutateShuffle:N_SHUFFLES
- as inMutate
, but instead of point mutations, the region betweenSTART
andEND
is shuffled, andN_SHUFFLES
random sequences are introduced in the same place (resulting inN_SHUFFLES
tiles)
This script produces the following output files:
OUT_FILE.patches.bed
BED file with positions of the designed files (if available)OUT_FILE.fa
FASTA file with the tile sequences
You need to make sure that the FASTA file of your tiles without adapters is available in LIBRARY.fa
. You will need to make a list of your sample names in names.txt
. The commannds use LSF, but can be easily adapted to other computing setups.
Count reads using simple counting, splitting each FASTQ into 100 slices that are counted indepedently:
cat names.txt | while read p; do for i in {0..99}; do bsub -o $p.$i.log -R rusage[mem=5000] java -Xmx48000m -cp jar/compbioLib.jar:jar/compbio.jar:jar/picard.jar scripts.lincs.patch.AnalyzeConservedPatches count_reads LIBRARY.fa ${p}_1.fastq.gz -slice 100 $i $p.counts.$i.txt 1000000000; done; done
Additional parameters:
-fq2 FILE
- if sequencing was with paired-end, use this option to provide the name of the 2nd FASTQ file-adapter ADAPTER
and-adapter2 ADAPTER
- by default adapter1 is TTGATTCGATATCCGCATGCTAGC and adapter2 is CGGCTTGCGGCCGCACTAGT, adjust if needed (adapter2 is expected at the beginning of read2-min_read2_len NUMBER
- the minimal length of expected read2
This will generate XXX.counts.YYY.txt
files for each slice, and XXX.counts.YYY.txt.noMatch.fa.gz
files that contain unmapped sequences in a table format.
Build FASTA files that contain the UMI in the header, and that can be used for BLAST:
cat names.txt | while read p; do for i in {0..99}; do zcat $p.countsAB.$i.txt.noMatch.fa.gz | cut -f 1,2,7 | awk '{print ">" $1 $2 ":" $4 "\n" $3}' > $p.$i.not_matching.fa; done; done
makeblastdb -dbtype nucl -in NucLibAB_noMut.fa
cat names.txt | while read p; do for i in {0..99}; do bsub -R rusage[mem=5000] -o $p.$i.log blastn -task blastn -word_size 12 -query $p.$i.not_matching.fa -db NucLibAB_noMut.fa -strand plus -out $p.$i.blast.txt -outfmt 7 -evalue 1e-10; done; done
(wait for all jobs to finish)
for FILE in *not_matching.fa *.blast.txt; do bsub -o log.log gzip $FILE; done
cat names.txt | while read p; do for i in {0..99}; do bsub -o $p.$i.log java -Xmx48000m -cp jar/compbioLib.jar:jar/compbio.jar:jar/picard.jar scripts.lincs.FilterTilingBLASTMapping $p.$i.blast.txt.gz 30 $p.blast_counts.$i $p.$i.not_matching.fa.gz; done; done
(wait for all jobs to finish)
cat names.txt | while read p; do bsub -R rusage[mem=2000] java -Xmx48000m -cp jar/compbioLib.jar:jar/compbio.jar:jar/picard.jar scripts.lincs.patch.AnalyzeConservedPatches combine_count_files $p.counts,$p.blast_counts 0 99 $p.combined_blast.txt; done
(wait for all jobs to finish - optional BLAST step over)
cat names.txt | while read p; do java -Xmx48000m -cp jar/compbioLib.jar:jar/compbio.jar:jar/picard.jar scripts.lincs.patch.AnalyzeConservedPatches combine_count_files $p.counts 0 99 $p.combined.txt; done
This will generate XXX.combined.txt
files that contain the combined counts for each tile in each library
Uses the names.txt
file and R version 3.6.0
Rscript qcTwist.R ./ names.txt combined.txt OUT_FILE.pdf
where combined.txt
is the suffix of your files built by combining the counts
This will generate OUT_FILE.pdf
with some plots
The next script filters the data and computes ratios both old-fashioned/simple/pseudocout way and with DESeq2. It needs a configuration file config.txt
which has 2 columns (sample in samples/config.txt
), key and value:
rootDir
→ Directory where the files sitgroupsFile
→ name of a 4 column file: FILE (from names.txt) GROUP (e.g. “Nuc”) REPLICATE (e.g “1”) USE_IN_DESEQ (T or F). For groups use “WCE” for WCE samples and “Input” for plasmid samples (for later filtering)ratiosFile
→ name of a 3 column file : RATIO_NAME (name of the column in the output file) FILE1 FILE2 → this will be used to compute ‘simple’ ratios (e.g., per replicate). The ratios will be FILE1/FILE.groupRatiosFile
→ name of a 3 column file : COMPARISON_NAME GROUP1 GROUP2 (this will be used by DESeq2 to compare GROUP1 (case) to GROUP2 (control). Note that the ratios will be GROUP1/GROUP2.suffix
→ suffixes of the files, as in the QC scriptoutBase
→ base of the filename for which output files will be formedrepeatFile
→ used for annotating repeats (can be empty)featuresFile
→ used for annotating features (can be empty)minWCEReads
→ minimum average number of reads in all the samples in groups with “WCE” in their name. Filtering will only take place if there are such groups.minInputReads
→ minimum average number of reads in all the samples in groups with “Input” in their name. Filtering will only take place if there are such groups.minRatioReads
→ minimum number of reads in FILE1+FILE2 for the ratio to be not NApseudo
→ pseudocount used to compute ratiosuseUMIs
→ whether UMIs (T) or reads (F) should be useduseReps
→ whether replicate number should be used in DESeq (only use if there is >1!useDESeq2
→ whether DESeq2 should be used (default - T)
Run the script:
Rscript processTwist.R config.txt
The ratios can then be ploted accross the genes represented in the library. This script needs a configuration file that specifies the output of the previous step, the condition that needs to be plotted and the number of individual replicates. This file also has 2 columns (sample in samples/config.txt
), key and value:
rootDir
→ Directory where the files sitfinalTab
→ Table that contains the ratios (output from the previous step)condition
→ The ratio that will be plottednumOfReplicates
→ The number of replicates plottedoutFile
→ The name of the output file
Run with Rscript ContextPlot.R CONFIG_FILE
A sample configuration file is available at samples/contextPlot.configFile.txt