Random coded scripts for RNA-Seq Simulation and Analysis.
- To simulate RNA sequencing reads using Flux Simulator program, you need two files, annotation file and chromosomes file in your "Genome" folder.
- To download gene annotation file for human genome, Gencode, release 17 (ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_17/gencode.v17.annotation.gtf.gz).
- To download chromosomes file for human genome, Genecode, release 17 (ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_17/GRCh37.p11.genome.fa.gz).
- Unzip the files using:
gzip -d gencode.v17.annotation.gtf.gz
gzip -d GRCh37.p11.genome.fa.gz
- To extract one chromosome, i.e.
chr10
from genome fileGRCh37.p11.genome.fa
, use the following bash command, the resulting chromosome file ischr10.fa
:
perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(chr10)}print if $c' GRCh37.p11.genome.fa > chr10.fa
- To extract one chromosome annotation, i.e.
chr10
from genome annotation filegencode.v17.annotation.gtf
, use the following bash command, the resulting annotation file ischr10.gtf
:
grep chr10 gencode.v17.annotation.gtf > chr10.gtf
or
grep ^chr10 gencode.v17.annotation.gtf> chr10.gtf
- Download Flux Simulator
- Go to
bin
, create a file calledmyParameters.par
and copy the following lines:
REF_FILE_NAME /PATH/TO/Genome/chr10.gtf
GEN_DIR /PATH/TO/Genome
NB_MOLECULES 5000000
READ_NUMBER 5000000
POLYA_SCALE NaN
POLYA_SHAPE NaN
READ_LENGTH 100
PAIRED_END YES
# use default 76-bp error model
ERR_FILE 76
# create a fastq file
FASTA YES
-
Do not forget to change the path of
REF_FILE_NAME
andGEN_DIR
parameters inmyParameters.par
file to the folder that haschr10.fa
-
Use the following command to run
Flux
Simulator
bash flux-simulator -p myParameters.par
or
bash flux-simulator -x -l -s -p myParameters.par
- You will find three files in
bin
folder:RNA.bed
,RNA.lib
,RNA.fastq
-
Download a reference transcriptome file for human genome from Genecode release 17 (ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_17/gencode.v17.pc_transcripts.fa.gz)
-
Unzip the files using:
gzip -d gencode.v17.pc_transcripts.fa.gz
- Run the following bash commands on the reference transcriptome file
gencode.v17.pc_transcripts.fa
to find the top 10 longest polyAs sequence, the result will be in fileSortedPolyAs.out
:
grep -Eo 'A+' gencode.v17.pc_transcripts.fa | awk '{print $1, length($1)}' > PolyAs.out
sort -k 2 -n -r PolyAs.out > SortedPolyAs.out
head -n 10 SortedPolyAs.out
- Run the following bash commands on the reference transcriptome file
gencode.v17.pc_transcripts.fa
to find the top 10 longest polyTs sequence, the result will be in fileSortedPolyTs.out
:
grep -Eo 'T+' gencode.v17.pc_transcripts.fa | awk '{print $1, length($1)}' > PolyTs.out
sort -k 2 -n -r PolyTs.out > SortedPolyTs.out
head -n 10 SortedPolyTs.out
- Download Jellyfish
- Unzip
jellyfish-2.2.7.tar.gz
usingtar
command and change the path to your desired extracted path:
tar zxvf jellyfish-2.2.7.tar.gz -C /Users/sarael-metwally
- Go to
bin
folder injellyfish-2.2.7
and copy the sequencing reads file calledRNA.fastq
there. - Run
jellyfish
using the following two commands:
jellyfish count -m 21 -s 100M -t 10 -C RNA.fastq
jellyfish dump mer_counts.jf > kmer_counts.fa
- The resulted kmers counting file is called
kmer_counts.fa
- To get most highly 200 abundant kmers, using the following commands in order:
jellyfish dump -c mer_counts.jf > kmer_counts.txt
sort -k 2 -n -r kmer_counts.txt > sorted_kmers
head -n 200 sorted_kmers > top_200_kmers
- The most top 200 kmers will be in file called
top_200_kmers
64-bit machine with g++ version 4.7 or higher, pthreads,and zlib libraries.
- Clone the GitHub repo, e.g. with
git clone https://github.com/SaraEl-Metwally/LightTrimmer.git
- Run
make
in the repo directory for k <= 31 ormake k=kmersize
for k > 31, e.g.make k=49
. - Copy the resulted Jellyfish kmers count file
kmer_counts.fa
in LightTrimmer directory or provide it's path to the program. - Copy the resulted Flux simulated sequencing reads
RNA.fastq
in LightTrimmer directory or provide it's path to the program. - Run the following command:
./LightTrimmer -k 21 -c kmer_counts.fa RNA.fastq --verbose
The output of LightTrimmer is the set of the following files:
kmers_prob.txt [Comma delimeted file for kmers probability].
kmers_count.txt [Comma delimeted file for kmers count].
kmers_correct.txt [Comma delimeted file for kmers position and correctness (position,1 or 0)].
kmers_info_all.txt [ Spaces delimited file for all information extracted from the reads.
The output file kmers_info_all.txt
has 7 columns, which are read ID (R)
, Kmer ID or position in the read (K)
, Actual kmer count (C)
, Median kmer (M)
, Kmer probability (P)
, Kmer Correct(1)/Incorrect(0)/N(-1) (Co)
, Probability Computed(1)/Deductive(0) (Ca)
i.e. computed based on median kmer coverage,Base pair Correct(1)/Incorrect(0)/N(-1) (Cbp)
.
R K C M P Co Ca Cbp
1 0 19 45 1.0237e-05 1 1 1
1 1 21 45 5.31115e-05 1 1 1
1 2 19 45 1.0237e-05 1 1 1
1 3 21 45 5.31115e-05 1 1 1
1 4 21 45 5.31115e-05 1 1 1
1 5 21 45 5.31115e-05 1 1 1
1 6 21 45 5.31115e-05 1 1 1
1 7 22 45 0.000112905 1 1 1
1 8 22 45 0.000112905 1 1 1
- Open R, and On R shell, write the following:
df <-read.csv("/Users/sarael-metwally/Documents/LightTrimmer/kmers_prob.txt",check.names=FALSE,header=FALSE)
y <-subset(df, select=V1:V80)
x <-(1:80)
jpeg("/Users/sarael-metwally/Documents/RNA-seq/images/first_read_kmers_prob");
plot(x,y[1,],type="l",xlab="kmers start positions",ylab="kmers probability",main="kmers probability for one read")
dev.off()
- The path provided to
read.csv
is the path to the comma delimited file forkmers_prob.txt
from LightTrimmer. y[1,]
means you are working on the kmers fromreadID 1
.- The path provided to
jpeg
is the path you wish to store your plots.
- Open R, and On R shell, write the following:
df <-read.csv("/Users/sarael-metwally/Documents/LightTrimmer/kmers_count.txt",check.names=FALSE,header=FALSE)
y <-subset(df, select=V1:V80)
x <-(1:80)
jpeg("/Users/sarael-metwally/Documents/RNA-seq/images/first_read_kmers_count");
plot(x,y[1,],type="l",xlab="kmers start positions",ylab="kmers counts",main="kmers counts for one read")
dev.off()
- The path provided to
read.csv
is the path to the comma delimited file forkmers_count.txt
from LightTrimmer. y[1,]
means you are working on the kmers fromreadID 1
.- The path provided to
jpeg
is the path you wish to store your plots. - You can change the
plot
by providing limits ony-axis
,i.e. the maximum value ony axis is 300
, using the following command:
plot(x,y[2,],ylim=c(1,300),type="l",xlab="kmers start positions",ylab="kmers counts",main="kmers counts for one read")
- Open R, and On R shell.
- For correct kmer file called
kmers_prob_C.txt
(Note: change the file paths to your LightTrimmer installation folder), write the following:
df <-read.csv("/Users/sarael-metwally/Documents/LightTrimmer/kmers_prob_C.txt",check.names=FALSE,header=FALSE)$V1
x <- df[ df != -1 ]
jpeg("/Users/sarael-metwally/Documents/LightTrimmer/images/correct_kmers_cutoff.jpg");
y <-hist(x,main="Histogram of kmers probabilities",xlab="kmers probablities",xlim=c(0.0,1.0), prob=TRUE)
dev.off()
jpeg("/Users/sarael-metwally/Documents/LightTrimmer/images/correct_kmers_cutoff_1.jpg");
plot(y$mids,y$counts,type="l",xlab="kmers probabilities",ylab="kmers probabilities counts",main="kmers probabilities histogram")
dev.off()
- You can change the scale of your plots by using the following R code:
y<-hist(x,axes=F);axis(2);axis(1,at=seq(0,1,by=0.1),labels=seq(0,1, by=0.1))
plot(y$mids,y$counts,type="l",xlab="kmers probabilities",ylab="kmers probabilities counts",main="kmers probabilities histogram");axis(2);axis(1,at=seq(0,1,by=0.1),labels=seq(0,1, by=0.1))