RandomScripts

Random coded scripts for RNA-Seq Simulation and Analysis.

RNA-Seq Data Simulation

Download Reference Data Set

To simulate RNA sequencing reads using Flux Simulator program, you need two files, annotation file and chromosomes file in your "Genome" folder.
To download gene annotation file for human genome, Gencode, release 17 (ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_17/gencode.v17.annotation.gtf.gz).
To download chromosomes file for human genome, Genecode, release 17 (ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_17/GRCh37.p11.genome.fa.gz).
Unzip the files using:

gzip -d gencode.v17.annotation.gtf.gz
gzip -d GRCh37.p11.genome.fa.gz

Preparing Data Set

To extract one chromosome, i.e. chr10 from genome file GRCh37.p11.genome.fa, use the following bash command, the resulting chromosome file is chr10.fa:

perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(chr10)}print if $c' GRCh37.p11.genome.fa > chr10.fa

To extract one chromosome annotation, i.e. chr10 from genome annotation file gencode.v17.annotation.gtf , use the following bash command, the resulting annotation file is chr10.gtf:

grep chr10 gencode.v17.annotation.gtf > chr10.gtf

or

grep ^chr10 gencode.v17.annotation.gtf> chr10.gtf

Flux Simulator

Download Flux Simulator
Go to bin, create a file called myParameters.par and copy the following lines:

REF_FILE_NAME   /PATH/TO/Genome/chr10.gtf
GEN_DIR         /PATH/TO/Genome

NB_MOLECULES    5000000
READ_NUMBER     5000000

POLYA_SCALE     NaN
POLYA_SHAPE     NaN

READ_LENGTH     100

PAIRED_END      YES
# use default 76-bp error model
ERR_FILE        76

# create a fastq file
FASTA           YES

Do not forget to change the path of REF_FILE_NAME and GEN_DIR parameters in myParameters.par file to the folder that has chr10.fa
Use the following command to run Flux Simulator

bash flux-simulator -p myParameters.par

or

bash flux-simulator -x -l -s -p myParameters.par

You will find three files in bin folder: RNA.bed, RNA.lib, RNA.fastq

Find the Longest PolyAs or PolyTs in a Reference Transcriptome

Download a reference transcriptome file for human genome from Genecode release 17 (ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_17/gencode.v17.pc_transcripts.fa.gz)
Unzip the files using:

gzip -d gencode.v17.pc_transcripts.fa.gz

Run the following bash commands on the reference transcriptome file gencode.v17.pc_transcripts.fa to find the top 10 longest polyAs sequence, the result will be in file SortedPolyAs.out :

grep -Eo 'A+' gencode.v17.pc_transcripts.fa | awk '{print $1, length($1)}' > PolyAs.out
sort -k 2 -n  -r PolyAs.out > SortedPolyAs.out
head -n 10 SortedPolyAs.out

Run the following bash commands on the reference transcriptome file gencode.v17.pc_transcripts.fa to find the top 10 longest polyTs sequence, the result will be in file SortedPolyTs.out :

grep -Eo 'T+' gencode.v17.pc_transcripts.fa | awk '{print $1, length($1)}' > PolyTs.out
sort -k 2 -n  -r PolyTs.out > SortedPolyTs.out
head -n 10 SortedPolyTs.out

RNA-Seq kmers counting

kmers counting program Jellyfish

Download Jellyfish
Unzip jellyfish-2.2.7.tar.gz using tar command and change the path to your desired extracted path:

tar zxvf jellyfish-2.2.7.tar.gz -C /Users/sarael-metwally

Go to bin folder in jellyfish-2.2.7 and copy the sequencing reads file called RNA.fastq there.
Run jellyfish using the following two commands:

jellyfish count -m 21 -s 100M -t 10 -C RNA.fastq
jellyfish dump mer_counts.jf > kmer_counts.fa

The resulted kmers counting file is called kmer_counts.fa
To get most highly 200 abundant kmers, using the following commands in order:

jellyfish dump -c mer_counts.jf > kmer_counts.txt 
sort -k 2 -n  -r  kmer_counts.txt > sorted_kmers
head -n 200 sorted_kmers > top_200_kmers

The most top 200 kmers will be in file called top_200_kmers

RNA-Seq LightTrimmer

System requirements

64-bit machine with g++ version 4.7 or higher, pthreads,and zlib libraries.

Installation

Clone the GitHub repo, e.g. with git clone https://github.com/SaraEl-Metwally/LightTrimmer.git
Run make in the repo directory for k <= 31 or make k=kmersize for k > 31, e.g. make k=49.
Copy the resulted Jellyfish kmers count file kmer_counts.fa in LightTrimmer directory or provide it's path to the program.
Copy the resulted Flux simulated sequencing reads RNA.fastqin LightTrimmer directory or provide it's path to the program.
Run the following command:

./LightTrimmer -k 21 -c kmer_counts.fa RNA.fastq --verbose

Outputs

The output of LightTrimmer is the set of the following files:

kmers_prob.txt [Comma delimeted file for kmers probability].
kmers_count.txt [Comma delimeted file for kmers count].
kmers_correct.txt [Comma delimeted file for kmers position and correctness (position,1 or 0)].
kmers_info_all.txt [ Spaces delimited file for all information extracted from the reads.

The output file kmers_info_all.txt has 7 columns, which are read ID (R), Kmer ID or position in the read (K), Actual kmer count (C), Median kmer (M), Kmer probability (P), Kmer Correct(1)/Incorrect(0)/N(-1) (Co), Probability Computed(1)/Deductive(0) (Ca) i.e. computed based on median kmer coverage,Base pair Correct(1)/Incorrect(0)/N(-1) (Cbp).

R  K   C   M      P        Co  Ca Cbp
1  0  19  45  1.0237e-05   1   1  1
1  1  21  45  5.31115e-05  1   1  1
1  2  19  45  1.0237e-05   1   1  1
1  3  21  45  5.31115e-05  1   1  1
1  4  21  45  5.31115e-05  1   1  1
1  5  21  45  5.31115e-05  1   1  1
1  6  21  45  5.31115e-05  1   1  1
1  7  22  45  0.000112905  1   1  1
1  8  22  45  0.000112905  1   1  1

Plot kmers probability for one read

Open R, and On R shell, write the following:

df <-read.csv("/Users/sarael-metwally/Documents/LightTrimmer/kmers_prob.txt",check.names=FALSE,header=FALSE)
y <-subset(df, select=V1:V80)
x <-(1:80)
jpeg("/Users/sarael-metwally/Documents/RNA-seq/images/first_read_kmers_prob");
plot(x,y[1,],type="l",xlab="kmers start positions",ylab="kmers probability",main="kmers probability for one read")
dev.off()

The path provided to read.csv is the path to the comma delimited file for kmers_prob.txt from LightTrimmer.
y[1,] means you are working on the kmers from readID 1.
The path provided to jpeg is the path you wish to store your plots.

Plot kmers count for one read

Open R, and On R shell, write the following:

df <-read.csv("/Users/sarael-metwally/Documents/LightTrimmer/kmers_count.txt",check.names=FALSE,header=FALSE)
y <-subset(df, select=V1:V80)
x <-(1:80)
jpeg("/Users/sarael-metwally/Documents/RNA-seq/images/first_read_kmers_count");
plot(x,y[1,],type="l",xlab="kmers start positions",ylab="kmers counts",main="kmers counts for one read")
dev.off()

The path provided to read.csv is the path to the comma delimited file for kmers_count.txt from LightTrimmer.
y[1,] means you are working on the kmers from readID 1.
The path provided to jpeg is the path you wish to store your plots.
You can change the plot by providing limits on y-axis,i.e. the maximum value on y axis is 300, using the following command:

plot(x,y[2,],ylim=c(1,300),type="l",xlab="kmers start positions",ylab="kmers counts",main="kmers counts for one read")

Plot correct/incorrect kmers probabilities hitogram

Open R, and On R shell.
For correct kmer file called kmers_prob_C.txt (Note: change the file paths to your LightTrimmer installation folder), write the following:

df <-read.csv("/Users/sarael-metwally/Documents/LightTrimmer/kmers_prob_C.txt",check.names=FALSE,header=FALSE)$V1
x <- df[ df != -1 ]
jpeg("/Users/sarael-metwally/Documents/LightTrimmer/images/correct_kmers_cutoff.jpg"); 
y <-hist(x,main="Histogram of kmers probabilities",xlab="kmers probablities",xlim=c(0.0,1.0), prob=TRUE)
dev.off()
jpeg("/Users/sarael-metwally/Documents/LightTrimmer/images/correct_kmers_cutoff_1.jpg");
plot(y$mids,y$counts,type="l",xlab="kmers probabilities",ylab="kmers probabilities counts",main="kmers probabilities histogram")
dev.off()

You can change the scale of your plots by using the following R code:

y<-hist(x,axes=F);axis(2);axis(1,at=seq(0,1,by=0.1),labels=seq(0,1, by=0.1))

plot(y$mids,y$counts,type="l",xlab="kmers probabilities",ylab="kmers probabilities counts",main="kmers probabilities histogram");axis(2);axis(1,at=seq(0,1,by=0.1),labels=seq(0,1, by=0.1))

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
DUST.py		DUST.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RandomScripts

RNA-Seq Data Simulation

Download Reference Data Set

Preparing Data Set

Flux Simulator

Find the Longest PolyAs or PolyTs in a Reference Transcriptome

RNA-Seq kmers counting

kmers counting program Jellyfish

RNA-Seq LightTrimmer

System requirements

Installation

Outputs

Plot kmers probability for one read

Plot kmers count for one read

Plot correct/incorrect kmers probabilities hitogram

About

Releases

Packages

Contributors 2

Languages

SaraEl-Metwally/RandomScripts

Folders and files

Latest commit

History

Repository files navigation

RandomScripts

RNA-Seq Data Simulation

Download Reference Data Set

Preparing Data Set

Flux Simulator

Find the Longest PolyAs or PolyTs in a Reference Transcriptome

RNA-Seq kmers counting

kmers counting program Jellyfish

RNA-Seq LightTrimmer

System requirements

Installation

Outputs

Plot kmers probability for one read

Plot kmers count for one read

Plot correct/incorrect kmers probabilities hitogram

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages