## Processing FASTQ files for BAM obtention

This step is necessary for the generation of BAM files, which will be then used in the [BAM_to_FASTQ notebook](BAM_to_FASTQ.ipynb).

I am mainly reproducing the [pipeline](Hubert_Pausch_initial_code.ipynb) already in place and developed by Hubert.

As I have previous experience with yeast data, I am trying this species from the beginning. First, I detect the read I am interested in: [SRR10079472](https://www.ncbi.nlm.nih.gov/sra/SRX6812509[accn])

The [SRA toolkit](https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/) is then downloaded and installed in the cluster.
The sequences (fastq files) and the reference genomes can be downloaded as follows:

In [None]:
cd /cluster/work/pausch/audald/sratoolkit.2.9.6-1-ubuntu64/bin #go to the bin folder in the SRA toolkit
./prefetch SRR10079472 #this will fetch the .sra package for your read and download to the root directory
./fastq-dump --split-files /cluster/work/pausch/audald/ncbi/public/sra/SRR10079472.sra #This command will retrieve the fastq files from the .sra package
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz 
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gff.gz
gunzip GCF_000146045.2_R64_genomic*.gz

We can take a look at the fastq files (in terms of lines) and see how these match:

In [None]:
cd /cluster/work/pausch/audald/data/yeast
wc -l *fastq

FASTQ files are normally compressed, so I am compressing them:

In [None]:
bsub -o output_file gzip *fastq

This generates compressed fastq files, ready to be used by fastp, which can be run as followed - Described in [Hubert code](Hubert_Pausch_initial_code.ipynb).

The process is generating some output (html, json and output). These files are moved to the same folder as the resulting fastq files.

In [None]:
cd /cluster/work/pausch/audald/data/yeast
bsub -o output_fastp_yeast -R "rusage[mem=3500,scratch=1000]" -J "fastp_job" "/cluster/work/pausch/group_bin/fastp -i original_raw_data/SRR10079472_1.fastq.gz -o fastq_fastp/SRR10079472_1.fastq.gz -I original_raw_data/SRR10079472_2.fastq.gz -O fastq_fastp/SRR10079472_2.fastq.gz -q 15 -u 40 -g >/dev/null"
mv fastp* fastq_fastp
mv output_fastp_yeast fastq_fastp/
cd fastq_fastp
ls -lrth

The fastq files can be then passed through BWA in order to get the first BAM file

In [None]:
module load bwa/0.7.12 #loading the module to run
module load samtools/1.6 #loading samtools
#Note: samblaster can be installed as described here: https://github.com/GregoryFaust/samblaster
bsub -o output_bwa -J "BWA" bwa index GCF_000146045.2_R64_genomic.fna #create index files for the fasta reference; needed for the next step
bsub -o bam_creation -R "rusage[mem=3500,scratch=1000]" -J "BAM_creation" "bwa mem -M -t 12 -R '@RG\tID:001\tPL:illumina\tPU:sample001\tSM:R_001' ../GCF_000146045.2_R64_genomic.fna SRR10079472_1.fastq.gz SRR10079472_2.fastq.gz | /cluster/work/pausch/audald/software/samblaster/samblaster -M | samtools view -Sb - > output.bam" 
#running BWA and generating the first BAM. Memory extra is required. Header starting by @RG is needed - here I use and example found online
mkidr /cluster/work/pausch/group_bin/sambamba_v0.6.6 sort -m 6G --tmpdir /cluster/work/pausch/audald/data/yeast/fastq_fastp/tmp_scratch
bsub -o sambamba_output -R "rusage[mem=75000]" -J "sambamba" "/cluster/work/pausch/group_bin/sambamba_v0.6.6 sort -m 6G --tmpdir /cluster/work/pausch/audald/data/yeast/fastq_fastp/tmp_scratch --out SRR10079472.bam --nthreads 10 output.bam"
#Sambamba software is run (https://lomereiter.github.io/sambamba/docs/sambamba-sort.html) in order to sort the BAM files and generate the BAI files
#A temporary directory and increasing the memory is needed for this process.
bsub -o sambamba_report_output -R "rusage[mem=25000]" -J "sambamba_report" "/cluster/work/pausch/group_bin/sambamba_v0.6.6 flagstat SRR10079472.bam > SRR10079472.stats"
#this command generates statsistic for the BAM file

The BAM, BAI and stats files are obtained. BAM file is sorted and, providing the stats file looks OK, it can be further processed or stored like this.