# WGS analysis
## Estimation of copy number variation
* [Amhed Missael Vargas Velazquez](https://www.researchgate.net/profile/Amhed-Vargas-Velazquez)
* Post-doctoral fellow, [SGB lab](https://syngenbio.kaust.edu.sa/), [KAUST](https://www.kaust.edu.sa/en)

## Brief description
This jupyter notebook describes the processing pipeline used to estimate the abundance of piRNAi fragments in *C. elegans* strains.

### Methodology
Paired end sequencing reads were mapped to an indexed reference fasta file containing the C. elegans WS235/ce11 genome sequence, synthetic piRNA fragment sequences, protein tags, and plasmid DNA sequences (bacterial backbone removed) via bwa (v.7.17-r1188) mem algorithm with default parameters. Duplicated reads were filtered out of the alignments using picard tools v2.23.6 and coverage analysis were performed using samtools (v.1.9) coverage function. Abundance of DNA sequences in extrachromosomal arrays was estimated by dividing their mean read depth over the average read depth seen in *C. elegans* chromosomes.  

## Software requirements
The core instance running this script is Python. However, the analysis are performed by multiple programs (handled by `system calls`) which have to be installed or, for portability convenience, be present in the working directory.
* samtools
* fastqc
* picard

## Results
Coverage analysis showed that plasmids that have been linearilized prior injection get inserted into extrachromosomal arrays in larger numbers than those that were not (undigested).

# Setup

## Working directory and libraries import
We import all the required Python libraries

In [1]:
## Load libraries
#os to move within directories
import os, sys
##Set main working directory
path = '/home/velazqam/Documents/Projects/piRNA_counts/June_2021/WGS'
##Move to path
os.chdir(path)

# Parameter definition
We set all relevant parameters for our notebook. By convention, parameters are uppercase, while all the 
other variables follow Python's guidelines.

In [None]:
minMaqQforBam=1
minMaqQforVcf=10
minBaseQforVcf=10
minVarCall=10
diffStep=10
minDel=2
maxSample=2
Ncpu=28
ramG=40

# Analysis
DO NOT RUN

In [None]:
##Add reference genome
cd Set1
cat ../c_elegans.PRJNA13758.WS275.genomic_softmasked.fa Background.fasta Plasmids.fasta > Reference.fasta
cd ..

cd Set2
cat ../c_elegans.PRJNA13758.WS275.genomic_softmasked.fa Background.fasta Plasmids.fasta > Reference.fasta
cd ..

cd Set3
cat ../c_elegans.PRJNA13758.WS275.genomic_softmasked.fa Plasmids.fasta > Reference.fasta
cd ..

sami=/home/velazqam/Documents/Projects/piRNA_counts/June_2021/WGS/scratch/samtools-1.12/bin/bin/samtools

##Start with simplest one
cd Set3
for lib in `ls -d S*`; do
echo ${lib}
cd ${lib}
cp ../Reference.fasta .
samtools faidx Reference.fasta
bwa index Reference.fasta
java -jar /home/velazqam/Downloads/Sonia_WGS/Software/picard2.23.6/picard.jar CreateSequenceDictionary R=Reference.fasta O=Reference.dict

#Perform fastqc
fastqc -t $Ncpu ${lib}*.gz;

##Perform mapping
bwa mem -t $Ncpu -aM Reference.fasta ${lib}*.gz | samtools view -buS - |samtools sort - -o ${lib}_rawmap.bam

#Samtools sort
samtools sort -@ $Ncpu ${lib}_rawmap.bam > ${lib}.sort.bam
#rm ${lib}_rawmap.bam

#Samtools index
samtools index ${lib}.sort.bam

##Remove minimal quality for Bam
samtools view -@ $Ncpu -q $minMaqQforBam -bh ${lib}.sort.bam > ${lib}.UM.bam
#rm ${lib}.sort.bam ${lib}.sort.bam.bai

#Picard add replace groups ## Explain its importance for combining multiple libraries
java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/picard2.23.6/picard.jar AddOrReplaceReadGroups I=${lib}.UM.bam O=${lib}.RG.bam RGID=${lib} RGLB=LB RGPL=illumina RGPU=PU RGSM=${lib}
#rm ${lib}.UM.bam

#Picard remove duplicates
java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/picard2.23.6/picard.jar MarkDuplicates I=${lib}.RG.bam O=${lib}.deDup.bam M=${lib}.dedupMetrics REMOVE_DUPLICATES=true
#rm ${lib}.RG.bam ${lib}.dedupMetrics

##Samtools
samtools index ${lib}.deDup.bam

${sami} coverage ${lib}.deDup.bam > ${lib}.coverage.tsv

cd ..;

done

cd ..

##Start with simplest one
cd Set2
for lib in `ls -d S*`; do
echo ${lib}
cd ${lib}
cp ../Reference.fasta .
samtools faidx Reference.fasta
bwa index Reference.fasta
java -jar /home/velazqam/Downloads/Sonia_WGS/Software/picard2.23.6/picard.jar CreateSequenceDictionary R=Reference.fasta O=Reference.dict

#Perform fastqc
fastqc -t $Ncpu ${lib}*.gz;

##Perform mapping
bwa mem -t $Ncpu -aM Reference.fasta ${lib}*.gz | samtools view -buS - |samtools sort - -o ${lib}_rawmap.bam

#Samtools sort
samtools sort -@ $Ncpu ${lib}_rawmap.bam > ${lib}.sort.bam
#rm ${lib}_rawmap.bam

#Samtools index
samtools index ${lib}.sort.bam

##Remove minimal quality for Bam
samtools view -@ $Ncpu -q $minMaqQforBam -bh ${lib}.sort.bam > ${lib}.UM.bam
#rm ${lib}.sort.bam ${lib}.sort.bam.bai

#Picard add replace groups ## Explain its importance for combining multiple libraries
java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/picard2.23.6/picard.jar AddOrReplaceReadGroups I=${lib}.UM.bam O=${lib}.RG.bam RGID=${lib} RGLB=LB RGPL=illumina RGPU=PU RGSM=${lib}
#rm ${lib}.UM.bam

#Picard remove duplicates
java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/picard2.23.6/picard.jar MarkDuplicates I=${lib}.RG.bam O=${lib}.deDup.bam M=${lib}.dedupMetrics REMOVE_DUPLICATES=true
#rm ${lib}.RG.bam ${lib}.dedupMetrics

##Samtools
samtools index ${lib}.deDup.bam

${sami} coverage ${lib}.deDup.bam > ${lib}.coverage.tsv

##GATK Indel realigner
java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/gatk-3.8.1.0/GenomeAnalysisTK.jar -T RealignerTargetCreator -nt $Ncpu -R Reference.fasta -I ${lib}.deDup.bam -o ${lib}.forIndelRealigner.intervals

java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/gatk-3.8.1.0/GenomeAnalysisTK.jar -T IndelRealigner -R Reference.fasta -I ${lib}.deDup.bam -targetIntervals ${lib}.forIndelRealigner.intervals -o ${lib}.realigned.bam

#rm ${lib}.deDup.bam ${lib}.deDup.bam.bai ${lib}.forIndelRealigner.intervals

#Unified genotyper
java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/gatk-3.8.1.0/GenomeAnalysisTK.jar -T UnifiedGenotyper -R Reference.fasta -nt $Ncpu -l INFO -glm BOTH -I ${lib}.realigned.bam -o ${lib}.UG.vcf -mbq $minBaseQforVcf -stand_call_conf $minVarCall 

java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/gatk-3.8.1.0/GenomeAnalysisTK.jar -T HaplotypeCaller --emitRefConfidence GVCF -R Reference.fasta -l INFO -I ${lib}.realigned.bam -o ${lib}.g.vcf -mbq ${minBaseQforVcf} -mmq ${minMaqQforVcf} -stand_call_conf ${minVarCall}

java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/gatk-3.8.1.0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R Reference.fasta -nt ${Ncpu} -l INFO -V ${lib}.g.vcf -dt none -o ${lib}.HC.vcf

cd ..;

done

cd ..


##Start with simplest one
cd Set1
for lib in `ls -d *_*`; do
echo ${lib}
cd ${lib}
cp ../Reference.fasta .
samtools faidx Reference.fasta
bwa index Reference.fasta
java -jar /home/velazqam/Downloads/Sonia_WGS/Software/picard2.23.6/picard.jar CreateSequenceDictionary R=Reference.fasta O=Reference.dict

#Perform fastqc
fastqc -t $Ncpu ${lib}*.gz;

##Perform mapping
bwa mem -t $Ncpu -aM Reference.fasta ${lib}*.gz | samtools view -buS - |samtools sort - -o ${lib}_rawmap.bam

#Samtools sort
samtools sort -@ $Ncpu ${lib}_rawmap.bam > ${lib}.sort.bam
#rm ${lib}_rawmap.bam

#Samtools index
samtools index ${lib}.sort.bam

##Remove minimal quality for Bam
samtools view -@ $Ncpu -q $minMaqQforBam -bh ${lib}.sort.bam > ${lib}.UM.bam
#rm ${lib}.sort.bam ${lib}.sort.bam.bai

#Picard add replace groups ## Explain its importance for combining multiple libraries
java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/picard2.23.6/picard.jar AddOrReplaceReadGroups I=${lib}.UM.bam O=${lib}.RG.bam RGID=${lib} RGLB=LB RGPL=illumina RGPU=PU RGSM=${lib}
#rm ${lib}.UM.bam

#Picard remove duplicates
java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/picard2.23.6/picard.jar MarkDuplicates I=${lib}.RG.bam O=${lib}.deDup.bam M=${lib}.dedupMetrics REMOVE_DUPLICATES=true
#rm ${lib}.RG.bam ${lib}.dedupMetrics

##Samtools
samtools index ${lib}.deDup.bam

${sami} coverage ${lib}.deDup.bam > ${lib}.coverage.tsv

##GATK Indel realigner
java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/gatk-3.8.1.0/GenomeAnalysisTK.jar -T RealignerTargetCreator -nt $Ncpu -R Reference.fasta -I ${lib}.deDup.bam -o ${lib}.forIndelRealigner.intervals

java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/gatk-3.8.1.0/GenomeAnalysisTK.jar -T IndelRealigner -R Reference.fasta -I ${lib}.deDup.bam -targetIntervals ${lib}.forIndelRealigner.intervals -o ${lib}.realigned.bam

#rm ${lib}.deDup.bam ${lib}.deDup.bam.bai ${lib}.forIndelRealigner.intervals

#Unified genotyper
java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/gatk-3.8.1.0/GenomeAnalysisTK.jar -T UnifiedGenotyper -R Reference.fasta -nt $Ncpu -l INFO -glm BOTH -I ${lib}.realigned.bam -o ${lib}.UG.vcf -mbq $minBaseQforVcf -stand_call_conf $minVarCall 

java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/gatk-3.8.1.0/GenomeAnalysisTK.jar -T HaplotypeCaller --emitRefConfidence GVCF -R Reference.fasta -l INFO -I ${lib}.realigned.bam -o ${lib}.g.vcf -mbq ${minBaseQforVcf} -mmq ${minMaqQforVcf} -stand_call_conf ${minVarCall}

java -Xmx${ramG}g -jar /home/velazqam/Downloads/Sonia_WGS/Software/gatk-3.8.1.0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R Reference.fasta -nt ${Ncpu} -l INFO -V ${lib}.g.vcf -dt none -o ${lib}.HC.vcf

cd ..;

done

cd ..


# Comments

# References
* Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
* Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
* “Picard Tools.” Broad Institute, GitHub repository. http://broadinstitute.github.io/picard/.