# PCR duplication tests

Some of our HiC libraries have very high PCR duplication rates, presumably due to low complexity libraries generated from very small ammounts of intra-molecular ligated chromatin (IMLC).

We normally run a MiSeq on the libraries to check the quality of the HiC library but this will vastly underestimate the PCR duplication rate. It does give us a good idea of the distribution of readpairs, and accurately estimates the percent of readpairs that are valid (intercontig or >10kb appart.

Can we estimate PCR duplication rates from a small subsample of HiC reads? (or any library prep)

### Method

Two libraries: A good one with low PCR duplication rates (kiwifruit) and a bad one with high PCR duplication rates (92%) will be used in this test. 

The cleaned HiC reads will be mapped to the genome and then the sam files will be subsampled. The subsampled sam files will then be fed into samblaster to flag PCR duplicates. After this a samtools flagstat will tell us the PCR duplication rate. 

Then plot these results to see if there is a predicatable relationship betwween # of reads and PCR duplication rate. 


Number of reads in each dataset:
- Rasp-826,802,162
- Kiwi-1,944,174,986


| FileName       	| NumPairs  	| NumReads  	| Kiwi        	| Rasp        	|
|----------------	|-----------	|-----------	|-------------	|-------------	|
| Starting Reads 	|           	|           	| 1944174986  	| 826802162   	|
| 400M           	| 400000000 	| 800000000 	| 0.411485595 	| 0.967583343 	|
| 250M           	| 250000000 	| 500000000 	| 0.257178497 	| 0.604739589 	|
| 125M           	| 125000000 	| 250000000 	| 0.128589248 	| 0.302369795 	|
| 50M            	| 50000000  	| 100000000 	| 0.051435699 	| 0.120947918 	|
| 25M            	| 25000000  	| 50000000  	| 0.02571785  	| 0.060473959 	|
| 5M             	| 5000000   	| 10000000  	| 0.00514357  	| 0.012094792 	|
| 2.5M           	| 2500000   	| 5000000   	| 0.002571785 	| 0.006047396 	|
| 1.5M           	| 1500000   	| 3000000   	| 0.001543071 	| 0.003628438 	|
| 1M             	| 1000000   	| 2000000   	| 0.001028714 	| 0.002418958 	|
| 0.5M           	| 500000    	| 1000000   	| 0.000514357 	| 0.001209479 	|
| 0.25M          	| 250000    	| 500000    	| 0.000257178 	| 0.00060474  	|


In [5]:
# Set up

WKDIR=/workspace/hraijc/HiC_trials/duptest
KIWISAM=${WKDIR}/Kiwifruit_HiC.sam
RASSAM=${WKDIR}/Wakefield_HiC.sam



mkdir -p $WKDIR/log
cd $WKDIR



In [7]:
#Symlink the input files.
#ln -s /workspace/hraijc/Raspberry/Wakefield_genomeV2/V2.2/temp/Wakefield_flye20_pilon1_purged.sam ${RASSAM}
#ln -s /workspace/hraijc/Kiwi/Ck69_01_monoploid/HiC8_Novaseq/Scratch_Files/CK69_01_v2_contigs_HiC8_min1kb.sam ${KIWISAM}


In [54]:
#Flagstat starting files.
sbatch << EOF
#!/bin/bash
#SBATCH -J Flagstat
#SBATCH -o ${WKDIR}/log/%J.out
#SBATCH -e ${WKDIR}/log/%J.err
#SBATCH --cpus-per-task=12
#SBATCH --mem=1G
#SBATCH --time=5:00:00

module load samtools/1.16


samtools flagstat -@ 12 /workspace/hraijc/Kiwi/Ck69_01_monoploid/HiC8_Novaseq/CK69_01_v2_HiC8_dedup.bam > Kiwifruit_HiC_all_flagstat.txt
samtools flagstat -@ 12 /workspace/hraijc/Raspberry/Wakefield_genomeV2/V2.2/Wakefield_flye20_pilon1_purged_marked.bam > Wakefield_HiC_all_flagstat.txt

EOF

Submitted batch job 951229


/workspace/hraijc/HiC_trials/duptest/Kiwifruit_HiC.sam


In [6]:

module load samtools/1.16

for SAM in {$KIWISAM,$RASSAM}     #Loop through input files
do
    PREFIX=$(basename $SAM)
    APREFIX=${PREFIX%.sam}
    for (( i=1; i<=11; i++ ))      #Loop through each line in table above to look for percent of sam file to subsample
    do
        RI=$(( ( RANDOM % 100 ) + 1)) #Generate random number between 1-100
        PREFILENAME=$(head -n $i Filenames.txt | tail -n 1) # Get the filename from the filenames list
        FILENAME=${PREFILENAME//[$'\t\r\n']} && FILENAME=${FILENAME%%*( )} # Remove newlines from variable #Ugly
        PRESUB=$(head -n $i ${APREFIX}_subsample_list.txt | tail -n 1) #Get the correct substitution value from file named after organism.
        SUB=${PRESUB//[$'\t\r\n']} && FILENAME=${FILENAME%%*( )} # Remove newlines from variable #Ugly
        OPREFIX=${APREFIX}_${FILENAME} #Make outfile name prefix.
        echo "samtools view -@ 12 -h -s ${RI}.${SUB} ${SAM} -o ${OPREFIX}.sam"
        echo "/workspace/hraijc/git_clones/samblaster/samblaster -i ${OPREFIX}.sam -o ${OPREFIX}_marked_byread.sam"
        echo "samtools flagstat -@ 12 ${OPREFIX}_marked_byread.sam > ${OPREFIX}_flagstat.txt"
    done 
done | abatch -j ${WKDIR}/log/subtest -g 3 --time 20:10:00 --cpus-per-task=12 --mem 30G | sbatch


SBATCH_ARGS: --time 20:10:00 --cpus-per-task=12 --mem 30G
JOB_ARRAY_NAME: /workspace/hraijc/HiC_trials/duptest/log/subtest
GROUP_SIZE: 3
NUM_COMMANDS: 66
ARRAY_SIZE: 22
Submitted batch job 953271


In [7]:
#Samtools flagstat reader
 
echo "File,Reads,Duplicates,Mapped" > flag_results.csv

for FLAG in *flagstat.txt
do
    READS=$(grep "total" $FLAG | cut -d ' ' -f1)
    DUPS=$(grep "duplicates" $FLAG | head -n 1 | cut -d ' ' -f1)
    MAPS=$(grep "mapped" $FLAG | head -n 1 | cut -d ' ' -f1)

    echo "$FLAG,$READS,$DUPS,$MAPS" >> flag_results.csv
done

Kiwifruit_HiC_100_flagstat.txt	test_flagstat.txt
