# Bwamin VS BWA Benchmark

Note: Bwamin --mem --sw will fail for Large + files. DO NOT RUN!

## Table of Contents
Note: Click to jump to section
1. [Installation Check](#installation)
2. [Time Section](#time)
    1. BWT
    2. SW
    3. BWA
    4. Graphs
3. [Memory Section](#memory)
    1. BWT
    2. SW
    3. BWA
    4. Graphs
4. [Conclusions](#conclusions)

## Installation Check <a name="installation"></a>

In [72]:
# Run this cell to make capture small again
%%capture
!echo hi

UsageError: Line magic function `%%capture` not found.


Let's check if BWA and Bwamin are installed

In [None]:
!bwa

In [None]:
!python bwamin.py -h

## Time Section <a name="time"></a>

We will test bwamin bwt, bwamin sw, and bwa on a custom
short, and longer files.

In [None]:
# Helper commands so we can graph later
import pandas as pd
import seaborn as sns

# Change string to seconds
def timing(string):
    if type(string) == type(0.01):
        return string
    elements = string.split('m')
    minutes = elements[0]
    seconds = elements[1]

    seconds = seconds.replace('s', '')
    minues = minutes.strip()
    seconds = seconds.strip()

    if not minutes.isnumeric() or seconds.isnumeric():
        print('ERROR')

    return (float(minutes) * 60 + float(seconds))
    
def convertTime(filename):
    if '.txt' not in filename:
        filename = filename + '.txt'
    x = pd.read_csv(filename, sep='\t', header=None)
    x[1] = x[1].apply(timing)
    x.to_csv(filename, header=False, index=False, sep='\t')     
    
# Summing the time with index and mem to one file
def addTime(filename):
    # Read files
    x = pd.read_csv(filename + "Index.txt", sep='\t', header=None)
    y = pd.read_csv(filename + "Mem.txt", sep='\t', header=None)
    
    # Convert and add times
    x[1] = x[1].apply(timing)
    y[1] = y[1].apply(timing)
    x[1] = x[1] + y[1]
    
    # Write to file
    x.to_csv(filename + '.txt', header=False, index=False, sep='\t') 

### Short

In [73]:
# Peak at short reference and read
print('short.fa')
!head benchmark/testfiles/short.fa
print('\n\nshort.fq')
!head benchmark/testfiles/short.fq

short.fa
>chr1
TGCAA

short.fq
@Sample1
TCAT
+
!1=D

In [74]:
%%capture
# Bwt
!{ time python bwamin.py --index --bwt --fa benchmark/testfiles/short.fa \
  2> sleep.stderr ;} 2> benchmark/testfiles/bwaminBwtShortTimeIndex.txt
!{ time python bwamin.py --mem --bwt --fa benchmark/testfiles/short.fa --fq benchmark/testfiles/short.fq \
  2> sleep.stderr ;} 2> benchmark/testfiles/bwaminBwtShortTimeMem.txt
addTime('benchmark/testfiles/bwaminBwtShortTime')

# Sw
!{ time python bwamin.py --mem --sw --fa benchmark/testfiles/short.fa --fq benchmark/testfiles/short.fq \
  2> sleep.stderr ;} 2> benchmark/testfiles/bwaminSwShortTime.txt
convertTime('benchmark/testfiles/bwaminSwShortTime.txt')

# Bwa
!{ time bwa index benchmark/testfiles/short.fa \
  2> sleep.stderr ;} 2> benchmark/testfiles/bwaShortTimeIndex.txt
!{ time bwa mem benchmark/testfiles/short.fa benchmark/testfiles/short.fq \
  2> sleep.stderr ;} 2> benchmark/testfiles/bwaShortTimeMem.txt
addTime('benchmark/testfiles/bwaShortTime')

In [75]:
# Print Short Results
print('Bwt')
!cat benchmark/testfiles/bwaminBwtShortTime.txt
print('\nSw')
!cat benchmark/testfiles/bwaminSwShortTime.txt
print('\nBwa')
!cat benchmark/testfiles/bwaShortTime.txt

Bwt
real	0.543
user	0.42
sys	0.077

Sw
real	0.341
user	0.217
sys	0.049

Bwa
real	0.022
user	0.003
sys	0.011


## Medium

In [76]:
# Peak at medium reference and read
print('testfastring.fa')
!head benchmark/testfiles/testfastring.fa
print('\n\ntestfqstring.fq')
!head benchmark/testfiles/testfqstring.fq

testfastring.fa
>chr1
ATCCTATATTACGACTTTGGCAGGGGGTTCGCAAGTCCCACCCCAAACGATGCTGAAGGCTCAGGTTACACAGGCACAAGTACTATATATACGAGTTCCCGCTCTTAACCTGGATCGAATGCAGAATCATGCATCGTACCACTGTGTTCGTGTCATCTAGGACGGGCGCAAAGGATATATAATTCAATTAAGAATACCTTATATTATTGTACACCTACCGGTCACCAGCCAACAATGTGCGGATGGCGTT
>chr2
ACGACTTACTGGGCCTGATCTCACCGCTTTAGATACCGCACACTGGGCAATACGAGGTAAAGCCAGTCACCCAGTGTCGATCAACAGCTAACGTAACGGTAAGAGGCTCACAAAATCGCACTGTCGGCGTCCCTTGGGTATTTTACGTTAGCATCAGGTGGACTAGCATGAATCTTTACTCCCAGGCGAAAACGGGTGCGTGGACAAGCGAGCAGCAAACGAAAATTCTTGGCCTGCTTGGTGTCTCGTA

testfqstring.fq
@SRR5077691.13
NTGAAAAGATGTCTCCTTCTGTAAGTCAGAACAAAAAACTTTAATTAACT
+
!1=DDFFFHHHHGJJJJJJJJIJJJIIIJIHGIJJJJJJJJJFHHIJIJI
@SRR5077691.79
GTTTCATTGTGTCTTTATTTCCTGTATTAATGAGATGGGATATGAAGTCT
+
JJJJJJJIJJIJJJJJJJJIHJJJJJJJJJJJJJJJJHHHHHFFFFF@CB
@SRR5077691.32
GCCGTGTGCCCCCTCTTGGGTGACACCCCACCCCACCCTTATTTGCATCN


In [77]:
%%capture
# Bwt
!{ time python bwamin.py --index --bwt\
  --fa benchmark/testfiles/testfastring.fa \
  2> sleep.stderr ;} \
  2> benchmark/testfiles/bwaminBwtMedTimeIndex.txt
!{ time python bwamin.py --mem --bwt\
  --fa benchmark/testfiles/testfastring.fa\
  --fq benchmark/testfiles/testfqstring.fq \
  2> sleep.stderr ;} \
  2> benchmark/testfiles/bwaminBwtMedTimeMem.txt
addTime('benchmark/testfiles/bwaminBwtMedTime')

# Sw
!{ time python bwamin.py --mem --sw\
  --fa benchmark/testfiles/testfastring.fa\
  --fq benchmark/testfiles/testfqstring.fq \
  2> sleep.stderr ;} 2> benchmark/testfiles/bwaminSwMedTime.txt
convertTime('benchmark/testfiles/bwaminSwMedTime.txt')

# Bwa
!{ time bwa index benchmark/testfiles/testfastring.fa \
  2> sleep.stderr ;} 2> benchmark/testfiles/bwaMedTimeIndex.txt
!{ time bwa mem benchmark/testfiles/testfastring.fa\
  benchmark/testfiles/testfqstring.fq \
  2> sleep.stderr ;} 2> benchmark/testfiles/bwaMedTimeMem.txt
addTime('benchmark/testfiles/bwaMedTime')

In [78]:
# Print Medium Results
print('Bwt')
!cat benchmark/testfiles/bwaminBwtMedTime.txt
print('\nSw')
!cat benchmark/testfiles/bwaminSwMedTime.txt
print('\nBwa')
!cat benchmark/testfiles/bwaMedTime.txt

Bwt
real	0.56
user	0.477
sys	0.075

Sw
real	0.526
user	0.474
sys	0.048

Bwa
real	0.024
user	0.003
sys	0.013000000000000001


### Long

This is an Ebola Virus found using sra.

In [79]:
# Peak at long reference and read
print('SRR10769501.fasta.fixed')
!head benchmark/mydata/SRR10769501.fasta.fixed
print('\n\nSRR10769501.fastq')
!head benchmark/mydata/SRR10769501.fastq

SRR10769501.fasta.fixed
>@SRR10769501.1.1 M02486:32:000000000-BTHFB:1:1101:16201:1720 length=160
GCCGTAGCCCTGCTCGCCAGCGCGTAGCGGTGTCGTTTCCGTAGCGTCATCTTCGTCATCATTATTTCCAGTGGGTTCCTCGTTTTCACTCGCATTCGTGTCTTCGTCTTCCACCTTGCGAACAAAGTCTTTCTTCCCCCGGATCGCAAAGAGCTCCAGC
>@SRR10769501.2.1 M02486:32:000000000-BTHFB:1:1101:21365:1807 length=160
GGGCTTCACGGGCTTGCGGCGTTTCCACGCCGTGGTCAACGGCGTTGCGCAGCAGGTGGCCCAGCGGGGCATCGAGCATCTGTCTCTTATACACATCTCCGACCCCACTCTACAGGCACAAATCTCCTATTCCGTCTTCTTCTTTAAAACAAAAACCCCC
>@SRR10769501.3.1 M02486:32:000000000-BTHFB:1:1101:22516:1837 length=160
CTGCCGGCATTCAAAAAGGTCGGTGTTGCCGGTTGATAACGCTGTTCGATCATTTCCCTCTCCAGTCTTTTAGCTGTCTCTTATACACATCTCCGAGCCCACGCTACCGTCACACCTCTCGTATTCCTTCTTCTCCTTGAACTAAAAAACCTCCCCCCCC
>@SRR10769501.4.1 M02486:32:000000000-BTHFB:1:1101:17054:1951 length=160
ACTCCCCTTCTTGGTCCGGGAAGCCGACGCCGGTCATGTACGTCGTCAGGGCATCGACCCAGACATACATGACCTGTCTCTTATACACATCTCCGAGCCCACGAGACAGGCAGAAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACAAAAACACA
>@SRR10769501.5.1 M02486:32:000000000-BT

In [82]:
%%capture
# Bwt
!{ time python bwamin.py --index --bwt\
  --fa benchmark/mydata/SRR10769501.fasta.fixed \
  2> sleep.stderr ;} \
  2> benchmark/mydata/bwaminBwtLongTimeIndex.txt
!{ time python bwamin.py --mem --bwt\
  --fa benchmark/mydata/SRR10769501.fasta.fixed\
  --fq benchmark/mydata/SRR10769501.fastq \
  2> sleep.stderr ;} \
  2> benchmark/mydata/bwaminBwtLongTimeMem.txt
addTime('benchmark/mydata/bwaminBwtLongTime')



EmptyDataError: No columns to parse from file

In [None]:
%%capture
# NOTE: This process will be killed even with 48GB, Don't run
# Sw
!{ time python bwamin.py --mem --sw\
  --fa benchmark/mydata/SRR10769501.fasta.fixed\
  --fq benchmark/mydata/SRR10769501.fastq \
  2> sleep.stderr ;} 2> benchmark/mydata/bwaminSwLongTime.txt
convertTime('benchmark/mydata/bwaminSwLongTime.txt')

In [80]:
%%capture
# Bwa
!{ time bwa index benchmark/mydata/SRR10769501.fasta.fixed \
  2> sleep.stderr ;} 2> benchmark/mydata/bwaLongTimeIndex.txt
!{ time bwa mem benchmark/mydata/SRR10769501.fasta.fixed\
  benchmark/mydata/SRR10769501.fastq \
  2> sleep.stderr ;} 2> benchmark/mydata/bwaLongTimeMem.txt
addTime('benchmark/mydata/bwaLongTime')

In [81]:
#Print Short Results
print('Bwt')
!cat benchmark/mydata/bwaminBwtLongTime.txt
print('\nSw')
!cat benchmark/mydata/bwaminSwLongTime.txt
print('\nBwa')
!cat benchmark/mydata/bwaLongTime.txt

Bwt
real	0.586
user	0.487
sys	0.091

Sw

Bwa
real	10.033
user	6.7
sys	1.33


In [None]:
### In-Class sample


### Graphs

## Memory Section