In [24]:
import pytest
import ipytest
ipytest.autoconfig()

## Why write tests
1. save hours detecting bugs or inconsistencies, esp. if software continually developed
2. smoother deployment, happier users, more citations?
3. helps document your software by helping define its intent

## What are tests: 

### Some *general* rules
1. A test should ideally focus on one tiny bit of functionality
2. Each test should be independent, capable of being run alone; i.e. each test gets a fresh dataset
3. Tests should ideally run fast
4. Run tests before and/or after a coding session to ensure nothing was broken, and certainly before pushing to a shared github repository

### Kinds of tests relevant for academic software
- Unit testing: checks that small, individual parts of code work properly
- End-to-end testing: tests the full workflow
- Functional testing: checks that program is behaving as expected
    - does variables at any stage have unexpected/unrealistic values?
- Performance testing: check speed, scalability, resource usage


**code coverage**: fraction of source code executed during testing, higher is better

## How to write tests

### [Pytest!](https://docs.pytest.org/en/6.2.x/contents.html)

pytest is popular and easier to use.

Typing `pytest` on command line searches for all files of the form test_\*.py or \*_test.py in current dir or subdirs and runs functions prefixed with "test" (but see [here](https://docs.pytest.org/en/6.2.x/goodpractices.html#test-discovery) for more info on test discovery).


In [11]:
# function.py
def binom_coeff(x):
    return x*(x-1)/2

In [35]:
%%ipytest
# test_function.py

def test_binom_coeff():
    assert binom_coeff(0) == 0
    assert binom_coeff(1) == 0
    assert binom_coeff(2) == 1
    assert binom_coeff(6) == 15

[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.01s[0m[0m


### pytest parameterization

Combine multiple tests into one! Give test function multiple sets of arguments.

In [34]:
%%ipytest
# test_function.py

@pytest.mark.parametrize('input,expected', [
    (0, 0),
    (1, 0),
    (2, 1),
    (6, 15),
])

def test_binom_coeff(input, expected):
    assert binom_coeff(input) == expected

[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                                                                         [100%][0m
[32m[32m[1m4 passed[0m[32m in 0.02s[0m[0m


### pytest fixtures

These help set up a resource(s) before running a test. For instance, creating objects or staging files (later). We will use a trivial example here to first show the syntax. With complex functions to test, fixtures can do much more.

Tell pytest that a function is a fixture by decorating it with `@pytest.fixture`.

In [33]:
%%ipytest
# test_function.py

@pytest.fixture
def my_fixture():
    # in practice with more complex functions, fixtures do more
    return 6
    
def test_binom_coeff(my_fixture):
    assert binom_coeff(my_fixture) == 15

[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.01s[0m[0m


In [31]:
%%ipytest
# test_function.py

@pytest.fixture
def my_fixture1():
    return 6

@pytest.fixture
def my_fixture2():
    return 15    
    
def test_binom_coeff(my_fixture1, my_fixture2):
    assert binom_coeff(my_fixture1) == my_fixture2

[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.01s[0m[0m


### A real example

Here is an example of an end-to-end test of a program called "HATCHet" from the Raphael lab. This program takes DNA sequences (in BAM files) from a patient's normal tissue and compares this to tumor samples to detect regions of the genome that have been duplicated or deleted.

HATCHet takes in these BAM files, has multiple intermediate steps with output, and ultimately outputs a data table as a final result.

In the example below, we 
- use pytest **fixtures** to store the locations of the input BAM files an also the locations of subdirectories in which intermediate results are stored 
- use pytest **mark** to skip the end-to-end test if a condition isn't satisfied (since this test takes some time)
- compare the output file of the test function to a file previously produced by a working version of the program.

While we provide the raw code below, we only want to show the general structure of the test using annotations that follow #'s. You can use this structure to write your own tests!

In [36]:
# test_script.py
import pytest
import sys
import os
import glob
from io import StringIO
from mock import patch
import shutil
import pandas as pd
from pandas.testing import assert_frame_equal

import hatchet
from hatchet import config
from hatchet.utils.count_reads import main as count_reads
from hatchet.utils.genotype_snps import main as genotype_snps
from hatchet.utils.count_alleles import main as count_alleles
from hatchet.utils.combine_counts import main as combine_counts
from hatchet.utils.cluster_bins import main as cluster_bins
from hatchet.utils.plot_bins import main as plot_bins
from hatchet.bin.HATCHet import main as main
from hatchet.utils.plot_cn import main as plot_cn
from hatchet.utils.solve import solver_available

this_dir = os.path.dirname(__file__)
SOLVE = os.path.join(os.path.dirname(hatchet.__file__), 'solve')

#####
# create 2 objects, normal_bam and tumor_bams, 
# that store the file location of the inputs: 
# the normal tissue BAM and all tumor tissue BAMs
#####
@pytest.fixture(scope='module')
def bams():
    bam_directory = config.tests.bam_directory
    normal_bam = os.path.join(bam_directory, 'normal.bam')
    if not os.path.exists(normal_bam):
        pytest.skip('File not found: {}/{}'.format(bam_directory, normal_bam))
    tumor_bams = sorted([f for f in glob.glob(bam_directory + '/*.bam') if os.path.basename(f) != 'normal.bam'])
    if not tumor_bams:
        pytest.skip('No tumor bams found in {}'.format(bam_directory))

    return normal_bam, tumor_bams

#####
# create subdirectories to contain the output of all the intermediate steps
#####
@pytest.fixture(scope='module')
def output_folder():
    out = os.path.join(this_dir, 'out')
    shutil.rmtree(out, ignore_errors=True)
    for sub_folder in ('bin', 'snps', 'baf', 'bb', 'bbc', 'plots', 'results', 'evaluation', 'analysis'):
        os.makedirs(os.path.join(out, sub_folder))
    return out

#####
# the end-to-end test, which is skipped if a variable is not defined
#####
@pytest.mark.skipif(not config.paths.reference, reason='paths.reference not set')
@patch('hatchet.utils.ArgParsing.extractChromosomes', return_value=['chr22'])
def test_script(_, bams, output_folder):
    normal_bam, tumor_bams = bams

    count_reads(
        args=[
            '-N', normal_bam,
            '-T'
        ] + tumor_bams + [
            '-b', '50kb',
            '-st', config.paths.samtools,
            '-S', 'Normal', 'Tumor1', 'Tumor2', 'Tumor3',
            '-g', config.paths.reference,
            '-j', '12',
            '-q', '11',
            '-O', os.path.join(output_folder, 'bin/normal.1bed'),
            '-o', os.path.join(output_folder, 'bin/bulk.1bed'),
            '-v'
        ]
    )

    genotype_snps(
        args=[
            '-N', normal_bam,
            '-r', config.paths.reference,
            '-c', '290',    # min reads
            '-C', '300',  # max reads
            '-R', '',
            '-o', os.path.join(output_folder, 'snps'),
            '-st', config.paths.samtools,
            '-bt', config.paths.bcftools,
            '-j', '1'
        ]
    )

    count_alleles(
        args=[
            '-bt', config.paths.bcftools,
            '-st', config.paths.samtools,
            '-N', normal_bam,
            '-T'
        ] + tumor_bams + [
            '-S', 'Normal', 'Tumor1', 'Tumor2', 'Tumor3',
            '-r', config.paths.reference,
            '-j', '12',
            '-q', '11',
            '-Q', '11',
            '-U', '11',
            '-c', '8',
            '-C', '300',
            '-O', os.path.join(output_folder, 'baf/normal.1bed'),
            '-o', os.path.join(output_folder, 'baf/bulk.1bed'),
            '-L', os.path.join(output_folder, 'snps', 'chr22.vcf.gz'),
            '-v'
        ]
    )

    _stdout = sys.stdout
    sys.stdout = StringIO()

    combine_counts(args=[
        '-c', os.path.join(output_folder, 'bin/normal.1bed'),
        '-C', os.path.join(output_folder, 'bin/bulk.1bed'),
        '-B', os.path.join(output_folder, 'baf/bulk.1bed'),
        '-e', '12'
    ])

    out = sys.stdout.getvalue()
    sys.stdout.close()
    sys.stdout = _stdout

    with open(os.path.join(output_folder, 'bb/bulk.bb'), 'w') as f:
        f.write(out)

    cluster_bins(args=[
        os.path.join(output_folder, 'bb/bulk.bb'),
        '-o', os.path.join(output_folder, 'bbc/bulk.seg'),
        '-O', os.path.join(output_folder, 'bbc/bulk.bbc'),
        '-e', '22171',  # random seed
        '-tB', '0.04',
        '-tR', '0.15',
        '-d', '0.4'
    ])

    df1 = pd.read_csv(os.path.join(output_folder, 'bbc/bulk.seg'), sep='\t')
    df2 = pd.read_csv(os.path.join(this_dir, 'data', 'bulk.seg'), sep='\t')
    assert_frame_equal(df1, df2)

    plot_bins(args=[
        os.path.join(output_folder, 'bbc/bulk.bbc'),
        '--rundir', os.path.join(output_folder, 'plots')
    ])

    if solver_available():
        main(args=[
            SOLVE,
            '-x', os.path.join(output_folder, 'results'),
            '-i', os.path.join(output_folder, 'bbc/bulk'),
            '-n2',
            '-p', '400',
            '-v', '3',
            '-u', '0.03',
            '-r', '6700',  # random seed
            '-j', '8',
            '-eD', '6',
            '-eT', '12',
            '-g', '0.35',
            '-l', '0.6'
        ])

        #####
        # compare the final results of this test to those 
        # previously generated using the same input
        # using assert_frame_equal
        #####
        df1 = pd.read_csv(os.path.join(output_folder, 'results/best.bbc.ucn'), sep='\t')
        df2 = pd.read_csv(os.path.join(this_dir, 'data', 'best.bbc.ucn'), sep='\t')
        assert_frame_equal(df1, df2)

        plot_cn(args=[
            os.path.join(output_folder, 'results/best.bbc.ucn'),
            '--rundir', os.path.join(output_folder, 'evaluation')
        ])

ModuleNotFoundError: No module named 'mock'