# Runtime benchmarks

This notebook is designed to calculate benchmark runtimes on the ORIO website.

Note that benchmarks are highly dependent on the number of processors available; the feature-list count-matrixes are parallelized using a task manager, and are also cached so the don't need to be recalculated in the future.

This script is desiged for re-runs.

## User inputs, modify environment:

Set environment variables as needed before running:

```bash
export "ORIO_BENCHMARK_EMAIL=foo@bar.com"
export "ORIO_BENCHMARK_FEATURELIST=/path/to/hg19_fake.filtered.bed"
```

To execute:

```bash
nohup python ./benchmarkReruns.py &> benchmarkReruns.out &
```

To get runtime results (dump django model):

```bash
python manage.py dumpdata analysis.Analysis -o ./analysis.json
```

## Startup

In [None]:
from io import BytesIO
import os 
import time

import numpy as np

import django
from django.core.files import File 

django.setup()
from analysis import models
from myuser.models import User

In [None]:
# setup user inputs
email = os.environ['ORIO_BENCHMARK_EMAIL']
bigFeatureList = os.environ['ORIO_BENCHMARK_FEATURELIST']
replicates = 3

# tuple of (features, datasets) for re-runs
pairs = [
    (500, 10),
    (100000, 10),
    (100000, 750)
]

In [None]:
featureNs = list(set([d[0] for d in pairs]))
datasetNs = list(set([d[1] for d in pairs]))

## Clear old benchmark results

In [None]:
user = User.objects.get(email=email)
models.FeatureList.objects\
    .filter(owner=user, name__icontains='benchmarking:')\
    .delete()

## Create feature lists

We take a list of over 130,000 features, and then randomly select a subset of features from this master set. Then, we create a list of FeatureLists, each with a different number of features.

In [None]:
# load big feature-list file
with open(bigFeatureList, 'r') as f:
    fls = f.readlines()

fls = np.array(fls)
print('{:,} lines'.format(fls.size))
print('First line: %s ' % fls[0])
print('Last line: %s' % fls[-1])

In [None]:
def getFeatureList(features, size):
    fl = features[np.random.choice(features.size, size, replace=False)]
    f  = BytesIO()
    bytestring = str.encode(''.join(fl.tolist()))
    f.write(bytestring)
    f.seek(0)
    return f

In [None]:
# create feature-list objects in Django
featureLists = {}
for n in featureNs:
    name = "benchmarking: {} features".format(n)
    fl = models.FeatureList.objects.create(
        owner=user,
        name=name,
        stranded=True,
        genome_assembly_id=1,  # hg19
    )    
    fl.dataset.save(name+'.txt', File(getFeatureList(fls, n)))
    fl.save()
    fl.validate_and_save()
    assert fl.validated is True
    featureLists[n] = fl

In [None]:
# delete existing feature-list count matrices; required becase
# it will change the benchmarking behavavior because by 
# default the matrix can be re-used after initial exeuction.
def deleteFlcm():
    fls = list(featureLists.values())
    models.FeatureListCountMatrix.objects\
        .filter(feature_list__in=fls)\
        .delete()

## Generate random dataset collections

We randomly select a subset of encode datasets of varying sizes. To try to make the datasets a little more uniform for benchmarking, we first select the largest subset, and then iteratively select smaller subsets from each previous subset (that way we know that the smallest subset is guarenteed to a set of datasets which were previously run in a larger dataset.

The end result is a list of datasets, going from smallest to largest.

In [None]:
# get available datasets
datasetLists = {}
datasets = list(models.EncodeDataset.objects\
    .filter(genome_assembly_id=1)\
    .values_list('id', 'name'))

# create subsets
for n in reversed(sorted(datasetNs)):
    subset_ids = np.random.choice(len(datasets), n, replace=False)
    subset = [datasets[i] for i in subset_ids] 
    datasetLists[n] = [
        dict(dataset=d[0], display_name=d[1]) 
        for d in subset
    ]
    datasets = subset

## Create analyses

We create and validate our analyses, where there will be a total of $i * j * k$, where $i$ is the number of feature lists, $j$ is the number of dataset lists, and $k$ is the number of replicates for each.

In [None]:
# create analyses
analyses = []
for rep in range(replicates):
    for n_fl, n_ds in pairs:    
        a = models.Analysis.objects.create(
            owner=user,
            name="benchmarking: {} features, {} datasets".format(n_fl, n_ds),
            genome_assembly_id=1,  # hg19
            feature_list=featureLists[n_fl],
        )
        a.save()        
        objects = [
            models.AnalysisDatasets(
                analysis_id=a.id,
                dataset_id=d['dataset'],
                display_name=d['display_name'],
            ) for d in datasetLists[n_ds]
        ]
        models.AnalysisDatasets.objects.bulk_create(objects)
        a.validate_and_save()
        assert a.validated is True
        analyses.append(a)

## Execution

Now, we iteratively execute each analysis. We don't start the next analysis until the previous has finished.

Results are saved, and then transformed into a pandas DataFrame, and exported.

In [None]:
# execute
results = []
for i, analysis in enumerate(analyses):
    print('Running {} of {}...'.format(i+1, len(analyses)))
    deleteFlcm()
    analysis.execute(silent=True)
    while True:
        time.sleep(3)
        a = models.Analysis.objects.get(id=analysis.id)
        if a.is_complete:
            break
            
print('complete!')