# Putting different metrics of sequence quality on comparible scales

For each sequence generated we want to be able to compare multible metrics to rank sequences and select the best one.
However, each metric may have a different meaning. A high value may be desireable in one case but not in another.

In each case there needs to be some kind of expectation. Currently expectations are formed by generating random
sequences, measuring the metric of interest for each sequence and using the mean value as the expectation. With
this approach the degree to which a generated sequence differs from the expectation can be measured in
standard deviations by converting the value of the metric to a z score. This leaves us with the problem
above. It would be good to get all metrics on some more absolute scale where a specific direction (high or low
values is always desireable) for easier comparisons.

## Metric classes

The desired value for a given metric can be defined with two goals; you either want to maximize or minimize
the distance (in standard deviations) to the mean and you either want that value to be positive or negative
(greater or less than the mean).

Going forward, we want to produce a scale where large numbers always represent more desireable values.





In [2]:
import numpy as np

example_mean = 10
example_samples = np.array([-2, -5, 4, -3, 1])  # distance to mean in sd


### Positive or negative

Multiply the metric value by 1 if desired value is + or by `-1` if -.
In this example lets say the desired result is to the right of the mean.

In [9]:
direction = 1
dir_samples = example_samples * direction
dir_samples

array([-2, -5,  4, -3,  1])

### Max or min

If distance from mean should be maximized do nothing to values, if minimized take reciprocol.

In this case we want to minimize value.

In [10]:
min_max = lambda x: 1 / x
dir_min_samples = np.vectorize(min_max)(dir_samples)
dir_min_samples


array([-0.5       , -0.2       ,  0.25      , -0.33333333,  1.        ])

Then can just sort

In [11]:
np.sort(dir_min_samples)

array([-0.5       , -0.33333333, -0.2       ,  0.25      ,  1.        ])

1 is to the right of mean and closest to it (on the right.)

## Debugging


In [1]:
mean = 10
sd = 1
z_score = lambda x: (x - mean) / sd

In [7]:
vals = 12, 3, 10.1, 7, 14, 8
raw_scores = {val: z_score(val) for val in vals}
raw_scores

{12: 2.0, 3: -7.0, 10.1: 0.09999999999999964, 7: -3.0, 14: 4.0, 8: -2.0}

### Want to max distance to the mean

So we want the largest z score (multply by one)

### Min distance to mean



In [9]:
min_scores = {val: z ** -1 for val, z in raw_scores.items()}
min_scores


{12: 0.5,
 3: -0.14285714285714285,
 10.1: 10.000000000000036,
 7: -0.3333333333333333,
 14: 0.25,
 8: -0.5}

1/x will not work because does not grow / shrink in a linear way. Instead can use a ranking approach and then only really consider values that
fall on the desired half of the distrabution.

In [None]:
def rank_max_z_score(scores):
    # rank scores in order to maximize distance from mean
    scores = sorted(scores, lambda s: abs(s), reverse=True)  # sort by absolute value
    return scores

def rank_min_z_scores(scores):
    # rank to minimize absolute distance from mean
    scores sorted(scores, lambda s: abs(s))
    return scores

Just throw out values not in the desired direction

In [None]:
def rank_scores(scores, direction, divergence):
    if direction == 1:
        temp_scores = [s for s in scores if s > 0]
    else:
        temp_scores = [s for s in scores if s < 0]
    
    if divergence == 1:  # max distance to mean
        temp_scores = sorted(temp_scores, key=lambda s: abs(s), reverse=True)
    else:
        temp_scores = sorted(temp_scores, key=lambda s: abs(s))
    
    # collect all scores that were dropped
    dropped_scores = set(scores) - set(temp_scores)
    