# Ligation Bias Metric Investigation

Notebook to explore the development of a metric to summarise the degree of ligation bias in a sample.

The basis for these metrics is the proportion to which dinucleotides are found at the 5' and 3' ends of ribo-seq reads relative to dinucleotide frequencies across all positions in Ribo-Seq reads.  

In [None]:
import math

A simple metric for assessing unexpected proportions of a given dinucleotide a the ends of reads is finding the max difference between observed and expected frequencies.

In [None]:
def ligation_bias_max_proportion_metric(
        observed_freq: dict,
        expected_freq: dict,
        prime: str = "five_prime",
        ) -> float:
    """
    Calculate the ligation bias metric from the output of
    the ligation_bias_distribution module.

    This metric is the maximum difference in observed and expected
    frequencies of dinucleotides

    Inputs:
        observed_freq: Dictionary containing the output of the
                ligation_bias_distribution module
        expected_freq: Dictionary containing the expected frequencies

    Outputs:
        lbd_df: Dataframe containing the ligation bias metric in bits
    """
    scores = {}
    for dinucleotide, observed_prob in observed_freq[prime].items():
        expected_prob = expected_freq[dinucleotide]
        scores[dinucleotide] = abs(
            observed_prob - expected_prob)

    return 1 - max(scores.values())

In [None]:
def ligation_bias_distribution_metric(
        observed_freq: dict,
        expected_freq: dict,
        ) -> float:
    """
    Calculate the ligation bias metric from the output of
    the ligation_bias_distribution module.

    This metric is the K-L divergence of the ligation bias distribution
    of the observed frequencies from the expected frequencies. The
    expected frequencies are calculated from the nucleotide composition
    of the genome.

    Inputs:
        observed_freq: Dictionary containing the output of the
                ligation_bias_distribution module
        expected_freq: Dictionary containing the expected frequencies

    Outputs:
        lbd_df: Dataframe containing the ligation bias metric in bits
    """
    # Needs possible rewrite using normalised ligation bias.
    # Current iteration only accounts for five_prime
    # division by 0 if background is non-existent, Only patterns that occur
    # at least once are used (needs to be changed in ligation bias)
    kl_divergence = 0.0

    for dinucleotide, observed_prob in observed_freq["five_prime"].items():
        expected_prob = expected_freq[dinucleotide]
        kl_divergence += observed_prob * math.log2(
                                            observed_prob / expected_prob
                                            )
    return 1 - kl_divergence
