# Sample to Sample distance ladder

Look at the distance (amount of change) between pairs of samples.

Various distance functions can be used to compute a distance between a pair of samples.  In this analysis you can choose which distance function suits your objective.

The number immediately to the right of a sample represents the distance between that sample and the one immediately below it.  The number TWO spots to the right of the sample represents the distance between that sample and the one TWO spots below it.  And so forth.

In [None]:
import main_sample_to_sample
import sample_distances

print('computation starting...\n\n')
main_sample_to_sample.calculate_combination(
    # distance function used to compute the distance between each pair of samples
    # valid inputs: l2 (Euclidean), lp(p) (input any integer >= 1 for p), linfty, jaccard, weighted_jaccard
    dist_func=sample_distances.lp(1),
    # list the samples you want to compare. they will be compared in the order you put them.
    file_names=[
        'cdr3.b.A_2017_2018_d_00_53535.ann',
        'cdr3.b.A_2017_2018_d_07_11143.ann',
        'cdr3.b.A_2017_2018_d_28_44887.ann',
        'cdr3.b.A_2017_2018_m_04_73516.ann',
        # 'cdr3.b.A_2019_2020_d_00_20857.ann',
    ],
    # the maximum number of columns in the distance ladder
    ladder_width=3,
)
print('\n\ncomputation complete.')


# CDR3s to Sample

Use distances between a small number of CDR3s in order to guess which sample they came from.

For example, we could have 5 people and 1 sample from each person.  We then remove a CDR3 at random from one of these samples (and to be fair, we remove the same CDR3 from all the other samples if it occurs).  We then look at the distance between this CDR3 and the remaining CDR3s in the samples, and we see which sample is closest.  We **guess** that this is the sample that the CDR3 was originally removed from.  Finally, we rerun the whole process `num_trials` times and report the percentage of guesses that were correct (the **accuracy**).

In [None]:
import main_cdr3s_to_sample

print('computation starting...')
main_cdr3s_to_sample.calculate_combination(
    # the file names of the samples you want to guess among
    file_names={
        'A': ['cdr3.a.A_2000_2001_d_00_47407.ann'],
        'B': ['cdr3.a.B_2017_2018_d_00_32483.ann'],
        'C': ['cdr3.a.C_2017_2018_d_00_26898.ann'],
        # 'D': ['cdr3.a.D_2017_2018_d_00_45294.ann'],
        # 'E': ['cdr3.a.E_2017_2018_d_00_94077.ann'],
    },
    # the number of times to run the simulation per sample
    num_trials_per_sample=3,
    # the ngram 'n' to use on the CDR3 sequences before applying the distance function
    n_gram_len=1,
    # the distance function to use on pairs of CDR3s
    inner_dist_func_name='jaccard',
    # the aggregation function to use on cdr3 distances in order to obtain a single cdr3--sample distance
    dist_agg_func_name='min',
    # the number of CDR3s to use per guess. more CDR3s is more information and should increase accuracy, but also increase computation time
    num_cdr3s=1,
)
print('computation complete.')


# CDR3 Lifespan

Run the following cell to get a graph of cdr3 frequencies changing over time.

In [None]:
import data_utils
import main_cdr3_lifespan

print('computation starting...')
sample = data_utils.get_cdr3_counter_from_file('s', 'cdr3.b.A_2019_2020_d_00_20857.ann')
cdr3s = [sample.get_cdr3_by_rank(r) for r in range(1, 5)] # <- top 4 cdr3 sequences occuring in cdr3.b.A_2019_2020_d_00_20857.ann
main_cdr3_lifespan.calculate_one(
    # list of cdr3 sequences you want to graph
    cdr3s=cdr3s,
    # Sample file names from which to grab cdr3 frequency data.
    # Typically all files are from the SAME person.
    # Each file will correspond to a single date on the x-axis of the graph.
    file_names=[
      'cdr3.b.A_2017_2018_d_00_53535.ann',
      'cdr3.b.A_2017_2018_d_07_11143.ann',
      'cdr3.b.A_2017_2018_d_28_44887.ann',
      'cdr3.b.A_2017_2018_m_04_73516.ann',
      'cdr3.b.A_2019_2020_d_00_20857.ann',
    ],
    # whether or not to display the cdr3 legend in the graph
    show_legend=True,
)
print('computation complete.')
