# Shared clonotype frequency

For every groupwise combination of two or more years, compute the frequency of universally shared clonotypes (that is, clonotypes found in the repertoire of every year in the group).

The following Python packages are required to run the code in this notebook:
  * numpy
  * pandas
  * [abutils](https://github.com/briney/abutils)

They can be install by running `pip install numpy pandas abutils`

*NOTE: this notebook requires the use of the Unix command line tool `wc`. Thus, it requires a Unix-based operating system to run correctly (MacOS and most flavors of Linux should be fine). Running this notebook on Windows 10 may be possible using the [Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/about) but we have not tested this.*

In [2]:
from collections import Counter
from datetime import datetime
import itertools
import json
import multiprocessing as mp
import os
import subprocess as sp
import sys

import numpy as np
import pandas as pd

from abutils.utils.jobs import monitor_mp_jobs
from abutils.utils.pipeline import list_files, make_dir
from abutils.utils.progbar import progress_bar

### years, files and directories

In [3]:
# files and directories
dedup_year_dir = './data/dedup_year_clonotype_pools/'
cross_year_occurance_files = list_files('./data/user-calculated_cross-year_clonotype_duplicate-counts/')

# years
with open('./data/years.txt') as f:
    years = sorted(f.read().split())

### Number of unique clonotypes per year

If you'd like to actually count the number of unique clonotypes per year, you can run the code in [**this**](LINK) notebook or download a dataset containing each year's unique clonotypes [**here**](LINK). Note that the decompressed unique clonotype dataset is fairly large (about 8GB). 

All we're doing is counting the number of lines in the unique clonotype file. If you'd rather not download and decompress the data just to count the lines, skip the next block of code.

In [4]:
years

['327059-2016', '327059-2020', 'D103-2016', 'D103-2021']

In [5]:
year_sizes = {}
for year in years:
    print(year)
    dedup_file = os.path.join(dedup_year_dir, '{}_dedup_pool_vj-aa.txt'.format(year))
    wc_cmd = 'wc -l {}'.format(dedup_file)
    p = sp.Popen(wc_cmd, stdout=sp.PIPE, stderr=sp.PIPE, shell=True, encoding='utf8')
    stdout, stderr = p.communicate()
    size = int(stdout.strip().split()[0])
    year_sizes[year] = size

327059-2016
327059-2020
D103-2016
D103-2021


In [6]:
year_sizes

{'327059-2016': 9281663,
 '327059-2020': 3619333,
 'D103-2016': 2917577,
 'D103-2021': 1536004}

## Quantify shared clonotypes

In [25]:
shared_frequencies_by_group_size = {i + 1: [] for i in range(len(years))}
shared_frequencies_by_group = []

start_time = datetime.now()
progress_bar(0, len(cross_year_occurance_files), start_time=start_time)

span=2

for i, of in enumerate(cross_year_occurance_files):
    words = os.path.basename(of).split('_')[0].split('-')
    _years = ["-".join(words[i:i+span]) for i in range(0, len(words), span)]
    smallest = min([year_sizes[s] for s in _years])
    min_freq = str(len(_years))
    with open(of) as f:
        for line in f:
            if not line.strip():
                continue
            if line.strip().split()[0] == min_freq:
                count = int(line.strip().split()[1])
                break
    frequency = 1. * count / smallest
    shared_frequencies_by_group.append('{}: {}'.format(', '.join(_years), 100. * frequency))
    shared_frequencies_by_group_size[len(_years)].append(frequency)
    progress_bar(i + 1, len(cross_year_occurance_files), start_time=start_time)

with open('./data/shared_clonotypes/groupwise_shared_clonotype_frequencies.txt', 'w') as f:
    f.write('\n'.join(shared_frequencies_by_group))
    
with open('./data/shared_clonotypes/groupwise_shared_clonotype_frequencies_by-size.json', 'w') as f:
    json.dump(shared_frequencies_by_group_size, f)

(15/15) ||||||||||||||||||||||||||||||||||||||||||||||||||||  100%  (00:00)  
