# Template switching analysis

Jupyter notebook containing analyses for looking at rates of template switching in viral diversity libraries from Shin & Urbanek. Analyses correspond to !!!!!!

This notebook contains an analysis that examines rates of shared UMIs between unique barcode sequences to identify high template switching events introduced during PCR. For our bit-based barcodes, there's the potential for 20bp+ overlap between the UMI and each bit, between shared bit 1s and between shared bit 2s. In each case, we would expect to see an identical UMI split between multiple barcode sequences. 

Input for this notebook requires:
1) The flat.txt input file for each barcode diversity library of interest

Output for this notebook includes:
1. Plots of unique barcodes detected per UMI by dataset

Module and their versions used when generating figures for the paper can be found in 'requirements.txt', which is stored in our GitHub repository: !!!!!!

This code was last amended by Maddie Urbanek on !!!!!

## Notebook set-up

In [1]:
#Load in modules:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as scipy
from itertools import product
import seaborn as sns
from statsmodels.multivariate.manova import MANOVA

In [2]:
#Set working directory to point to barcode diversity libraries
import os
os.chdir('/Users/maddieurbanek/Desktop/revision_data/resubmission/data/barcode_diversity_libraries/')

## Import datasets

In [None]:
#SADB19 libraries
rep_1=pd.read_table('./sadb19/rep_1_flat.txt',delimiter='\t')
rep_2=pd.read_table('./sadb19/rep_2_flat.txt',delimiter='\t')
rep_3=pd.read_table('./sadb19/rep_3_flat.txt',delimiter='\t')

In [None]:
#CVS-N2c libraries
rep_4=pd.read_table('./cvs/rep_4_flat.txt',delimiter='\t')
rep_5=pd.read_table('./cvs/rep_5_flat.txt',delimiter='\t')
rep_6=pd.read_table('./cvs/rep_6_flat.txt',delimiter='\t')

In [None]:
#Randomer
randomer=pd.read_table('./cvs/randomer_flat.txt',delimiter=',')

## Calculate unique barcodes detected per UMI for each dataset

### Make function

In [None]:
def calculate_barcodes_per_umi(dataset #input dataset for formatting
               ):

    dataset.columns=['id','barcode']
    dataset[['read', 'umi', 'cbc']] = dataset['id'].str.split('_', expand=True)
    dataset[['bit1', 'bit2', 'bit3']] = dataset['barcode'].str.split('-', expand=True)

    barcodes_per_umi = dataset.groupby('umi')['barcode'].nunique().rename_axis('umi').reset_index(name='unique_barcodes')
    print(barcodes_per_umi.head())

    return barcodes_per_umi

### Apply function to replicates