# MAPseq collapsing pipeline for randomer

Jupyter notebook containing analysis used to collapse randomer sequences to avoid calling sequencing errors/mutations as independent barcode sequences

This notebook is derived from the MAPseq pipeline, check out their GitHub here: https://github.com/ZadorLaboratory/mapseq-processing

Input for this notebook requires:
1) A flat file of all randomer barcode sequences and their UMIs derived from 02_randomer_processing.sh
2) A default config file prepackaged with the MAPseq pipeline

Output for this notebook includes:
1) a collapsed flat file which can be input into the end steps of 01_diversity_barcode_alignment.sh to generate a barcode counts matrix

Modules and their versions used when generating figures for the paper can be found in 'requirements.txt', which is stored in our GitHub repository: https://github.com/MEUrbanek/rabies_barcode_tech

This code was last amended by Maddie Urbanek on 12/16/2025

## Notebook set-up

In [1]:
# Built-in python libraries
import logging
import os
import sys
from configparser import ConfigParser

# Data science libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import dask
import dask.dataframe as dd
from dask.dataframe import from_pandas


# For handling barcode tags or other fields with letters and numbers. 
from natsort import natsorted

# Allowing to run our custom libraries from git area. 
gitpath=os.path.expanduser("/home/maddie_urbanek/Desktop/mapseq/mapseq-processing")
sys.path.append(gitpath)
from mapseq.core import *
from mapseq.barcode import *
from mapseq.utils import *
from mapseq.bowtie import *
from mapseq.stats import *

gitpath=os.path.expanduser("/home/maddie_urbanek/Desktop/mapseq/mapseq-analysis")
sys.path.append(gitpath)
#from msanalysis.analysis import * 
print("Done")

Done


# align_collapse_pd()

This function works by an all x all alignment using bowtie. The bowtie output is parsed to create an edge graph, then Tarjan's algorithm is used to find all the components for a given graph. Finally, the vbc_read sequences are collapsed to the most common sequence variant in the component.  

This is the most memory and CPU-intensive function in the pipeline, and can take a long time for large datasets.

In [2]:
#cp = get_default_config()
configfile = os.path.expanduser('~/Desktop/mapseq/mapseq-processing/etc/mapseq.conf')
cp = ConfigParser()
cp.read(configfile)
#logging.getLogger().setLevel(logging.INFO)
#sampleinfo = os.path.expanduser('~/project/mapseq/M205/M205_sampleinfo.xlsx')
#bcfile = os.path.expanduser( cp.get('barcodes','ssifile') )
#project_id = cp.get('project','project_id')

#infilelist = [
#    os.path.expanduser('~/project/mapseq/M205/fastq/M205_HZ_S1_R1_001.fastq.gz'),
#    os.path.expanduser('~/project/mapseq/M205/fastq/M205_HZ_S1_R2_001.fastq.gz')
#          ]
#infiles = package_pairs(infilelist)
outdir = os.path.expanduser('~/Desktop/mapseq/tutorial')

In [4]:
#Read in randomer flat file
randomer=pd.read_table('/home/maddie_urbanek/Desktop/mapseq/tutorial/randomer_flat.txt',delimiter='\t',header=None)
randomer

Unnamed: 0,0,1
0,VH00874:177:AACN7GFHV:1:1113:55989:38144_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
1,VH00874:177:AACN7GFHV:1:1113:56027:38561_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
2,VH00874:177:AACN7GFHV:1:1113:56121:38580_AACAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
3,VH00874:177:AACN7GFHV:1:1113:56216:38447_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
4,VH00874:177:AACN7GFHV:1:1113:56784:39280_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
...,...,...
142068196,VH00874:177:AACN7GFHV:2:2211:10714:16694_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
142068197,VH00874:177:AACN7GFHV:2:2211:10733:16713_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
142068198,VH00874:177:AACN7GFHV:2:2211:10771:16903_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
142068199,VH00874:177:AACN7GFHV:2:2211:10771:16978_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT


In [6]:
#Split columns by underscore
randomer[['read_id','UMI','cbc']]=randomer[0].str.split('_',expand=True)
randomer

Unnamed: 0,0,1,read_id,UMI,cbc
0,VH00874:177:AACN7GFHV:1:1113:55989:38144_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:55989:38144,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
1,VH00874:177:AACN7GFHV:1:1113:56027:38561_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56027:38561,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
2,VH00874:177:AACN7GFHV:1:1113:56121:38580_AACAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56121:38580,AACAAAAAAAAAAAA,AAAAAAAAAAAA
3,VH00874:177:AACN7GFHV:1:1113:56216:38447_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56216:38447,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
4,VH00874:177:AACN7GFHV:1:1113:56784:39280_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56784:39280,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
...,...,...,...,...,...
142068196,VH00874:177:AACN7GFHV:2:2211:10714:16694_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10714:16694,TTTTTTTTTTTTTTT,AAAAAAAAAAAA
142068197,VH00874:177:AACN7GFHV:2:2211:10733:16713_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10733:16713,TTTTTTTTTTTTTTT,AAAAAAAAAAAA
142068198,VH00874:177:AACN7GFHV:2:2211:10771:16903_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10771:16903,TTTTTTTTTTTTTTT,AAAAAAAAAAAA
142068199,VH00874:177:AACN7GFHV:2:2211:10771:16978_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10771:16978,TTTTTTTTTTTTTTT,AAAAAAAAAAAA


In [7]:
randomer=randomer.rename(columns={0:'full',1:'vbc_read'})
randomer

Unnamed: 0,full,vbc_read,read_id,UMI,cbc
0,VH00874:177:AACN7GFHV:1:1113:55989:38144_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:55989:38144,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
1,VH00874:177:AACN7GFHV:1:1113:56027:38561_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56027:38561,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
2,VH00874:177:AACN7GFHV:1:1113:56121:38580_AACAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56121:38580,AACAAAAAAAAAAAA,AAAAAAAAAAAA
3,VH00874:177:AACN7GFHV:1:1113:56216:38447_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56216:38447,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
4,VH00874:177:AACN7GFHV:1:1113:56784:39280_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56784:39280,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
...,...,...,...,...,...
142068196,VH00874:177:AACN7GFHV:2:2211:10714:16694_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10714:16694,TTTTTTTTTTTTTTT,AAAAAAAAAAAA
142068197,VH00874:177:AACN7GFHV:2:2211:10733:16713_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10733:16713,TTTTTTTTTTTTTTT,AAAAAAAAAAAA
142068198,VH00874:177:AACN7GFHV:2:2211:10771:16903_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10771:16903,TTTTTTTTTTTTTTT,AAAAAAAAAAAA
142068199,VH00874:177:AACN7GFHV:2:2211:10771:16978_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10771:16978,TTTTTTTTTTTTTTT,AAAAAAAAAAAA


In [9]:
#Filter out any barcodes with N base pairs
print('Number of unique barcodes identified in randomer dataset:')
print(len(randomer))
discard = ["N"]
randomer_filtered = randomer[~randomer.vbc_read.str.contains('|'.join(discard))]
print('Number of unique barcodes remaining after removing N sequences:')
print(len(randomer_filtered))

Number of unique barcodes identified in randomer dataset:
142068201
Number of unique barcodes remaining after removing N sequences:
137182967


In [11]:
#Get read counts
counts=pd.DataFrame(randomer_filtered['vbc_read'].value_counts())
counts

Unnamed: 0_level_0,count
vbc_read,Unnamed: 1_level_1
ACATCTTCAGCTAAGCCTCACTTAACTCCC,32865
TACAAATACCGCACGCTCCCCAACCATAGC,26196
ACAGGACCTCGGGGACATTGACGGACCGCA,25637
CAGACCAAGCGAATTGGCGACAGCGTCACG,18572
AGCTCCCCGCCGCTTAGCGCATGAGAGTTG,17949
...,...
AAAAAAAAAAACCTAACGCTAACCTCTATA,1
AAAAAAAAAAACTAAATAAACAGGACGTGT,1
AAAAAAAAAAATTCTTTTGCGACATGCGGC,1
AAAAAAAAAACAAAAAAAAAAAAAAAAAAA,1


In [12]:
temp=counts.reset_index()

In [14]:
merged=pd.merge(randomer_filtered,temp, on="vbc_read")
merged

Unnamed: 0,full,vbc_read,read_id,UMI,cbc,count
0,VH00874:177:AACN7GFHV:1:1113:55989:38144_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:55989:38144,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,87
1,VH00874:177:AACN7GFHV:1:1113:56027:38561_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56027:38561,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,87
2,VH00874:177:AACN7GFHV:1:1113:56121:38580_AACAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56121:38580,AACAAAAAAAAAAAA,AAAAAAAAAAAA,87
3,VH00874:177:AACN7GFHV:1:1113:56216:38447_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56216:38447,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,87
4,VH00874:177:AACN7GFHV:1:1113:56784:39280_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56784:39280,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,87
...,...,...,...,...,...,...
137182962,VH00874:177:AACN7GFHV:2:2211:10714:16694_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10714:16694,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,51
137182963,VH00874:177:AACN7GFHV:2:2211:10733:16713_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10733:16713,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,51
137182964,VH00874:177:AACN7GFHV:2:2211:10771:16903_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10771:16903,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,51
137182965,VH00874:177:AACN7GFHV:2:2211:10771:16978_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10771:16978,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,51


In [15]:
#Prep for MAPseq collapsing
merged.rename(columns={'count': 'read_count'}, inplace=True)
merged

Unnamed: 0,full,vbc_read,read_id,UMI,cbc,read_count
0,VH00874:177:AACN7GFHV:1:1113:55989:38144_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:55989:38144,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,87
1,VH00874:177:AACN7GFHV:1:1113:56027:38561_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56027:38561,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,87
2,VH00874:177:AACN7GFHV:1:1113:56121:38580_AACAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56121:38580,AACAAAAAAAAAAAA,AAAAAAAAAAAA,87
3,VH00874:177:AACN7GFHV:1:1113:56216:38447_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56216:38447,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,87
4,VH00874:177:AACN7GFHV:1:1113:56784:39280_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56784:39280,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,87
...,...,...,...,...,...,...
137182962,VH00874:177:AACN7GFHV:2:2211:10714:16694_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10714:16694,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,51
137182963,VH00874:177:AACN7GFHV:2:2211:10733:16713_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10733:16713,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,51
137182964,VH00874:177:AACN7GFHV:2:2211:10771:16903_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10771:16903,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,51
137182965,VH00874:177:AACN7GFHV:2:2211:10771:16978_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10771:16978,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,51


In [17]:
#Subset to only necessary columns
randomer_input=merged[['vbc_read','read_count','UMI']]
randomer_input

Unnamed: 0,vbc_read,read_count,UMI
0,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,87,AAAAAAAAAAAAAAA
1,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,87,AAAAAAAAAAAAAAA
2,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,87,AACAAAAAAAAAAAA
3,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,87,AAAAAAAAAAAAAAA
4,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,87,AAAAAAAAAAAAAAA
...,...,...,...
137182962,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,51,TTTTTTTTTTTTTTT
137182963,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,51,TTTTTTTTTTTTTTT
137182964,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,51,TTTTTTTTTTTTTTT
137182965,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,51,TTTTTTTTTTTTTTT


In [20]:
randomer_input.to_csv('/home/maddie_urbanek/Desktop/mapseq/tutorial/randomer_collapsing_input.csv')

In [21]:
cdf = align_collapse_pd(randomer_input.copy(), 
                       outdir=outdir, 
                       cp=cp)
cdf


Unnamed: 0,read_count,UMI,vbc_read
0,87,AAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
1,87,AAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
2,87,AACAAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
3,87,AAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
4,87,AAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
...,...,...,...
137182962,51,TTTTTTTTTTTTTTT,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
137182963,51,TTTTTTTTTTTTTTT,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
137182964,51,TTTTTTTTTTTTTTT,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
137182965,51,TTTTTTTTTTTTTTT,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA


In [25]:
randomer_filtered=randomer_filtered.reset_index()
randomer_filtered

Unnamed: 0,index,full,vbc_read,read_id,UMI,cbc
0,0,VH00874:177:AACN7GFHV:1:1113:55989:38144_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:55989:38144,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
1,1,VH00874:177:AACN7GFHV:1:1113:56027:38561_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56027:38561,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
2,2,VH00874:177:AACN7GFHV:1:1113:56121:38580_AACAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56121:38580,AACAAAAAAAAAAAA,AAAAAAAAAAAA
3,3,VH00874:177:AACN7GFHV:1:1113:56216:38447_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56216:38447,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
4,4,VH00874:177:AACN7GFHV:1:1113:56784:39280_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56784:39280,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
...,...,...,...,...,...,...
137182962,142068196,VH00874:177:AACN7GFHV:2:2211:10714:16694_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10714:16694,TTTTTTTTTTTTTTT,AAAAAAAAAAAA
137182963,142068197,VH00874:177:AACN7GFHV:2:2211:10733:16713_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10733:16713,TTTTTTTTTTTTTTT,AAAAAAAAAAAA
137182964,142068198,VH00874:177:AACN7GFHV:2:2211:10771:16903_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10771:16903,TTTTTTTTTTTTTTT,AAAAAAAAAAAA
137182965,142068199,VH00874:177:AACN7GFHV:2:2211:10771:16978_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10771:16978,TTTTTTTTTTTTTTT,AAAAAAAAAAAA


In [38]:
randomer_filtered = randomer_filtered.rename(columns={'vbc_read': 'original_barcode','UMI':'output_UMI'})
randomer_filtered

Unnamed: 0,index,full,original_barcode,read_id,output_UMI,cbc
0,0,VH00874:177:AACN7GFHV:1:1113:55989:38144_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:55989:38144,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
1,1,VH00874:177:AACN7GFHV:1:1113:56027:38561_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56027:38561,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
2,2,VH00874:177:AACN7GFHV:1:1113:56121:38580_AACAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56121:38580,AACAAAAAAAAAAAA,AAAAAAAAAAAA
3,3,VH00874:177:AACN7GFHV:1:1113:56216:38447_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56216:38447,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
4,4,VH00874:177:AACN7GFHV:1:1113:56784:39280_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56784:39280,AAAAAAAAAAAAAAA,AAAAAAAAAAAA
...,...,...,...,...,...,...
137182962,142068196,VH00874:177:AACN7GFHV:2:2211:10714:16694_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10714:16694,TTTTTTTTTTTTTTT,AAAAAAAAAAAA
137182963,142068197,VH00874:177:AACN7GFHV:2:2211:10733:16713_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10733:16713,TTTTTTTTTTTTTTT,AAAAAAAAAAAA
137182964,142068198,VH00874:177:AACN7GFHV:2:2211:10771:16903_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10771:16903,TTTTTTTTTTTTTTT,AAAAAAAAAAAA
137182965,142068199,VH00874:177:AACN7GFHV:2:2211:10771:16978_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10771:16978,TTTTTTTTTTTTTTT,AAAAAAAAAAAA


In [39]:
collapsed_flat = pd.concat([randomer_filtered, cdf], axis=1)
collapsed_flat

Unnamed: 0,index,full,original_barcode,read_id,output_UMI,cbc,read_count,UMI,vbc_read
0,0,VH00874:177:AACN7GFHV:1:1113:55989:38144_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:55989:38144,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,87,AAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
1,1,VH00874:177:AACN7GFHV:1:1113:56027:38561_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56027:38561,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,87,AAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
2,2,VH00874:177:AACN7GFHV:1:1113:56121:38580_AACAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56121:38580,AACAAAAAAAAAAAA,AAAAAAAAAAAA,87,AACAAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
3,3,VH00874:177:AACN7GFHV:1:1113:56216:38447_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56216:38447,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,87,AAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
4,4,VH00874:177:AACN7GFHV:1:1113:56784:39280_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56784:39280,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,87,AAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
...,...,...,...,...,...,...,...,...,...
137182962,142068196,VH00874:177:AACN7GFHV:2:2211:10714:16694_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10714:16694,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,51,TTTTTTTTTTTTTTT,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
137182963,142068197,VH00874:177:AACN7GFHV:2:2211:10733:16713_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10733:16713,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,51,TTTTTTTTTTTTTTT,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
137182964,142068198,VH00874:177:AACN7GFHV:2:2211:10771:16903_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10771:16903,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,51,TTTTTTTTTTTTTTT,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
137182965,142068199,VH00874:177:AACN7GFHV:2:2211:10771:16978_TTTTT...,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,VH00874:177:AACN7GFHV:2:2211:10771:16978,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,51,TTTTTTTTTTTTTTT,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA


In [40]:
input_flat=collapsed_flat[['read_id','UMI','cbc','vbc_read']]
input_flat

Unnamed: 0,read_id,UMI,cbc,vbc_read
0,VH00874:177:AACN7GFHV:1:1113:55989:38144,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
1,VH00874:177:AACN7GFHV:1:1113:56027:38561,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
2,VH00874:177:AACN7GFHV:1:1113:56121:38580,AACAAAAAAAAAAAA,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
3,VH00874:177:AACN7GFHV:1:1113:56216:38447,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
4,VH00874:177:AACN7GFHV:1:1113:56784:39280,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
...,...,...,...,...
137182962,VH00874:177:AACN7GFHV:2:2211:10714:16694,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
137182963,VH00874:177:AACN7GFHV:2:2211:10733:16713,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
137182964,VH00874:177:AACN7GFHV:2:2211:10771:16903,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
137182965,VH00874:177:AACN7GFHV:2:2211:10771:16978,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA


In [41]:
input_flat=input_flat.reset_index()
input_flat['col_1'] = input_flat[['read_id','UMI','cbc']].agg('_'.join, axis = 1 )
input_flat

Unnamed: 0,index,read_id,UMI,cbc,vbc_read,col_1
0,0,VH00874:177:AACN7GFHV:1:1113:55989:38144,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:55989:38144_AAAAA...
1,1,VH00874:177:AACN7GFHV:1:1113:56027:38561,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56027:38561_AAAAA...
2,2,VH00874:177:AACN7GFHV:1:1113:56121:38580,AACAAAAAAAAAAAA,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56121:38580_AACAA...
3,3,VH00874:177:AACN7GFHV:1:1113:56216:38447,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56216:38447_AAAAA...
4,4,VH00874:177:AACN7GFHV:1:1113:56784:39280,AAAAAAAAAAAAAAA,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:1:1113:56784:39280_AAAAA...
...,...,...,...,...,...,...
137182962,137182962,VH00874:177:AACN7GFHV:2:2211:10714:16694,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:2:2211:10714:16694_TTTTT...
137182963,137182963,VH00874:177:AACN7GFHV:2:2211:10733:16713,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:2:2211:10733:16713_TTTTT...
137182964,137182964,VH00874:177:AACN7GFHV:2:2211:10771:16903,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:2:2211:10771:16903_TTTTT...
137182965,137182965,VH00874:177:AACN7GFHV:2:2211:10771:16978,TTTTTTTTTTTTTTT,AAAAAAAAAAAA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,VH00874:177:AACN7GFHV:2:2211:10771:16978_TTTTT...


In [43]:
final_flat=input_flat[['col_1','vbc_read']]
final_flat

Unnamed: 0,col_1,vbc_read
0,VH00874:177:AACN7GFHV:1:1113:55989:38144_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
1,VH00874:177:AACN7GFHV:1:1113:56027:38561_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
2,VH00874:177:AACN7GFHV:1:1113:56121:38580_AACAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
3,VH00874:177:AACN7GFHV:1:1113:56216:38447_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
4,VH00874:177:AACN7GFHV:1:1113:56784:39280_AAAAA...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
...,...,...
137182962,VH00874:177:AACN7GFHV:2:2211:10714:16694_TTTTT...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
137182963,VH00874:177:AACN7GFHV:2:2211:10733:16713_TTTTT...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
137182964,VH00874:177:AACN7GFHV:2:2211:10771:16903_TTTTT...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
137182965,VH00874:177:AACN7GFHV:2:2211:10771:16978_TTTTT...,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA


In [46]:
#Export collapsed flat file
final_flat.to_csv('/home/maddie_urbanek/Desktop/mapseq/tutorial/collapsed_randomer_flat.txt', sep="\t",index=False)