# Randomer processing using MapSEQ pipeline

Jupyter notebook containing analyses for collapsing randomer barcodes for randomer library from Shin & Urbanek. Analyses correspond to !!!!!!

The notebook runs the Zador lab's MapSEQ collapsing pipeline to collapse any randomer barcodes that are ≤3 base pairs apart. This serves to reduce possible barcode inflation introduced by PCR and sequencing errors.

Input for this notebook requires:
1) A barcode counts matrix with, at a minimum, a barcode column containing each identified barcode sequence and a ???

Output for this notebook includes:
1) A collapsed counts.tsv table with revised UMIs per collapsed randomer barcode sequences

Module and their versions used when generating figures for the paper can be found in 'requirements.txt', which is stored in our GitHub repository: !!!!!!

In [None]:
This code was last amended by Maddie Urbanek on !!!!!

## Notebook set-up

In [1]:
#Load in modules:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as scipy
from itertools import product
import seaborn as sns
from statsmodels.multivariate.manova import MANOVA

In [2]:
#Set working directory to point to barcode diversity libraries
import os
os.chdir('/Users/maddieurbanek/Desktop/revision_data/resubmission/data/fastqs/randomer/')

## Import and reformat data

Randomer data should be the completecounts.tsv file that's output from ##_randomer_processing.sh with, at a minimum, a column with each randomer sequence and it's correpondent UMI counts

In [3]:
#Import dataset
randomer=pd.read_table('./randomer_completecounts.tsv',delimiter='\t')

In [4]:
#Filter out any barcodes with N base pairs
print('Number of unique barcodes identified in randomer dataset:')
print(len(randomer))
discard = ["N"]
randomer_filtered = randomer[~randomer.barcode.str.contains('|'.join(discard))]
print('Number of unique barcodes remaining after removing N sequences:')
print(len(randomer_filtered))

Number of unique barcodes identified in randomer dataset:
7525099
Number of unique barcodes remaining after removing N sequences:
5434075


In [5]:
#Prep for MAPseq collapsing
randomer_filtered.rename(columns={'UMI_Count': 'read_count'}, inplace=True)
randomer_filtered.rename(columns={'barcode': 'vbc_read'}, inplace=True)
randomer_filtered

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  randomer_filtered.rename(columns={'UMI_Count': 'read_count'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  randomer_filtered.rename(columns={'barcode': 'vbc_read'}, inplace=True)


Unnamed: 0,CBC,vbc_read,read_count
0,b'AAAAAAAAAAAA',AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,3
1,b'AAAAAAAAAAAA',AAAAAAAAAAAAAAAAAAAAAAAAAAACAA,1
2,b'AAAAAAAAAAAA',AAAAAAAAAAAAAAAAAAAAAAAAAACCAA,1
3,b'AAAAAAAAAAAA',AAAAAAAAAAAAAAAAAAAAAAACAAAAAA,1
4,b'AAAAAAAAAAAA',AAAAAAAAAAAAAAAAAAAAACAAAAAAAA,1
...,...,...,...
7525094,b'AAAAAAAAAAAA',TTTTTTTTTTTGCGCTCCCAGCGCCCTCAA,1
7525095,b'AAAAAAAAAAAA',TTTTTTTTTTTTAATCTTTCACAGACGGTT,1
7525096,b'AAAAAAAAAAAA',TTTTTTTTTTTTTTTCATCTCGCAGGCTGC,1
7525097,b'AAAAAAAAAAAA',TTTTTTTTTTTTTTTGTTTTTTTTTTTTTT,2


## Run data through MAPseq processing pipeline

For complete pipeline see: https://github.com/ZadorLaboratory/mapseq-processing/tree/main

### Import libraries

In [10]:
# Built-in python libraries
import logging
import os
import sys
from configparser import ConfigParser

# Data science libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import dask
import dask.dataframe as dd
from dask.dataframe import from_pandas

# For handling barcode tags or other fields with letters and numbers. 
from natsort import natsorted

# Allowing to run our custom libraries from git area. 
gitpath=os.path.expanduser("../../../code/mapseq-processing")
sys.path.append(gitpath)
from mapseq.core import *
from mapseq.barcode import *
from mapseq.utils import *
from mapseq.bowtie import *
from mapseq.stats import *

gitpath=os.path.expanduser("../../../code/mapseq-analysis")
sys.path.append(gitpath)
#from msanalysis.analysis import * 
print("Done")

Done


### Run align_collapse_pd function on the randomer_filtered dataframe

In [11]:
randomer_filtered

Unnamed: 0,CBC,vbc_read,read_count
0,b'AAAAAAAAAAAA',AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,3
1,b'AAAAAAAAAAAA',AAAAAAAAAAAAAAAAAAAAAAAAAAACAA,1
2,b'AAAAAAAAAAAA',AAAAAAAAAAAAAAAAAAAAAAAAAACCAA,1
3,b'AAAAAAAAAAAA',AAAAAAAAAAAAAAAAAAAAAAACAAAAAA,1
4,b'AAAAAAAAAAAA',AAAAAAAAAAAAAAAAAAAAACAAAAAAAA,1
...,...,...,...
7525094,b'AAAAAAAAAAAA',TTTTTTTTTTTGCGCTCCCAGCGCCCTCAA,1
7525095,b'AAAAAAAAAAAA',TTTTTTTTTTTTAATCTTTCACAGACGGTT,1
7525096,b'AAAAAAAAAAAA',TTTTTTTTTTTTTTTCATCTCGCAGGCTGC,1
7525097,b'AAAAAAAAAAAA',TTTTTTTTTTTTTTTGTTTTTTTTTTTTTT,2


In [13]:
print(randomer_filtered['vbc_read'].nunique())
print(randomer_filtered['read_count'].sum())

5434075
135563481


In [17]:
#cp = get_default_config()
configfile = os.path.expanduser('/Users/maddieurbanek/Desktop/revision_data/resubmission/code/mapseq-processing/misc/htna2024')
cp = ConfigParser()
cp.read(configfile)
#logging.getLogger().setLevel(logging.INFO)
#sampleinfo = os.path.expanduser('~/project/mapseq/M205/M205_sampleinfo.xlsx')
#bcfile = os.path.expanduser( cp.get('barcodes','ssifile') )
#project_id = cp.get('project','project_id')

#infilelist = [
#    os.path.expanduser('~/project/mapseq/M205/fastq/M205_HZ_S1_R1_001.fastq.gz'),
#    os.path.expanduser('~/project/mapseq/M205/fastq/M205_HZ_S1_R2_001.fastq.gz')
#          ]
#infiles = package_pairs(infilelist)
#outdir = os.path.expanduser('~/project/mapseq/M205')
#print(f"For {project_id}:\nconfig={configfile}\nbcfile={bcfile}\ninfiles={infiles}\noutdir={outdir}\nDone")

[]

In [19]:
#cp = ConfigParser()
cdf = align_collapse_pd(randomer_filtered.copy(), 
                       outdir='./collapsed_randomer.csv', 
                       cp=cp)
cdf

NoSectionError: No section: 'collapse'

## Re-sum UMIs, accounting for collapsed barcodes

In [None]:
final_collapsed=collapsed.groupby('vbc_read_col')['read_count'].sum()
final_collapsed = pd.DataFrame(final_collapsed)
final_collapsed = final_collapsed.reset_index()

## Export collapsed randomer counts matrix to local machine

In [None]:
final_collapsed.to_csv('../barcode_diversity_libraries/cvs/collapsed_randomer.csv')