# imports

In [9]:
import pandas as pd
import numpy as np
import statsmodels.api as sm


In [2]:
# set some formatting preferences to make things nicer to read
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# load in and summarize

Count data from scMPRA performmed by Shendure lab, downloaded from GEO: GSE217686


In [3]:
counts = pd.read_table('../data/GSE217686_assigned_oBC_CRE_mBC_joined_counts_sc_rep_mEB_series.txt')

In [4]:
counts

Unnamed: 0,cellBC,rep_id,oBC,mBC,CRE_class,CRE_id,reads_oBC,UMIs_oBC,reads_mBC,UMIs_mBC
0,A1_GTTACCCAGTTGAAGT-1,A1,GAAAGTGTATTTGGGT,ACGTAACATTATAAT,devCRE,Txndc12_chr4_7978,499,415,0,0
1,A1_GTTACCCAGTTGAAGT-1,A1,CCGGGGTAGGCGAAGA,AACTCAGACTACCAC,devCRE,Col1a2_chr6_72,510,414,0,0
2,A1_GTTACCCAGTTGAAGT-1,A1,ACTGACGTCAATCAAT,TGTTTAAGTCAACAA,devCRE,Klf4_chr4_3952,850,678,0,0
3,A1_GTTACCCAGTTGAAGT-1,A1,GGAGGTGTGCGCGTGG,CAACAACACATTTTA,devCRE,Foxa2_chr2_13840,549,451,0,0
4,A1_GTTACCCAGTTGAAGT-1,A1,CGATACCTACTTAATA,TACCTAATGGGAAAG,promoters,minP,413,353,0,0
...,...,...,...,...,...,...,...,...,...,...
877059,2B2_TTCTGTAAGGCTCTCG-1,2B2,AGCTGGTACACCCGAG,GAACAATTAAACGAG,devCRE,Sparc_chr11_7198,55,50,0,0
877060,2B2_TTCTGTAAGGCTCTCG-1,2B2,ATTTAGACACCAATAC,CAGCCCGCCGCCGAG,promoters,ubcP,32,31,196,11
877061,2B2_TGGAACTCAATCCTAG-1,2B2,GTTAGATACGCTATGG,ATGGCACATATCTTA,devCRE,Gata4_chr14_5760,21,18,0,0
877062,2B2_TGGAACTCAATCCTAG-1,2B2,GATTAATTCGGTCTAC,CTCTATCGTCCTCGT,devCRE,Lamb1_chr12_2289,13,13,0,0


- cellBC: cell barcode, 1 per cell no repeats
- rep_id: are these biological or technical replicates? Biological, but kinda both
- oBC: Tornado barcodes, circular RNA barcodes to indicate successful transfection of a given CRE into a given cell
- mBC: mRNA barcode for quantifying CRE activity 
- CRE_class: type of cis-regulatory element
- CRE_id: CRE identifier
- reads_oBC: 
- UMIs_oBC: 
- reads_mBC:
- UMIs_mBC: 

Questions/Notes to discuss with Mackenzie:
- bio vs tech reps?
- can a CRE-cell combo have more than one oBC-mBC pair? 
- can we just talk through which things can be used multiple times, which have more unique values than they are tagging and which combinations we're looking for? I think I have it all straight in my head but would be good to go through it with someone more familiar with the experimental side of things. 
- UMIs vs reads for these barcodes, the difference here is not clear to me. 
- by endogenous do they mean without inserted CREs in this case? 
- In some of Rohit's code I'm getting the sense that he's modeling the number of cells captured of each cell type is that the goal here?
- okay nevermind, I think cell type is a covariate for modeling counts per oBC 
- I can't track down this file where he got the cell type mappings from, any idea where that come from? 

In [5]:
counts.describe()

Unnamed: 0,reads_oBC,UMIs_oBC,reads_mBC,UMIs_mBC
count,877064.0,877064.0,877064.0,877064.0
mean,168.055,146.082,108.523,5.708
std,169.74,140.72,843.699,43.481
min,11.0,11.0,0.0,0.0
25%,61.0,55.0,0.0,0.0
50%,119.0,107.0,0.0,0.0
75%,217.0,190.0,0.0,0.0
max,5876.0,4616.0,60250.0,2890.0


In [6]:
counts.dtypes

cellBC       object
rep_id       object
oBC          object
mBC          object
CRE_class    object
CRE_id       object
reads_oBC     int64
UMIs_oBC      int64
reads_mBC     int64
UMIs_mBC      int64
dtype: object

In [7]:
counts.nunique()

cellBC       43447
rep_id           6
oBC          28938
mBC          28938
CRE_class        2
CRE_id         212
reads_oBC     2040
UMIs_oBC      1732
reads_mBC     8591
UMIs_mBC      1045
dtype: int64

In [8]:
counts

Unnamed: 0,cellBC,rep_id,oBC,mBC,CRE_class,CRE_id,reads_oBC,UMIs_oBC,reads_mBC,UMIs_mBC
0,A1_GTTACCCAGTTGAAGT-1,A1,GAAAGTGTATTTGGGT,ACGTAACATTATAAT,devCRE,Txndc12_chr4_7978,499,415,0,0
1,A1_GTTACCCAGTTGAAGT-1,A1,CCGGGGTAGGCGAAGA,AACTCAGACTACCAC,devCRE,Col1a2_chr6_72,510,414,0,0
2,A1_GTTACCCAGTTGAAGT-1,A1,ACTGACGTCAATCAAT,TGTTTAAGTCAACAA,devCRE,Klf4_chr4_3952,850,678,0,0
3,A1_GTTACCCAGTTGAAGT-1,A1,GGAGGTGTGCGCGTGG,CAACAACACATTTTA,devCRE,Foxa2_chr2_13840,549,451,0,0
4,A1_GTTACCCAGTTGAAGT-1,A1,CGATACCTACTTAATA,TACCTAATGGGAAAG,promoters,minP,413,353,0,0
...,...,...,...,...,...,...,...,...,...,...
877059,2B2_TTCTGTAAGGCTCTCG-1,2B2,AGCTGGTACACCCGAG,GAACAATTAAACGAG,devCRE,Sparc_chr11_7198,55,50,0,0
877060,2B2_TTCTGTAAGGCTCTCG-1,2B2,ATTTAGACACCAATAC,CAGCCCGCCGCCGAG,promoters,ubcP,32,31,196,11
877061,2B2_TGGAACTCAATCCTAG-1,2B2,GTTAGATACGCTATGG,ATGGCACATATCTTA,devCRE,Gata4_chr14_5760,21,18,0,0
877062,2B2_TGGAACTCAATCCTAG-1,2B2,GATTAATTCGGTCTAC,CTCTATCGTCCTCGT,devCRE,Lamb1_chr12_2289,13,13,0,0


- look into modeling/stats packages
- 