# Experimental design

### Biological

- Mice with sleeping beauty
- harvest left/right tumor and spleen after transposition
- amplification at IRR and IRL sites of transposon
- Illumina paired reads

### Computational

- preprocess reads
- preprocess insertions
- get read counts grouped by interstion sites (IS) per chromosome and samples
- pseudo common insertion site (pCIS) is found using a graph based proximity approach of the IS
  - chromosome is the actual graph object
  - an IS is represented as a node in the graph
  - pCIS is a distinct subgraph within the chromosome graph
- pCIS -> CIS based on some sort of statistic
  - currently using LFC or binomial for significance, not any graph based form of significance

# IS significance

### not read normalized - set a threshold of read

- less than threshold, then no IS event (count of 1 is implicitely used)
- more than threshold, it is an IS event
- start with threshold of 10 reads, but results would likely change based on this

### read normalized - use a penetrance(like) score

- determine the strength of the IS
- add 1 pseudo count divided by read depth or the reads per million mapped to a site
- read depth normalization is done per sample and per chromosome
- divide read counts at IS by the total reads on the chromosome
- look into outliers
- So for normalization, I can think of two ways: counts per million (CPM) and transposon copy number (TCN)
  - CPM: scale counts based on library read counts (total reads in IRR/IRL per sample)
  - TCN: scale counts based on transposon read counts (total reads in both IRR and IRL per sample)
- 
- TODO: make plots and slides to capture the above things
  - avg total read count per chromosome across all samples and controls
  - plot the normalized values and replot with the read sum. It should have approx the same read depth
  - read count normalized per site

# Statistics
hypergeometric/fisher's exact test
- choose threshold. Are there more or less than this?
- do this per sample in a pCIS for how many samples met this threshold
- use total valid samples. How many insertions needed to be valid? defualt is 0 meaning does a sample have any reads in a pCIS
- gives a binary profile for cases and controls that is then used in fisher's exact test
- Easy to then test different thresholds

quantitative
- sum read depth normalized counts per sample then use rank sums
- use one sided test (why this?)

global
- how many CIS's in each group

Misc
- could try testing right and left sum ranks versus spleen
- use fdr cause we are testing a certain amount of pCIS's
- can stick with union of two pCIS's range. Then get into the interesting genomic features for the identified CIS's


# cis_networks.py

In [1]:
import sys
from pathlib import Path
from multiprocessing import Pool
import pickle

import numpy as np
import pandas as pd
from tqdm import tqdm
from IPython.display import display
import networkx as nx

module_path = "/home/fisch872/mat/projects/Laura-SB-Analysis/NetCIS/"
sys.path.append(module_path)
from netcis import cis_networks as cn

from importlib import reload


args = {
    # "output_prefix" : "/project/cs-myers/MathewF/projects/Laura-SB-Analysis/2023-SB-screen/output/ACF_SCF/GRCm39/results",
    "output_prefix": "/project/cs-myers/MathewF/projects/Laura-SB-Analysis/2020_SB-output/GRCm39/results",
    "verbose": 0,
    "threshold": 50000,
    "njobs": 1,
    }

args["insertion_dir"] = Path(args["output_prefix"] + "-insertions")
args["depth_dir"] = Path(args["output_prefix"] + "-insertions-depth")
args["output"] = Path(args["output_prefix"] + "-graphs")


In [2]:
reload(cn)

# new way of doing things
insertion_list = [ pd.read_csv(file, sep="\t") for file in args["depth_dir"].iterdir() ]
inserts_df = pd.concat(insertion_list, ignore_index=True)
inserts_df.insert(4, "counts_irr", np.where(inserts_df['library'] == 'IRR', 1, 0))
inserts_df.insert(5, "counts_irl", np.where(inserts_df['library'] == 'IRL', 1, 0))
# display(inserts_df)

def create_graph(iter_args):
    chrom_df, save_dir, threshold, verbose = iter_args
    
    G = nx.Graph()
    cols = ["CPM", "counts_irr", "counts_irl"]
    tmp_group = chrom_df.groupby(by=['chr', 'pos'], sort=False, as_index=False, dropna=False)
    insertion_nodes_df = tmp_group[cols].sum()
    insertion_nodes_df.insert(2, "counts", tmp_group['count'].count().pop('count'))

    # add in info about which samples are in each insertion site
    tmp_samples = chrom_df.groupby(by=['chr', 'pos'], sort=False, as_index=False, dropna=False)["sampleID"].apply(lambda x: x.unique())
    if tmp_samples.size == 0:
        insertion_nodes_df["n_samples"] = 0
        insertion_nodes_df["sample_IDs"] = []
    else:
        insertion_nodes_df.insert(6, "n_samples", tmp_samples["sampleID"].apply(lambda x: len(x)))
        insertion_nodes_df.insert(6, "sample_IDs", tmp_samples["sampleID"].apply(lambda x: list(x)).to_list())

    # add nodes and edges to graph
    G.add_nodes_from(cn.add_nodes(insertion_nodes_df))
    G.add_edges_from(cn.find_edges(G.nodes(), threshold))
    
    if verbose > 1:
        cn.graph_properties(G)

    # save the graph
    nx.write_gml(G, save_dir / "G.gml")
    
    # save subgraphs from graph
    subgraphs_by_nodes = sorted(nx.connected_components(G), key=len, reverse=True)
    subgraphs = [ G.subgraph(x) for x in subgraphs_by_nodes ]
    with open(save_dir / "subgraphs.pickle", "wb") as f:
        pickle.dump(subgraphs, f, pickle.HIGHEST_PROTOCOL)

chrom_list = np.unique(inserts_df["chr"].to_numpy())
treatment_list = inserts_df["treatment"].unique()

# total unique samples across all treatments
total_samples = inserts_df["sampleID"].unique().shape[0]
metadata = {"total": total_samples}

for treatment in treatment_list:
    print(treatment)
    # prepare output
    out_dir = args['output'] / treatment
    out_dir.mkdir(parents=True, exist_ok=True)
    
    treatment_df = inserts_df[inserts_df["treatment"] == treatment]
    metadata[treatment] = treatment_df["sampleID"].unique().shape[0]
    
    # don't allow more jobs than there are chromosomes
    jobs = args["njobs"]
    num_chr = len(chrom_list)
    if num_chr < jobs:
        print(f"Reducing number of jobs from {jobs} to {num_chr}, since there are only {num_chr} chromosomes present.")
        jobs = len(chrom_list)
        
    # construct CIS network per chromosome for treatment insertion
    iter_gen = cn.create_graph_generator(chrom_list, treatment_df, out_dir, args)
    iter_gen = tqdm(iter_gen)
    with Pool(jobs) as p:
        for _ in p.imap_unordered(create_graph, iter_gen):
            pass
        p.close()
        
# save sample numbers as meta data for network analysis
samples, counts = zip(*metadata.items())
meta_df = pd.DataFrame({"samples": samples, "counts": counts})
meta_df.to_csv(args['output'].parent / "samples_with_insertions.csv", index=False)

LT


22it [00:01, 12.49it/s]


RT


22it [00:02, 10.91it/s]


S


22it [00:04,  4.80it/s]


# network_analysis.py

In [87]:
import sys, os
import pickle
from pathlib import Path
from multiprocessing import Pool

import pandas as pd
import numpy as np 
import seaborn.objects as so
from seaborn import axes_style
import networkx as nx
from scipy.stats import binomtest, ranksums, fisher_exact, boschloo_exact
from tqdm import tqdm

from IPython.display import display
from importlib import reload

module_path = "/home/fisch872/mat/projects/Laura-SB-Analysis/NetCIS/"
sys.path.append(module_path)
from netcis import network_analysis as na
reload(na)

<module 'netcis.network_analysis' from '/home/fisch872/mat/projects/Laura-SB-Analysis/NetCIS/netcis/network_analysis.py'>

In [153]:
refdata = Path("/project/cs-myers/MathewF/projects/Laura-SB-Analysis/2023-SB-screen/ref_data/GRCm39")

args = {
    # "output_prefix": "/project/cs-myers/MathewF/projects/Laura-SB-Analysis/2023-SB-screen/output/GRCm39/results", 
    "output_prefix": "/project/cs-myers/MathewF/projects/Laura-SB-Analysis/2020_SB-output/GRCm39/results", 
    "ta_dir": refdata / "ta_files",
    "gene_annot": refdata / "MRK_List2.rpt",
    "ta_error": 5,
    "pval_threshold": 0.05,
    "verbose": 1,
    "case": "LT",  #    CAR     ACF    LT RT
    "control": "S",  # NoCAR   SCF    S 
    "njobs": 22,
}

graph_dir = Path(args["output_prefix"] + "-graphs/")
args["graph_dir"] = graph_dir
output = Path(args["output_prefix"] + "-analysis-new")
output.mkdir(exist_ok=True)

ta_dir = args["ta_dir"]
gene_annot = args["gene_annot"]
ta_error = args["ta_error"]
pval_threshold = args["pval_threshold"]
verbose = args["verbose"]
case = args["case"]
control = args["control"]
njobs = args["njobs"]

output_res = output / f"{case}-{control}"
output_res.mkdir(exist_ok=True)

In [93]:
annot_df = pd.read_csv(gene_annot, sep="\t")
annot_df = annot_df[pd.notna(annot_df["genome coordinate start"])].drop("Status", axis=1)
annot_df["chrom"] = annot_df["Chr"].apply(lambda x: f"chr{x}")
annot_df = annot_df.sort_values(["chrom"]).reset_index(drop=True)

bed_files = {file.name.split(".")[0]: file for file in args["ta_dir"].iterdir()}

chroms = sorted([ chrom.name for chrom in (graph_dir / case).iterdir() ])
print(chroms)
print(len(chroms))

# don't allow more jobs than there are chromosomes
jobs = args["njobs"]
num_chr = len(chroms)
if num_chr < jobs:
    print(f"Reducing number of jobs from {jobs} to {num_chr}, since there are only {num_chr} chromosomes present.")
    jobs = len(chroms)

['chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chrM', 'chrX', 'chrY']
22


## test everything

In [160]:
# iter_args = tqdm([ (chrom, annot_df[annot_df["chrom"] == chrom], bed_files[chrom], args) for chrom in chroms ])
reload(na)
iter_args = [ (chrom, annot_df[annot_df["chrom"] == chrom], bed_files[chrom], args) for chrom in chroms ]
print("chrom\tCIS/pCIS")
with Pool(args["njobs"]) as p:
    res_dict_list = [ x for x in p.imap_unordered(na.chrom_analysis, iter_args) ]
    
# join chromosomes results together  
IS_list = []
pCIS_list = []
CIS_list = []
for res_dict in res_dict_list:
    IS_list.append(res_dict["is"])
    pCIS_list.append(res_dict["pcis"])
    CIS_list.append(res_dict["cis"])

IS_df = pd.concat(IS_list, ignore_index=True)
pCIS_df = pd.concat(pCIS_list, ignore_index=True)
CIS_df = pd.concat(CIS_list, ignore_index=True)

# save results
IS_df.to_csv(output_res / "IS.tsv", sep="\t", index=False)
pCIS_df.to_csv(output_res / "pCIS.tsv", sep="\t", index=False)
CIS_df.to_csv(output_res / "CIS.tsv", sep="\t", index=False)

chrom	CIS/pCIS
chrM	1/1
chr19	141/197
chrY	105/191
chr17	262/350
chr18	222/323
chr16	288/403
chr15	308/420
chr11	326/436
chr13	304/431
chr8	347/481
chr14	357/492
chr10	372/511
chr12	369/495
chr7	415/546
chr6	409/567
chr5	424/559
chr3	442/639
chr9	452/585
chrX	500/768
chr2	487/671
chr4	781/952
chr1	1018/1212


## test individual chrom, minimal functions

In [110]:
iter_args = [ (chrom, annot_df[annot_df["chrom"] == chrom], bed_files[chrom], args) for chrom in chroms ]
# with Pool(args["njobs"]) as p:
#     res_dict_list = [ x for x in p.imap_unordered(chrom_analysis, iter_args) ]
chrom, annot_chrom_df, chrom_bed_file, args = iter_args[0]
print(chrom)


graph_dir = args["graph_dir"]
case = args["case"]
control = args["control"]
ta_error = args["ta_error"]
pval_threshold = args["pval_threshold"]
verbose = args["verbose"]
gene_expander = 50000  # TODO: add to input args


# get bed chromosome file to find positions of TAs
bed_chrom_df = pd.read_csv(chrom_bed_file, sep="\t", header=None)

# get basic stats on case chromosome subgraphs 
with open(graph_dir / case / chrom / "subgraphs.pickle", 'rb') as f:
    case_chrom_subgraphs = pickle.load(f)
case_chrom_df = na.get_subgraph_stats(case_chrom_subgraphs, case, chrom, bed_chrom_df, ta_error)
    
# get basic stats on control chromosome subgraphs 
with open(graph_dir / control / chrom / "subgraphs.pickle", 'rb') as f:
    control_chrom_subgraphs = pickle.load(f)
control_chrom_df = na.get_subgraph_stats(control_chrom_subgraphs, control, chrom, bed_chrom_df, ta_error)


# get total samples for case and controls
# double list comprehension https://stackoverflow.com/questions/17657720/python-list-comprehension-double-for
# but using a set to remove duplicates
case_samples = { x for y in case_chrom_df["sample_IDs"] for x in y }
control_samples = { x for y in control_chrom_df["sample_IDs"] for x in y }
num_cases = len(case_samples)
num_controls = len(control_samples)


chr1


In [139]:
# find the overlapping pCIS between case and control subgraphs
reload(na)
overlap_df = na.pcis_overlaps(case_chrom_df, control_chrom_df)

In [152]:
# compare pcis
IS_df_list = []
pCIS_df_list = []

for overlap in overlap_df.itertuples():
    # get normalized read counts (CPM) for each pCIS
    case_ind = overlap.case
    control_ind = overlap.control
    
    # if just case 
    if control_ind is None or np.isnan(control_ind):
        case_G = case_chrom_subgraphs[case_ind]
        case_pos = [ case_G.nodes[node]['position'] for node in case_G.nodes ]
        tmp_case = pd.DataFrame([ {"case_count": case_G.nodes[node]['CPM']} for node in case_G.nodes ], index=case_pos)
        tmp_case["case_index"] = case_ind
        
        tmp = tmp_case
        tmp["control_count"] = 0.0
        tmp["control_index"] = np.nan
        
        num_case_samples = len({ x for y in [ case_G.nodes[node]["sample_IDs"] for node in case_G.nodes ] for x in y })
        num_control_samples = 0
        
        case_pos_min = min(case_pos)
        case_pos_max = max(case_pos)
        control_pos_min = None
        control_pos_max = None
        
        case_IS = len(tmp)
        control_IS = 0
        
    # if just control
    elif case_ind is None or np.isnan(case_ind):
        control_G = control_chrom_subgraphs[control_ind]
        control_pos = [ control_G.nodes[node]['position'] for node in control_G.nodes ]
        tmp_control = pd.DataFrame([ {"control_count": control_G.nodes[node]['CPM']} for node in control_G.nodes ], index=control_pos)
        tmp_control["control_index"] = control_ind
        
        tmp = tmp_control
        tmp["case_count"] = 0.0
        tmp["case_index"] = np.nan
        
        num_case_samples = 0
        num_control_samples = len({ x for y in [ control_G.nodes[node]["sample_IDs"] for node in control_G.nodes ] for x in y })
    
        case_pos_min = None
        case_pos_max = None
        control_pos_min = min(control_pos)
        control_pos_max = max(control_pos)

        case_IS = 0
        control_IS = len(tmp)
        
    # if one case and one control
    elif type(case_ind) is not list and type(control_ind) is not list:
        case_G = case_chrom_subgraphs[case_ind]
        case_pos = [ case_G.nodes[node]['position'] for node in case_G.nodes ]
        tmp_case = pd.DataFrame([ {"case_count": case_G.nodes[node]['CPM']} for node in case_G.nodes ], index=case_pos)
        tmp["case_index"] = case_ind
        
        control_G = control_chrom_subgraphs[control_ind]
        control_pos = [ control_G.nodes[node]['position'] for node in control_G.nodes ]
        tmp_control = pd.DataFrame([ {"control_count": control_G.nodes[node]['CPM']} for node in control_G.nodes ], index=control_pos)
        tmp["control_index"] = control_ind
        
        tmp = tmp_case.join(tmp_control, how="outer")
        
        num_case_samples = len({ x for y in [ case_G.nodes[node]["sample_IDs"] for node in case_G.nodes ] for x in y })
        num_control_samples = len({ x for y in [ control_G.nodes[node]["sample_IDs"] for node in control_G.nodes ] for x in y })
        
        case_pos_min = min(case_pos)
        case_pos_max = max(case_pos)
        control_pos_min = min(control_pos)
        control_pos_max = max(control_pos)
        
        case_IS = len(tmp_case)
        control_IS = len(tmp_control)
        
    # if multiple case 
    elif type(case_ind) is list:
        # get single control
        control_G = control_chrom_subgraphs[control_ind]
        control_pos = [ control_G.nodes[node]['position'] for node in control_G.nodes ]
        tmp_control = pd.DataFrame([ {"control_count": control_G.nodes[node]['CPM']} for node in control_G.nodes ], index=control_pos)
        tmp_control["control_index"] = control_ind
        
        case_samples = set()
        num_control_samples = len({ x for y in [ control_G.nodes[node]["sample_IDs"] for node in control_G.nodes ] for x in y })
        
        # get multiple cases
        tmp_case_list = []
        tmp_case_pos = []
        for case_index in case_ind:
            case_G = case_chrom_subgraphs[case_index]
            case_position = [ case_G.nodes[node]['position'] for node in case_G.nodes ]
            tmp_case_pos.extend(case_position)
            tmp_case = pd.DataFrame([ {"case_count": case_G.nodes[node]['CPM']} for node in case_G.nodes ], index=case_position)
            tmp_case["case_index"] = int(case_index)
            tmp_case_list.append(tmp_case)
            case_samples = case_samples.union({ x for y in [ case_G.nodes[node]["sample_IDs"] for node in case_G.nodes ] for x in y })
        num_case_samples = len(case_samples)
        tmp_cases = pd.concat(tmp_case_list, axis=0)
        tmp = tmp_control.join(tmp_cases, how="outer")
        
        case_pos_min = min(tmp_case_pos)
        case_pos_max = max(tmp_case_pos)
        control_pos_min = min(control_pos)
        control_pos_max = max(control_pos)

        case_IS = len(tmp_cases)
        control_IS = len(tmp_control)
        
    # if mulitple control
    elif type(control_ind) is list:
        # get single case
        case_G = case_chrom_subgraphs[case_ind]
        case_pos = [ case_G.nodes[node]['position'] for node in case_G.nodes ]
        tmp_case = pd.DataFrame([ {"case_count": case_G.nodes[node]['CPM']} for node in case_G.nodes ], index=case_pos)
        tmp_case["case_index"] = case_ind

        num_case_samples = len({ x for y in [ case_G.nodes[node]["sample_IDs"] for node in case_G.nodes ] for x in y })
        control_samples = set()
        
        # get multiple controls
        tmp_control_list = []
        tmp_control_pos = []
        for control_index in control_ind:
            control_G = control_chrom_subgraphs[control_index]
            control_pos = [ control_G.nodes[node]['position'] for node in control_G.nodes ]
            tmp_control_pos.extend(control_pos)
            tmp_control = pd.DataFrame([ {"control_count": control_G.nodes[node]['CPM']} for node in control_G.nodes ], index=control_pos)
            tmp_control["control_index"] = int(control_index)
            tmp_control_list.append(tmp_control)
            control_samples = control_samples.union({ x for y in [ control_G.nodes[node]["sample_IDs"] for node in control_G.nodes ] for x in y })
        num_control_samples = len(control_samples)
        tmp_controls = pd.concat(tmp_control_list, axis=0)
        tmp = tmp_case.join(tmp_controls, how="outer")
        
        case_pos_min = min(case_pos)
        case_pos_max = max(case_pos)
        control_pos_min = min(tmp_control_pos)
        control_pos_max = max(tmp_control_pos)

        case_IS = len(tmp_cases)
        control_IS = len(tmp_control)
        
    else:
        print(overlap)
        print("this shouldn't happen")
    
    # only fillna in case_count and control_count
    tmp["case_count"] = tmp["case_count"].fillna(0.0)
    tmp["control_count"] = tmp["control_count"].fillna(0.0)
    tmp = tmp.reset_index(drop=False).rename(columns={"index": "pos"})


    # run stats for each pCIS
    # NOTE: binomtest takes only integeres, so I'm converting the normalized read counts to the closest integers
    # get stats per TA site (only count is used)
    # used pseudo count of 1 for log fold change, and so I wanted to show the difference in binomial test and significance with this
    tmp["target_binom_pval"] = tmp.apply(lambda x: binomtest( int(x["case_count"]) + 1, int(x["case_count"] + x["control_count"]) + 1 ).pvalue, axis=1)
    tmp["target_binom_sig"] = tmp["target_binom_pval"] < 0.05
    tmp["LFC"] = tmp.apply(lambda x: np.log2((x["case_count"] + 1) / (x["control_count"] + 1)), axis=1)

    rs = ranksums(tmp["case_count"], tmp["control_count"]).pvalue
    binom = binomtest(int(tmp["case_count"].sum()) + 1, int(tmp["case_count"].sum() + tmp["control_count"].sum()) + 1, 0.5).pvalue
    
    total_IS = len(tmp)
    sig_IS = tmp["target_binom_sig"].sum()
    
    # contingency table = [[a, b], [c, d]]
    #            in pCIS   not in pCIS
    # target        a           b
    # reference     c           d
    a = num_case_samples
    b = num_cases - num_case_samples
    c = num_control_samples
    d = num_controls - num_control_samples
    if a < 0 or b < 0 or c < 0 or d < 0:
        print(chrom)
        print(a, b, num_cases)
        print(c, d, num_controls)
    fi = fisher_exact([[a, b], [c, d]]).pvalue
    
    tmp2 = {
        "case_index": case_ind,
        "case_pos_min": case_pos_min,
        "case_pos_max": case_pos_max,
        "control_index": control_ind,
        "control_pos_min": control_pos_min,
        "control_pos_max": control_pos_max,
        
        "ranksums": rs,
        "binomial": binom,
        "fishers_exact": fi,
    
        "case_num_samples": num_case_samples,
        "control_num_samples": num_control_samples,
        
        "total_IS": total_IS,
        "case_IS": case_IS,
        "control_IS": control_IS,
        "case_total_read_count": tmp["case_count"].sum(),
        "control_total_read_count": tmp["control_count"].sum(),
        }
    
    IS_df_list.append(tmp)
    pCIS_df_list.append(tmp2)
    
IS_df = pd.concat(IS_df_list, ignore_index=True)
IS_df["case"] = case
IS_df["control"] = control
IS_df["chrom"] = chrom

pCIS_df = pd.DataFrame(pCIS_df_list)
pCIS_df["case"] = case
pCIS_df["control"] = control
pCIS_df["chrom"] = chrom

In [107]:
# get genes for each pCIS


# gene_expander = 50000
gene_expander = 0

# trim down annotation dataframe to just gene
annot_chrom_genes = annot_chrom_df[annot_chrom_df["Marker Type"] == "Gene"]
gene_names = annot_chrom_genes["Marker Symbol"].to_numpy()

pos_min = pCIS_df[["case_pos_min", "control_pos_min"]].min(axis=1).to_numpy().reshape(-1, 1)
pos_max = pCIS_df[["case_pos_max", "control_pos_max"]].max(axis=1).to_numpy().reshape(-1, 1)
gene_start = (annot_chrom_genes["genome coordinate start"] - gene_expander).to_numpy().reshape(1, -1)
gene_end = (annot_chrom_genes["genome coordinate end"] + gene_expander).to_numpy().reshape(1, -1)
tmp = (pos_min <= gene_end) & (pos_max >= gene_start)

pCIS_df["genes"] = [ list(gene_names[tmp[i]]) for i in range(tmp.shape[0]) ]


# search for a gene
gene_search = "Aak1"
pCIS_df[pCIS_df["genes"].apply(lambda x: gene_search in x)]

Unnamed: 0,case_index,case_pos_min,case_pos_max,control_index,control_pos_min,control_pos_max,ranksums,binomial,fishers_exact,case_num_samples,control_num_samples,total_IS,case_IS,control_IS,case_total_read_count,control_total_read_count,case,control,chrom,genes
5,14.0,86858017.0,86858024.0,,,,0.04953461,3.889385e-62,1.0,1,0,3,3,0,204.786894,0.0,LT,S,chr6,"[Aak1, wa1l, Sndy1]"
354,,,,211.0,86976665.0,86976665.0,0.3173105,7.801456e-131,0.412903,0,1,1,0,1,0.0,441.27086,LT,S,chr6,"[Aak1, wa1l, Sndy1]"
534,1.0,86910505.0,86911207.0,1.0,86886567.0,86910900.0,3.931046e-10,0.0,0.355551,64,50,108,101,77,151431.857664,23456.878311,LT,S,chr6,"[Aak1, wa1l, Sndy1]"


In [108]:
# filter out insignificant pCISes to get CISes
CIS_df = pCIS_df[
    (pCIS_df["fishers_exact"] <= pval_threshold) |\
    (pCIS_df["ranksums"] <= pval_threshold) |\
    (pCIS_df["binomial"] <= pval_threshold) ]
CIS_df


Unnamed: 0,case_index,case_pos_min,case_pos_max,control_index,control_pos_min,control_pos_max,ranksums,binomial,fishers_exact,case_num_samples,control_num_samples,total_IS,case_IS,control_IS,case_total_read_count,control_total_read_count,case,control,chrom,genes
0,6,133492835.0,133505740.0,,,,1.745119e-03,1.535690e-238,0.021223,8,0,7,7,0,790.045473,0.000000,LT,S,chr6,[Sndy1]
1,8,29174318.0,29174323.0,,,,2.092134e-02,2.980232e-08,1.000000,1,0,4,4,0,25.663399,0.000000,LT,S,chr6,"[Prrt4, tint]"
2,9,30608702.0,30608706.0,,,,2.092134e-02,0.000000e+00,1.000000,1,0,4,4,0,3059.975520,0.000000,LT,S,chr6,"[Gm31453, Hdp1, tint]"
3,10,16307995.0,16350128.0,,,,2.092134e-02,9.055679e-72,0.512023,2,0,4,4,0,236.923721,0.000000,LT,S,chr6,[]
4,11,90544038.0,90544042.0,,,,2.092134e-02,6.077163e-64,1.000000,1,0,4,4,0,210.737582,0.000000,LT,S,chr6,"[Aldh1l1, wa1l, Sndy1]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
559,125,66218819.0,66218819.0,94,66218819.0,66225910.0,4.385780e-01,7.366511e-06,0.306937,1,3,2,1,2,28.026906,75.798475,LT,S,chr6,"[Gm36408, Hdp1, Sndy1]"
563,183,103626139.0,103626139.0,3,103626021.0,103647633.0,1.333961e-11,3.789431e-279,0.000017,1,14,33,1,32,2.593428,950.402410,LT,S,chr6,"[Chl1, ssl, Sndy1]"
564,184,111438186.0,111438186.0,38,111349308.0,111438186.0,8.326452e-02,1.548895e-09,0.082194,1,5,4,1,4,10.606703,61.147122,LT,S,chr6,"[Grm7, ssl, Sndy1]"
565,"[3, 21]",133573805.0,133676524.0,5,133553711.0,133676524.0,8.419395e-02,0.000000e+00,0.120192,24,10,25,16,12,5597.302232,641.691134,LT,S,chr6,[Sndy1]


In [84]:
# search for a gene
gene_search = "Aak1"
CIS_df[CIS_df["genes"].apply(lambda x: gene_search in x)]

Unnamed: 0,case_index,case_pos_min,case_pos_max,control_index,control_pos_min,control_pos_max,ranksums,binomial,fishers_exact,case_num_samples,control_num_samples,total_IS,case_IS,control_IS,case_total_read_count,control_total_read_count,case,control,chrom,genes
5,14.0,86858017.0,86858024.0,,,,0.04953461,3.889385e-62,1.0,1,0,3,3,0,204.786894,0.0,LT,S,chr6,"[Aak1, wa1l, Sndy1]"
354,,,,211.0,86976665.0,86976665.0,0.3173105,7.801456e-131,0.412903,0,1,1,0,1,0.0,441.27086,LT,S,chr6,"[Aak1, wa1l, Sndy1]"
534,1.0,86910505.0,86911207.0,1.0,86886567.0,86910900.0,3.931046e-10,0.0,0.355551,64,50,108,101,77,151431.857664,23456.878311,LT,S,chr6,"[Aak1, wa1l, Sndy1]"
