# Experimental design

### Biological

- Mice with sleeping beauty
- harvest left/right tumor and spleen after transposition
- amplification at IRR and IRL sites of transposon
- Illumina paired reads

### Computational

- preprocess reads
- preprocess insertions
- get read counts grouped by interstion sites (IS) per chromosome and samples
- pseudo common insertion site (pCIS) is found using a graph based proximity approach of the IS
  - chromosome is the actual graph object
  - an IS is represented as a node in the graph
  - pCIS is a distinct subgraph within the chromosome graph
- pCIS -> CIS based on some sort of statistic
  - currently using LFC or binomial for significance, not any graph based form of significance

# IS significance

### not read normalized - set a threshold of read

- less than threshold, then no IS event (count of 1 is implicitely used)
- more than threshold, it is an IS event
- start with threshold of 10 reads, but results would likely change based on this

### read normalized - use a penetrance(like) score

- determine the strength of the IS
- add 1 pseudo count divided by read depth or the reads per million mapped to a site
- read depth normalization is done per sample and per chromosome
- divide read counts at IS by the total reads on the chromosome
- look into outliers
- So for normalization, I can think of two ways: counts per million (CPM) and transposon copy number (TCN)
  - CPM: scale counts based on library read counts (total reads in IRR/IRL per sample)
  - TCN: scale counts based on transposon read counts (total reads in both IRR and IRL per sample)
- 
- TODO: make plots and slides to capture the above things
  - avg total read count per chromosome across all samples and controls
  - plot the normalized values and replot with the read sum. It should have approx the same read depth
  - read count normalized per site

# Statistics
hypergeometric/fisher's exact test
- choose threshold. Are there more or less than this?
- do this per sample in a pCIS for how many samples met this threshold
- use total valid samples. How many insertions needed to be valid? defualt is 0 meaning does a sample have any reads in a pCIS
- gives a binary profile for cases and controls that is then used in fisher's exact test
- Easy to then test different thresholds

quantitative
- sum read depth normalized counts per sample then use rank sums
- use one sided test (why this?)

global
- how many CIS's in each group

Misc
- could try testing right and left sum ranks versus spleen
- use fdr cause we are testing a certain amount of pCIS's
- can stick with union of two pCIS's range. Then get into the interesting genomic features for the identified CIS's


# cis_networks.py

In [1]:
import sys
from pathlib import Path
from multiprocessing import Pool
import pickle

import numpy as np
import pandas as pd
from tqdm import tqdm
from IPython.display import display
import networkx as nx

module_path = "/home/fisch872/mat/projects/Laura-SB-Analysis/NetCIS/"
sys.path.append(module_path)
from netcis import cis_networks as cn

from importlib import reload


args = {
    # "output_prefix" : "/project/cs-myers/MathewF/projects/Laura-SB-Analysis/2023-SB-screen/output/ACF_SCF/GRCm39/results",
    "output_prefix": "/project/cs-myers/MathewF/projects/Laura-SB-Analysis/2020_SB-output/GRCm39/results",
    "verbose": 0,
    "threshold": 50000,
    "njobs": 1,
    }

args["insertion_dir"] = Path(args["output_prefix"] + "-insertions")
args["depth_dir"] = Path(args["output_prefix"] + "-insertions-depth")
args["output"] = Path(args["output_prefix"] + "-graphs")


In [2]:
reload(cn)

# new way of doing things
insertion_list = [ pd.read_csv(file, sep="\t") for file in args["depth_dir"].iterdir() ]
inserts_df = pd.concat(insertion_list, ignore_index=True)
inserts_df.insert(4, "counts_irr", np.where(inserts_df['library'] == 'IRR', 1, 0))
inserts_df.insert(5, "counts_irl", np.where(inserts_df['library'] == 'IRL', 1, 0))
# display(inserts_df)

def create_graph(iter_args):
    chrom_df, save_dir, threshold, verbose = iter_args
    
    G = nx.Graph()
    cols = ["CPM", "counts_irr", "counts_irl"]
    tmp_group = chrom_df.groupby(by=['chr', 'pos'], sort=False, as_index=False, dropna=False)
    insertion_nodes_df = tmp_group[cols].sum()
    insertion_nodes_df.insert(2, "counts", tmp_group['count'].count().pop('count'))

    # add in info about which samples are in each insertion site
    tmp_samples = chrom_df.groupby(by=['chr', 'pos'], sort=False, as_index=False, dropna=False)["sampleID"].apply(lambda x: x.unique())
    if tmp_samples.size == 0:
        insertion_nodes_df["n_samples"] = 0
        insertion_nodes_df["sample_IDs"] = []
    else:
        insertion_nodes_df.insert(6, "n_samples", tmp_samples["sampleID"].apply(lambda x: len(x)))
        insertion_nodes_df.insert(6, "sample_IDs", tmp_samples["sampleID"].apply(lambda x: list(x)).to_list())

    # add nodes and edges to graph
    G.add_nodes_from(cn.add_nodes(insertion_nodes_df))
    G.add_edges_from(cn.find_edges(G.nodes(), threshold))
    
    if verbose > 1:
        cn.graph_properties(G)

    # save the graph
    nx.write_gml(G, save_dir / "G.gml")
    
    # save subgraphs from graph
    subgraphs_by_nodes = sorted(nx.connected_components(G), key=len, reverse=True)
    subgraphs = [ G.subgraph(x) for x in subgraphs_by_nodes ]
    with open(save_dir / "subgraphs.pickle", "wb") as f:
        pickle.dump(subgraphs, f, pickle.HIGHEST_PROTOCOL)

chrom_list = np.unique(inserts_df["chr"].to_numpy())
treatment_list = inserts_df["treatment"].unique()

# total unique samples across all treatments
total_samples = inserts_df["sampleID"].unique().shape[0]
metadata = {"total": total_samples}

for treatment in treatment_list:
    print(treatment)
    # prepare output
    out_dir = args['output'] / treatment
    out_dir.mkdir(parents=True, exist_ok=True)
    
    treatment_df = inserts_df[inserts_df["treatment"] == treatment]
    metadata[treatment] = treatment_df["sampleID"].unique().shape[0]
    
    # don't allow more jobs than there are chromosomes
    jobs = args["njobs"]
    num_chr = len(chrom_list)
    if num_chr < jobs:
        print(f"Reducing number of jobs from {jobs} to {num_chr}, since there are only {num_chr} chromosomes present.")
        jobs = len(chrom_list)
        
    # construct CIS network per chromosome for treatment insertion
    iter_gen = cn.create_graph_generator(chrom_list, treatment_df, out_dir, args)
    iter_gen = tqdm(iter_gen)
    with Pool(jobs) as p:
        for _ in p.imap_unordered(create_graph, iter_gen):
            pass
        p.close()
        
# save sample numbers as meta data for network analysis
samples, counts = zip(*metadata.items())
meta_df = pd.DataFrame({"samples": samples, "counts": counts})
meta_df.to_csv(args['output'].parent / "samples_with_insertions.csv", index=False)

LT


22it [00:01, 12.49it/s]


RT


22it [00:02, 10.91it/s]


S


22it [00:04,  4.80it/s]


# network_analysis.py

In [9]:
import sys, os
import pickle
from pathlib import Path
from multiprocessing import Pool

import pandas as pd
import numpy as np 
import seaborn.objects as so
from seaborn import axes_style
import networkx as nx
from scipy.stats import binomtest, ranksums, mannwhitneyu, wilcoxon, skewtest, kurtosistest
from tqdm import tqdm

from IPython.display import display
from importlib import reload

In [30]:
module_path = "/home/fisch872/mat/projects/Laura-SB-Analysis/NetCIS/"
sys.path.append(module_path)
from netcis import network_analysis_new as na
reload(na)

<module 'netcis.network_analysis_new' from '/home/fisch872/mat/projects/Laura-SB-Analysis/NetCIS/netcis/network_analysis_new.py'>

In [31]:
refdata = Path("/project/cs-myers/MathewF/projects/Laura-SB-Analysis/2023-SB-screen/ref_data/GRCm39")

args = {
    # "output_prefix": "/project/cs-myers/MathewF/projects/Laura-SB-Analysis/2023-SB-screen/output/GRCm39/results", 
    "output_prefix": "/project/cs-myers/MathewF/projects/Laura-SB-Analysis/2020_SB-output/GRCm39/results", 
    "ta_dir": refdata / "ta_files",
    "gene_annot": refdata / "MRK_List2.rpt",
    "ta_error": 5,
    "pval_threshold": 0.05,
    "verbose": 1,
    "case": "LT",  # CAR ACF LT RT
    "control": "S",  # NoCAR SCF S S
    "njobs": 21,
}

args["graph_dir"] = Path(args["output_prefix"] + "-graphs/")

output = Path(args["output_prefix"] + "-analysis-new")
output.mkdir(exist_ok=True)

ta_dir = args["ta_dir"]
gene_annot = args["gene_annot"]
ta_error = args["ta_error"]
pval_threshold = args["pval_threshold"]
verbose = args["verbose"]
case = args["case"]
control = args["control"]
njobs = args["njobs"]

output_res = output / f"{case}-{control}"
output_res.mkdir(exist_ok=True)

In [32]:
annot_df = pd.read_csv(gene_annot, sep="\t")
annot_df = annot_df[pd.notna(annot_df["genome coordinate start"])].drop("Status", axis=1)
annot_df["chrom"] = annot_df["Chr"].apply(lambda x: f"chr{x}")
annot_df = annot_df.sort_values(["chrom"]).reset_index(drop=True)

bed_files = {file.name.split(".")[0]: file for file in args["ta_dir"].iterdir()}

chroms = sorted([ chrom.name for chrom in (args["graph_dir"] / case).iterdir() ])
print(chroms)
print(len(chroms))

# don't allow more jobs than there are chromosomes
jobs = args["njobs"]
num_chr = len(chroms)
if num_chr < jobs:
    print(f"Reducing number of jobs from {jobs} to {num_chr}, since there are only {num_chr} chromosomes present.")
    jobs = len(chroms)

['chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chrM', 'chrX', 'chrY']
22


In [33]:
# # iter_args = tqdm([ (chrom, annot_df[annot_df["chrom"] == chrom], bed_files[chrom], args) for chrom in chroms ])
iter_args = [ (chrom, annot_df[annot_df["chrom"] == chrom], bed_files[chrom], args) for chrom in chroms ]
# with Pool(args["njobs"]) as p:
#     res_dict_list = [ x for x in p.imap_unordered(chrom_analysis, iter_args) ]




chrom, annot_chrom_df, chrom_bed_file, args = iter_args[0]
graph_dir = args["graph_dir"]
case = args["case"]
control = args["control"]
ta_error = args["ta_error"]
pval_threshold = args["pval_threshold"]
verbose = args["verbose"]
gene_expander = 50000  # TODO: add to input args


bed_chrom_df = pd.read_csv(chrom_bed_file, sep="\t", header=None)

with open(graph_dir / case / chrom / "subgraphs.pickle", 'rb') as f:
    case_chrom_subgraphs = pickle.load(f)
case_chrom_df = na.get_subgraph_stats(case_chrom_subgraphs, case, chrom, bed_chrom_df, ta_error)

with open(graph_dir / control / chrom / "subgraphs.pickle", 'rb') as f:
    control_chrom_subgraphs = pickle.load(f)
control_chrom_df = na.get_subgraph_stats(control_chrom_subgraphs, control, chrom, bed_chrom_df, ta_error)


In [34]:
case_chrom_df

Unnamed: 0,type,chrom,subgraph,nodes,edges,norm_num_inserts,min_pos,max_pos,range,sample_IDs,num_unique_samples,num_insert_sites,num_ta_sites,num_ta_insert_sites
0,LT,chr1,0,32,496,3318.863118,169134075,169134182,107,"[18_2, 24_1, 25_3, 26_4, 27_3, 28_1, 28_2, 28_...",28,32,3,7
1,LT,chr1,1,7,21,285.447814,58591817,58591825,8,[29_2],1,7,1,6
2,LT,chr1,2,7,21,4109.589041,144854698,144854707,9,[521_2],1,7,1,4
3,LT,chr1,3,6,15,371.299548,104958006,104958011,5,[18_2],1,6,0,0
4,LT,chr1,4,5,10,551.065393,37209538,37209544,6,[29_2],1,5,1,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299,LT,chr1,299,1,0,4.469074,188115921,188115921,0,[172_2],1,1,0,0
300,LT,chr1,300,1,0,4.299226,116053978,116053978,0,[484_1],1,1,0,0
301,LT,chr1,301,1,0,17.094017,69087636,69087636,0,[521_1],1,1,0,0
302,LT,chr1,302,1,0,85.470085,124762243,124762243,0,[521_1],1,1,0,0


In [42]:
reload(na)

# cases as the target
case_overlaps = na.pcis_overlaps(case_chrom_df, control_chrom_df)
if not case_overlaps:  # if empty
    case_features, case_TA_df, case_overall_df, case_sig_df = None, None, None, None
else:
    case_TA_df, case_overall_df = na.compare_pcis(case_overlaps, case_chrom_subgraphs, control_chrom_subgraphs, case, control, chrom)
    case_sig_df = na.pcis_to_cis(case_overall_df, pval_threshold)
    if len(case_sig_df) != 0:
        case_features = na.cis_annotate(case_sig_df, annot_chrom_df, gene_expander)
    else:
        case_features = None


# controls as the target
control_overlaps = na.pcis_overlaps(control_chrom_df, case_chrom_df)
if not control_overlaps:  # if empty
    control_features, control_TA_df, control_overall_df, control_sig_df = None, None, None, None
else:
    control_TA_df, control_overall_df = na.compare_pcis(control_overlaps, control_chrom_subgraphs, case_chrom_subgraphs, control, case, chrom)
    control_sig_df = na.pcis_to_cis(control_overall_df, pval_threshold)
    if len(control_sig_df) != 0:
        control_features = na.cis_annotate(control_sig_df, annot_chrom_df, gene_expander)
    else:
        control_features = None
        
        
if case_features is not None or control_features is not None:
    genomic_features_df = pd.concat([case_features, control_features], ignore_index=True)
    if verbose:
        print(f"""{chrom}\tsig. genomic features: {genomic_features_df["marker_symbol"].unique().shape[0]}/{annot_chrom_df["Marker Symbol"].unique().shape[0]}""")
else:
    genomic_features_df = None
    if verbose:
        print(f"{chrom}\tno sig. genomic features found")

ta_df = pd.concat([case_TA_df, control_TA_df], ignore_index=True)
overall_df = pd.concat([case_overall_df, control_overall_df], ignore_index=True)
sig_df = pd.concat([case_sig_df, control_sig_df], ignore_index=True)
graph_chrom_df = pd.concat([case_chrom_df, control_chrom_df], ignore_index=True)

# return {"ta": ta_df, "overall": overall_df, "sig": sig_df, "genomic_features": genomic_features_df, "graph_stats": graph_chrom_df}

chr1	sig. genomic features: 19672/42676


In [44]:
ta_df

Unnamed: 0,pos,target_count,reference_count,reference_index,target_index,target_binom_pval,target_binom_sig,LFC,p_target_binom_pval,p_target_binom_sig,target,reference,chrom
0,169134075,12.000000,0.0,0.0,0,4.882812e-04,True,3.700440,1.831055e-03,True,LT,S,chr1
1,169134076,254.000000,0.0,0.0,0,6.908935e-77,True,7.994353,4.438991e-75,True,LT,S,chr1
2,169134077,139.000000,39.0,83.0,0,2.289560e-14,True,1.807355,3.316357e-14,True,LT,S,chr1
3,169134078,107.000000,0.0,0.0,0,1.232595e-32,True,6.754888,3.389637e-31,True,LT,S,chr1
4,169134079,71.000000,0.0,0.0,0,8.470329e-22,True,6.169925,1.567011e-20,True,LT,S,chr1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3530,108664096,6.414368,0.0,,965,3.125000e-02,True,2.890324,7.031250e-02,False,S,LT,chr1
3531,128570444,6.414368,0.0,,966,3.125000e-02,True,2.890324,7.031250e-02,False,S,LT,chr1
3532,167179188,3.669994,0.0,,967,2.500000e-01,False,2.223421,3.750000e-01,False,S,LT,chr1
3533,17245274,23.772733,0.0,,968,2.384186e-07,True,4.630681,1.549721e-06,True,S,LT,chr1


In [45]:
overall_df

Unnamed: 0,target_index,reference_index,target_pos_min,target_pos_max,reference_pos_min,reference_pos_max,mannwhitneyu,ranksums,binomial,target_num_samples,reference_num_samples,total_IS,sig_IS,target_IS_count,reference_IS_count,sig_ratio,target,reference,chrom
0,0,0.0,169134075,169134182,169134077,169134162,3.500163e-10,2.330901e-09,0.000000e+00,28,8,35,33,3302.000000,161.0,0.942857,LT,S,chr1
1,1,,58591817,58591825,,,1.027148e-03,1.745119e-03,3.217223e-86,1,,7,5,285.447814,0.0,0.714286,LT,S,chr1
2,2,,144854698,144854707,,,1.042288e-03,1.745119e-03,0.000000e+00,1,,7,7,4109.589041,0.0,1.000000,LT,S,chr1
3,3,,104958006,104958011,,,2.778430e-03,3.947752e-03,4.158164e-112,1,,6,6,371.299548,0.0,1.000000,LT,S,chr1
4,4,,37209538,37209544,,,7.494958e-03,9.023439e-03,2.713329e-166,1,,5,5,551.065393,0.0,1.000000,LT,S,chr1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1269,965,,108664096,108664096,,,1.000000e+00,3.173105e-01,3.125000e-02,1,,1,1,6.414368,0.0,1.000000,S,LT,chr1
1270,966,,128570444,128570444,,,1.000000e+00,3.173105e-01,3.125000e-02,1,,1,1,6.414368,0.0,1.000000,S,LT,chr1
1271,967,,167179188,167179188,,,1.000000e+00,3.173105e-01,2.500000e-01,1,,1,0,3.669994,0.0,0.000000,S,LT,chr1
1272,968,,17245274,17245274,,,1.000000e+00,3.173105e-01,2.384186e-07,1,,1,1,23.772733,0.0,1.000000,S,LT,chr1


In [46]:
sig_df

Unnamed: 0,target_index,reference_index,target_pos_min,target_pos_max,reference_pos_min,reference_pos_max,mannwhitneyu,ranksums,binomial,target_num_samples,reference_num_samples,total_IS,sig_IS,target_IS_count,reference_IS_count,sig_ratio,target,reference,chrom
0,0,0.0,169134075,169134182,169134077,169134162,3.500163e-10,2.330901e-09,0.000000e+00,28,8,35,33,3302.000000,161.0,0.942857,LT,S,chr1
1,1,,58591817,58591825,,,1.027148e-03,1.745119e-03,3.217223e-86,1,,7,5,285.447814,0.0,0.714286,LT,S,chr1
2,2,,144854698,144854707,,,1.042288e-03,1.745119e-03,0.000000e+00,1,,7,7,4109.589041,0.0,1.000000,LT,S,chr1
3,3,,104958006,104958011,,,2.778430e-03,3.947752e-03,4.158164e-112,1,,6,6,371.299548,0.0,1.000000,LT,S,chr1
4,4,,37209538,37209544,,,7.494958e-03,9.023439e-03,2.713329e-166,1,,5,5,551.065393,0.0,1.000000,LT,S,chr1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1145,964,,102345860,102345860,,,1.000000e+00,3.173105e-01,3.125000e-02,1,,1,1,6.414368,0.0,1.000000,S,LT,chr1
1146,965,,108664096,108664096,,,1.000000e+00,3.173105e-01,3.125000e-02,1,,1,1,6.414368,0.0,1.000000,S,LT,chr1
1147,966,,128570444,128570444,,,1.000000e+00,3.173105e-01,3.125000e-02,1,,1,1,6.414368,0.0,1.000000,S,LT,chr1
1148,968,,17245274,17245274,,,1.000000e+00,3.173105e-01,2.384186e-07,1,,1,1,23.772733,0.0,1.000000,S,LT,chr1


In [47]:
graph_chrom_df

Unnamed: 0,type,chrom,subgraph,nodes,edges,norm_num_inserts,min_pos,max_pos,range,sample_IDs,num_unique_samples,num_insert_sites,num_ta_sites,num_ta_insert_sites
0,LT,chr1,0,32,496,3318.863118,169134075,169134182,107,"[18_2, 24_1, 25_3, 26_4, 27_3, 28_1, 28_2, 28_...",28,32,3,7
1,LT,chr1,1,7,21,285.447814,58591817,58591825,8,[29_2],1,7,1,6
2,LT,chr1,2,7,21,4109.589041,144854698,144854707,9,[521_2],1,7,1,4
3,LT,chr1,3,6,15,371.299548,104958006,104958011,5,[18_2],1,6,0,0
4,LT,chr1,4,5,10,551.065393,37209538,37209544,6,[29_2],1,5,1,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1269,S,chr1,965,1,0,6.414368,108664096,108664096,0,[22_3],1,1,0,0
1270,S,chr1,966,1,0,6.414368,128570444,128570444,0,[22_3],1,1,0,0
1271,S,chr1,967,1,0,3.669994,167179188,167179188,0,[22_3],1,1,0,0
1272,S,chr1,968,1,0,23.772733,17245274,17245274,0,[485_2],1,1,0,0
