This is a notebook for the regesion model, 2.iv 

###  General Idea

> **“As a group, how much of a gene’s expression variation across cell types can be **explained by** the accessibility of its candidate CREs?”**

Concretely, for each gene you’ll:

1. **Gather predictors**  
   – All your ATAC-seq peaks (CREs) linked to that gene (e.g. ±20 kb around its TSS or however you defined your peak→gene map).  
   – For each of these peaks, you already have a vector of accessibility values across N cell types.

2. **Gather response**  
   – The same gene’s expression levels across those N cell types.

3. **Fit a linear model**  
   –  where  
     - **Y** is the N-vector of gene expression,  
     - **X** is the N×P matrix of peak-accessibility (one column per CRE),  
     - **β** are the coefficients.  
     – Compute **R²** to see “what fraction of variance in Y is captured by X.”

4. **Inspect coefficients**  
   – Which CREs get large positive (activating) or negative (repressing) **β**?


In [1]:
import pandas as pd

# 1) Load the new ATAC-seq matrix
#    Replace the file name below with the path to whatever your new ATAC CSV is called
atac = pd.read_csv("ATAC_high_var.csv", index_col=False)

# 2) Quickly check what sample/cell‐type columns you have
print("ATAC columns:", atac.columns.tolist())

# 3) Take a look at the first few rows to make sure everything loaded as expected
atac.head()


ATAC columns: ['version https://git-lfs.github.com/spec/v1']


Unnamed: 0,version https://git-lfs.github.com/spec/v1
0,oid sha256:a3ea0ff5fb2620df3333deca108ed47b5a3...
1,size 19095310


In [2]:
import pandas as pd

# 1) Load your new ATAC‐seq matrix
atac = pd.read_csv("ATAC_high_var.csv")

# 2) Rename the ImmGenATAC1219.peakID column to just "peak_id"
atac = atac.rename(columns={"ImmGenATAC1219.peakID": "peak_id"})

# 3) List exactly the metadata columns you want to toss
meta_cols = [
    "chrom",
    "Summit",                     # peak summit coordinate
    "mm10.60way.phastCons_scores", # conservation score
    "_-log10_bestPvalue",          # significance of peak call
    "Included.in.systematic.analysis",
    "TSS",                         # count of nearby TSSs
    "genes.within.100Kb"           # original mapped gene list
]

# 4) Drop them, keeping peak_id + all the per‐sample columns
atac_sub = atac.drop(columns=meta_cols, errors="ignore")

# 5) Confirm what’s left
print("Kept columns:", atac_sub.columns.tolist())
atac_sub.head()

# 5) Quick check
print("ATAC shape:", atac_sub.shape)
atac_sub.head()



Kept columns: ['version https://git-lfs.github.com/spec/v1']
ATAC shape: (2, 1)


Unnamed: 0,version https://git-lfs.github.com/spec/v1
0,oid sha256:a3ea0ff5fb2620df3333deca108ed47b5a3...
1,size 19095310


In [3]:
import pandas as pd

# 1) Load without setting an index
rna = pd.read_csv("RNA-seq/filtered_RNA_abT_Tact_Stem.csv")

# Rename and index
rna = (
    rna
    .rename(columns={"Unnamed: 0": "gene_symbol"})
    .set_index("gene_symbol")
)

# Confirm
print(rna.shape)
rna.head()


(17535, 29)


Unnamed: 0_level_0,preT.DN1.Th,preT.DN2a.Th,preT.DN2b.Th,preT.DN3.Th,T.DN4.Th,T.ISP.Th,T.DP.Th,T.4.Th,T.8.Th,T.4.Nve.Sp,...,T8.Tcm.LCMV.d180.Sp,T8.Tem.LCMV.d180.Sp,NKT.Sp,NKT.Sp.LPS.3hr,NKT.Sp.LPS.18hr,NKT.Sp.LPS.3d,LTHSC.34-.BM,LTHSC.34+.BM,STHSC.150-.BM,MPP4.135+.BM
gene_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0610005C13Rik,1.022363,1.389747,1.024819,1.024482,1.02643,1.026217,3.01092,1.024462,1.024819,2.726341,...,1.025833,1.024819,1.385805,1.025833,1.575395,1.024819,1.096732,1.096732,1.02175,1.021812
0610007P14Rik,162.641117,206.945221,209.187788,198.421365,215.056475,225.56536,73.904647,138.841383,139.863904,168.924363,...,206.241084,205.309922,165.69072,133.23492,127.894194,195.147548,206.053987,246.105317,192.424636,204.298358
0610009B22Rik,68.070719,82.468806,89.769337,57.661619,76.399214,84.671456,32.828651,27.207241,36.169759,32.753248,...,36.1057,34.348965,25.168975,33.305724,29.284365,33.322384,78.272059,78.83703,68.844751,76.418169
0610009L18Rik,15.450717,13.573968,14.42762,8.249482,1.683173,4.001953,5.595954,6.367369,6.505833,8.262234,...,8.645607,7.268431,3.840215,1.025833,6.28354,12.791348,8.577159,16.791386,15.511549,16.947354
0610009O20Rik,160.246297,125.475307,155.928005,120.692893,118.433597,149.630866,92.040668,76.781112,87.529814,86.523573,...,87.608325,56.128251,109.175415,91.992319,102.035627,108.414405,168.645852,157.926022,155.941641,186.261464


In [4]:

# 3) Load your peak→gene annotation
annot = pd.read_csv("peak_to_gene_annotated.csv")

# Quick checks
print("ATAC shape:", atac_sub.shape)
print("RNA  shape:",  rna.shape)
print("Annot shape:", annot.shape)

ATAC shape: (2, 1)
RNA  shape: (17535, 29)
Annot shape: (2, 1)


ATAC and RNA shapes are not the same so lets fix that

In [5]:
# 1) Get the list of samples (columns) in each matrix:
atac_samples = list(atac_sub.columns)   # after dropping metadata, before transposing
rna_samples  = list(rna.columns)        # after you set gene_symbol as index

# 2) Turn them into Python sets:
set_atac = set(atac_samples)
set_rna  = set(rna_samples)

# 3) Compute intersections and differences:
common        = set_atac & set_rna
only_in_atac  = set_atac - set_rna
only_in_rna   = set_rna  - set_atac

# 4) Print a concise summary:
print(f"✅ # samples in ATAC: {len(set_atac)}")
print(f"✅ # samples in RNA : {len(set_rna)}")
print(f"🔗 # in both        : {len(common)}")
print(f"❌ only in ATAC    : {len(only_in_atac)} -> {sorted(only_in_atac)}")
print(f"❌ only in RNA     : {len(only_in_rna)} -> {sorted(only_in_rna)}")


✅ # samples in ATAC: 1
✅ # samples in RNA : 29
🔗 # in both        : 0
❌ only in ATAC    : 1 -> ['version https://git-lfs.github.com/spec/v1']
❌ only in RNA     : 29 -> ['LTHSC.34+.BM', 'LTHSC.34-.BM', 'MPP4.135+.BM', 'NKT.Sp', 'NKT.Sp.LPS.18hr', 'NKT.Sp.LPS.3d', 'NKT.Sp.LPS.3hr', 'STHSC.150-.BM', 'T.4.Nve.Fem.Sp', 'T.4.Nve.Sp', 'T.4.Sp.aCD3+CD40.18hr', 'T.4.Th', 'T.8.Nve.Sp', 'T.8.Th', 'T.DN4.Th', 'T.DP.Th', 'T.ISP.Th', 'T8.IEL.LCMV.d7.Gut', 'T8.MP.LCMV.d7.Sp', 'T8.TE.LCMV.d7.Sp', 'T8.TN.P14.Sp', 'T8.Tcm.LCMV.d180.Sp', 'T8.Tem.LCMV.d180.Sp', 'Treg.4.25hi.Sp', 'Treg.4.FP3+.Nrplo.Co', 'preT.DN1.Th', 'preT.DN2a.Th', 'preT.DN2b.Th', 'preT.DN3.Th']


In [6]:
# 1) Drop the variance column (we don’t use that in the model)
atac_sub = atac_sub.drop(columns=["variance"], errors="ignore")

# 2) Rename the mismatched sample so it matches RNA
atac_sub = atac_sub.rename(
    columns={"T8.IEL.LCMV.d7.SI": "T8.IEL.LCMV.d7.Gut"}
)

# 3) Set peak_id as the index (so it’s not treated like a sample)
atac_sub = atac_sub.set_index("peak_id")

# 4) Recompute the intersection of sample names
common = sorted(set(atac_sub.columns) & set(rna.columns))
print(f"✅ Now {len(common)} samples in common:", common)

# 5) Subset both matrices to just those samples, in the same order
atac_sub = atac_sub[common]
rna      = rna[common]

# 6) Quick sanity check
print("ATAC_sub shape:", atac_sub.shape)   # (n_peaks, n_samples)
print("RNA    shape:", rna.shape)          # (n_genes, n_samples)


KeyError: "None of ['peak_id'] are in the columns"

AFTER THIS POINT CHANGE ACCORDING TO NEW DEFINITION OF PROMOTER AND ENHANCER 

This is the regression of all peaks & gene expression data, so basically for all CRE's

In [None]:
from sklearn.linear_model import LinearRegression

# — 1) Pick your gene of interest (must be in rna.index)
gene = "0610005C13Rik"     # replace with any gene_symbol you like
assert gene in rna.index, f"{gene} not found in RNA data!"

# — 2) Build the response vector Y (shape: 29,)
Y = rna.loc[gene]

# — 3) Find all peaks linked to that gene
peaks = annot.loc[annot.gene_symbol == gene, "peak_id"].tolist()
print(f"{len(peaks)} peaks linked to {gene}")

# — 4) Build the predictor matrix X (rows=samples, cols=peaks)
#     atac_sub is indexed by peak_id, columns are samples
X = atac_sub.loc[peaks].T
print("X shape:", X.shape, " Y shape:", Y.shape)

# — 5) Fit the model and report R²
model = LinearRegression().fit(X, Y)
print(f"{gene}  R² = {model.score(X, Y):.3f}")

# — 6) Inspect the top 5 CREs by absolute coefficient
import pandas as pd
coefs = pd.Series(model.coef_, index=X.columns).abs().nlargest(5)
print("Top 5 CREs (by |β|):\n", coefs)


4 peaks linked to 0610005C13Rik
X shape: (29, 4)  Y shape: (29,)
0610005C13Rik  R² = 0.076
Top 5 CREs (by |β|):
 peak_id
ImmGenATAC1219.peak_427236    0.027864
ImmGenATAC1219.peak_427237    0.023046
ImmGenATAC1219.peak_427239    0.014774
ImmGenATAC1219.peak_427241    0.001844
dtype: float64


Now since we know that promoters and enhancers of a same gene can be very differently accesible, and that's why the regression model might be so poor, we thought about seperating different CRE's and creating that type of regression model

In [None]:
# … earlier you did:
# atac_sub = atac_sub.set_index("peak_id")
# rna      = rna  # already indexed by gene_symbol

def fit_and_report(peak_list, label):
    if not peak_list:
        print(f"No {label} peaks to model – skipping.")
        return

    # <-- NO set_index here; peak_id is already index
    X = atac_sub.loc[peak_list].T
    print(f"\n{label} regression → X shape {X.shape}, Y shape {Y.shape}")

    model = LinearRegression().fit(X, Y)
    print(f" {label} R² = {model.score(X, Y):.3f}")


In [None]:
import pandas as pd

# replace with the actual path to your correlations CSV
cor_df = pd.read_csv("all_peak_gene_correlations.csv")


In [None]:
print(cor_df.shape)
print(cor_df.columns.tolist())
cor_df.head()


(56448, 5)
['peak_id', 'gene_symbol', 'signed_distance_to_tss', 'spearman_rho', 'pval']


Unnamed: 0,peak_id,gene_symbol,signed_distance_to_tss,spearman_rho,pval
0,ImmGenATAC1219.peak_69,Sox17,28775,-0.168543,0.391255
1,ImmGenATAC1219.peak_77,Sox17,6702,-0.643429,0.000221
2,ImmGenATAC1219.peak_83,Sox17,875,-0.4532,0.015437
3,ImmGenATAC1219.peak_84,Sox17,616,-0.368105,0.053941
4,ImmGenATAC1219.peak_93,Sox17,-50220,0.05024,0.799583


In [None]:
def assign_region(dist):
    if -1000 <= dist <= 1000:
        return "Promoter"
    elif -10000 <= dist < -1000:
        return "Proximal Upstream"
    elif 1000 < dist <= 10000:
        return "Proximal Downstream"
    elif -20000 <= dist < -10000:
        return "Distal Upstream"
    elif 10000 < dist <= 20000:
        return "Distal Downstream"
    else:
        return None


In [None]:
cor_df["Region"] = cor_df["signed_distance_to_tss"].apply(assign_region)


In [None]:
cor_df.Region.value_counts(dropna=False)
cor_df.head()


Unnamed: 0,peak_id,gene_symbol,signed_distance_to_tss,spearman_rho,pval,Region
0,ImmGenATAC1219.peak_69,Sox17,28775,-0.168543,0.391255,
1,ImmGenATAC1219.peak_77,Sox17,6702,-0.643429,0.000221,Proximal Downstream
2,ImmGenATAC1219.peak_83,Sox17,875,-0.4532,0.015437,Promoter
3,ImmGenATAC1219.peak_84,Sox17,616,-0.368105,0.053941,Promoter
4,ImmGenATAC1219.peak_93,Sox17,-50220,0.05024,0.799583,


In [None]:
import pandas as pd

# 1) Load ATAC, rename peakID, drop all metadata columns
atac = pd.read_csv("ATAC_high_var.csv")
atac = atac.rename(columns={"ImmGenATAC1219.peakID": "peak_id"})

meta_cols = [
    "chrom",
    "Summit",
    "mm10.60way.phastCons_scores",
    "_-log10_bestPvalue",
    "Included.in.systematic.analysis",
    "TSS",
    "genes.within.100Kb",
    "variance"  # if present
]

# keep only peak_id + sample columns, then index by peak_id
atac_sub = atac.drop(columns=meta_cols, errors="ignore").set_index("peak_id")

# 2) Load RNA, rename the first column to gene_symbol and index
rna = pd.read_csv("RNA-seq/filtered_RNA_abT_Tact_Stem.csv")
rna = (
    rna
    .rename(columns={"Unnamed: 0": "gene_symbol"})
    .set_index("gene_symbol")
)

# 3) Load your peak→gene annotation (used later for regression)
cor_df = pd.read_csv("peak_to_gene_annotated.csv")

# 4) Subset to the 29 shared samples
common = atac_sub.columns.intersection(rna.columns)
print(f"# samples in common: {len(common)} → {common.tolist()}")

atac_sub = atac_sub[common]
rna      = rna[common]

print("ATAC_sub shape:", atac_sub.shape)
print("RNA      shape:", rna.shape)


# samples in common: 28 → ['preT.DN1.Th', 'preT.DN2a.Th', 'preT.DN2b.Th', 'preT.DN3.Th', 'T.DN4.Th', 'T.ISP.Th', 'T.DP.Th', 'T.4.Th', 'T.8.Th', 'T.4.Nve.Sp', 'T.4.Nve.Fem.Sp', 'T.4.Sp.aCD3+CD40.18hr', 'T.8.Nve.Sp', 'Treg.4.25hi.Sp', 'Treg.4.FP3+.Nrplo.Co', 'T8.TN.P14.Sp', 'T8.TE.LCMV.d7.Sp', 'T8.MP.LCMV.d7.Sp', 'T8.Tcm.LCMV.d180.Sp', 'T8.Tem.LCMV.d180.Sp', 'NKT.Sp', 'NKT.Sp.LPS.3hr', 'NKT.Sp.LPS.18hr', 'NKT.Sp.LPS.3d', 'LTHSC.34-.BM', 'LTHSC.34+.BM', 'STHSC.150-.BM', 'MPP4.135+.BM']
ATAC_sub shape: (75857, 28)
RNA      shape: (17535, 28)


1. Keep all of your peaks as separate predictors, but regularize the regression
When the number of features (= peaks) per gene can be 0, 1, or dozens, you need a modeling method that

automatically shrinks uninformative/collinear predictors toward zero

doesn’t blow up when p (features) is close to or exceeds n (samples)

A classic choice is ridge regression (L2‐penalized linear regression). Ridge will let you include all promoter peaks and all enhancer peaks at once, and it’ll learn which ones actually carry signal without you having to average them ahead of time.

2. Filter genes by minimum peak count
If you want to compare “promoter-only” vs. “enhancer-only” R², you should only do that on the subset of genes that have at least one promoter peak (for the promoter model) or at least one enhancer peak (for the enhancer model). Otherwise you’re just fitting an intercept and getting nonsense.

In [None]:
# 0) Bring in everything we need
import numpy as np
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import KFold, cross_val_score

# (Assuming you’ve already got these:)
#   atac_sub  : DataFrame indexed by peak_id, columns = the 28 matched cell-types
#   rna       : DataFrame indexed by gene_symbol, columns = the same 28 cell-types
#   cor_df    : your peak-to-gene table, with cor_df["Region"] already assigned

# 1) Pick your gene of interest
gene = "0610009B22Rik"

# 2) Gather its promoter / enhancer peaks
sub = cor_df[cor_df.gene_symbol == gene]
promoter_peaks = sub.loc[sub.Region == "Promoter", "peak_id"].tolist()
enhancer_peaks = sub.loc[
    sub.Region.isin(["Proximal Upstream","Proximal Downstream","Distal Upstream","Distal Downstream"]),
    "peak_id"
].tolist()

# 3) Set up your response vector and CV splitter
Y_gene = rna.loc[gene]                       # shape (28,)
cv     = KFold(n_splits=5, shuffle=True, random_state=1)

# 4) Define the ridge‐CV + outer CV wrapper
def fit_ridge(peaks, label, X_full, Y_full, cv):
    if not peaks:
        print(f"{gene} | no {label} peaks – skipping")
        return None

    X = X_full.loc[peaks].T                  # shape (28 × P)
    assert list(X.index) == list(Y_full.index), "sample mismatch!"

    # inner CV grid of penalties
    alphas = np.logspace(-3, 3, 31)
    model  = RidgeCV(alphas=alphas, scoring="r2", cv=cv)
    model.fit(X, Y_full)

    # outer CV R²
    scores = cross_val_score(model, X, Y_full, scoring="r2", cv=cv)
    print(f"{gene} | {label}: α* = {model.alpha_:.2g} → CV R² = {scores.mean():.3f} "
          f"(folds: {scores.round(3).tolist()})")
    return scores.mean()

# 5) Run promoter vs enhancer
fit_ridge(promoter_peaks, "Promoter",  atac_sub, Y_gene, cv)
fit_ridge(enhancer_peaks, "Enhancer",  atac_sub, Y_gene, cv)


0610009B22Rik | Promoter: α* = 1e+03 → CV R² = -0.054 (folds: [-1.367, 0.048, 0.33, 0.533, 0.183])
0610009B22Rik | no Enhancer peaks – skipping


In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import KFold, cross_val_score
from tqdm import tqdm

# — 0) Prep: load your dataframes (if not already in memory)
# atac_sub should be indexed by peak_id, with columns matching RNA samples
# rna      should be indexed by gene_symbol, columns matching the same samples
# cor_df   should have columns: ["peak_id","gene_symbol","signed_distance_to_tss","spearman_rho",…, "Region"]

# 1) Compute per‐gene variance and pick top 200 most variable genes
variances = rna.var(axis=1).sort_values(ascending=False)
genes     = variances.index[:200].tolist()

# 2) Set up cross‐validation and alpha grid for Ridge
cv     = KFold(n_splits=3, shuffle=True, random_state=1)
alphas = np.logspace(-3, 3, 31)

def ridge_cv_score(peaks, X_full, Y_full, cv, alphas):
    """
    Return mean CV-R² of RidgeCV across 'peaks' or NaN if no peaks.
    X_full: DataFrame indexed by peak_id, columns are samples.
    Y_full: Series indexed by samples.
    """
    if not peaks:
        return np.nan
    X = X_full.loc[peaks].T
    # align rows (samples) exactly to Y_full
    X = X.reindex(index=Y_full.index, columns=X.columns)
    model = RidgeCV(alphas=alphas, scoring="r2", cv=cv)
    model.fit(X, Y_full)
    scores = cross_val_score(model, X, Y_full, scoring="r2", cv=cv)
    return scores.mean()

# 3) Loop through the selected genes and compute promoter vs enhancer CV-R²
out = []
for gene in tqdm(genes, desc="Genes processed"):
    Y = rna.loc[gene]                      # shape (n_samples,)
    sub = cor_df[cor_df.gene_symbol == gene]

    # promoter peaks
    prom_peaks = sub.loc[sub.Region == "Promoter", "peak_id"].tolist()

    # enhancer peaks = any of the four non-promoter CRE bins
    enh_peaks = sub.loc[
        sub.Region.isin([
            "Proximal Upstream","Proximal Downstream",
            "Distal Upstream","Distal Downstream"
        ]),
        "peak_id"
    ].tolist()

    r2_prom = ridge_cv_score(prom_peaks, atac_sub, Y, cv, alphas)
    r2_enh  = ridge_cv_score(enh_peaks,  atac_sub, Y, cv, alphas)

    out.append((gene,
                len(prom_peaks),
                len(enh_peaks),
                r2_prom,
                r2_enh))

# 4) Assemble results into a DataFrame
summary = pd.DataFrame(
    out,
    columns=[
        "gene",
        "n_promoter",
        "n_enhancer",
        "promoter_CV_R2",
        "enhancer_CV_R2"
    ]
)

# 5) View the first few rows
print(summary.head(10))


Genes processed: 100%|██████████| 200/200 [03:06<00:00,  1.07it/s]

       gene  n_promoter  n_enhancer  promoter_CV_R2  enhancer_CV_R2
0     H2-K1           3           3       -0.559274       -0.362003
1      Actb           3           6       -0.435391       -0.355914
2    Eef1a1           5           0        0.190472             NaN
3      Tcf7           0           6             NaN        0.439224
4     Actg1           0           0             NaN             NaN
5  Hsp90ab1           5           2        0.313858        0.190419
6       B2m           2           3       -0.189043        0.369620
7     Hspa8           3           4       -0.626394       -0.287602
8      Ets1           1           9        0.309980        0.400578
9    Malat1           2           3       -0.061596       -0.388207





In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import KFold, cross_val_score
from tqdm import tqdm  

# 0) set up CV and alphas
cv     = KFold(n_splits=3, shuffle=True, random_state=1)
alphas = np.logspace(-3, 3, 31)

# 1) helper to compute one CV-R²
def ridge_cv_score(peaks, X_full, Y_full, cv, alphas):
    """Return mean CV-R² of RidgeCV(X_full.loc[peaks].T, Y_full) or NaN if no peaks."""
    if not peaks:
        return np.nan
    X = X_full.loc[peaks].T
    # make sure samples align
    X = X.reindex(index=Y_full.index, columns=X.columns)
    model = RidgeCV(alphas=alphas, scoring="r2", cv=cv)
    model.fit(X, Y_full)
    scores = cross_val_score(model, X, Y_full, scoring="r2", cv=cv)
    return scores.mean()

# 2) prepare output
genes = sorted(cor_df['gene_symbol'].unique())
out   = []

# 3) loop over genes with a progress bar
for gene in tqdm(genes, desc="Genes processed"):   # ← step 2: wrap your genes in tqdm()
    Y     = rna.loc[gene]       # shape (28,)
    sub   = cor_df[cor_df.gene_symbol == gene]
    prom_peaks = sub.loc[sub.Region == "Promoter", "peak_id"].tolist()
    enh_peaks  = sub.loc[
        sub.Region.isin(["Proximal Upstream","Proximal Downstream","Distal Upstream","Distal Downstream"]),
        "peak_id"
    ].tolist()
    r2_prom = ridge_cv_score(prom_peaks, atac_sub, Y, cv, alphas)
    r2_enh  = ridge_cv_score(enh_peaks,  atac_sub, Y, cv, alphas)
    out.append((gene, len(prom_peaks), len(enh_peaks), r2_prom, r2_enh))

# 4) turn into DataFrame
summary = pd.DataFrame(out,
    columns=["gene","n_promoter","n_enhancer","promoter_CV_R2","enhancer_CV_R2"]
)
summary.head(10)



Genes processed:   2%|▏         | 299/13624 [03:44<2:46:29,  1.33it/s]


KeyboardInterrupt: 