# GEO2R DEG post-processing (Step 1): Preprocess full tables

This notebook preprocesses **GEO2R (limma) "Download full table"** outputs (TSV) for:
- **GSE43292** (Atheroma plaque vs. Macroscopically intact tissue)
- **GSE20950** (Insulin sensitive vs. Insulin resistance)

## What this notebook does
1. Loads the downloaded GEO2R full tables (TSV).
2. Cleans gene symbols and removes ambiguous probes (`Gene.symbol` containing `///`).
3. Applies consistent thresholds:
   - Adjusted p-value (Benjaminiâ€“Hochberg FDR): **< 0.01**
   - Absolute log2 fold change: **> 0.58**
4. Saves filtered tables for downstream visualization (volcano/venn).

## Outputs
- `GSE43292_DEG_filtered_clean.tsv`
- `GSE20950_DEG_filtered_clean.tsv`


In [None]:
# Optional (Colab): uncomment if needed
# !pip -q install pandas numpy


In [1]:
import pandas as pd
import numpy as np

# ----------------------------
# Config
# ----------------------------
P_THRESH = 0.01
LOG2FC_THRESH = 0.58

# If running in Google Colab and you want to upload the TSV files via UI:
USE_UPLOAD = True

# If USE_UPLOAD = False, set local file paths here
GSE43292_TSV = "GSE43292.top.table.tsv"
GSE20950_TSV = "GSE20950.top.table.tsv"


In [3]:
# ----------------------------
# Load input tables
# ----------------------------
def load_tsv(path: str) -> pd.DataFrame:
    return pd.read_csv(path, sep="\t")

if USE_UPLOAD:
    from google.colab import files
    uploaded = files.upload()
    print("Uploaded:", list(uploaded.keys()))
    # Recommended file names:
    # - GSE43292.top.table.tsv
    # - GSE20950.top.table.tsv

deg_43292 = load_tsv(GSE43292_TSV)
deg_20950 = load_tsv(GSE20950_TSV)

print("GSE43292 raw shape:", deg_43292.shape)
print("GSE20950 raw shape:", deg_20950.shape)


Saving GSE43292.top.table.tsv to GSE43292.top.table.tsv
Saving GSE20950.top.table.tsv to GSE20950.top.table (1).tsv
Uploaded: ['GSE43292.top.table.tsv', 'GSE20950.top.table (1).tsv']
GSE43292 raw shape: (33297, 8)
GSE20950 raw shape: (54675, 8)


In [4]:
# ----------------------------
# Preprocessing function
# ----------------------------
def preprocess_geo2r_table(df: pd.DataFrame,
                           gene_col: str = "Gene.symbol",
                           padj_col: str = "adj.P.Val",
                           logfc_col: str = "logFC",
                           p_thresh: float = P_THRESH,
                           log2fc_thresh: float = LOG2FC_THRESH) -> pd.DataFrame:
    out = df.copy()

    # Drop missing gene symbols
    out = out[out[gene_col].notna()].copy()

    # Remove ambiguous probes (multiple genes)
    out = out[~out[gene_col].astype(str).str.contains(r"///", na=False)].copy()

    # Coerce numeric
    out[padj_col] = pd.to_numeric(out[padj_col], errors="coerce")
    out[logfc_col] = pd.to_numeric(out[logfc_col], errors="coerce")

    # Apply thresholds
    out = out[(out[padj_col] < p_thresh) & (out[logfc_col].abs() > log2fc_thresh)].copy()

    # Helper for volcano plots
    out["neglog10p"] = -np.log10(out[padj_col])

    return out


In [5]:
# ----------------------------
# Apply preprocessing & save
# ----------------------------
deg_43292_f = preprocess_geo2r_table(deg_43292)
deg_20950_f = preprocess_geo2r_table(deg_20950)

print("GSE43292 filtered shape:", deg_43292_f.shape)
print("GSE20950 filtered shape:", deg_20950_f.shape)

out_43292 = "GSE43292_DEG_filtered_clean.tsv"
out_20950 = "GSE20950_DEG_filtered_clean.tsv"

deg_43292_f.to_csv(out_43292, sep="\t", index=False)
deg_20950_f.to_csv(out_20950, sep="\t", index=False)

print("Saved:", out_43292, out_20950)

# Optional (Colab): download outputs
if USE_UPLOAD:
    from google.colab import files
    files.download(out_43292)
    files.download(out_20950)


GSE43292 filtered shape: (912, 9)
GSE20950 filtered shape: (2523, 9)
Saved: GSE43292_DEG_filtered_clean.tsv GSE20950_DEG_filtered_clean.tsv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>