**Abstract**

Diagnostic and prognostic prediction is crucial for clinical decision-making in cancer patients. In order to identify prognostic biomarkers of cancer, we created an automated way to find the intersection between differentially expressed essential genes, genes with significant expression changes, and genes that have significantly impacted survival rates in various cancer patients.

Considering the speed of our program, we were able to repeat the process for more than 20 types of different cancers, and for each of them, found overexpression of certain genes to be significantly associated with worse overall survival for the relavant cancer.

With this code, our process is easily repeatable for any future data that may come to light.

**Method**

For each cancer type, our method includes finding the intersection of genes from 3 different databases: 1) the collection of essential genes for that cancer type, 2) the collection of overexpressed genes for that cancer type, and 3) the collection of genes that have significantly impacted survival rates from that cancer type.

1. The collection of essential genes for that cancer type. These essential genes are pulled from Behan et al's "Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens", specifically table 2 [1]. 

    All cancer types are recategorized based on TCGA study abbreviations, and filtered for each specific cancer type [2]. 

    Genes that are present in more than 50% of cell lines are considered as essential, computed via excel and placed in the folder: Differential Expression

2. The collection of overexpressed genes for that cancer type
    The collection of overexpressed genes are downloaded from Gepia's differential genes analysis via cancer type analysis [3]. 

    The q-value cutoff is set as < 0.01, log fold change cutoff is set as > 1. We are only interested in positive log fold changes as it is much easier to downregulate over-expressed genes via inhibition than induce under-expressed genes. CITATION NEEDED

    We access the data by webscraping GEPIA2's API to examine differential gene expression via LIMMA and parsing the JSON/HTML output to obtain a list of genes expressed higher in cancerous cells, which signals a promising therapeutic target



In [102]:
import gepia
import requests
from bs4 import BeautifulSoup
import pandas as pd

def gepia2_differential_genes(
    dataset: str,
    methodoption: str = "LIMMA",   # "ANOVA" or "LIMMA" (as in UI)
    fccutoff: float = 1.0,         # |log2FC| cutoff (set None to leave blank)
    qcutoff: float = 0.01,         # q-value cutoff (set None to leave blank)
    gene_or_isoform: str = "gene", # "gene" or "isoform"
    timeout: int = 60,
):
    url = "http://gepia2.cancer-pku.cn/assets/PHP2/differential_genes.php"
    params = {
        "methodoption": methodoption,
        "dataset": dataset,
        "fccutoff": "" if fccutoff is None else str(fccutoff),
        "type": gene_or_isoform,
        "qcutoff": "" if qcutoff is None else str(qcutoff),
    }

    headers = {
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "X-Requested-With": "XMLHttpRequest",
        "Referer": "http://gepia2.cancer-pku.cn/",
        "User-Agent": "Mozilla/5.0",
    }

    r = requests.get(url, params=params, headers=headers, timeout=timeout)
    r.raise_for_status()
    return r.json()

In [103]:
data = gepia2_differential_genes(dataset="BRCA", methodoption="LIMMA", fccutoff=1, qcutoff=0.01)
print(type(data))
print(data.keys() if isinstance(data, dict) else data[:2])

<class 'dict'>
dict_keys(['output'])


In [104]:
def gepia_html_to_df(gepia_json: dict) -> pd.DataFrame:
    """
    Convert GEPIA2 differential genes API response to a pandas DataFrame.

    Parameters
    ----------
    gepia_json : dict
        The JSON returned by the GEPIA2 differential_genes endpoint.
        Must contain the key "output" with an HTML table string.

    Returns
    -------
    pd.DataFrame
        Parsed table as dataframe.
    """

    if "output" not in gepia_json:
        raise ValueError("Input JSON does not contain 'output' field.")

    html = gepia_json["output"]

    soup = BeautifulSoup(html, "html.parser")
    table = soup.find("table")

    if table is None:
        raise ValueError("No table found in GEPIA HTML output.")

    # Extract headers
    headers = [th.get_text(strip=True) for th in table.find_all("th")]

    # Extract rows
    rows = []
    tbody = table.find("tbody")
    if tbody is None:
        raise ValueError("No table body found in GEPIA HTML output.")

    for tr in tbody.find_all("tr"):
        rows.append([td.get_text(strip=True) for td in tr.find_all("td")])

    df = pd.DataFrame(rows, columns=headers if headers else None)

    return df

In [105]:
import os

cancer_types = [
    "BRCA","CHOL","COAD","DLBC","ESCA","GBM","HNSC",
    "KICH","LAML","LGG","LUAD","LUSC","MESO","OV",
    "PAAD","PRAD","SKCM","STAD","UCEC"
]

# NEW: folder to store outputs
OUTPUT_DIR = "gepia_DE_results"
os.makedirs(OUTPUT_DIR, exist_ok=True)

for cancer in cancer_types:
    try:
        data = gepia2_differential_genes(cancer, fccutoff=1, qcutoff=0.01)
        df = gepia_html_to_df(data)

        filename = f"{cancer}.csv"
        filepath = os.path.join(OUTPUT_DIR, filename)

        df.to_csv(filepath, index=False)
        print(f"{cancer} saved → {filepath}")

    except Exception as e:
        print(f"{cancer} unavailable ({e})")

BRCA saved → gepia_DE_results/BRCA.csv
CHOL unavailable (500 Server Error: Internal Server Error for url: http://gepia2.cancer-pku.cn/assets/PHP2/differential_genes.php?methodoption=LIMMA&dataset=CHOL&fccutoff=1&type=gene&qcutoff=0.01)
COAD saved → gepia_DE_results/COAD.csv
DLBC saved → gepia_DE_results/DLBC.csv
ESCA saved → gepia_DE_results/ESCA.csv
GBM saved → gepia_DE_results/GBM.csv
HNSC saved → gepia_DE_results/HNSC.csv
KICH saved → gepia_DE_results/KICH.csv
LAML saved → gepia_DE_results/LAML.csv
LGG saved → gepia_DE_results/LGG.csv
LUAD saved → gepia_DE_results/LUAD.csv
LUSC saved → gepia_DE_results/LUSC.csv
MESO unavailable (500 Server Error: Internal Server Error for url: http://gepia2.cancer-pku.cn/assets/PHP2/differential_genes.php?methodoption=LIMMA&dataset=MESO&fccutoff=1&type=gene&qcutoff=0.01)
OV saved → gepia_DE_results/OV.csv
PAAD saved → gepia_DE_results/PAAD.csv
PRAD saved → gepia_DE_results/PRAD.csv
SKCM saved → gepia_DE_results/SKCM.csv
STAD saved → gepia_DE_results

3. The collection of genes that have significantly impacted survival rates from that cancer type

    It is fair to assume that a gene is only critically relevant to a patient if its abmormality leads to a significant decrease in lifespan. To achieve this, we use GEPIA2's API to predict the survival rate graph as a result of possessing a certain over-expressed gene. We parse through the graph to access the significance: logrank p value, with p < 0.05 chosen as significant.

In [106]:
import re
from pypdf import PdfReader

def extract_logrank_p(pdf_path: str) -> float:
    reader = PdfReader(pdf_path)
    text = "\n".join((page.extract_text() or "") for page in reader.pages)
    text = text.replace("\u2212", "-")  # normalize unicode minus
    m = re.search(r"Logrank\s*p\s*=\s*([0-9]*\.?[0-9]+(?:e-?[0-9]+)?)", text, flags=re.I)
    if not m:
        raise ValueError("No Logrank p found in PDF text.")
    return float(m.group(1))

In [107]:
import itertools
import pandas as pd

df = pd.read_excel("Gene_data.xlsx")

genes = (
    df.iloc[5:, 1]     # gene column starts at row 5, column 1
    .dropna()
    .astype(str)
    .tolist()
)

print(len(genes))   # should print 7470
print(genes[:20])

# 2) Start simple: single genes only
combos = [(g,) for g in genes]


FileNotFoundError: [Errno 2] No such file or directory: 'Gene_data.xlsx'

In [None]:
import gepia

# 1) Make survival object
bp = gepia.survival()

# 2) Pick one gene to test
test_gene = genes[0]   # e.g. "A1BG" (or set to "CCR7" if you prefer)

# 3) Run the query (this should create/return a PDF)
bp.setParam(test_gene,['BLCA'])
out = bp.query()
print("bp.query output:", out)

# 4) Figure out the PDF path
pdf_path = None
if isinstance(out, str) and out.lower().endswith(".pdf"):
    pdf_path = out
elif isinstance(out, (list, tuple)):
    pdf_path = next((x for x in out if isinstance(x, str) and x.lower().endswith(".pdf")), None)
elif isinstance(out, dict):
    pdf_path = next((v for v in out.values() if isinstance(v, str) and v.lower().endswith(".pdf")), None)

print("pdf_path:", pdf_path)

# 5) Extract logrank p (your function)
p = extract_logrank_p(pdf_path)
print("logrank p:", p)

./CCR7_survival_VbcXy.pdf
bp.query output: ./CCR7_survival_VbcXy.pdf
pdf_path: ./CCR7_survival_VbcXy.pdf
logrank p: 0.32


In [None]:
import os, glob, time, copy
import pandas as pd
import gepia

cancers = ["BLCA"]          # change later to loop cancers
out_dir = "./"
sleep_time = 0.6

bp = gepia.survival()
base_params = copy.deepcopy(bp.params)

results = []
N = 50   # test small first

for i, gene in enumerate(genes[:N], start=1):

    # ---------- RESET PARAMETERS EACH ITERATION ----------
    bp.setParams(copy.deepcopy(base_params))
    bp.setParam("signature", [gene])
    bp.setParam("dataset", cancers)

    # ---------- RUN QUERY ----------
    bp.query()

    # ---------- FIND NEWEST PDF ----------
    pdfs = glob.glob(os.path.join(out_dir, "*.pdf"))
    if not pdfs:
        print(f"[{i}/{N}] {gene}: no PDF found")
        continue

    pdf_path = max(pdfs, key=os.path.getmtime)

    # ---------- EXTRACT LOGRANK P ----------
    try:
        p = extract_logrank_p(pdf_path)
    except Exception as e:
        print(f"[{i}/{N}] {gene}: failed to parse p ({e})")
        os.remove(pdf_path)
        continue

    # ---------- SAVE RESULT ----------
    results.append({
        "gene": gene,
        "cancers": ",".join(cancers),
        "logrank_p": p
    })

    print(f"[{i}/{N}] {gene}: p={p:g}")

    # ---------- DELETE PDF (IMPORTANT) ----------
    os.remove(pdf_path)

    # ---------- SAVE PROGRESS EVERY ITER ----------
    pd.DataFrame(results).to_csv("gepia_logrank_progress.csv", index=False)

    time.sleep(sleep_time)


./A1BG_survival_ImanB.pdf
[1/50] A1BG: p=0.74
Error! Please check your parameters.
axisunit: month
dataset: ['BLCA']
groupcutoff1: 50
groupcutoff2: 50
highcol: #ff0000
ifconf: conf
ifhr: hr
is_sub: false
lowcol: #0000ff
methodoption: os
signature: ['A1CF']
signature_norm: 
subtype: 
out_dir:./
[2/50] A1CF: p=0.3
./AAAS_survival_e5smG.pdf
[3/50] AAAS: p=0.98
./AACS_survival_OSqiz.pdf
[4/50] AACS: p=0.091
Error! Please check your parameters.
axisunit: month
dataset: ['BLCA']
groupcutoff1: 50
groupcutoff2: 50
highcol: #ff0000
ifconf: conf
ifhr: hr
is_sub: false
lowcol: #0000ff
methodoption: os
signature: ['AADACL2']
signature_norm: 
subtype: 
out_dir:./
[5/50] AADACL2: p=0.22
./AADAT_survival_GJOVP.pdf
[6/50] AADAT: p=0.086
./AAGAB_survival_ovFeD.pdf
[7/50] AAGAB: p=0.35
./AAMP_survival_5WqKw.pdf
[8/50] AAMP: p=0.26
./AAR2_survival_VoKjt.pdf
[9/50] AAR2: p=0.46
./AARD_survival_wFA98.pdf
[10/50] AARD: p=0.0012
./AARS_survival_QcqvJ.pdf
[11/50] AARS: p=0.066
./AARS2_survival_wwkps.pdf
[12/5

In [None]:
excel_folder = "Differential Expression"   # folder with .xlsx files
csv_folder   = "./"     # folder with .csv files

genes_by_cancer = {}

excel_files = glob.glob(os.path.join(excel_folder, "*.xlsx"))

for excel_path in excel_files:
    cancer = os.path.basename(excel_path).replace(".xlsx", "")
    csv_path = os.path.join(csv_folder, f"{cancer}.csv")

    if not os.path.exists(csv_path):
        print(f"{cancer}: no matching CSV found")
        continue

    # Load first column from each file
    excel_genes = pd.read_excel(excel_path, usecols=[0]).iloc[:,0].dropna().astype(str)
    csv_genes   = pd.read_csv(csv_path,   usecols=[0]).iloc[:,0].dropna().astype(str)

    # Intersection
    common_genes = sorted(set(excel_genes) & set(csv_genes))

    genes_by_cancer[cancer] = common_genes

# Result = dictionary of lists

cancer_types = [
    "BRCA","CHOL","COAD","DLBC","ESCA","GBM","HNSC",
    "KICH","LAML","LGG","LUAD","LUSC","MESO","OV",
    "PAAD","PRAD","SKCM","STAD","UCEC"
]

CHOL: no matching CSV found
MESO: no matching CSV found
MISC: no matching CSV found


In [None]:
import os, glob, time, copy
import pandas as pd
import gepia

# genes_by_cancer: dict[str, list[str]]
# Example:
# genes_by_cancer = {"BLCA": ["AAMP","AARS"], "BRCA": ["ERBB2","ESR1"]}

OUT_DIR = "./"
RESULTS_DIR = "gepia_survival_results"
SLEEP_TIME = 0
os.makedirs(RESULTS_DIR, exist_ok=True)

bp = gepia.survival()
base_params = copy.deepcopy(bp.params)

def newest_pdf(out_dir: str) -> str | None:
    pdfs = glob.glob(os.path.join(out_dir, "*.pdf"))
    if not pdfs:
        return None
    return max(pdfs, key=os.path.getmtime)

for cancer, gene_list in genes_by_cancer.items():
    # skip empty lists
    if not gene_list:
        print(f"[{cancer}] Skipping (no genes)")
        continue

    out_csv = os.path.join(RESULTS_DIR, f"{cancer}_logrank_p.csv")

    # Resume if file exists
    if os.path.exists(out_csv):
        done_df = pd.read_csv(out_csv)
        done_genes = set(done_df["gene"].astype(str))
        results = done_df.to_dict("records")
        print(f"[{cancer}] Resuming: {len(done_genes)} genes already done")
    else:
        done_genes = set()
        results = []
        print(f"[{cancer}] Starting fresh")

    total = len(gene_list)

    for i, gene in enumerate(gene_list, start=1):
        gene = str(gene).strip()
        if not gene or gene.lower() == "nan":
            continue
        if gene in done_genes:
            continue

        # Reset params each time
        bp.setParams(copy.deepcopy(base_params))
        bp.setParam("signature", [gene])
        bp.setParam("dataset", [cancer])

        # Run query
        try:
            bp.query()
        except Exception as e:
            print(f"[{cancer}] {gene}: query failed ({e})")
            time.sleep(SLEEP_TIME)
            continue

        pdf_path = newest_pdf(OUT_DIR)
        if pdf_path is None:
            print(f"[{cancer}] {gene}: no PDF produced")
            time.sleep(SLEEP_TIME)
            continue

        # Extract p
        try:
            p = extract_logrank_p(pdf_path)
        except Exception as e:
            print(f"[{cancer}] {gene}: failed to parse p ({e})")
            try:
                os.remove(pdf_path)
            except OSError:
                pass
            time.sleep(SLEEP_TIME)
            continue

        # Save result
        if p < 0.05:
            results.append({
                "gene": gene,
                "cancer": cancer,
                "logrank_p": p
            })
        else:

        done_genes.add(gene)

        # Delete PDF immediately
        try:
            os.remove(pdf_path)
        except OSError:
            pass

        # Write progress every gene (crash-safe)
        pd.DataFrame(results).to_csv(out_csv, index=False)

        print(f"[{cancer}] {i}/{total} {gene}: p={p:g}")
        #time.sleep(SLEEP_TIME)

    print(f"[{cancer}] Done. Saved to {out_csv}")

[PAAD] Resuming: 89 genes already done
./CDC5L_survival_gwdU9.pdf
[PAAD] 90/779 CDC5L: p=0.16  (not significant)
[PAAD] 90/779 CDC5L: p=0.16
./CDC6_survival_YKsor.pdf
[PAAD] 91/779 CDC6: p=0.008  <-- kept
[PAAD] 91/779 CDC6: p=0.008
./CDC73_survival_zU9fO.pdf
[PAAD] 92/779 CDC73: p=0.25  (not significant)
[PAAD] 92/779 CDC73: p=0.25
./CDCA8_survival_PaCBh.pdf
[PAAD] 93/779 CDCA8: p=0.046  <-- kept
[PAAD] 93/779 CDCA8: p=0.046
./CDK1_survival_hOEZG.pdf
[PAAD] 94/779 CDK1: p=0.0006  <-- kept
[PAAD] 94/779 CDK1: p=0.0006
./CDK7_survival_bjqBB.pdf
[PAAD] 95/779 CDK7: p=0.048  <-- kept
[PAAD] 95/779 CDK7: p=0.048
./CDT1_survival_CsSnp.pdf
[PAAD] 96/779 CDT1: p=0.4  (not significant)
[PAAD] 96/779 CDT1: p=0.4
./CEBPZ_survival_3Yd72.pdf
[PAAD] 97/779 CEBPZ: p=0.23  (not significant)
[PAAD] 97/779 CEBPZ: p=0.23
./CENPE_survival_ZRxW2.pdf
[PAAD] 98/779 CENPE: p=0.00036  <-- kept
[PAAD] 98/779 CENPE: p=0.00036
./CENPK_survival_dKuzY.pdf
[PAAD] 99/779 CENPK: p=0.022  <-- kept
[PAAD] 99/779 CENPK:

In [None]:
df[:50]

Unnamed: 0,gene,cancers,logrank_p,pdf
0,A1BG,BLCA,0.74,./A1BG_survival_OvLdW.pdf
1,A1CF,BLCA,0.74,./A1BG_survival_OvLdW.pdf
2,AAAS,BLCA,0.98,./AAAS_survival_hQkY3.pdf
3,AACS,BLCA,0.091,./AACS_survival_DpXZl.pdf
4,AADACL2,BLCA,0.091,./AACS_survival_DpXZl.pdf
5,AADAT,BLCA,0.086,./AADAT_survival_lLg4w.pdf
6,AAGAB,BLCA,0.35,./AAGAB_survival_Hf45X.pdf
7,AAMP,BLCA,0.26,./AAMP_survival_IHtS0.pdf
8,AAR2,BLCA,0.46,./AAR2_survival_zbtIB.pdf
9,AARD,BLCA,0.0012,./AARD_survival_iKBqF.pdf


In [None]:
cancer_types = [
    "ACC",
    "BLCA",
    "BRCA",
    "CESC",
    "CHOL",
    "COAD",
    "DLBC",
    "ESCA",
    "GBM",
    "HNSC",
    "KICH",
    "KIRC",
    "KIRP",
    "LAML",
    "LGG",
    "LIHC",
    "LUAD",
    "LUSC",
    "MESO",
    "OV",
    "PAAD",
    "PCPG",
    "PRAD",
    "READ",
    "SARC",
    "SKCM",
    "STAD",
    "TGCT",
    "THCA",
    "THYM",
    "UCEC",
    "UCS",
    "UVM"
]


for cancer in cancer_types:
    try:
        data = gepia2_differential_genes(cancer, fccutoff=1, qcutoff=0.01)

        df = gepia_html_to_df(data)
        print(df.head())
        name = cancer + ".csv"
        df.to_csv(name, index=False)
    except:
        print("This cancer is unavailable")

                                                               
0  A4GALT  ENSG00000128274.15  4.660   10.640  -1.040   1.44e-5
1   AADAC  ENSG00000114771.13  0.960  192.584  -6.626  2.67e-60
2    AASS  ENSG00000008311.14  1.980    7.735  -1.551  4.39e-33
3   ABCA1  ENSG00000165029.15  3.240   31.830  -2.953  1.57e-49
4  ABCA10  ENSG00000154263.17  0.450    2.905  -1.429  3.10e-26
                                                               
0    A2M  ENSG00000175899.14  71.629  509.627  -2.814  1.19e-20
1   AARD   ENSG00000205002.3   0.060    2.413  -1.687  3.61e-34
2   AASS  ENSG00000008311.14   1.690    4.970  -1.150  1.46e-16
3  AATBC   ENSG00000215458.8   5.215    1.447   1.345   3.16e-4
4  ABCA1  ENSG00000165029.15   3.840   10.184  -1.208   5.70e-8
                                                                  
0     A2M  ENSG00000175899.14  150.957  329.263  -1.120   1.94e-60
1  A4GALT  ENSG00000128274.15   10.050   21.720  -1.040   2.65e-53
2   AADAC  ENSG00000114771.13  

In [None]:
import pandas as pd

# load the CSV you previously saved
df = pd.read_csv("BRCA_DEGs.csv")

# convert 5th column (index 4) to numeric just in case
df.iloc[:, 4] = pd.to_numeric(df.iloc[:, 4], errors="coerce")

# keep only rows where column 5 > 1
filtered_df = df[df.iloc[:, 4] > 1]

# overwrite or save new file
filtered_df.to_csv("BRCA_DEGs_log2FC_gt1.csv", index=False)

print("Rows before:", len(df))
print("Rows after:", len(filtered_df))

Rows before: 3559
Rows after: 1424


**Citations**

[1] Behan, Fiona M., Francesco Iorio, Gabriele Picco, Emanuel Gonçalves, Charlotte M. Beaver, Giorgia Migliardi, Rita Santos, et al. “Prioritization of Cancer Therapeutic Targets Using CRISPR–Cas9 Screens.” Nature 568, no. 7753 (April 2019): 511–16. https://doi.org/10.1038/s41586-019-1103-9.

[2] National Cancer Institute. “TCGA Study Abbreviations | NCI Genomic Data Commons.” gdc.cancer.gov, n.d. https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations.

[3] Zhang's Lab. “GEPIA 2.” Cancer-pku.cn, 2026. http://gepia2.cancer-pku.cn/#degenes.