# Convert RefSeq IDs to Ensembl Gene IDs in a BED File

This notebook:
1. Loads a BED file whose 4th column contains RefSeq transcript IDs.  
2. Loads the NCBI `mouse_gene2ensembl` mapping file.  
3. Strips version suffixes (e.g. “.1”) from RefSeq IDs in both datasets.  
4. Builds a dictionary that maps RefSeq → Ensembl Gene ID.  
5. Replaces each RefSeq in the BED with its corresponding Ensembl Gene ID (fallback to original if no match).  
6. Writes out a new BED file with Ensembl IDs in column 4.


In [1]:
# ─── Cell 1: Imports ────────────────────────────────────────────────────────
import pandas as pd


# ─── Cell 2: Load the BED file with RefSeq IDs in column 4 ─────────────────
# The BED must have at least 4 columns; column index 3 (0-based) holds RefSeq IDs.
bed = pd.read_csv(
    "/projectnb/perissilab/Xinyu/GPS2_CHIPseq/mm39.bed",
    sep="\t",
    header=None
)

# ─── Cell 3: Load the mouse_gene2ensembl mapping TSV ───────────────────────
# Columns: tax_id, GeneID, EnsemblGene, RefSeq, EnsemblTranscript, RefSeqProtein, EnsemblProtein
mapping = pd.read_csv(
    "/projectnb/perissilab/Xinyu/GPS2_CHIPseq/Adapters_and_Annotations/mouse_gene2ensembl.tsv",
    sep="\t",
    header=None,
    names=[
        "tax_id", "GeneID", "EnsemblGene", "RefSeq",
        "EnsemblTranscript", "RefSeqProtein", "EnsemblProtein"
    ]
)


# ─── Cell 4: Strip version suffixes (“.N”) to match IDs reliably ──────────
# e.g. convert “NM_001234.2” → “NM_001234”
mapping["RefSeq"] = mapping["RefSeq"].str.replace(r"\.\d+$", "", regex=True)
bed[3]           = bed[3].str.replace(r"\.\d+$", "", regex=True)


# ─── Cell 5: Build a RefSeq→EnsemblGene dictionary and apply mapping ──────
refseq_to_ensg = dict(zip(mapping["RefSeq"], mapping["EnsemblGene"]))
# Map each RefSeq in column 3 to its EnsemblGene (leave unchanged if not found)
bed[3] = bed[3].map(lambda x: refseq_to_ensg.get(x, x))

# ─── Cell 6: Save the new BED file with Ensembl IDs in column 4 ──────────
bed.to_csv(
    "/projectnb/perissilab/Xinyu/GPS2_CHIPseq/mm39_ensembl.bed",
    sep="\t",
    header=False,
    index=False
)
print("Done! Saved as mm39_ensembl.bed")



Done! Saved as mm39_ensembl.bed
