# Dataset Creation

## Aim
Create a dataset for **future analysis**.

---

## Necessary Downloads
Before starting, be sure to have downloaded the following material.

### 1. MSigDB Gene Sets (Human)
Download the entire MSigDB for **Human** in **JSON** format (`Human Gene Set JSON file set (ZIPped)`).
- **Link**: [https://www.gsea-msigdb.org/gsea/downloads.jsp](https://www.gsea-msigdb.org/gsea/downloads.jsp)

### 2. UniProt Human Proteome
Download the list of all **HUMAN proteins** (UniProt human proteome - **UP000005640**).
- **Link (Website)**: [https://www.uniprot.org/proteomes/UP000005640](https://www.uniprot.org/proteomes/UP000005640)
- **Programmatic Download (Terminal)**:

```bash
wget -O human_proteome.tsv.gz "[https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession,reviewed,id,protein_name,gene_names,organism_name,sequence&format=tsv&query=(proteome:UP000005640](https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession,reviewed,id,protein_name,gene_names,organism_name,sequence&format=tsv&query=(proteome:UP000005640))"
gunzip human_proteome.tsv.gz
```

### 3. UniRef50 Clusters

#### A. Initial Programmatic Download (Partial)
This method retrieves all clusters with at least one human protein, but is a **subsample** of the full UniRef50.
- **Link (Website)**: [https://www.uniprot.org/uniref?query=%28identity%3A0.5%29+AND+%28taxonomy_id%3A9606%29](https://www.uniprot.org/uniref?query=%28identity%3A0.5%29+AND+%28taxonomy_id%3A9606%29)
- **Programmatic Download (Terminal)**:
    - **ATTENTION**: This call retrieves all clusters with at least 1 human protein, *not* just the human proteins themselves.

```bash
curl -o uniref50_human.tsv.gz "[https://rest.uniprot.org/uniref/stream?compressed=true&fields=id,name,organism,length,identity,count,members&format=tsv&query=((identity:0.5)+AND+(taxonomy_id:9606](https://rest.uniprot.org/uniref/stream?compressed=true&fields=id,name,organism,length,identity,count,members&format=tsv&query=((identity:0.5)+AND+(taxonomy_id:9606)))"
gunzip uniref50_human.tsv.gz
```

#### B. Full UniRef50 Download (Required for Complete Data)
Since the above options only download a subsample, the entire UniRef50 ($\sim$230GB) must be downloaded using the XML format.

- **Download Full XML UniRef50**:

```bash
wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.xml.gz
gunzip uniref50.xml.gz
```

### 4. Subsample Human Clusters from UniRef50 (Processing)
The full UniRef50 XML file is too large for routine handling. The next steps process the file to create a dataset containing **only clusters with human proteins** and **only the human proteins themselves**. This requires a dedicated C++ script.

- **Dependencies**:

```bash
## Download dependencies
sudo apt-get install libexpat1-dev
```

- **C++ Script and Compilation**:
  - The necessary C++ file (`uniref_extractor.cpp`) is assumed to be in the utils folder (e.g., `/home/gdallagl/myworkdir/ESMSec/utils/fast_uniref_extractor.cpp`).
  - **Compile file**:

```bash
g++ -O3 -march=native -o extract_human uniref_extractor.cpp -lexpat
```

- **Run Script**:
  - This script processes the full XML file to create a CSV with the desired human clusters/proteins.

```bash
./extract_human uniref50.xml fast_c++_human_clusters.csv
```

### 5. UniProt Gene-Protein mapping

[UniProt protein-gene Mapping](https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping.dat.gz)

<!-- ### 5. BioMart Gene-Protein mapping

Download BioMart gne-protein mapping, besure to include **Gene Name** and **Transcript Name**.
[link](https://www.ensembl.org/biomart/martview/3aaaf734b93facdfad8207234204cc31) -->


## Hyperparameters

In [26]:
import json
import os
import re
import pandas as pd
from scipy.special import softmax
from sklearn.model_selection import train_test_split

import utils.dataset_functions as dataf

# Directory containing MSigDB JSON files
JSON_DIR = "/home/gdallagl/myworkdir/data/MSigDB/msigdb_v2025.1.Hs_json_files_to_download_locally"

# Garated genes list
GUARANTEED_GENES_PATH = "/home/gdallagl/myworkdir/data/MSigDB/julies_cycling_signatures_cancer.tsv"

# Updated keywords pattern with word boundaries to avoid false matches
KEYWORDS_PATTERN = r"(?:PROLIFERA|\bPROLIFER\b|_PROLIFER_|_CYCLING_|^CELL_CYCLE_|_CELL_CYCLE_|_CC_|_G1_|_S_PHASE_|_G2_|_M_PHASE_|\bMITOSIS\b|\bCYCLIN\b|\bCDK\b|\bCHECKPOINT\b|\bGS1\b|\bGS2\b)"

# Exclusion pattern
EXCLUSION_PATTERN = r"(?:MEIOSIS|FATTY_ACID_CYCLING_MODEL)"

# Human proteome path
HUMAN_PROTEOME_PATH = "/home/gdallagl/myworkdir/data/UniRef50/human_proteome.tsv"

# Uniref Apth
UNIREF_PATH = "/home/gdallagl/myworkdir/ESMSec/data/UniRef50/fast_c++_human_clusters.csv"

# Biomart apth
MAPPING_PATH = "/home/gdallagl/myworkdir/ESMSec/data/UniRef50/HUMAN_9606_idmapping.dat"

# Minimum frequency threshold for filtering ambiguous genes
MIN_FREQ_AMBIGOUS = 1

# min number of postive samples per positive cluster
MIN_SAMPLE_N_POSITIVE = 1

# how many mroe negativ class to sampel
NEGATIVE_CLASS_MULT = 2

# savifn csv datset
FINAL_DATASET_PATH = f"/home/gdallagl/myworkdir/ESMSec/data/cell_cycle_dataset_{MIN_SAMPLE_N_POSITIVE}:{NEGATIVE_CLASS_MULT}.csv"


# Autorelaod
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Select pathways related to cell cycle

Read the MSigDB and select all pathways with keyword related to field of interest.

In [2]:
### 1) Transform jsons into df
df_genesets = dataf.load_json_folder_to_df(JSON_DIR)
display(df_genesets.head(2)); print(df_genesets.shape)

### 2) Select only geneset related to interested fiedl
df_filtered = dataf.filter_gene_sets_by_keywords(df_genesets, KEYWORDS_PATTERN, EXCLUSION_PATTERN)
display(df_filtered.head(2)); print(df_filtered.shape)

list(df_filtered['set_name'])

Unnamed: 0,set_name,collection,systematicName,pmid,exactSource,externalDetailsURL,msigdbURL,geneSymbols,filteredBySimilarity,externalNamesForSimilarTerms,source_file
0,MIR153_5P,C3:MIR:MIRDB,M30412,31504780,,http://mirdb.org/cgi-bin/mature_mir.cgi?name=h...,https://www.gsea-msigdb.org/gsea/msigdb/human/...,"[A1CF, AAK1, AASDHPPT, ABCE1, ABHD2, ABI2, ACB...",[],[],c3.mir.mirdb.v2025.1.Hs.json
1,MIR8485,C3:MIR:MIRDB,M30413,31504780,,http://mirdb.org/cgi-bin/mature_mir.cgi?name=h...,https://www.gsea-msigdb.org/gsea/msigdb/human/...,"[AAK1, ABHD18, ABL2, ABLIM1, ACVR1, ACVR2B, AC...",[],[],c3.mir.mirdb.v2025.1.Hs.json


(122192, 11)


Unnamed: 0,set_name,collection,systematicName,pmid,exactSource,externalDetailsURL,msigdbURL,geneSymbols,filteredBySimilarity,externalNamesForSimilarTerms,source_file
12226,KEGG_MEDICUS_PATHOGEN_KSHV_VCYCLIN_TO_CELL_CYC...,C2:CP:KEGG_MEDICUS,M47461,,N00168,https://www.kegg.jp/entry/N00168,https://www.gsea-msigdb.org/gsea/msigdb/human/...,"[CDK4, CDK6, E2F1, E2F2, E2F3, RB1]",[],[],c2.cp.kegg_medicus.v2025.1.Hs.json
12229,KEGG_MEDICUS_PATHOGEN_HTLV_1_TAX_TO_P21_CELL_C...,C2:CP:KEGG_MEDICUS,M47585,,N00498,https://www.kegg.jp/entry/N00498,https://www.gsea-msigdb.org/gsea/msigdb/human/...,"[CCNE1, CCNE2, CDK2, CDKN1A, E2F1, E2F2, E2F3,...",[],[],c2.cp.kegg_medicus.v2025.1.Hs.json


(1343, 11)


['KEGG_MEDICUS_PATHOGEN_KSHV_VCYCLIN_TO_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_PATHOGEN_HTLV_1_TAX_TO_P21_CELL_CYCLE_G1_S_N00498',
 'KEGG_MEDICUS_REFERENCE_MDM2_P21_CELL_CYCLE_G1_S_N00536',
 'KEGG_MEDICUS_VARIANT_AMPLIFIED_MDM2_TO_P21_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_VARIANT_AMPLIFIED_MYC_TO_P27_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_VARIANT_AMPLIFIED_MYC_TO_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_REFERENCE_P300_P21_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_REFERENCE_CDC25_CELL_CYCLE_G2_M',
 'KEGG_MEDICUS_REFERENCE_ATR_P21_CELL_CYCLE_G2_M',
 'KEGG_MEDICUS_REFERENCE_WEE1_CELL_CYCLE_G2_M',
 'KEGG_MEDICUS_PATHOGEN_HPV_E7_TO_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_VARIANT_AMPLIFIED_CCND1_TO_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_PATHOGEN_EBV_EBNA3C_TO_CELL_CYCLE_G1_S_N00483',
 'KEGG_MEDICUS_REFERENCE_P27_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_PATHOGEN_EBV_EBNA3C_TO_P27_CELL_CYCLE_G1_S_N00264',
 'KEGG_MEDICUS_PATHOGEN_EBV_EBNA3C_TO_CELL_CYCLE_G1_S_N00484',
 'KEGG_MEDICUS_PATHOGEN_HTLV_1_TAX_TO_P21_CELL_CYCLE_G1_S_N00497',
 'KEGG_MEDICUS

## Count in how many genesets each gene is present

In [3]:
gene_counts_df = dataf.gene_set_counts(df_filtered)
display(gene_counts_df.head())

Unnamed: 0,gene,geneset_count
0,E2F1,75
1,CCNB1,70
2,CDK1,70
3,CDKN1A,69
4,TP53,64


## Filter out genes with too few GS
*) GS = genesets

In [4]:
# give a label to gene that overcome the thr
gene_counts_df["label"] = gene_counts_df.geneset_count.apply(lambda x: 'positive' if x > MIN_FREQ_AMBIGOUS else 'ambigous')

# create a label for later
gene_counts_df["is_guaranteed"] = False

display(gene_counts_df)

Unnamed: 0,gene,geneset_count,label,is_guaranteed
0,E2F1,75,positive,False
1,CCNB1,70,positive,False
2,CDK1,70,positive,False
3,CDKN1A,69,positive,False
4,TP53,64,positive,False
...,...,...,...,...
9091,ATG4B,1,ambigous,False
9092,ARVCF,1,ambigous,False
9093,ARRDC1,1,ambigous,False
9094,ARHGEF7,1,ambigous,False


In [5]:
# make postive gene-freq mapping
# Filter only positive genes
positive_genes_df = gene_counts_df[gene_counts_df["label"] == "positive"]
# Create a mapping: gene -> frequency
positive_gene_map = dict(zip(positive_genes_df["gene"], positive_genes_df["geneset_count"]))
positive_gene_map

{'E2F1': 75,
 'CCNB1': 70,
 'CDK1': 70,
 'CDKN1A': 69,
 'TP53': 64,
 'CCNE1': 56,
 'CDK2': 56,
 'RB1': 56,
 'CCND1': 55,
 'CCNA2': 50,
 'CDC6': 49,
 'CCNE2': 46,
 'TGFB1': 45,
 'CDK4': 45,
 'MIR222': 43,
 'PLK1': 43,
 'CDKN1B': 42,
 'MYC': 42,
 'CCNB2': 42,
 'CDC25C': 42,
 'FGF2': 41,
 'E2F3': 41,
 'NDC80': 41,
 'CDKN2A': 41,
 'BIRC5': 41,
 'CHEK1': 40,
 'SHH': 40,
 'FGF10': 39,
 'CDC20': 39,
 'MAD2L1': 38,
 'CTNNB1': 38,
 'IGF1': 38,
 'NF1': 38,
 'CDC25A': 38,
 'TFDP1': 38,
 'PKMYT1': 37,
 'CDC25B': 37,
 'AURKA': 36,
 'DTL': 36,
 'E2F2': 36,
 'FGFR2': 36,
 'WEE1': 36,
 'BMP4': 36,
 'RRM2': 35,
 'FBXO5': 35,
 'ZWINT': 35,
 'CENPF': 35,
 'AURKB': 35,
 'ATM': 35,
 'BRCA1': 35,
 'CDK6': 34,
 'EZH2': 34,
 'MIR21': 34,
 'RGCC': 33,
 'CDCA8': 33,
 'CDC7': 33,
 'MDM2': 33,
 'RRM1': 33,
 'TTK': 33,
 'WNT5A': 33,
 'BUB1': 33,
 'FZR1': 33,
 'CLSPN': 32,
 'CCND3': 32,
 'KIF14': 32,
 'MIR221': 32,
 'BRCA2': 31,
 'BARD1': 31,
 'FGFR1': 31,
 'UBE2C': 31,
 'CCND2': 30,
 'CHEK2': 30,
 'MIR503': 30,
 '

## Add Guaranteed genes

Add genes related to interesting field (i.e. that msut be present).

Read them from csv file.

Add them with max-freq.

In [6]:
### 1) Read spefic csv
garanted_genes_df = pd.read_csv(GUARANTEED_GENES_PATH, sep='\t')
display(garanted_genes_df.head(5))

### 2) Extarct the single gene names
all_values = garanted_genes_df.to_numpy().flatten().tolist()
all_values = [x for x in all_values if pd.notna(x)] # remove nan
all_values = list(set(all_values)) # remove duplicated
print(all_values); print(len(all_values))

### 3) Add them into previus df
# create a DataFrame for new genes
new_genes_df = pd.DataFrame({
    'gene': all_values,
    'geneset_count': max(gene_counts_df.geneset_count), # Use as Freq the max (as these genes are guaranted)
    'label': "positive",
    "is_guaranteed": True
})

# append to existing gene_frequency_df
gene_frequency_df = pd.concat([gene_counts_df, new_genes_df], ignore_index=True)

# Drop duplicates, keeping **the last occurrence** (i.e., from new_genes_df)
gene_frequency_df = gene_frequency_df.drop_duplicates(subset='gene', keep='last')

# Sort and reset index
gene_frequency_df.sort_values(by=['geneset_count', 'gene'], ascending=[False, True], inplace=True)
gene_frequency_df.reset_index(drop=True, inplace=True)

# add labels
ambiguos_genes = set(gene_frequency_df[gene_frequency_df.label == "ambigous"].gene)
positive_genes = set(gene_frequency_df[gene_frequency_df.label == "positive"].gene)

display(gene_frequency_df)

print(gene_frequency_df.is_guaranteed.value_counts())
print(gene_frequency_df.gene.nunique())

Unnamed: 0,GBM_G1S,GBM_G2M,H3_K27M_CC,IDH_O_G1S,IDH_O_G2M,Melanoma_G1S,Melanoma_G2M
0,RRM2,CCNB1,UBE2T,MCM5,HMGB2,MCM5,HMGB2
1,PCNA,CDC20,HMGB2,PCNA,CDK1,PCNA,CDK1
2,KIAA0101,CCNB2,TYMS,TYMS,NUSAP1,TYMS,NUSAP1
3,HIST1H4C,PLK1,MAD2L1,FEN1,UBE2C,FEN1,UBE2C
4,MLF1IP,CCNA2,CDK1,MCM2,BIRC5,MCM2,BIRC5


['OIP5', 'ATAD2', 'CDCA8', 'HN1', 'KIF20B', 'CKAP5', 'NEK2', 'USP1', 'PBK', 'KIF4A', 'KIF2C', 'CDC25B', 'CDC25C', 'KPNA2', 'CDC20', 'AURKB', 'TIPIN', 'CKAP2', 'FOXM1', 'HMGB2', 'ARHGAP11A', 'TUBA1B', 'NCAPD2', 'CHAF1B', 'RRM1', 'RRM2', 'CKS2', 'TMEM106C', 'GPSM2', 'CDCA2', 'TACC3', 'KIF22', 'KIF20A', 'CBX5', 'RNASEH2A', 'RACGAP1', 'RAD51', 'PCNA', 'CCNA2', 'GTSE1', 'GMNN', 'RFC2', 'CTCF', 'PLK1', 'CCNB1', 'KIAA0101', 'UNG', 'RAD51AP1', 'WDR76', 'SPAG5', 'MXD3', 'CKS1B', 'TMPO', 'CENPA', 'CENPK', 'CCNE2', 'NDC80', 'CLSPN', 'FEN1', 'E2F8', 'LBR', 'TK1', 'GINS2', 'NUF2', 'UHRF1', 'PTTG1', 'KIF11', 'CDC6', 'PRIM1', 'ASF1B', 'MKI67', 'TUBB4B', 'TOP2A', 'TUBA1C', 'FAM64A', 'SMC4', 'BRIP1', 'ARL6IP1', 'POLD3', 'DLGAP5', 'CDC45', 'BUB1', 'AURKA', 'RPA2', 'CENPM', 'MAD2L1', 'PKMYT1', 'SLBP', 'TYMS', 'G2E3', 'MCM3', 'CDCA7', 'FANCI', 'NUSAP1', 'TTK', 'ANLN', 'KIF23', 'UBE2T', 'MCM7', 'MZT1', 'TROAP', 'MELK', 'MCM5', 'ZWINT', 'HIST1H4C', 'MCM2', 'CENPE', 'RANGAP1', 'DSCC1', 'KIFC1', 'UBR7', 'HMGN

Unnamed: 0,gene,geneset_count,label,is_guaranteed
0,ANLN,75,positive,True
1,ANP32E,75,positive,True
2,ARHGAP11A,75,positive,True
3,ARL6IP1,75,positive,True
4,ASF1B,75,positive,True
...,...,...,...,...
9098,ZSCAN20,1,ambigous,False
9099,ZSCAN22,1,ambigous,False
9100,ZSCAN9,1,ambigous,False
9101,ZSWIM4,1,ambigous,False


is_guaranteed
False    8957
True      146
Name: count, dtype: int64
9103


## Read proteome df

In [7]:
# ### 1) Read proteiame df
# proteome_df = pd.read_csv(HUMAN_PROTEOME_PATH, sep='\t')
# display(proteome_df.head(5)); print(proteome_df.shape)

# # 2) Extarct
# all_human_proteins = set(proteome_df['Entry'].unique())
# print(len(all_human_proteins))

# all_human_genes = set()
# for names in proteome_df['Gene Names'].dropna(): # need this because same p
#     genes = names.split()  # split by spaces
#     all_human_genes.update(genes)
# print(len(all_human_genes)) # ATTENTION: One gene can produce multiple protein isoforms

# ### 3) Chekc how many genes intersect
# print(len(all_human_genes.intersection(positive_genes))) 
#     # ATTWNTION: they shoudl be ALL the positve genes, why npot?

# # add protein anme
# # gene_frequency_df["mapped_protein_name"] = gene_frequency_df['gene_name'].map(
# #     proteome_df.drop_duplicates(subset='Gene Names').set_index('Gene Names')['Protein names'] # ATTENTION DUPLICATED !!!
# # )
# # display(gene_frequency_df[gene_frequency_df.mapped_protein_name.notna()])

## Find Clusters in UniRef50

Find clusters un UniRef with at least one gene defined before.

Use file created from entire Uniref50 xml. 

In [8]:
# Read mapong df
df = pd.read_csv(MAPPING_PATH, sep='\t', header=None, names=['UniProtKB_Accession', 'ID_Type', 'External_ID'])

print("Unique ID types:", df['ID_Type'].unique())

# Sleect onyl gene names
gene_df = df[df['ID_Type'] == 'Gene_Name'].copy()
display(gene_df)

# make mapping dict
protein_to_gene_map = gene_df.set_index('UniProtKB_Accession')['External_ID'].dropna().to_dict()

# all tpriens weith UniProtKB_Accession
all_uniprot_prioteins_name = set(gene_df["UniProtKB_Accession"].unique())

#little cjeck
for i, (protein, gene) in enumerate(protein_to_gene_map.items()):
    if i >= 5:
        break
    print(protein, "->", gene)

protein_to_gene_map["Q8WZ42"] #TITIN, A0AAQ5BIC8,

Unique ID types: ['UniProtKB-ID' 'Gene_Name' 'GI' 'UniRef100' 'UniRef90' 'UniRef50'
 'UniParc' 'EMBL' 'EMBL-CDS' 'NCBI_TaxID' 'CCDS' 'RefSeq' 'RefSeq_NT'
 'PDB' 'EMDB' 'BioGRID' 'DIP' 'MINT' 'STRING' 'ChEMBL' 'DrugBank'
 'BioMuta' 'DMDM' 'CPTAC' 'ProteomicsDB' 'DNASU' 'Ensembl' 'Ensembl_TRS'
 'Ensembl_PRO' 'GeneID' 'KEGG' 'GeneCards' 'HGNC' 'MIM' 'neXtProt'
 'OpenTargets' 'PharmGKB' 'VEuPathDB' 'eggNOG' 'GeneTree' 'HOGENOM' 'OMA'
 'OrthoDB' 'TreeFam' 'Reactome' 'ChiTaRS' 'GeneWiki' 'GenomeRNAi' 'IDEAL'
 'CRC64' 'TCDB' 'UCSC' 'Orphanet' 'Gene_Synonym' 'ComplexPortal'
 'GeneReviews' 'BioCyc' 'SwissLipids' 'UniPathway' 'Gene_ORFName'
 'DisProt' 'GuidetoPHARMACOLOGY' 'GlyConnect' 'MEROPS' 'ESTHER'
 'Allergome' 'PeroxiBase' 'REBASE' 'PATRIC']


Unnamed: 0,UniProtKB_Accession,ID_Type,External_ID
1,P31946,Gene_Name,YWHAB
120,P62258,Gene_Name,YWHAE
251,Q04917,Gene_Name,YWHAH
342,P61981,Gene_Name,YWHAG
470,P31947,Gene_Name,SFN
...,...,...,...
5375347,Q8TB44,Gene_Name,MTSS1L
5375359,Q9UJU1,Gene_Name,VIL2
5375411,A5HC06,Gene_Name,KRAS
5375520,V5LL19,Gene_Name,HLA-C


P31946 -> YWHAB
P62258 -> YWHAE
Q04917 -> YWHAH
P61981 -> YWHAG
P31947 -> SFN


'TTN'

In [9]:
# Load
uniref_df = pd.read_csv(UNIREF_PATH) 

# Tranfor form string to lsit
uniref_df['proteins'] = uniref_df['Human_Proteins'].apply(lambda x: [item.strip().rstrip('.') for item in x.split(';')])
uniref_df['n_proteins'] = uniref_df['proteins'].apply(len)

# Format proteins
def clean_protein(protein):
    # 1) strip away isoform suffix: e.g., "protein-1" -> "protein"
    protein = protein.split('-')[0]
    # 2) strip away "_HUMAN": e.g., "protein_HUMAN" -> "protein"
    protein = protein.replace('_HUMAN', '')
    return protein
# Apply cleaning and remove duplicates in each group
uniref_df['proteins_cleaned'] = uniref_df['proteins'].apply(lambda lst: list(set(clean_protein(p) for p in lst)))

# remove protien wiht not UniProtKB_Accession iD
uniref_df["proteins_cleaned"] = uniref_df["proteins_cleaned"].apply(lambda x: [p for p in x if p in all_uniprot_prioteins_name])
 
uniref_df['n_proteins_cleaned'] = uniref_df['proteins_cleaned'].apply(len)
display(uniref_df.head(5))
display(uniref_df.shape)

# Map prot to genes
def map_proteins_to_genes(protein_list):
    # Map each protein in the list to a gene, skip if not found
    genes = [protein_to_gene_map[p] if p in protein_to_gene_map else None for p in protein_list]
    # Remove duplicates while preserving order return list(dict.fromkeys(genes))
    return genes  # keep duplicates
uniref_df['genes'] = uniref_df['proteins_cleaned'].apply(map_proteins_to_genes)
display(uniref_df.head(5))
display(uniref_df.shape)




# 1) Add the 'label' column: 'positive' if at least one protein maps to a positive gene, otherwise 'negative'
uniref_df["label"] = uniref_df["genes"].apply(
    lambda gene_list: "positive" 
    if any(g in positive_genes for g in gene_list)
    else "negative"
)
display(uniref_df.head(5))
display(uniref_df.shape)

# 2) Add lsist of  'positive_genes' and "postive_proteins" 
# that ONLY map to a positive gene.
uniref_df["positive_genes"] = uniref_df["genes"].apply(
    lambda gene_list: [
        g
        for g in gene_list 
        if g in positive_genes
    ]
)
def get_positive_proteins(protein_list):
    pos_proteins = [
        p for p in protein_list 
        if p in protein_to_gene_map and protein_to_gene_map[p] in positive_genes
    ]
    return pos_proteins

uniref_df["positive_proteins"] = uniref_df["proteins_cleaned"].apply(get_positive_proteins)
uniref_df["n_positive_proteins"] = uniref_df["proteins_cleaned"].apply(len)


display(uniref_df.head(5))
display(uniref_df.shape)


Unnamed: 0,Cluster_ID,Cluster_Name,Human_Proteins,proteins,n_proteins,proteins_cleaned,n_proteins_cleaned
0,UniRef50_Q8WZ42,Cluster: Titin,TITIN_HUMAN; Q8WZ42-8; Q8WZ42-2; Q8WZ42-7; C0J...,"[TITIN_HUMAN, Q8WZ42-8, Q8WZ42-2, Q8WZ42-7, C0...",18,"[Q8WZ42, H0Y4J7, A0A0C4DG59, C0JYZ2, A0AAQ5BIC...",7
1,UniRef50_Q8WZ42-9,Cluster: Isoform 9 of Titin,Q8WZ42-9; A0A0A0MRA3_HUMAN; Q8WZ42-3; Q8WZ42-10,"[Q8WZ42-9, A0A0A0MRA3_HUMAN, Q8WZ42-3, Q8WZ42-10]",4,"[Q8WZ42, A0A0A0MRA3]",2
2,UniRef50_Q8WXI7,Cluster: Mucin-16,MUC16_HUMAN; A0AAG2UXD3_HUMAN; A0AAG2UUZ0_HUMA...,"[MUC16_HUMAN, A0AAG2UXD3_HUMAN, A0AAG2UUZ0_HUM...",8,"[A0AAG2UXD3, A0AAG2UXK0, A0AA34QW05, A0AAG2UUZ...",7
3,UniRef50_A0AA34QVW0,"Cluster: Mucin 16, cell surface associated",A0AA34QVW0_HUMAN,[A0AA34QVW0_HUMAN],1,[A0AA34QVW0],1
4,UniRef50_Q9H195,Cluster: Mucin-3B,MUC3B_HUMAN; H9XFA8_HUMAN; I0CMK2_HUMAN; O4342...,"[MUC3B_HUMAN, H9XFA8_HUMAN, I0CMK2_HUMAN, O434...",4,"[I0CMK2, O43420, H9XFA8]",3


(77325, 7)

Unnamed: 0,Cluster_ID,Cluster_Name,Human_Proteins,proteins,n_proteins,proteins_cleaned,n_proteins_cleaned,genes
0,UniRef50_Q8WZ42,Cluster: Titin,TITIN_HUMAN; Q8WZ42-8; Q8WZ42-2; Q8WZ42-7; C0J...,"[TITIN_HUMAN, Q8WZ42-8, Q8WZ42-2, Q8WZ42-7, C0...",18,"[Q8WZ42, H0Y4J7, A0A0C4DG59, C0JYZ2, A0AAQ5BIC...",7,"[TTN, TTN, TTN, TTN, TTN, TTN, TTN]"
1,UniRef50_Q8WZ42-9,Cluster: Isoform 9 of Titin,Q8WZ42-9; A0A0A0MRA3_HUMAN; Q8WZ42-3; Q8WZ42-10,"[Q8WZ42-9, A0A0A0MRA3_HUMAN, Q8WZ42-3, Q8WZ42-10]",4,"[Q8WZ42, A0A0A0MRA3]",2,"[TTN, TTN]"
2,UniRef50_Q8WXI7,Cluster: Mucin-16,MUC16_HUMAN; A0AAG2UXD3_HUMAN; A0AAG2UUZ0_HUMA...,"[MUC16_HUMAN, A0AAG2UXD3_HUMAN, A0AAG2UUZ0_HUM...",8,"[A0AAG2UXD3, A0AAG2UXK0, A0AA34QW05, A0AAG2UUZ...",7,"[MUC16, MUC16, MUC16, MUC16, MUC16, MUC16, MUC16]"
3,UniRef50_A0AA34QVW0,"Cluster: Mucin 16, cell surface associated",A0AA34QVW0_HUMAN,[A0AA34QVW0_HUMAN],1,[A0AA34QVW0],1,[MUC16]
4,UniRef50_Q9H195,Cluster: Mucin-3B,MUC3B_HUMAN; H9XFA8_HUMAN; I0CMK2_HUMAN; O4342...,"[MUC3B_HUMAN, H9XFA8_HUMAN, I0CMK2_HUMAN, O434...",4,"[I0CMK2, O43420, H9XFA8]",3,"[MUC3B, MUC3, MUC3B]"


(77325, 8)

Unnamed: 0,Cluster_ID,Cluster_Name,Human_Proteins,proteins,n_proteins,proteins_cleaned,n_proteins_cleaned,genes,label
0,UniRef50_Q8WZ42,Cluster: Titin,TITIN_HUMAN; Q8WZ42-8; Q8WZ42-2; Q8WZ42-7; C0J...,"[TITIN_HUMAN, Q8WZ42-8, Q8WZ42-2, Q8WZ42-7, C0...",18,"[Q8WZ42, H0Y4J7, A0A0C4DG59, C0JYZ2, A0AAQ5BIC...",7,"[TTN, TTN, TTN, TTN, TTN, TTN, TTN]",positive
1,UniRef50_Q8WZ42-9,Cluster: Isoform 9 of Titin,Q8WZ42-9; A0A0A0MRA3_HUMAN; Q8WZ42-3; Q8WZ42-10,"[Q8WZ42-9, A0A0A0MRA3_HUMAN, Q8WZ42-3, Q8WZ42-10]",4,"[Q8WZ42, A0A0A0MRA3]",2,"[TTN, TTN]",positive
2,UniRef50_Q8WXI7,Cluster: Mucin-16,MUC16_HUMAN; A0AAG2UXD3_HUMAN; A0AAG2UUZ0_HUMA...,"[MUC16_HUMAN, A0AAG2UXD3_HUMAN, A0AAG2UUZ0_HUM...",8,"[A0AAG2UXD3, A0AAG2UXK0, A0AA34QW05, A0AAG2UUZ...",7,"[MUC16, MUC16, MUC16, MUC16, MUC16, MUC16, MUC16]",negative
3,UniRef50_A0AA34QVW0,"Cluster: Mucin 16, cell surface associated",A0AA34QVW0_HUMAN,[A0AA34QVW0_HUMAN],1,[A0AA34QVW0],1,[MUC16],negative
4,UniRef50_Q9H195,Cluster: Mucin-3B,MUC3B_HUMAN; H9XFA8_HUMAN; I0CMK2_HUMAN; O4342...,"[MUC3B_HUMAN, H9XFA8_HUMAN, I0CMK2_HUMAN, O434...",4,"[I0CMK2, O43420, H9XFA8]",3,"[MUC3B, MUC3, MUC3B]",negative


(77325, 9)

Unnamed: 0,Cluster_ID,Cluster_Name,Human_Proteins,proteins,n_proteins,proteins_cleaned,n_proteins_cleaned,genes,label,positive_genes,positive_proteins,n_positive_proteins
0,UniRef50_Q8WZ42,Cluster: Titin,TITIN_HUMAN; Q8WZ42-8; Q8WZ42-2; Q8WZ42-7; C0J...,"[TITIN_HUMAN, Q8WZ42-8, Q8WZ42-2, Q8WZ42-7, C0...",18,"[Q8WZ42, H0Y4J7, A0A0C4DG59, C0JYZ2, A0AAQ5BIC...",7,"[TTN, TTN, TTN, TTN, TTN, TTN, TTN]",positive,"[TTN, TTN, TTN, TTN, TTN, TTN, TTN]","[Q8WZ42, H0Y4J7, A0A0C4DG59, C0JYZ2, A0AAQ5BIC...",7
1,UniRef50_Q8WZ42-9,Cluster: Isoform 9 of Titin,Q8WZ42-9; A0A0A0MRA3_HUMAN; Q8WZ42-3; Q8WZ42-10,"[Q8WZ42-9, A0A0A0MRA3_HUMAN, Q8WZ42-3, Q8WZ42-10]",4,"[Q8WZ42, A0A0A0MRA3]",2,"[TTN, TTN]",positive,"[TTN, TTN]","[Q8WZ42, A0A0A0MRA3]",2
2,UniRef50_Q8WXI7,Cluster: Mucin-16,MUC16_HUMAN; A0AAG2UXD3_HUMAN; A0AAG2UUZ0_HUMA...,"[MUC16_HUMAN, A0AAG2UXD3_HUMAN, A0AAG2UUZ0_HUM...",8,"[A0AAG2UXD3, A0AAG2UXK0, A0AA34QW05, A0AAG2UUZ...",7,"[MUC16, MUC16, MUC16, MUC16, MUC16, MUC16, MUC16]",negative,[],[],7
3,UniRef50_A0AA34QVW0,"Cluster: Mucin 16, cell surface associated",A0AA34QVW0_HUMAN,[A0AA34QVW0_HUMAN],1,[A0AA34QVW0],1,[MUC16],negative,[],[],1
4,UniRef50_Q9H195,Cluster: Mucin-3B,MUC3B_HUMAN; H9XFA8_HUMAN; I0CMK2_HUMAN; O4342...,"[MUC3B_HUMAN, H9XFA8_HUMAN, I0CMK2_HUMAN, O434...",4,"[I0CMK2, O43420, H9XFA8]",3,"[MUC3B, MUC3, MUC3B]",negative,[],[],3


(77325, 12)

In [10]:
# give sampling prob
uniref_df["logits"] = uniref_df["positive_genes"].apply(
    lambda gene_list: [positive_gene_map.get(g, 0) for g in gene_list]
)
def safe_softmax(logits):
    if len(logits) == 0:
        return []  # return empty list if no logits
    return softmax(logits).tolist()  # convert numpy array to list
uniref_df["probs"] = uniref_df["logits"].apply(safe_softmax)

uniref_df

Unnamed: 0,Cluster_ID,Cluster_Name,Human_Proteins,proteins,n_proteins,proteins_cleaned,n_proteins_cleaned,genes,label,positive_genes,positive_proteins,n_positive_proteins,logits,probs
0,UniRef50_Q8WZ42,Cluster: Titin,TITIN_HUMAN; Q8WZ42-8; Q8WZ42-2; Q8WZ42-7; C0J...,"[TITIN_HUMAN, Q8WZ42-8, Q8WZ42-2, Q8WZ42-7, C0...",18,"[Q8WZ42, H0Y4J7, A0A0C4DG59, C0JYZ2, A0AAQ5BIC...",7,"[TTN, TTN, TTN, TTN, TTN, TTN, TTN]",positive,"[TTN, TTN, TTN, TTN, TTN, TTN, TTN]","[Q8WZ42, H0Y4J7, A0A0C4DG59, C0JYZ2, A0AAQ5BIC...",7,"[2, 2, 2, 2, 2, 2, 2]","[0.14285714285714285, 0.14285714285714285, 0.1..."
1,UniRef50_Q8WZ42-9,Cluster: Isoform 9 of Titin,Q8WZ42-9; A0A0A0MRA3_HUMAN; Q8WZ42-3; Q8WZ42-10,"[Q8WZ42-9, A0A0A0MRA3_HUMAN, Q8WZ42-3, Q8WZ42-10]",4,"[Q8WZ42, A0A0A0MRA3]",2,"[TTN, TTN]",positive,"[TTN, TTN]","[Q8WZ42, A0A0A0MRA3]",2,"[2, 2]","[0.5, 0.5]"
2,UniRef50_Q8WXI7,Cluster: Mucin-16,MUC16_HUMAN; A0AAG2UXD3_HUMAN; A0AAG2UUZ0_HUMA...,"[MUC16_HUMAN, A0AAG2UXD3_HUMAN, A0AAG2UUZ0_HUM...",8,"[A0AAG2UXD3, A0AAG2UXK0, A0AA34QW05, A0AAG2UUZ...",7,"[MUC16, MUC16, MUC16, MUC16, MUC16, MUC16, MUC16]",negative,[],[],7,[],[]
3,UniRef50_A0AA34QVW0,"Cluster: Mucin 16, cell surface associated",A0AA34QVW0_HUMAN,[A0AA34QVW0_HUMAN],1,[A0AA34QVW0],1,[MUC16],negative,[],[],1,[],[]
4,UniRef50_Q9H195,Cluster: Mucin-3B,MUC3B_HUMAN; H9XFA8_HUMAN; I0CMK2_HUMAN; O4342...,"[MUC3B_HUMAN, H9XFA8_HUMAN, I0CMK2_HUMAN, O434...",4,"[I0CMK2, O43420, H9XFA8]",3,"[MUC3B, MUC3, MUC3B]",negative,[],[],3,[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77320,UniRef50_B2MXE1,Cluster: Anaplastic lymphoma kinase (Fragment),B2MXE1_HUMAN,[B2MXE1_HUMAN],1,[B2MXE1],1,[ALK],positive,[ALK],[B2MXE1],1,[2],[1.0]
77321,UniRef50_Q16427,Cluster: Dystrophin protein (Fragment),Q16427_HUMAN,[Q16427_HUMAN],1,[Q16427],1,[dystrophin],negative,[],[],1,[],[]
77322,UniRef50_Q16175,Cluster: Low density lipoprotein receptor gene...,Q16175_HUMAN,[Q16175_HUMAN],1,[],0,[],negative,[],[],0,[],[]
77323,UniRef50_A6QL42,Cluster: Potassium voltage-gated channel long ...,A6QL42_HUMAN,[A6QL42_HUMAN],1,[A6QL42],1,[KCND3],negative,[],[],1,[],[]


In [11]:
# Filter only positive clusters
uniref_df_pos = uniref_df[uniref_df.label == "positive"]
print(uniref_df_pos.shape)

# Print value counts of the number of positive proteins
print(uniref_df_pos["n_positive_proteins"].value_counts().sort_index())

# Show clusters with more than 307 positive proteins
uniref_df_pos[uniref_df_pos.n_positive_proteins > 6712]

(19586, 14)
n_positive_proteins
1       13274
2        3211
3        1257
4         616
5         392
        ...  
2369        1
2960        1
3402        1
5156        1
6663        1
Name: count, Length: 98, dtype: int64


Unnamed: 0,Cluster_ID,Cluster_Name,Human_Proteins,proteins,n_proteins,proteins_cleaned,n_proteins_cleaned,genes,label,positive_genes,positive_proteins,n_positive_proteins,logits,probs


## Positive class sampling

For each postive class, take N posive proteins with prob defined above (ie prob based on MSigDB freq)

In [12]:
import pandas as pd
import numpy as np

# Assuming uniref_df is already defined
# uniref_df_pos = uniref_df[uniref_df.label == "positive"]

# Define a placeholder for the minimum sample size
# MIN_SAMPLE_N_POSITIVE = 2 # Example value

def sample_proteins_per_cluster(df, min_sample_n, specific_probs=True, gene_col_name="positive_genes", prot_col_name="positive_proteins"):
    """
    Sample proteins and their corresponding genes per cluster using 'probs' for weighted sampling.
    
    Assumes df has columns: 'Cluster_ID', 'Cluster_Name', 
    The order of proteins and genes must be identical.
    """
    sampled_rows_data = []

    # Iterate over the rows of the DataFrame
    for index, row in df.iterrows():
        proteins_to_sample = row[prot_col_name]
        genes_to_sample = row[gene_col_name] # New line: Get the list of genes
        if specific_probs:
            probabilities = row.probs
        else:
            probabilities = None #uniform

        # Determine how many to sample
        n_to_sample = min(len(proteins_to_sample), min_sample_n)

        if n_to_sample > 0:
            # 1. Get the indices of the items to sample using np.random.choice
            #    We sample the indices (0 to N-1) using the probabilities 'p'.
            sampled_indices = np.random.choice(
                a=len(proteins_to_sample),
                size=n_to_sample,
                replace=False,
                p=probabilities
            )

            # 2. Use the sampled indices to extract the proteins and genes
            sampled_proteins = np.array(proteins_to_sample)[sampled_indices]
            sampled_genes = np.array(genes_to_sample)[sampled_indices] # Use same indices for genes

            # Append the results to the list
            sampled_rows_data.append(
                {
                    "Cluster_ID": row.Cluster_ID,
                    "Cluster_Name": row.Cluster_Name,
                    "sampled_prot": sampled_proteins,
                    "sampled_gene": sampled_genes # Added the sampled genes
                }
            )

    # Convert the list of dictionaries into a DataFrame
    sampled_df = pd.DataFrame(sampled_rows_data)
    
    return sampled_df

# Example of how to apply the function (requires uniref_df_pos and MIN_SAMPLE_N_POSITIVE to be defined)
# Apply the sampling function
positive_sampled_df = sample_proteins_per_cluster(uniref_df_pos, MIN_SAMPLE_N_POSITIVE)
positive_sampled_df["n_proteins"] = positive_sampled_df.sampled_prot.apply(len)
positive_sampled_df["n_genes"] = positive_sampled_df.sampled_gene.apply(lambda x: len(set(x)))
print(positive_sampled_df["n_proteins"].value_counts().sort_index())
print(positive_sampled_df["n_proteins"].value_counts().sort_index())
# Display result
display(positive_sampled_df)
display(positive_sampled_df.sort_values(by="n_genes", ascending = False))

n_proteins
1    19586
Name: count, dtype: int64
n_proteins
1    19586
Name: count, dtype: int64


Unnamed: 0,Cluster_ID,Cluster_Name,sampled_prot,sampled_gene,n_proteins,n_genes
0,UniRef50_Q8WZ42,Cluster: Titin,[H7C0U7],[TTN],1,1
1,UniRef50_Q8WZ42-9,Cluster: Isoform 9 of Titin,[A0A0A0MRA3],[TTN],1,1
2,UniRef50_A0A1W2PS37,Cluster: Spectrin repeat containing nuclear en...,[A0A1W2PS37],[SYNE2],1,1
3,UniRef50_Q03001,Cluster: Dystonin,[E7ERX3],[DST],1,1
4,UniRef50_A0AAV6QKA6,Cluster: Versican core protein,[H0YKM8],[KLF13],1,1
...,...,...,...,...,...,...
19581,UniRef50_Q32W77,Cluster: Deleted in lung and esophageal cancer...,[Q32W77],[DLEC1],1,1
19582,UniRef50_Q0E5H0,Cluster: Rhesus blood group little e antigen (...,[Q0E5H0],[RHCE],1,1
19583,UniRef50_A0A1J0F944,Cluster: Truncated neurofibromin 1 (Fragment),[A0A1J0F944],[NF1],1,1
19584,UniRef50_H0Y9R2,Cluster: Nicotinamide nucleotide transhydrogen...,[H0Y9R2],[NNT],1,1


Unnamed: 0,Cluster_ID,Cluster_Name,sampled_prot,sampled_gene,n_proteins,n_genes
19585,UniRef50_B2MXE1,Cluster: Anaplastic lymphoma kinase (Fragment),[B2MXE1],[ALK],1,1
0,UniRef50_Q8WZ42,Cluster: Titin,[H7C0U7],[TTN],1,1
19569,UniRef50_A8YPG6,Cluster: Glycophorin B(MNS bloodgroup) (Fragment),[A8YPG6],[GYPB],1,1
19568,UniRef50_B2MXE5,Cluster: Anaplastic lymphoma kinase (Fragment),[B2MXE5],[ALK],1,1
19567,UniRef50_Q14461,Cluster: Glycophorin B (Fragment),[Q14461],[GYPB],1,1
...,...,...,...,...,...,...
6,UniRef50_A0A8B9D461,Cluster: Dystonin,[F8W9J4],[DST],1,1
5,UniRef50_A0A9V1G0U9,Cluster: Dystonin,[A0A7P0T890],[DST],1,1
4,UniRef50_A0AAV6QKA6,Cluster: Versican core protein,[H0YKM8],[KLF13],1,1
3,UniRef50_Q03001,Cluster: Dystonin,[E7ERX3],[DST],1,1


## Negative class sampling

take 3N proteins form each cluster:
- sample unformly from each cluster
- avoid positive --> jusr remove them from df
- avoid ambigous --> just remove if present df

In [23]:
import numpy as np
import pandas as pd
# Assuming positive_genes, ambiguos_genes, etc., are defined.

# Sample randomly (No changes needed here)
negative_sampled_df = sample_proteins_per_cluster(
                                                uniref_df, # ATTENTION: use all df
                                                NEGATIVE_CLASS_MULT * MIN_SAMPLE_N_POSITIVE,
                                                specific_probs=False, 
                                                gene_col_name="genes", prot_col_name="proteins_cleaned"
                                                ) 

def filter_sampled_data(row, genes_to_remove):
    """Filters out proteins and their corresponding genes if the gene is in the exclusion set."""

    # Ensures data is treated as arrays
    proteins = np.array(row.sampled_prot)
    genes = np.array(row.sampled_gene)

    # Create a boolean mask: True if the gene is NOT in the set of genes to remove
    keep_mask = np.array([gene not in genes_to_remove for gene in genes])

    # Apply the mask
    filtered_proteins = proteins[keep_mask]
    filtered_genes = genes[keep_mask]
    
    # CRUCIAL CHANGE 1: Ensure the dictionary is properly closed and returned as a Series
    return pd.Series({
        "sampled_prot": filtered_proteins,
        "sampled_gene": filtered_genes
    })

# 1. Combine all genes to remove into one efficient set 🚀
genes_to_remove_combined = set(positive_genes).union(set(ambiguos_genes))

# 2. Apply the filter function and correctly assign the updated columns back ✨
# The result of apply(axis=1) is a temporary DataFrame containing the two filtered columns.
updated_cols = negative_sampled_df.apply(
    lambda x: filter_sampled_data(x, genes_to_remove_combined), 
    axis=1
)

# CRUCIAL CHANGE 2: Use multi-column assignment to update the original DataFrame
negative_sampled_df[['sampled_prot', 'sampled_gene']] = updated_cols[['sampled_prot', 'sampled_gene']]

# Recalculate counts (No changes needed here)
negative_sampled_df["n_proteins"] = negative_sampled_df.sampled_prot.apply(len)
negative_sampled_df["n_genes"] = negative_sampled_df.sampled_gene.apply(lambda x: len(set(x)))
print(negative_sampled_df["n_proteins"].value_counts().sort_index())
print(negative_sampled_df["n_genes"].value_counts().sort_index())

# remove empy clsuters afte filetring
negative_sampled_df = negative_sampled_df[(negative_sampled_df.n_proteins != 0)]

# Display result
display(negative_sampled_df)
display(negative_sampled_df.sort_values(by="n_genes", ascending = False))

n_proteins
0    30890
1    22725
2     8002
Name: count, dtype: int64
n_genes
0    30890
1    29529
2     1198
Name: count, dtype: int64


Unnamed: 0,Cluster_ID,Cluster_Name,sampled_prot,sampled_gene,n_proteins,n_genes
2,UniRef50_Q8WXI7,Cluster: Mucin-16,"[A0AA34QW05, A0AAA9YHI4]","[MUC16, MUC16]",2,1
3,UniRef50_A0AA34QVW0,"Cluster: Mucin 16, cell surface associated",[A0AA34QVW0],[MUC16],1,1
4,UniRef50_Q9H195,Cluster: Mucin-3B,"[H9XFA8, I0CMK2]","[MUC3B, MUC3B]",2,1
6,UniRef50_Q5VST9-7,Cluster: Isoform 7 of Obscurin,"[Q5VST9, A6NGQ3]","[OBSCN, OBSCN]",2,1
10,UniRef50_Q5VST9,Cluster: Obscurin,"[A0ABB0LN81, A0ABB0H0G2]","[OBSCN, OBSCN]",2,1
...,...,...,...,...,...,...
61611,UniRef50_Q16234,Cluster: HuD protein,[Q16234],[HuD],1,1
61612,UniRef50_V9H0A7,Cluster: Troponin T protein (Fragment),[V9H0A7],[troponin T],1,1
61614,UniRef50_Q16427,Cluster: Dystrophin protein (Fragment),[Q16427],[dystrophin],1,1
61615,UniRef50_A6QL42,Cluster: Potassium voltage-gated channel long ...,[A6QL42],[KCND3],1,1


Unnamed: 0,Cluster_ID,Cluster_Name,sampled_prot,sampled_gene,n_proteins,n_genes
20653,UniRef50_Q9UGM5,Cluster: Fetuin-B,"[Q5J875, Q9UGM5]","[GUGU, FETUB]",2,2
12445,UniRef50_A0A8C5VTV6,Cluster: GATOR complex protein NPRL3,"[B7Z6Q0, Q4TT55]","[NPRL3, C16orf35]",2,2
19190,UniRef50_A0AAN7P6Y4,Cluster: Ferric oxidoreductase domain-containi...,"[A4D137, C9JL51]","[MGC87042, STEAP1B]",2,2
10890,UniRef50_B9A6J9,Cluster: TBC1 domain family member 3L,"[A0A0G2JLM6, A0A0G2JPY0]","[TBC1D3B, TBC1D3K]",2,2
17539,UniRef50_Q7Z5V6,Cluster: Stabilizer of axonemal microtubules 4,"[Q7Z5V6, G3F4G3]","[SAXO4, C11orf66]",2,2
...,...,...,...,...,...,...
20589,UniRef50_P36382,Cluster: Gap junction alpha-5 protein,"[X5D2H9, A0A0B4J1Y3]","[GJA5, GJA5]",2,1
20586,UniRef50_O94800,Cluster: HRIHFB2060 protein (Fragment),[O94800],[HRIHFB2060],1,1
20585,UniRef50_Q9BYZ7,Cluster: Transporter,[Q9BYZ7],[SLC6A18],1,1
20582,UniRef50_P42331-2,Cluster: Isoform 2 of Rho GTPase-activating pr...,[P42331],[ARHGAP25],1,1


In [24]:
# lttle check
print("SYNE2" in genes_to_remove_combined)

uniref_df[uniref_df.Cluster_ID == "UniRef50_A0A1W2PS37"]

True


Unnamed: 0,Cluster_ID,Cluster_Name,Human_Proteins,proteins,n_proteins,proteins_cleaned,n_proteins_cleaned,genes,label,positive_genes,positive_proteins,n_positive_proteins,logits,probs
5,UniRef50_A0A1W2PS37,Cluster: Spectrin repeat containing nuclear en...,A0A1W2PS37_HUMAN,[A0A1W2PS37_HUMAN],1,[A0A1W2PS37],1,[SYNE2],positive,[SYNE2],[A0A1W2PS37],1,[4],[1.0]


## Make dataset

protein name | label | seq

In [25]:
positive_sampled_df["label"] = 1
negative_sampled_df["label"] = 0

# merge
dataset_df = pd.concat([positive_sampled_df, negative_sampled_df])
display(dataset_df)

# logn df
long_df = dataset_df.explode("sampled_prot")
long_df = long_df.explode("sampled_gene")
long_df = long_df[["sampled_prot", "sampled_gene", "label"]].rename(
    columns={"sampled_prot": "protein", "sampled_gene": "gene"}
)# Rename columns
long_df = long_df.rename(columns={"sampled_prot": "protein", "sampled_gene": "gene"})
long_df = long_df.reset_index(drop=True)

display(long_df)

print(long_df.shape)
print(long_df['label'].value_counts())

# STRATIFIED spit
# Stratified split: 80% train, 10% val, 10% test
# First split: 80% train, 20% temp
train_idx, temp_idx = train_test_split(
    long_df.index,
    test_size=0.2,
    stratify=long_df['label'],
    random_state=42
)

# Second split: temp into val and test (50%-50% of temp)
temp_labels = long_df.loc[temp_idx, 'label']
val_idx, test_idx = train_test_split(
    temp_idx,
    test_size=0.5,
    stratify=temp_labels,
    random_state=42
)

# Assign splits
long_df['set'] = ''
long_df.loc[train_idx, 'set'] = 'train'
long_df.loc[val_idx, 'set'] = 'val'
long_df.loc[test_idx, 'set'] = 'test'

print(long_df['set'].value_counts())
display(long_df)

Unnamed: 0,Cluster_ID,Cluster_Name,sampled_prot,sampled_gene,n_proteins,n_genes,label
0,UniRef50_Q8WZ42,Cluster: Titin,[H7C0U7],[TTN],1,1,1
1,UniRef50_Q8WZ42-9,Cluster: Isoform 9 of Titin,[A0A0A0MRA3],[TTN],1,1,1
2,UniRef50_A0A1W2PS37,Cluster: Spectrin repeat containing nuclear en...,[A0A1W2PS37],[SYNE2],1,1,1
3,UniRef50_Q03001,Cluster: Dystonin,[E7ERX3],[DST],1,1,1
4,UniRef50_A0AAV6QKA6,Cluster: Versican core protein,[H0YKM8],[KLF13],1,1,1
...,...,...,...,...,...,...,...
61611,UniRef50_Q16234,Cluster: HuD protein,[Q16234],[HuD],1,1,0
61612,UniRef50_V9H0A7,Cluster: Troponin T protein (Fragment),[V9H0A7],[troponin T],1,1,0
61614,UniRef50_Q16427,Cluster: Dystrophin protein (Fragment),[Q16427],[dystrophin],1,1,0
61615,UniRef50_A6QL42,Cluster: Potassium voltage-gated channel long ...,[A6QL42],[KCND3],1,1,0


Unnamed: 0,protein,gene,label
0,H7C0U7,TTN,1
1,A0A0A0MRA3,TTN,1
2,A0A1W2PS37,SYNE2,1
3,E7ERX3,DST,1
4,H0YKM8,KLF13,1
...,...,...,...
74314,Q16234,HuD,0
74315,V9H0A7,troponin T,0
74316,Q16427,dystrophin,0
74317,A6QL42,KCND3,0


(74319, 3)
label
0    54733
1    19586
Name: count, dtype: int64
set
train    59455
test      7432
val       7432
Name: count, dtype: int64


Unnamed: 0,protein,gene,label,set
0,H7C0U7,TTN,1,train
1,A0A0A0MRA3,TTN,1,train
2,A0A1W2PS37,SYNE2,1,test
3,E7ERX3,DST,1,train
4,H0YKM8,KLF13,1,train
...,...,...,...,...
74314,Q16234,HuD,0,train
74315,V9H0A7,troponin T,0,val
74316,Q16427,dystrophin,0,val
74317,A6QL42,KCND3,0,train


## Add Seqeunce 

In [None]:
def fetch_sequence(uniprot_id, verbose=False):
    """Fetch protein sequence from UniProt in FASTA format."""
    try:
        if verbose:
            print(f"Fetching sequence for {uniprot_id}...")
        url = f"https://www.uniprot.org/uniprot/{uniprot_id}.fasta"
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        sequence = ''.join(response.text.splitlines()[1:])
        return sequence
    except RequestException as e:
        if verbose:
            print(f"Request failed for {uniprot_id}: {e}")
        return None
    


In [None]:
import requests
from concurrent.futures import ThreadPoolExecutor

result_df = add_sequences_to_dataframe(
    long_df=long_df, 
    id_column="protein", 
    max_workers=10
)

Starting concurrent sequence fetching for 55059 unique proteins using 10 threads...
Error fetching Q86T18: Request failed: 404 Client Error: Not Found for url: https://rest.uniprot.org/uniprotkb/accessions/Q86T18
Error fetching A0A1W2PS37: Request failed: 404 Client Error: Not Found for url: https://rest.uniprot.org/uniprotkb/accessions/A0A1W2PS37
Error fetching H0YKM8: Request failed: 404 Client Error: Not Found for url: https://rest.uniprot.org/uniprotkb/accessions/H0YKM8
Error fetching H7C0U7: Request failed: 404 Client Error: Not Found for url: https://rest.uniprot.org/uniprotkb/accessions/H7C0U7
Error fetching A0A7P0T890: Request failed: 404 Client Error: Not Found for url: https://rest.uniprot.org/uniprotkb/accessions/A0A7P0T890
Error fetching A0A0A0MRA3: Request failed: 404 Client Error: Not Found for url: https://rest.uniprot.org/uniprotkb/accessions/A0A0A0MRA3
Error fetching F8W9J4: Request failed: 404 Client Error: Not Found for url: https://rest.uniprot.org/uniprotkb/accessi

## Save

In [None]:
df_long.to_csv(FINAL_DATASET_PATH)