# Dataset Creation

## Aim
Create a dataset for **future analysis**.

---

## Necessary Downloads
Before starting, be sure to have downloaded the following material.

### 1. MSigDB Gene Sets (Human)
Download the entire MSigDB for **Human** in **JSON** format (`Human Gene Set JSON file set (ZIPped)`).
- **Link**: [https://www.gsea-msigdb.org/gsea/downloads.jsp](https://www.gsea-msigdb.org/gsea/downloads.jsp)

### 2. UniProt Human Proteome
Download the list of all **HUMAN proteins** (UniProt human proteome - **UP000005640**).
- **Link (Website)**: [https://www.uniprot.org/proteomes/UP000005640](https://www.uniprot.org/proteomes/UP000005640)
- **Programmatic Download (Terminal)**:

```bash
wget -O human_proteome.tsv.gz "[https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession,reviewed,id,protein_name,gene_names,organism_name,sequence&format=tsv&query=(proteome:UP000005640](https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession,reviewed,id,protein_name,gene_names,organism_name,sequence&format=tsv&query=(proteome:UP000005640))"
gunzip human_proteome.tsv.gz
```

### 3. UniRef50 Clusters

#### A. Initial Programmatic Download (Partial)
This method retrieves all clusters with at least one human protein, but is a **subsample** of the full UniRef50.
- **Link (Website)**: [https://www.uniprot.org/uniref?query=%28identity%3A0.5%29+AND+%28taxonomy_id%3A9606%29](https://www.uniprot.org/uniref?query=%28identity%3A0.5%29+AND+%28taxonomy_id%3A9606%29)
- **Programmatic Download (Terminal)**:
    - **ATTENTION**: This call retrieves all clusters with at least 1 human protein, *not* just the human proteins themselves.

```bash
curl -o uniref50_human.tsv.gz "[https://rest.uniprot.org/uniref/stream?compressed=true&fields=id,name,organism,length,identity,count,members&format=tsv&query=((identity:0.5)+AND+(taxonomy_id:9606](https://rest.uniprot.org/uniref/stream?compressed=true&fields=id,name,organism,length,identity,count,members&format=tsv&query=((identity:0.5)+AND+(taxonomy_id:9606)))"
gunzip uniref50_human.tsv.gz
```

#### B. Full UniRef50 Download (Required for Complete Data)
Since the above options only download a subsample, the entire UniRef50 ($\sim$230GB) must be downloaded using the XML format.

- **Download Full XML UniRef50**:

```bash
wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.xml.gz
gunzip uniref50.xml.gz
```

### 4. Subsample Human Clusters from UniRef50 (Processing)
The full UniRef50 XML file is too large for routine handling. The next steps process the file to create a dataset containing **only clusters with human proteins** and **only the human proteins themselves**. This requires a dedicated C++ script.

- **Dependencies**:

```bash
## Download dependencies
sudo apt-get install libexpat1-dev
```

- **C++ Script and Compilation**:
  - The necessary C++ file (`uniref_extractor.cpp`) is assumed to be in the utils folder (e.g., `/home/gdallagl/myworkdir/ESMSec/utils/fast_uniref_extractor.cpp`).
  - **Compile file**:

```bash
g++ -O3 -march=native -o extract_human uniref_extractor.cpp -lexpat
```

- **Run Script**:
  - This script processes the full XML file to create a CSV with the desired human clusters/proteins.

```bash
./extract_human uniref50.xml fast_c++_human_clusters.csv
```

### 5. UniProt Gene-Protein mapping

[UniProt protein-gene Mapping](https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping.dat.gz)

<!-- ### 5. BioMart Gene-Protein mapping

Download BioMart gne-protein mapping, besure to include **Gene Name** and **Transcript Name**.
[link](https://www.ensembl.org/biomart/martview/3aaaf734b93facdfad8207234204cc31) -->


## Hyperparameters

In [1]:
import json
import os
import re
import pandas as pd
import numpy as np
from scipy.special import softmax
from sklearn.model_selection import train_test_split

import utils.dataset_functions as dataf

# Directory containing MSigDB JSON files
JSON_DIR = "/home/gdallagl/myworkdir/data/MSigDB/msigdb_v2025.1.Hs_json_files_to_download_locally"

# Garated genes list
GUARANTEED_GENES_PATH = "/home/gdallagl/myworkdir/data/MSigDB/julies_cycling_signatures_cancer.tsv"

# Updated keywords pattern with word boundaries to avoid false matches
KEYWORDS_PATTERN = r"(?:PROLIFERA|\bPROLIFER\b|_PROLIFER_|_CYCLING_|^CELL_CYCLE_|_CELL_CYCLE_|_CC_|_G1_|_S_PHASE_|_G2_|_M_PHASE_|\bMITOSIS\b|\bCYCLIN\b|\bCDK\b|\bCHECKPOINT\b|\bGS1\b|\bGS2\b)"

# Exclusion pattern
EXCLUSION_PATTERN = r"(?:MEIOSIS|FATTY_ACID_CYCLING_MODEL)"

# Human proteome path
HUMAN_PROTEOME_PATH = "/home/gdallagl/myworkdir/data/UniRef50/human_proteome.tsv"

# Uniref Apth
UNIREF_PATH = "/home/gdallagl/myworkdir/ESMSec/data/UniRef50/fast_c++_human_clusters.csv"

# mapping protein-gene apth
MAPPING_PATH = "/home/gdallagl/myworkdir/ESMSec/data/UniRef50/HUMAN_9606_idmapping.dat"

# Minimum frequency threshold for filtering ambiguous genes
MIN_FREQ_AMBIGOUS = 1

# min number of postive samples per positive cluster
MIN_SAMPLE_N_POSITIVE = 1

# how many mroe negativ class to sampel
NEGATIVE_CLASS_MULT = 2

# savifn csv datset
FINAL_DATASET_PATH = f"/home/gdallagl/myworkdir/ESMSec/data/cell_cycle/cell_cycle_dataset_{MIN_SAMPLE_N_POSITIVE}:{NEGATIVE_CLASS_MULT}.csv"


# Autorelaod
%load_ext autoreload
%autoreload 2

## Select pathways related to cell cycle

Read the MSigDB and select all pathways with keyword related to field of interest.

In [2]:
### 1) Transform jsons into df
df_genesets = dataf.load_json_folder_to_df(JSON_DIR)
display(df_genesets.head(2)); print(df_genesets.shape)

### 2) Select only geneset related to interested fiedl
df_filtered = dataf.filter_gene_sets_by_keywords(df_genesets, KEYWORDS_PATTERN, EXCLUSION_PATTERN)
display(df_filtered.head(2)); print(df_filtered.shape)

list(df_filtered['set_name'])

Unnamed: 0,set_name,collection,systematicName,pmid,exactSource,externalDetailsURL,msigdbURL,geneSymbols,filteredBySimilarity,externalNamesForSimilarTerms,source_file
0,MIR153_5P,C3:MIR:MIRDB,M30412,31504780,,http://mirdb.org/cgi-bin/mature_mir.cgi?name=h...,https://www.gsea-msigdb.org/gsea/msigdb/human/...,"[A1CF, AAK1, AASDHPPT, ABCE1, ABHD2, ABI2, ACB...",[],[],c3.mir.mirdb.v2025.1.Hs.json
1,MIR8485,C3:MIR:MIRDB,M30413,31504780,,http://mirdb.org/cgi-bin/mature_mir.cgi?name=h...,https://www.gsea-msigdb.org/gsea/msigdb/human/...,"[AAK1, ABHD18, ABL2, ABLIM1, ACVR1, ACVR2B, AC...",[],[],c3.mir.mirdb.v2025.1.Hs.json


(122192, 11)


Unnamed: 0,set_name,collection,systematicName,pmid,exactSource,externalDetailsURL,msigdbURL,geneSymbols,filteredBySimilarity,externalNamesForSimilarTerms,source_file
12226,KEGG_MEDICUS_PATHOGEN_KSHV_VCYCLIN_TO_CELL_CYC...,C2:CP:KEGG_MEDICUS,M47461,,N00168,https://www.kegg.jp/entry/N00168,https://www.gsea-msigdb.org/gsea/msigdb/human/...,"[CDK4, CDK6, E2F1, E2F2, E2F3, RB1]",[],[],c2.cp.kegg_medicus.v2025.1.Hs.json
12229,KEGG_MEDICUS_PATHOGEN_HTLV_1_TAX_TO_P21_CELL_C...,C2:CP:KEGG_MEDICUS,M47585,,N00498,https://www.kegg.jp/entry/N00498,https://www.gsea-msigdb.org/gsea/msigdb/human/...,"[CCNE1, CCNE2, CDK2, CDKN1A, E2F1, E2F2, E2F3,...",[],[],c2.cp.kegg_medicus.v2025.1.Hs.json


(1343, 11)


['KEGG_MEDICUS_PATHOGEN_KSHV_VCYCLIN_TO_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_PATHOGEN_HTLV_1_TAX_TO_P21_CELL_CYCLE_G1_S_N00498',
 'KEGG_MEDICUS_REFERENCE_MDM2_P21_CELL_CYCLE_G1_S_N00536',
 'KEGG_MEDICUS_VARIANT_AMPLIFIED_MDM2_TO_P21_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_VARIANT_AMPLIFIED_MYC_TO_P27_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_VARIANT_AMPLIFIED_MYC_TO_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_REFERENCE_P300_P21_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_REFERENCE_CDC25_CELL_CYCLE_G2_M',
 'KEGG_MEDICUS_REFERENCE_ATR_P21_CELL_CYCLE_G2_M',
 'KEGG_MEDICUS_REFERENCE_WEE1_CELL_CYCLE_G2_M',
 'KEGG_MEDICUS_PATHOGEN_HPV_E7_TO_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_VARIANT_AMPLIFIED_CCND1_TO_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_PATHOGEN_EBV_EBNA3C_TO_CELL_CYCLE_G1_S_N00483',
 'KEGG_MEDICUS_REFERENCE_P27_CELL_CYCLE_G1_S',
 'KEGG_MEDICUS_PATHOGEN_EBV_EBNA3C_TO_P27_CELL_CYCLE_G1_S_N00264',
 'KEGG_MEDICUS_PATHOGEN_EBV_EBNA3C_TO_CELL_CYCLE_G1_S_N00484',
 'KEGG_MEDICUS_PATHOGEN_HTLV_1_TAX_TO_P21_CELL_CYCLE_G1_S_N00497',
 'KEGG_MEDICUS

### Count in how many genesets each gene is present

Needed for later to calculate probability for sampling.

In [3]:
gene_counts_df = dataf.gene_set_counts(df_filtered)
display(gene_counts_df)

Unnamed: 0,gene,geneset_count
0,E2F1,75
1,CCNB1,70
2,CDK1,70
3,CDKN1A,69
4,TP53,64
...,...,...
9091,ATG4B,1
9092,ARVCF,1
9093,ARRDC1,1
9094,ARHGEF7,1


### Filter out genes with too few Gene Sets

In [4]:
# give a label to gene that overcome the thr
gene_counts_df["label"] = gene_counts_df.geneset_count.apply(lambda x: 'positive' if x > MIN_FREQ_AMBIGOUS else 'ambigous')

# create a label for later
gene_counts_df["is_guaranteed"] = False

display(gene_counts_df)
print(gene_counts_df.label.value_counts())

Unnamed: 0,gene,geneset_count,label,is_guaranteed
0,E2F1,75,positive,False
1,CCNB1,70,positive,False
2,CDK1,70,positive,False
3,CDKN1A,69,positive,False
4,TP53,64,positive,False
...,...,...,...,...
9091,ATG4B,1,ambigous,False
9092,ARVCF,1,ambigous,False
9093,ARRDC1,1,ambigous,False
9094,ARHGEF7,1,ambigous,False


label
positive    5338
ambigous    3758
Name: count, dtype: int64


### Add Guaranteed genes

Add genes related to interesting field (i.e. that msut be present).

Read them from csv file.

Add them with max-freq.

In [5]:
### 1) Read spefic csv
garanted_genes_df = pd.read_csv(GUARANTEED_GENES_PATH, sep='\t')
display(garanted_genes_df.head(5))

### 2) Extarct the single gene names
all_values = garanted_genes_df.to_numpy().flatten().tolist()
all_values = [x for x in all_values if pd.notna(x)] # remove nan
all_values = list(set(all_values)) # remove duplicated
print("Number of guaranted genes: ", len(all_values))

### 3) Add them into previus df
# create a DataFrame for new genes
new_genes_df = pd.DataFrame({
    'gene': all_values,
    'geneset_count': max(gene_counts_df.geneset_count), # Use as Freq the max (as these genes are guaranted)
    'label': "positive",
    "is_guaranteed": True
})

# append to existing gene_frequency_df
gene_frequency_df = pd.concat([gene_counts_df, new_genes_df], ignore_index=True)

# ATTENTION: Drop duplicates, keeping **the last occurrence** (i.e., from new_genes_df) --< so gurated genes have max freq
gene_frequency_df = gene_frequency_df.drop_duplicates(subset='gene', keep='last')

# Sort and reset index
gene_frequency_df.sort_values(by=['geneset_count', 'gene'], ascending=[False, True], inplace=True)
gene_frequency_df.reset_index(drop=True, inplace=True)

# create lists of genes
ambiguos_genes = set(gene_frequency_df[gene_frequency_df.label == "ambigous"].gene)
positive_genes = set(gene_frequency_df[gene_frequency_df.label == "positive"].gene)

display(gene_frequency_df)

print(gene_frequency_df.is_guaranteed.value_counts())
print(gene_frequency_df.gene.nunique())

Unnamed: 0,GBM_G1S,GBM_G2M,H3_K27M_CC,IDH_O_G1S,IDH_O_G2M,Melanoma_G1S,Melanoma_G2M
0,RRM2,CCNB1,UBE2T,MCM5,HMGB2,MCM5,HMGB2
1,PCNA,CDC20,HMGB2,PCNA,CDK1,PCNA,CDK1
2,KIAA0101,CCNB2,TYMS,TYMS,NUSAP1,TYMS,NUSAP1
3,HIST1H4C,PLK1,MAD2L1,FEN1,UBE2C,FEN1,UBE2C
4,MLF1IP,CCNA2,CDK1,MCM2,BIRC5,MCM2,BIRC5


Number of guaranted genes:  146


Unnamed: 0,gene,geneset_count,label,is_guaranteed
0,ANLN,75,positive,True
1,ANP32E,75,positive,True
2,ARHGAP11A,75,positive,True
3,ARL6IP1,75,positive,True
4,ASF1B,75,positive,True
...,...,...,...,...
9098,ZSCAN20,1,ambigous,False
9099,ZSCAN22,1,ambigous,False
9100,ZSCAN9,1,ambigous,False
9101,ZSWIM4,1,ambigous,False


is_guaranteed
False    8957
True      146
Name: count, dtype: int64
9103


### Create mapping postive_gebe-geneset_freq

In [6]:
# make postive gene-freq mapping
# Filter only positive genes
positive_genes_df = gene_frequency_df[gene_frequency_df["label"] == "positive"]
# Create a mapping: gene -> frequency
positive_gene_freq_map = dict(zip(positive_genes_df["gene"], positive_genes_df["geneset_count"]))
positive_gene_freq_map

{'ANLN': 75,
 'ANP32E': 75,
 'ARHGAP11A': 75,
 'ARL6IP1': 75,
 'ASF1B': 75,
 'ATAD2': 75,
 'AURKA': 75,
 'AURKB': 75,
 'BIRC5': 75,
 'BLM': 75,
 'BRIP1': 75,
 'BUB1': 75,
 'BUB1B': 75,
 'CASP8AP2': 75,
 'CBX5': 75,
 'CCNA2': 75,
 'CCNB1': 75,
 'CCNB2': 75,
 'CCNE2': 75,
 'CDC20': 75,
 'CDC25B': 75,
 'CDC25C': 75,
 'CDC45': 75,
 'CDC6': 75,
 'CDCA2': 75,
 'CDCA3': 75,
 'CDCA5': 75,
 'CDCA7': 75,
 'CDCA8': 75,
 'CDK1': 75,
 'CENPA': 75,
 'CENPE': 75,
 'CENPF': 75,
 'CENPK': 75,
 'CENPM': 75,
 'CHAF1B': 75,
 'CKAP2': 75,
 'CKAP2L': 75,
 'CKAP5': 75,
 'CKS1B': 75,
 'CKS2': 75,
 'CLSPN': 75,
 'CTCF': 75,
 'DLGAP5': 75,
 'DSCC1': 75,
 'DTL': 75,
 'DUT': 75,
 'E2F1': 75,
 'E2F8': 75,
 'ECT2': 75,
 'EXO1': 75,
 'FAM64A': 75,
 'FANCI': 75,
 'FEN1': 75,
 'FOXM1': 75,
 'G2E3': 75,
 'GAS2L3': 75,
 'GINS2': 75,
 'GMNN': 75,
 'GPSM2': 75,
 'GTSE1': 75,
 'H2AFZ': 75,
 'HELLS': 75,
 'HIST1H4C': 75,
 'HJURP': 75,
 'HMGB2': 75,
 'HMGB3': 75,
 'HMGN2': 75,
 'HMMR': 75,
 'HN1': 75,
 'KIAA0101': 75,
 'KIF1

## Protein-Gene mapping

As MSigDB has genes, while Uniref50 uses proteins, we need a mapping.

In [7]:
# Read mapong df
df = pd.read_csv(MAPPING_PATH, sep='\t', header=None, names=['UniProtKB_Accession', 'ID_Type', 'External_ID'])
display(df)
print("Unique ID types:", df['ID_Type'].unique())

# Sleect onyl gene names
gene_df = df[df['ID_Type'] == 'Gene_Name'].copy()
display(gene_df)

# make mapping dict
protein_to_gene_map = gene_df.set_index('UniProtKB_Accession')['External_ID'].dropna().to_dict()

# all tpriens weith UniProtKB_Accession
all_uniprot_prioteins_name = set(gene_df["UniProtKB_Accession"].unique())



#little cjeck
for i, (protein, gene) in enumerate(protein_to_gene_map.items()):
    if i >= 5:
        break
    print(protein, "->", gene)

protein_to_gene_map["Q8WZ42"] #TITIN, A0AAQ5BIC8,

Unnamed: 0,UniProtKB_Accession,ID_Type,External_ID
0,P31946,UniProtKB-ID,1433B_HUMAN
1,P31946,Gene_Name,YWHAB
2,P31946,GI,4507949
3,P31946,GI,377656702
4,P31946,GI,67464628
...,...,...,...
5375538,A0A411NQ82,EMBL,MK034717
5375539,A0A411NQ82,EMBL-CDS,QBF54154.1
5375540,A0A411NQ82,NCBI_TaxID,9606
5375541,A0A411NQ82,OrthoDB,9613897at2759


Unique ID types: ['UniProtKB-ID' 'Gene_Name' 'GI' 'UniRef100' 'UniRef90' 'UniRef50'
 'UniParc' 'EMBL' 'EMBL-CDS' 'NCBI_TaxID' 'CCDS' 'RefSeq' 'RefSeq_NT'
 'PDB' 'EMDB' 'BioGRID' 'DIP' 'MINT' 'STRING' 'ChEMBL' 'DrugBank'
 'BioMuta' 'DMDM' 'CPTAC' 'ProteomicsDB' 'DNASU' 'Ensembl' 'Ensembl_TRS'
 'Ensembl_PRO' 'GeneID' 'KEGG' 'GeneCards' 'HGNC' 'MIM' 'neXtProt'
 'OpenTargets' 'PharmGKB' 'VEuPathDB' 'eggNOG' 'GeneTree' 'HOGENOM' 'OMA'
 'OrthoDB' 'TreeFam' 'Reactome' 'ChiTaRS' 'GeneWiki' 'GenomeRNAi' 'IDEAL'
 'CRC64' 'TCDB' 'UCSC' 'Orphanet' 'Gene_Synonym' 'ComplexPortal'
 'GeneReviews' 'BioCyc' 'SwissLipids' 'UniPathway' 'Gene_ORFName'
 'DisProt' 'GuidetoPHARMACOLOGY' 'GlyConnect' 'MEROPS' 'ESTHER'
 'Allergome' 'PeroxiBase' 'REBASE' 'PATRIC']


Unnamed: 0,UniProtKB_Accession,ID_Type,External_ID
1,P31946,Gene_Name,YWHAB
120,P62258,Gene_Name,YWHAE
251,Q04917,Gene_Name,YWHAH
342,P61981,Gene_Name,YWHAG
470,P31947,Gene_Name,SFN
...,...,...,...
5375347,Q8TB44,Gene_Name,MTSS1L
5375359,Q9UJU1,Gene_Name,VIL2
5375411,A5HC06,Gene_Name,KRAS
5375520,V5LL19,Gene_Name,HLA-C


P31946 -> YWHAB
P62258 -> YWHAE
Q04917 -> YWHAH
P61981 -> YWHAG
P31947 -> SFN


'TTN'

In [8]:
# Crete also a mapping from UniProtKB-ID to UniProtKB_Accession	

# Sleect onyl gene names
gene_df = df[df['ID_Type'] == 'UniProtKB-ID'].copy()
display(gene_df)

# make mapping dict
# ATTENTION hwo is the key
protein_ID_to_accession = gene_df.set_index('External_ID')['UniProtKB_Accession'].dropna().to_dict()

protein_ID_to_accession

Unnamed: 0,UniProtKB_Accession,ID_Type,External_ID
0,P31946,UniProtKB-ID,1433B_HUMAN
119,P62258,UniProtKB-ID,1433E_HUMAN
250,Q04917,UniProtKB-ID,1433F_HUMAN
341,P61981,UniProtKB-ID,1433G_HUMAN
469,P31947,UniProtKB-ID,1433S_HUMAN
...,...,...,...
5375485,B4DKG6,UniProtKB-ID,B4DKG6_HUMAN
5375495,B4DSK2,UniProtKB-ID,B4DSK2_HUMAN
5375509,Q59EC5,UniProtKB-ID,Q59EC5_HUMAN
5375519,V5LL19,UniProtKB-ID,V5LL19_HUMAN


{'1433B_HUMAN': 'P31946',
 '1433E_HUMAN': 'P62258',
 '1433F_HUMAN': 'Q04917',
 '1433G_HUMAN': 'P61981',
 '1433S_HUMAN': 'P31947',
 '1433T_HUMAN': 'P27348',
 '1433Z_HUMAN': 'P63104',
 '1A1L1_HUMAN': 'Q96QU6',
 '1A1L2_HUMAN': 'Q4AC99',
 '2A5A_HUMAN': 'Q15172',
 '2A5B_HUMAN': 'Q15173',
 '2A5D_HUMAN': 'Q14738',
 '2A5E_HUMAN': 'Q16537',
 '2A5G_HUMAN': 'Q13362',
 '2AAA_HUMAN': 'P30153',
 '2AAB_HUMAN': 'P30154',
 '2ABA_HUMAN': 'P63151',
 '2ABB_HUMAN': 'Q00005',
 '2ABD_HUMAN': 'Q66LE6',
 '2ABG_HUMAN': 'Q9Y2T4',
 '3BHS1_HUMAN': 'P14060',
 '3BHS2_HUMAN': 'P26439',
 '3BHS7_HUMAN': 'Q9H2F3',
 '3BP1_HUMAN': 'Q9Y3L3',
 '3BP2_HUMAN': 'P78314',
 '3BP5L_HUMAN': 'Q7L8J4',
 '3BP5_HUMAN': 'O60239',
 '3HAO_HUMAN': 'P46952',
 '3HIDH_HUMAN': 'P31937',
 '3MG_HUMAN': 'P29372',
 '4EBP1_HUMAN': 'Q13541',
 '4EBP2_HUMAN': 'Q13542',
 '4EBP3_HUMAN': 'O60516',
 '4ET_HUMAN': 'Q9NRA8',
 '4F2_HUMAN': 'P08195',
 '5HT1A_HUMAN': 'P08908',
 '5HT1B_HUMAN': 'P28222',
 '5HT1D_HUMAN': 'P28221',
 '5HT1E_HUMAN': 'P28566',
 '5HT1F

## Find Clusters in UniRef50

We start with an alredy processsed version of the entire Uniref50:
- only clusters with at least one human prot
- only human proteins


In [9]:
# Load
uniref_df = pd.read_csv(UNIREF_PATH)

# Transform 'Human_Proteins' string into list
uniref_df["proteins"] = uniref_df["Human_Proteins"].apply(
    lambda x: [i.strip().rstrip('.') for i in str(x).split(';') if i.strip()]
)

# Remove isoforms (keep only before '-') and duplicates while preserving order
uniref_df["proteins_no_isoform"] = uniref_df["proteins"].apply(
    lambda lst: list(dict.fromkeys(p.split('-', 1)[0] for p in lst)) # deduplicated list that keeps the original order of first appearance.
)

# Map to accessions if present in mapping; keep original otherwise
uniref_df["proteins_only_accession"] = uniref_df["proteins_no_isoform"].apply(
    lambda lst: [protein_ID_to_accession[p] if p in protein_ID_to_accession else p for p in lst]
)

# Remove duplicates caused by mapping
uniref_df["proteins_only_accession_nodup"] = uniref_df["proteins_only_accession"].apply(
    lambda lst: list(dict.fromkeys(lst))
)

# Remove proteins not in UniProt list (preserve order)
uniref_df["proteins_cleaned"] = uniref_df["proteins_only_accession_nodup"].apply(
    lambda lst: [p for p in lst if p in all_uniprot_prioteins_name]
)

# Count proteins before/after cleaning
uniref_df["n_proteins"] = uniref_df["proteins"].apply(len)
uniref_df["n_proteins_cleaned"] = uniref_df["proteins_cleaned"].apply(len)

display(uniref_df.head())
print(uniref_df.shape)


Unnamed: 0,Cluster_ID,Cluster_Name,Human_Proteins,proteins,proteins_no_isoform,proteins_only_accession,proteins_only_accession_nodup,proteins_cleaned,n_proteins,n_proteins_cleaned
0,UniRef50_Q8WZ42,Cluster: Titin,TITIN_HUMAN; Q8WZ42-8; Q8WZ42-2; Q8WZ42-7; C0J...,"[TITIN_HUMAN, Q8WZ42-8, Q8WZ42-2, Q8WZ42-7, C0...","[TITIN_HUMAN, Q8WZ42, C0JYZ2_HUMAN, H0Y4J7_HUM...","[Q8WZ42, Q8WZ42, C0JYZ2, H0Y4J7, A2TKE3, H7C0U...","[Q8WZ42, C0JYZ2, H0Y4J7, A2TKE3, H7C0U7, A0AAQ...","[Q8WZ42, C0JYZ2, H0Y4J7, H7C0U7, A0AAQ5BIC8, A...",18,7
1,UniRef50_Q8WZ42-9,Cluster: Isoform 9 of Titin,Q8WZ42-9; A0A0A0MRA3_HUMAN; Q8WZ42-3; Q8WZ42-10,"[Q8WZ42-9, A0A0A0MRA3_HUMAN, Q8WZ42-3, Q8WZ42-10]","[Q8WZ42, A0A0A0MRA3_HUMAN]","[Q8WZ42, A0A0A0MRA3]","[Q8WZ42, A0A0A0MRA3]","[Q8WZ42, A0A0A0MRA3]",4,2
2,UniRef50_Q8WXI7,Cluster: Mucin-16,MUC16_HUMAN; A0AAG2UXD3_HUMAN; A0AAG2UUZ0_HUMA...,"[MUC16_HUMAN, A0AAG2UXD3_HUMAN, A0AAG2UUZ0_HUM...","[MUC16_HUMAN, A0AAG2UXD3_HUMAN, A0AAG2UUZ0_HUM...","[Q8WXI7, A0AAG2UXD3, A0AAG2UUZ0, A0AA34QW05, A...","[Q8WXI7, A0AAG2UXD3, A0AAG2UUZ0, A0AA34QW05, A...","[Q8WXI7, A0AAG2UXD3, A0AAG2UUZ0, A0AA34QW05, A...",8,8
3,UniRef50_A0AA34QVW0,"Cluster: Mucin 16, cell surface associated",A0AA34QVW0_HUMAN,[A0AA34QVW0_HUMAN],[A0AA34QVW0_HUMAN],[A0AA34QVW0],[A0AA34QVW0],[A0AA34QVW0],1,1
4,UniRef50_Q9H195,Cluster: Mucin-3B,MUC3B_HUMAN; H9XFA8_HUMAN; I0CMK2_HUMAN; O4342...,"[MUC3B_HUMAN, H9XFA8_HUMAN, I0CMK2_HUMAN, O434...","[MUC3B_HUMAN, H9XFA8_HUMAN, I0CMK2_HUMAN, O434...","[Q9H195, H9XFA8, I0CMK2, O43420]","[Q9H195, H9XFA8, I0CMK2, O43420]","[Q9H195, H9XFA8, I0CMK2, O43420]",4,4


(77325, 10)


In [10]:
# Apply mapping to each list of proteins, preserving order 
    # using None if missing
uniref_df["genes_cleaned"] = uniref_df["proteins_cleaned"].apply(
    lambda lst: [protein_to_gene_map.get(p, None) for p in lst]
)

# Add the 'label' column: 'positive' if at least one protein maps to a positive gene, otherwise 'negative'
uniref_df["label"] = uniref_df["genes_cleaned"].apply(
    lambda gene_list: "positive" 
    if any(g in positive_genes for g in gene_list)
    else "negative"
)

# unzip in parallel pritens_clened and genes_clened and create ritens_clened_psotive and genes_clened_positive 
    # maintain if proteins comes form psotive gene
uniref_df["proteins_cleaned_positive"], uniref_df["genes_cleaned_positive"] = zip(*uniref_df.apply(
    lambda row: (
        [p for p, g in zip(row["proteins_cleaned"], row["genes_cleaned"]) if g in positive_genes],
        [g for g in row["genes_cleaned"] if g in positive_genes]
    ),
    axis=1
))
uniref_df["n_genes_cleaned_positive"] = uniref_df["genes_cleaned_positive"].apply(len)


display(uniref_df.head(5))
display(uniref_df.shape)


Unnamed: 0,Cluster_ID,Cluster_Name,Human_Proteins,proteins,proteins_no_isoform,proteins_only_accession,proteins_only_accession_nodup,proteins_cleaned,n_proteins,n_proteins_cleaned,genes_cleaned,label,proteins_cleaned_positive,genes_cleaned_positive,n_genes_cleaned_positive
0,UniRef50_Q8WZ42,Cluster: Titin,TITIN_HUMAN; Q8WZ42-8; Q8WZ42-2; Q8WZ42-7; C0J...,"[TITIN_HUMAN, Q8WZ42-8, Q8WZ42-2, Q8WZ42-7, C0...","[TITIN_HUMAN, Q8WZ42, C0JYZ2_HUMAN, H0Y4J7_HUM...","[Q8WZ42, Q8WZ42, C0JYZ2, H0Y4J7, A2TKE3, H7C0U...","[Q8WZ42, C0JYZ2, H0Y4J7, A2TKE3, H7C0U7, A0AAQ...","[Q8WZ42, C0JYZ2, H0Y4J7, H7C0U7, A0AAQ5BIC8, A...",18,7,"[TTN, TTN, TTN, TTN, TTN, TTN, TTN]",positive,"[Q8WZ42, C0JYZ2, H0Y4J7, H7C0U7, A0AAQ5BIC8, A...","[TTN, TTN, TTN, TTN, TTN, TTN, TTN]",7
1,UniRef50_Q8WZ42-9,Cluster: Isoform 9 of Titin,Q8WZ42-9; A0A0A0MRA3_HUMAN; Q8WZ42-3; Q8WZ42-10,"[Q8WZ42-9, A0A0A0MRA3_HUMAN, Q8WZ42-3, Q8WZ42-10]","[Q8WZ42, A0A0A0MRA3_HUMAN]","[Q8WZ42, A0A0A0MRA3]","[Q8WZ42, A0A0A0MRA3]","[Q8WZ42, A0A0A0MRA3]",4,2,"[TTN, TTN]",positive,"[Q8WZ42, A0A0A0MRA3]","[TTN, TTN]",2
2,UniRef50_Q8WXI7,Cluster: Mucin-16,MUC16_HUMAN; A0AAG2UXD3_HUMAN; A0AAG2UUZ0_HUMA...,"[MUC16_HUMAN, A0AAG2UXD3_HUMAN, A0AAG2UUZ0_HUM...","[MUC16_HUMAN, A0AAG2UXD3_HUMAN, A0AAG2UUZ0_HUM...","[Q8WXI7, A0AAG2UXD3, A0AAG2UUZ0, A0AA34QW05, A...","[Q8WXI7, A0AAG2UXD3, A0AAG2UUZ0, A0AA34QW05, A...","[Q8WXI7, A0AAG2UXD3, A0AAG2UUZ0, A0AA34QW05, A...",8,8,"[MUC16, MUC16, MUC16, MUC16, MUC16, MUC16, MUC...",negative,[],[],0
3,UniRef50_A0AA34QVW0,"Cluster: Mucin 16, cell surface associated",A0AA34QVW0_HUMAN,[A0AA34QVW0_HUMAN],[A0AA34QVW0_HUMAN],[A0AA34QVW0],[A0AA34QVW0],[A0AA34QVW0],1,1,[MUC16],negative,[],[],0
4,UniRef50_Q9H195,Cluster: Mucin-3B,MUC3B_HUMAN; H9XFA8_HUMAN; I0CMK2_HUMAN; O4342...,"[MUC3B_HUMAN, H9XFA8_HUMAN, I0CMK2_HUMAN, O434...","[MUC3B_HUMAN, H9XFA8_HUMAN, I0CMK2_HUMAN, O434...","[Q9H195, H9XFA8, I0CMK2, O43420]","[Q9H195, H9XFA8, I0CMK2, O43420]","[Q9H195, H9XFA8, I0CMK2, O43420]",4,4,"[MUC3B, MUC3B, MUC3B, MUC3]",negative,[],[],0


(77325, 15)

In [11]:
# give sampling prob
uniref_df["logits"] = uniref_df["genes_cleaned_positive"].apply(
    lambda gene_list: [positive_gene_freq_map.get(g, 0) for g in gene_list]
)
def safe_softmax(logits):
    if len(logits) == 0:
        return []  # return empty list if no logits
    return softmax(logits).tolist()  # convert numpy array to list
uniref_df["probs"] = uniref_df["logits"].apply(safe_softmax)

uniref_df

Unnamed: 0,Cluster_ID,Cluster_Name,Human_Proteins,proteins,proteins_no_isoform,proteins_only_accession,proteins_only_accession_nodup,proteins_cleaned,n_proteins,n_proteins_cleaned,genes_cleaned,label,proteins_cleaned_positive,genes_cleaned_positive,n_genes_cleaned_positive,logits,probs
0,UniRef50_Q8WZ42,Cluster: Titin,TITIN_HUMAN; Q8WZ42-8; Q8WZ42-2; Q8WZ42-7; C0J...,"[TITIN_HUMAN, Q8WZ42-8, Q8WZ42-2, Q8WZ42-7, C0...","[TITIN_HUMAN, Q8WZ42, C0JYZ2_HUMAN, H0Y4J7_HUM...","[Q8WZ42, Q8WZ42, C0JYZ2, H0Y4J7, A2TKE3, H7C0U...","[Q8WZ42, C0JYZ2, H0Y4J7, A2TKE3, H7C0U7, A0AAQ...","[Q8WZ42, C0JYZ2, H0Y4J7, H7C0U7, A0AAQ5BIC8, A...",18,7,"[TTN, TTN, TTN, TTN, TTN, TTN, TTN]",positive,"[Q8WZ42, C0JYZ2, H0Y4J7, H7C0U7, A0AAQ5BIC8, A...","[TTN, TTN, TTN, TTN, TTN, TTN, TTN]",7,"[2, 2, 2, 2, 2, 2, 2]","[0.14285714285714285, 0.14285714285714285, 0.1..."
1,UniRef50_Q8WZ42-9,Cluster: Isoform 9 of Titin,Q8WZ42-9; A0A0A0MRA3_HUMAN; Q8WZ42-3; Q8WZ42-10,"[Q8WZ42-9, A0A0A0MRA3_HUMAN, Q8WZ42-3, Q8WZ42-10]","[Q8WZ42, A0A0A0MRA3_HUMAN]","[Q8WZ42, A0A0A0MRA3]","[Q8WZ42, A0A0A0MRA3]","[Q8WZ42, A0A0A0MRA3]",4,2,"[TTN, TTN]",positive,"[Q8WZ42, A0A0A0MRA3]","[TTN, TTN]",2,"[2, 2]","[0.5, 0.5]"
2,UniRef50_Q8WXI7,Cluster: Mucin-16,MUC16_HUMAN; A0AAG2UXD3_HUMAN; A0AAG2UUZ0_HUMA...,"[MUC16_HUMAN, A0AAG2UXD3_HUMAN, A0AAG2UUZ0_HUM...","[MUC16_HUMAN, A0AAG2UXD3_HUMAN, A0AAG2UUZ0_HUM...","[Q8WXI7, A0AAG2UXD3, A0AAG2UUZ0, A0AA34QW05, A...","[Q8WXI7, A0AAG2UXD3, A0AAG2UUZ0, A0AA34QW05, A...","[Q8WXI7, A0AAG2UXD3, A0AAG2UUZ0, A0AA34QW05, A...",8,8,"[MUC16, MUC16, MUC16, MUC16, MUC16, MUC16, MUC...",negative,[],[],0,[],[]
3,UniRef50_A0AA34QVW0,"Cluster: Mucin 16, cell surface associated",A0AA34QVW0_HUMAN,[A0AA34QVW0_HUMAN],[A0AA34QVW0_HUMAN],[A0AA34QVW0],[A0AA34QVW0],[A0AA34QVW0],1,1,[MUC16],negative,[],[],0,[],[]
4,UniRef50_Q9H195,Cluster: Mucin-3B,MUC3B_HUMAN; H9XFA8_HUMAN; I0CMK2_HUMAN; O4342...,"[MUC3B_HUMAN, H9XFA8_HUMAN, I0CMK2_HUMAN, O434...","[MUC3B_HUMAN, H9XFA8_HUMAN, I0CMK2_HUMAN, O434...","[Q9H195, H9XFA8, I0CMK2, O43420]","[Q9H195, H9XFA8, I0CMK2, O43420]","[Q9H195, H9XFA8, I0CMK2, O43420]",4,4,"[MUC3B, MUC3B, MUC3B, MUC3]",negative,[],[],0,[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77320,UniRef50_B2MXE1,Cluster: Anaplastic lymphoma kinase (Fragment),B2MXE1_HUMAN,[B2MXE1_HUMAN],[B2MXE1_HUMAN],[B2MXE1],[B2MXE1],[B2MXE1],1,1,[ALK],positive,[B2MXE1],[ALK],1,[2],[1.0]
77321,UniRef50_Q16427,Cluster: Dystrophin protein (Fragment),Q16427_HUMAN,[Q16427_HUMAN],[Q16427_HUMAN],[Q16427],[Q16427],[Q16427],1,1,[dystrophin],negative,[],[],0,[],[]
77322,UniRef50_Q16175,Cluster: Low density lipoprotein receptor gene...,Q16175_HUMAN,[Q16175_HUMAN],[Q16175_HUMAN],[Q16175],[Q16175],[],1,0,[],negative,[],[],0,[],[]
77323,UniRef50_A6QL42,Cluster: Potassium voltage-gated channel long ...,A6QL42_HUMAN,[A6QL42_HUMAN],[A6QL42_HUMAN],[A6QL42],[A6QL42],[A6QL42],1,1,[KCND3],negative,[],[],0,[],[]


## Positive class sampling

For each postive class, take N posive proteins with prob defined above (ie prob based on MSigDB freq)

In [12]:
# Filter only positive clusters
uniref_df_pos = uniref_df[uniref_df.label == "positive"].copy()
print(uniref_df_pos.shape)
print(uniref_df_pos["n_genes_cleaned_positive"].value_counts().sort_index())


(21024, 17)
n_genes_cleaned_positive
1       14355
2        3507
3        1332
4         657
5         369
        ...  
1910        1
2175        1
2321        1
2913        1
5690        1
Name: count, Length: 89, dtype: int64


In [13]:
def sample_row(row, min_sample_n, prot_col, gene_col, probs_col):
    N = min(len(row[prot_col]), min_sample_n)
    if probs_col != None:
        probs = row[probs_col]
    else:
        probs=None # unirform prob
    sampled_indices = np.random.choice(len(row[prot_col]), size=N, replace=False, p=probs)
    proteins_array = np.array(row[prot_col])
    genes_array = np.array(row[gene_col])
    return proteins_array[sampled_indices], genes_array[sampled_indices]

uniref_df_pos["proteins_sampled"], uniref_df_pos["genes_sampled"] = zip(
    *uniref_df_pos.apply(sample_row, axis=1, min_sample_n=MIN_SAMPLE_N_POSITIVE, gene_col="genes_cleaned_positive", prot_col="proteins_cleaned_positive", probs_col="probs")
)

uniref_df_pos

Unnamed: 0,Cluster_ID,Cluster_Name,Human_Proteins,proteins,proteins_no_isoform,proteins_only_accession,proteins_only_accession_nodup,proteins_cleaned,n_proteins,n_proteins_cleaned,genes_cleaned,label,proteins_cleaned_positive,genes_cleaned_positive,n_genes_cleaned_positive,logits,probs,proteins_sampled,genes_sampled
0,UniRef50_Q8WZ42,Cluster: Titin,TITIN_HUMAN; Q8WZ42-8; Q8WZ42-2; Q8WZ42-7; C0J...,"[TITIN_HUMAN, Q8WZ42-8, Q8WZ42-2, Q8WZ42-7, C0...","[TITIN_HUMAN, Q8WZ42, C0JYZ2_HUMAN, H0Y4J7_HUM...","[Q8WZ42, Q8WZ42, C0JYZ2, H0Y4J7, A2TKE3, H7C0U...","[Q8WZ42, C0JYZ2, H0Y4J7, A2TKE3, H7C0U7, A0AAQ...","[Q8WZ42, C0JYZ2, H0Y4J7, H7C0U7, A0AAQ5BIC8, A...",18,7,"[TTN, TTN, TTN, TTN, TTN, TTN, TTN]",positive,"[Q8WZ42, C0JYZ2, H0Y4J7, H7C0U7, A0AAQ5BIC8, A...","[TTN, TTN, TTN, TTN, TTN, TTN, TTN]",7,"[2, 2, 2, 2, 2, 2, 2]","[0.14285714285714285, 0.14285714285714285, 0.1...",[A0A0C4DG59],[TTN]
1,UniRef50_Q8WZ42-9,Cluster: Isoform 9 of Titin,Q8WZ42-9; A0A0A0MRA3_HUMAN; Q8WZ42-3; Q8WZ42-10,"[Q8WZ42-9, A0A0A0MRA3_HUMAN, Q8WZ42-3, Q8WZ42-10]","[Q8WZ42, A0A0A0MRA3_HUMAN]","[Q8WZ42, A0A0A0MRA3]","[Q8WZ42, A0A0A0MRA3]","[Q8WZ42, A0A0A0MRA3]",4,2,"[TTN, TTN]",positive,"[Q8WZ42, A0A0A0MRA3]","[TTN, TTN]",2,"[2, 2]","[0.5, 0.5]",[A0A0A0MRA3],[TTN]
5,UniRef50_A0A1W2PS37,Cluster: Spectrin repeat containing nuclear en...,A0A1W2PS37_HUMAN,[A0A1W2PS37_HUMAN],[A0A1W2PS37_HUMAN],[A0A1W2PS37],[A0A1W2PS37],[A0A1W2PS37],1,1,[SYNE2],positive,[A0A1W2PS37],[SYNE2],1,[4],[1.0],[A0A1W2PS37],[SYNE2]
8,UniRef50_Q03001,Cluster: Dystonin,DYST_HUMAN; E7ERX3_HUMAN,"[DYST_HUMAN, E7ERX3_HUMAN]","[DYST_HUMAN, E7ERX3_HUMAN]","[Q03001, E7ERX3]","[Q03001, E7ERX3]","[Q03001, E7ERX3]",2,2,"[DST, DST]",positive,"[Q03001, E7ERX3]","[DST, DST]",2,"[2, 2]","[0.5, 0.5]",[E7ERX3],[DST]
9,UniRef50_A0AAV6QKA6,Cluster: Versican core protein,H0YKM8_HUMAN,[H0YKM8_HUMAN],[H0YKM8_HUMAN],[H0YKM8],[H0YKM8],[H0YKM8],1,1,[KLF13],positive,[H0YKM8],[KLF13],1,[4],[1.0],[H0YKM8],[KLF13]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77293,UniRef50_Q32W77,Cluster: Deleted in lung and esophageal cancer...,Q32W77_HUMAN,[Q32W77_HUMAN],[Q32W77_HUMAN],[Q32W77],[Q32W77],[Q32W77],1,1,[DLEC1],positive,[Q32W77],[DLEC1],1,[2],[1.0],[Q32W77],[DLEC1]
77295,UniRef50_Q0E5H0,Cluster: Rhesus blood group little e antigen (...,Q0E5H0_HUMAN,[Q0E5H0_HUMAN],[Q0E5H0_HUMAN],[Q0E5H0],[Q0E5H0],[Q0E5H0],1,1,[RHCE],positive,[Q0E5H0],[RHCE],1,[2],[1.0],[Q0E5H0],[RHCE]
77299,UniRef50_A0A1J0F944,Cluster: Truncated neurofibromin 1 (Fragment),A0A1J0F944_HUMAN,[A0A1J0F944_HUMAN],[A0A1J0F944_HUMAN],[A0A1J0F944],[A0A1J0F944],[A0A1J0F944],1,1,[NF1],positive,[A0A1J0F944],[NF1],1,[38],[1.0],[A0A1J0F944],[NF1]
77309,UniRef50_H0Y9R2,Cluster: Nicotinamide nucleotide transhydrogen...,H0Y9R2_HUMAN,[H0Y9R2_HUMAN],[H0Y9R2_HUMAN],[H0Y9R2],[H0Y9R2],[H0Y9R2],1,1,[NNT],positive,[H0Y9R2],[NNT],1,[2],[1.0],[H0Y9R2],[NNT]


## Negative class sampling

take 3N proteins form each cluster:
- sample unformly from each cluster
- avoid positive --> jusr remove them from df
- avoid ambigous --> just remove if present df

In [14]:
uniref_df_neg = uniref_df.copy()

In [15]:
# first we need ot remvve form possbile choicdes the genes ambigous OR positive

def filter_row(row, prot_col, gene_col, filter_out_genes):
    """
    Remove genes in filter_out_genes and their corresponding proteins.
    Returns filtered proteins and genes as lists.
    """
    proteins_array = np.array(row[prot_col])
    genes_array = np.array(row[gene_col])
    
    # Indices to keep (genes NOT in filter_out_genes)
    retain_indices = [i for i, g in enumerate(genes_array) if g not in filter_out_genes]
    
    if len(retain_indices) == 0:
        return [], []
    
    return proteins_array[retain_indices].tolist(), genes_array[retain_indices].tolist()

# list of genes to remove
genes_to_remove_combined = set(positive_genes).union(set(ambiguos_genes))

# filter out not allowed genative genes/proteins
uniref_df_neg["proteins_allowed"], uniref_df_neg["genes_allowed"] = zip(
    *uniref_df_neg.apply(filter_row, axis=1, 
                         gene_col="genes_cleaned",  # sample form all proteins
                         prot_col="proteins_cleaned", 
                         filter_out_genes=genes_to_remove_combined
                         )
)

# ATTENTION
# tak from all clsters (no just psoitive)
# tek a multiple of N
# take for allowed prtiens
uniref_df_neg["proteins_sampled"], uniref_df_neg["genes_sampled"] = zip(
    *uniref_df_neg.apply(sample_row, axis=1, 
                         min_sample_n= NEGATIVE_CLASS_MULT * MIN_SAMPLE_N_POSITIVE, 
                         gene_col="genes_allowed",  # sample form all proteins
                         prot_col="proteins_allowed", 
                         probs_col=None # uniform
                         )
)

display(uniref_df_neg.head(5))
# chekc: are there psotive clsuter with a nregative gene?
display(uniref_df_neg[(uniref_df_neg.label == "positive") & (uniref_df_neg.genes_allowed.apply(len) != 0)].head(2))


Unnamed: 0,Cluster_ID,Cluster_Name,Human_Proteins,proteins,proteins_no_isoform,proteins_only_accession,proteins_only_accession_nodup,proteins_cleaned,n_proteins,n_proteins_cleaned,...,label,proteins_cleaned_positive,genes_cleaned_positive,n_genes_cleaned_positive,logits,probs,proteins_allowed,genes_allowed,proteins_sampled,genes_sampled
0,UniRef50_Q8WZ42,Cluster: Titin,TITIN_HUMAN; Q8WZ42-8; Q8WZ42-2; Q8WZ42-7; C0J...,"[TITIN_HUMAN, Q8WZ42-8, Q8WZ42-2, Q8WZ42-7, C0...","[TITIN_HUMAN, Q8WZ42, C0JYZ2_HUMAN, H0Y4J7_HUM...","[Q8WZ42, Q8WZ42, C0JYZ2, H0Y4J7, A2TKE3, H7C0U...","[Q8WZ42, C0JYZ2, H0Y4J7, A2TKE3, H7C0U7, A0AAQ...","[Q8WZ42, C0JYZ2, H0Y4J7, H7C0U7, A0AAQ5BIC8, A...",18,7,...,positive,"[Q8WZ42, C0JYZ2, H0Y4J7, H7C0U7, A0AAQ5BIC8, A...","[TTN, TTN, TTN, TTN, TTN, TTN, TTN]",7,"[2, 2, 2, 2, 2, 2, 2]","[0.14285714285714285, 0.14285714285714285, 0.1...",[],[],[],[]
1,UniRef50_Q8WZ42-9,Cluster: Isoform 9 of Titin,Q8WZ42-9; A0A0A0MRA3_HUMAN; Q8WZ42-3; Q8WZ42-10,"[Q8WZ42-9, A0A0A0MRA3_HUMAN, Q8WZ42-3, Q8WZ42-10]","[Q8WZ42, A0A0A0MRA3_HUMAN]","[Q8WZ42, A0A0A0MRA3]","[Q8WZ42, A0A0A0MRA3]","[Q8WZ42, A0A0A0MRA3]",4,2,...,positive,"[Q8WZ42, A0A0A0MRA3]","[TTN, TTN]",2,"[2, 2]","[0.5, 0.5]",[],[],[],[]
2,UniRef50_Q8WXI7,Cluster: Mucin-16,MUC16_HUMAN; A0AAG2UXD3_HUMAN; A0AAG2UUZ0_HUMA...,"[MUC16_HUMAN, A0AAG2UXD3_HUMAN, A0AAG2UUZ0_HUM...","[MUC16_HUMAN, A0AAG2UXD3_HUMAN, A0AAG2UUZ0_HUM...","[Q8WXI7, A0AAG2UXD3, A0AAG2UUZ0, A0AA34QW05, A...","[Q8WXI7, A0AAG2UXD3, A0AAG2UUZ0, A0AA34QW05, A...","[Q8WXI7, A0AAG2UXD3, A0AAG2UUZ0, A0AA34QW05, A...",8,8,...,negative,[],[],0,[],[],"[Q8WXI7, A0AAG2UXD3, A0AAG2UUZ0, A0AA34QW05, A...","[MUC16, MUC16, MUC16, MUC16, MUC16, MUC16, MUC...","[A0AAG2UXK0, Q8WXI7]","[MUC16, MUC16]"
3,UniRef50_A0AA34QVW0,"Cluster: Mucin 16, cell surface associated",A0AA34QVW0_HUMAN,[A0AA34QVW0_HUMAN],[A0AA34QVW0_HUMAN],[A0AA34QVW0],[A0AA34QVW0],[A0AA34QVW0],1,1,...,negative,[],[],0,[],[],[A0AA34QVW0],[MUC16],[A0AA34QVW0],[MUC16]
4,UniRef50_Q9H195,Cluster: Mucin-3B,MUC3B_HUMAN; H9XFA8_HUMAN; I0CMK2_HUMAN; O4342...,"[MUC3B_HUMAN, H9XFA8_HUMAN, I0CMK2_HUMAN, O434...","[MUC3B_HUMAN, H9XFA8_HUMAN, I0CMK2_HUMAN, O434...","[Q9H195, H9XFA8, I0CMK2, O43420]","[Q9H195, H9XFA8, I0CMK2, O43420]","[Q9H195, H9XFA8, I0CMK2, O43420]",4,4,...,negative,[],[],0,[],[],"[Q9H195, H9XFA8, I0CMK2, O43420]","[MUC3B, MUC3B, MUC3B, MUC3]","[H9XFA8, I0CMK2]","[MUC3B, MUC3B]"


Unnamed: 0,Cluster_ID,Cluster_Name,Human_Proteins,proteins,proteins_no_isoform,proteins_only_accession,proteins_only_accession_nodup,proteins_cleaned,n_proteins,n_proteins_cleaned,...,label,proteins_cleaned_positive,genes_cleaned_positive,n_genes_cleaned_positive,logits,probs,proteins_allowed,genes_allowed,proteins_sampled,genes_sampled
28,UniRef50_Q8WXH0,Cluster: Nesprin-2,SYNE2_HUMAN; Q8WXH0-3; Q8WXH0-4; A0A0C4DGK3_HU...,"[SYNE2_HUMAN, Q8WXH0-3, Q8WXH0-4, A0A0C4DGK3_H...","[SYNE2_HUMAN, Q8WXH0, A0A0C4DGK3_HUMAN, Q6MZP0...","[Q8WXH0, Q8WXH0, A0A0C4DGK3, Q6MZP0, A0A669KB6...","[Q8WXH0, A0A0C4DGK3, Q6MZP0, A0A669KB61, G3V2Q...","[Q8WXH0, A0A0C4DGK3, Q6MZP0, A0A669KB61, G3V2Q...",14,6,...,positive,"[Q8WXH0, A0A0C4DGK3, A0A669KB61, G3V2Q0, A0A1W...","[SYNE2, SYNE2, SYNE2, SYNE2, SYNE2]",5,"[4, 4, 4, 4, 4]","[0.2, 0.2, 0.2, 0.2, 0.2]",[Q6MZP0],[DKFZp686M09200],[Q6MZP0],[DKFZp686M09200]
30,UniRef50_A0A8C9AHQ9,Cluster: Spectrin repeat containing nuclear en...,D4YW74_HUMAN; G3V5X4_HUMAN,"[D4YW74_HUMAN, G3V5X4_HUMAN]","[D4YW74_HUMAN, G3V5X4_HUMAN]","[D4YW74, G3V5X4]","[D4YW74, G3V5X4]","[D4YW74, G3V5X4]",2,2,...,positive,[G3V5X4],[SYNE2],1,[4],[1.0],[D4YW74],[TROPH],[D4YW74],[TROPH]


## Make dataset

protein name | label | seq

In [16]:
# add labels
uniref_df_pos["label_single_prot"] = 1
uniref_df_neg["label_single_prot"] = 0

# merge --> each entry is a cluster
dataset_df = pd.concat([uniref_df_pos, uniref_df_neg])
display(dataset_df)


Unnamed: 0,Cluster_ID,Cluster_Name,Human_Proteins,proteins,proteins_no_isoform,proteins_only_accession,proteins_only_accession_nodup,proteins_cleaned,n_proteins,n_proteins_cleaned,...,proteins_cleaned_positive,genes_cleaned_positive,n_genes_cleaned_positive,logits,probs,proteins_sampled,genes_sampled,label_single_prot,proteins_allowed,genes_allowed
0,UniRef50_Q8WZ42,Cluster: Titin,TITIN_HUMAN; Q8WZ42-8; Q8WZ42-2; Q8WZ42-7; C0J...,"[TITIN_HUMAN, Q8WZ42-8, Q8WZ42-2, Q8WZ42-7, C0...","[TITIN_HUMAN, Q8WZ42, C0JYZ2_HUMAN, H0Y4J7_HUM...","[Q8WZ42, Q8WZ42, C0JYZ2, H0Y4J7, A2TKE3, H7C0U...","[Q8WZ42, C0JYZ2, H0Y4J7, A2TKE3, H7C0U7, A0AAQ...","[Q8WZ42, C0JYZ2, H0Y4J7, H7C0U7, A0AAQ5BIC8, A...",18,7,...,"[Q8WZ42, C0JYZ2, H0Y4J7, H7C0U7, A0AAQ5BIC8, A...","[TTN, TTN, TTN, TTN, TTN, TTN, TTN]",7,"[2, 2, 2, 2, 2, 2, 2]","[0.14285714285714285, 0.14285714285714285, 0.1...",[A0A0C4DG59],[TTN],1,,
1,UniRef50_Q8WZ42-9,Cluster: Isoform 9 of Titin,Q8WZ42-9; A0A0A0MRA3_HUMAN; Q8WZ42-3; Q8WZ42-10,"[Q8WZ42-9, A0A0A0MRA3_HUMAN, Q8WZ42-3, Q8WZ42-10]","[Q8WZ42, A0A0A0MRA3_HUMAN]","[Q8WZ42, A0A0A0MRA3]","[Q8WZ42, A0A0A0MRA3]","[Q8WZ42, A0A0A0MRA3]",4,2,...,"[Q8WZ42, A0A0A0MRA3]","[TTN, TTN]",2,"[2, 2]","[0.5, 0.5]",[A0A0A0MRA3],[TTN],1,,
5,UniRef50_A0A1W2PS37,Cluster: Spectrin repeat containing nuclear en...,A0A1W2PS37_HUMAN,[A0A1W2PS37_HUMAN],[A0A1W2PS37_HUMAN],[A0A1W2PS37],[A0A1W2PS37],[A0A1W2PS37],1,1,...,[A0A1W2PS37],[SYNE2],1,[4],[1.0],[A0A1W2PS37],[SYNE2],1,,
8,UniRef50_Q03001,Cluster: Dystonin,DYST_HUMAN; E7ERX3_HUMAN,"[DYST_HUMAN, E7ERX3_HUMAN]","[DYST_HUMAN, E7ERX3_HUMAN]","[Q03001, E7ERX3]","[Q03001, E7ERX3]","[Q03001, E7ERX3]",2,2,...,"[Q03001, E7ERX3]","[DST, DST]",2,"[2, 2]","[0.5, 0.5]",[E7ERX3],[DST],1,,
9,UniRef50_A0AAV6QKA6,Cluster: Versican core protein,H0YKM8_HUMAN,[H0YKM8_HUMAN],[H0YKM8_HUMAN],[H0YKM8],[H0YKM8],[H0YKM8],1,1,...,[H0YKM8],[KLF13],1,[4],[1.0],[H0YKM8],[KLF13],1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77320,UniRef50_B2MXE1,Cluster: Anaplastic lymphoma kinase (Fragment),B2MXE1_HUMAN,[B2MXE1_HUMAN],[B2MXE1_HUMAN],[B2MXE1],[B2MXE1],[B2MXE1],1,1,...,[B2MXE1],[ALK],1,[2],[1.0],[],[],0,[],[]
77321,UniRef50_Q16427,Cluster: Dystrophin protein (Fragment),Q16427_HUMAN,[Q16427_HUMAN],[Q16427_HUMAN],[Q16427],[Q16427],[Q16427],1,1,...,[],[],0,[],[],[Q16427],[dystrophin],0,[Q16427],[dystrophin]
77322,UniRef50_Q16175,Cluster: Low density lipoprotein receptor gene...,Q16175_HUMAN,[Q16175_HUMAN],[Q16175_HUMAN],[Q16175],[Q16175],[],1,0,...,[],[],0,[],[],[],[],0,[],[]
77323,UniRef50_A6QL42,Cluster: Potassium voltage-gated channel long ...,A6QL42_HUMAN,[A6QL42_HUMAN],[A6QL42_HUMAN],[A6QL42],[A6QL42],[A6QL42],1,1,...,[],[],0,[],[],[A6QL42],[KCND3],0,[A6QL42],[KCND3]


In [29]:
# smaller df
cols_to_keep = ['Cluster_ID', 'proteins_sampled',
                'genes_sampled', 'label_single_prot']
dataset_df_small = dataset_df[cols_to_keep].copy()

# Columns to explode
list_cols = ['proteins_sampled', 'genes_sampled']

# Keep the rest of the columns
other_cols = [c for c in dataset_df_small.columns if c not in list_cols]

# Use explode on both list-columns
long_df = dataset_df_small.copy()
for col in list_cols:
    long_df = long_df.explode(col)

# Now all other columns are repeated for each exploded row
long_df = long_df.reset_index(drop=True)

# Rename
long_df = long_df.rename(columns={"proteins_sampled": "protein", "genes_sampled": "gene", "label_single_prot": "label"})

# ATTNETION: remove ducplated
    # single protein can appear in different clusters if it matches multiple representative sequences in different UniRef50 clusters.
long_df = long_df.drop_duplicates(subset=["protein"])

# reset indec
long_df = long_df.reset_index(drop=True)

display(long_df)

print(long_df['label'].value_counts())

Unnamed: 0,Cluster_ID,protein,gene,label
0,UniRef50_Q8WZ42,A0A0C4DG59,TTN,1
1,UniRef50_Q8WZ42-9,A0A0A0MRA3,TTN,1
2,UniRef50_A0A1W2PS37,A0A1W2PS37,SYNE2,1
3,UniRef50_Q03001,E7ERX3,DST,1
4,UniRef50_A0AAV6QKA6,H0YKM8,KLF13,1
...,...,...,...,...
60743,UniRef50_Q16234,Q16234,HuD,0
60744,UniRef50_V9H0A7,V9H0A7,troponin T,0
60745,UniRef50_Q16427,Q16427,dystrophin,0
60746,UniRef50_A6QL42,A6QL42,KCND3,0


label
0    41489
1    19259
Name: count, dtype: int64


In [34]:
# check: entries where the same Cluster_ID has conflicting labels.
long_df.groupby('Cluster_ID').filter(lambda x: x['label'].nunique() > 1).sort_values(by="Cluster_ID")

Unnamed: 0,Cluster_ID,protein,gene,label,set
45840,UniRef50_A0A060Z6A6,H0YEJ4,CAPN8,0,train
11301,UniRef50_A0A060Z6A6,E9PL37,CAPN1,1,train
27531,UniRef50_A0A061IN33,A0A510GDE6,p38,0,train
3231,UniRef50_A0A061IN33,E7EX54,MAPK14,1,train
2568,UniRef50_A0A091CPA8,C9JIF6,AKT2,1,train
...,...,...,...,...,...
39811,UniRef50_W5PLM1,C5J3U3,HLA-Cw,0,train
16867,UniRef50_X2GFG1,X2GFG1,CD1D,1,train
56583,UniRef50_X2GFG1,X2F926,CD1d,0,train
38912,UniRef50_X5D3F6,X5CHT4,IRF-1,0,train


In [30]:
# STRATIFIED spit
# Stratified split: 80% train, 10% val, 10% test
# First split: 80% train, 20% temp
train_idx, temp_idx = train_test_split(
    long_df.index,
    test_size=0.2,
    stratify=long_df['label'],
    random_state=42
)

# Second split: temp into val and test (50%-50% of temp)
temp_labels = long_df.loc[temp_idx, 'label']
val_idx, test_idx = train_test_split(
    temp_idx,
    test_size=0.5,
    stratify=temp_labels,
    random_state=42
)

# Assign splits
long_df['set'] = ''
long_df.loc[train_idx, 'set'] = 'train'
long_df.loc[val_idx, 'set'] = 'val'
long_df.loc[test_idx, 'set'] = 'test'

print(long_df['set'].value_counts())
display(long_df)

set
train    48598
val       6075
test      6075
Name: count, dtype: int64


Unnamed: 0,Cluster_ID,protein,gene,label,set
0,UniRef50_Q8WZ42,A0A0C4DG59,TTN,1,val
1,UniRef50_Q8WZ42-9,A0A0A0MRA3,TTN,1,train
2,UniRef50_A0A1W2PS37,A0A1W2PS37,SYNE2,1,test
3,UniRef50_Q03001,E7ERX3,DST,1,test
4,UniRef50_A0AAV6QKA6,H0YKM8,KLF13,1,train
...,...,...,...,...,...
60743,UniRef50_Q16234,Q16234,HuD,0,train
60744,UniRef50_V9H0A7,V9H0A7,troponin T,0,train
60745,UniRef50_Q16427,Q16427,dystrophin,0,val
60746,UniRef50_A6QL42,A6QL42,KCND3,0,train


## Add Seqeunce 

In [None]:
### 1) Read proteiame df
proteome_df = pd.read_csv(HUMAN_PROTEOME_PATH, sep='\t')
display(proteome_df.head(5)); print(proteome_df.shape)

proteome_df.drop(columns=['Entry Name', 'Protein names', 'Organism'], inplace=True)

## little dheck --> ATTNETION: I lose many datapoints
all_proteome_proteins = set(proteome_df.Entry.unique())
all_used_proteins = set(long_df.protein.unique())
print(len(all_used_proteins), len(all_used_proteins.intersection(all_proteome_proteins)))

Unnamed: 0,Entry,Reviewed,Entry Name,Protein names,Gene Names,Organism,Sequence
0,A0A087WZT3,unreviewed,A0A087WZT3_HUMAN,BOLA2-SMG1P6 readthrough,BOLA2-SMG1P6,Homo sapiens (Human),MELSAEYLREKLQRDLEAEHVLPSPGGVGQVRGETAASETQLGS
1,A0A087X1C5,reviewed,CP2D7_HUMAN,Cytochrome P450 2D7 (EC 1.14.14.1),CYP2D7,Homo sapiens (Human),MGLEALVPLAMIVAIFLLLVDLMHRHQRWAARYPPGPLPLPGLGNL...
2,A0A087X296,unreviewed,A0A087X296_HUMAN,Prostaglandin G/H synthase 1 (EC 1.14.99.1) (C...,PTGS1,Homo sapiens (Human),MSRSLLLWFLLFLLLLPPLPVLLADPGAPTPVNPCCYYPCQHQGIC...
3,A0A0A0MQV1,unreviewed,A0A0A0MQV1_HUMAN,11-beta-hydroxysteroid dehydrogenase 1 (EC 1.1...,HSD11B1,Homo sapiens (Human),MAFMKKYLLPILGLFMAYYYYSANEEFRPEMLQGKKVIVTGASKGI...
4,A0A0A0MRG2,unreviewed,A0A0A0MRG2_HUMAN,Amyloid-beta precursor protein (ABPP) (Alzheim...,APP,Homo sapiens (Human),MFCGRLNMHMNVQNGKWDSDPSGTKTCIDTKEGILQYCQEVYPELQ...


(83587, 7)
60748 46206


In [36]:
# inner mantians onyl proitns wiht seq
long_df_seq = pd.merge(how="inner", left=long_df, right=proteome_df, left_on="protein", right_on="Entry")
long_df_seq.rename(columns={
    "Sequence": "sequence"
}, inplace=True)
long_df_seq

Unnamed: 0,Cluster_ID,protein,gene,label,set,Entry,Reviewed,Gene Names,sequence
0,UniRef50_Q8WZ42,A0A0C4DG59,TTN,1,val,A0A0C4DG59,unreviewed,TTN,MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVI...
1,UniRef50_Q8WZ42-9,A0A0A0MRA3,TTN,1,train,A0A0A0MRA3,unreviewed,TTN,MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVI...
2,UniRef50_A0A1W2PS37,A0A1W2PS37,SYNE2,1,test,A0A1W2PS37,unreviewed,SYNE2,MSMERRMKIEETWRLW
3,UniRef50_Q03001,E7ERX3,DST,1,test,E7ERX3,unreviewed,DST,MMGQLMKNDQDWNAKLVGLMCCMDERDKVQKKTFTKWINQHLMKVR...
4,UniRef50_A0AAV6QKA6,H0YKM8,KLF13,1,train,H0YKM8,unreviewed,KLF13,MRSDHLTKHARRHA
...,...,...,...,...,...,...,...,...,...
46201,UniRef50_U3KPR3,U3KPR3,PDE4C,0,train,U3KPR3,unreviewed,PDE4C,XARQQCLGAAK
46202,UniRef50_H0Y3V6,H0Y3V6,FAM131B,0,train,H0Y3V6,unreviewed,FAM131B,XDFSWDGINAL
46203,UniRef50_A0A6Q8PGP0,A0A6Q8PGP0,KIF5A,0,train,A0A6Q8PGP0,unreviewed,KIF5A,XNATDINDNSF
46204,UniRef50_A0A0A0MTA1,A0A0A0MTA1,IGLJ4,0,test,A0A0A0MTA1,unreviewed,IGLJ4,VFGGGTQLIIL


## Save

In [37]:
long_df_seq.to_csv(FINAL_DATASET_PATH)