# Aging Biomarker Identification - Dataset generation


## Overview of project

This project seeks to identify novel genes that can serve as biomarkers for longevity and select candidates as therapeutic targets, using a strong and logical rationale based on biology and biologically relevant data. To that aim, we use the STRING database, which contains human protein-protein interaction data, and use a random walk with restart and identify gene connectivity with a set of established (i.e., known and proven) genes with impact on longevity, referred to as seed genes (sGenes), and extracted from the HAGR's GenAge (genage_human.csv). We will build a classifier that classifies genes into good or bad targets for longevity therapeutics: Longevity-Relevan Genes (LRGs) and Non-longevity Relevant Genes (NRGs). LRGs and NRGs are obtained from the targets of DrugAge database compounds, with LRGs having positive lifespan increases and NRGs having negative impacts. GenDR is used to augment the selection of LRGs, CellAge for sGenes, and "Homo_sapiens.tsv" for translating genes across species. Lastly, model evaluation and selection will yield an optimal model, which will be applied to non-training genes to identify novel LRGs.

## Table of Contents

1. [Longevity Genes](#cellage-data)
2. [Gene Expression Data](#gene-expression-data)
3. [Protein-Protein Interaction Data](#protein-protein-interaction-data)
4. [Data Integration](#data-integration)
5. [Data Preprocessing](#data-preprocessing)
6. [Exploratory Data Analysis](#exploratory-data-analysis)

In [1]:
# Below we import all necessary libraries for running this notebook
import pandas as pd
import numpy as np
import os
from tqdm import tqdm
pwd = os.getcwd()
data_dir = os.path.join(pwd, '../data')

## Longevity data

We have collected cellular senescence data from the CellAge and GeneAge from the Human Ageing Genomic Resources (HAGR) databases, which provides information on human genes associated with aging and cellular senescence. 

From HAGR we also extract the DrugAge and GenDR, containing information of drugs that have anti-aging properties and genes affected by Dietary Restriction, respectively. This information is useful as it will provide a foundational definition for therapy-effect in the context of extending longevity.

### Data Source

- GenAge: [https://genomics.senescence.info/genes/human.html](https://genomics.senescence.info/genes/human.html)
- CellAge: [https://genomics.senescence.info/download.html#cellage](https://genomics.senescence.info/download.html#cellage)
- DrugAge: [https://genomics.senescence.info/drugs/](https://genomics.senescence.info/drugs/)
- GenDR: [https://genomics.senescence.info/diet/](https://genomics.senescence.info/diet/)
- BioMART (Ensembl): [http://www.ensembl.org/biomart](http://www.ensembl.org/biomart)

### Dataset

- genage_human.csv: List of manually curated genes identified through meta-analysis of aging human microarray data. 
- CellAge_genes.csv: List of genes experimentally identified as being associated with cellular senescence, along with experimental evidence, Gene Ontology terms, and orthologous genes in model organisms.
- drugage.csv: List of compounds observed to have an impact on longevity in healthy organisms.
- gendr_manipulations.csv: List of genes affected by dietary restriction.
- Humo_sapiens.tsv: File containing human homologs for all genes in GeneDR
- mart_export.txt: CSV file containing naming information for translating genes across datasets + GeneOntology terms for each gene


In [3]:
genage_df = pd.read_csv(os.path.join(data_dir, 'HAGR/GenAge/genage_human.csv'))
display(genage_df.head())
cellage_df = pd.read_csv(os.path.join(
    data_dir, 'HAGR/CellAge/cellage3.tsv'), sep='\t', header=0)
cellage_df.head()

Unnamed: 0,GenAge ID,symbol,name,entrez gene id,uniprot,why
0,1,GHR,growth hormone receptor,2690,GHR_HUMAN,mammal
1,2,GHRH,growth hormone releasing hormone,2691,SLIB_HUMAN,mammal
2,3,SHC1,SHC (Src homology 2 domain containing) transfo...,6464,SHC1_HUMAN,mammal
3,4,POU1F1,POU class 1 homeobox 1,5449,PIT1_HUMAN,mammal
4,5,PROP1,PROP paired-like homeobox 1,5626,PROP1_HUMAN,mammal


Unnamed: 0,Entrez ID,Gene symbol,Gene name,Cancer Cell,Type of senescence,Senescence Effect,Reference
0,22848,AAK1,AP2 associated kinase 1,No,Unclear,Induces,26583757
1,5243,ABCB1,ATP binding cassette subfamily B member 1,Yes,Stress-induced,Induces,10825123
2,368,ABCC6,ATP binding cassette subfamily C member 6,Yes,Unclear,Inhibits,28536638
3,51225,ABI3,ABI family member 3,Yes,Oncogene-induced,Induces,21223585
4,25890,ABI3BP,ABI family member 3 binding protein,Yes,Unclear,Induces,18559958


In [4]:
drugage_df = pd.read_csv(os.path.join(data_dir, 'HAGR/DrugAge/drugage.csv')
                         ).drop(columns=[f'Unnamed: {i}' for i in range(10, 12)])
display(drugage_df.head())
gendr_df = pd.read_csv(os.path.join(data_dir, 'HAGR/GenDR/gendr_manipulations.csv')
                       ).drop(columns=[f'Unnamed: {i}' for i in range(5, 286)])
display(gendr_df.head())
gendr_homologs_df = pd.read_csv(os.path.join(
    data_dir, 'HAGR/GenDR/Homo_sapiens.tsv'), sep='\t')
gendr_homologs_df.head()

Unnamed: 0,compound_name,cas_number,species,strain,dosage,avg_lifespan_change,max_lifespan_change,gender,significance,pubmed_id
0,Ethanol,64-17-5,Drosophila mojavensis,A350,2%,57.65,,FEMALE,O,13369
1,Ethanol,64-17-5,Drosophila mojavensis,A350,2%,59.55,,MALE,O,13369
2,Ethanol,64-17-5,Drosophila mojavensis,A350,4%,28.42,,FEMALE,O,13369
3,Ethanol,64-17-5,Drosophila mojavensis,A350,4%,21.75,,MALE,O,13369
4,Ethanol,64-17-5,Drosophila mojavensis,A420,2%,89.09,,FEMALE,O,13369


Unnamed: 0,GenDR ID,gene symbol,species,entrez gene id,gene name
0,1,SIR2,Saccharomyces cerevisiae,851520,Silent Information Regulator 2
1,2,CDC25,Saccharomyces cerevisiae,851019,Cell Division Cycle 25
2,3,HAP4,Saccharomyces cerevisiae,853751,Heme Activator Protein 4
3,4,PNC1,Saccharomyces cerevisiae,852846,Pyrazinamidase/NiCotinamidase 1
4,5,CYT1,Saccharomyces cerevisiae,854231,Cyt1p


Unnamed: 0,gene_symbol,species_name,entrez_id,homolog species,homolog entrez_id,homolog identifier,homolog symbol
wwp-1,Caenorhabditis elegans,171647,Homo sapiens,11060,232,WWP2,
wwp-1,Caenorhabditis elegans,171647,Homo sapiens,11059,232,WWP1,
wwp-1,Caenorhabditis elegans,171647,Homo sapiens,83737,232,ITCH,
let-363,Caenorhabditis elegans,172167,Homo sapiens,2475,43,MTOR,
vps-34,Caenorhabditis elegans,172280,Homo sapiens,5289,657,PIK3C3,


## Gene Expression Data

We have collected publicly available gene expression datasets from the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA). These datasets contain aging-related gene expression profiles for various tissue types and organisms.

### Data Sources

- GEO: [https://www.ncbi.nlm.nih.gov/geo/](https://www.ncbi.nlm.nih.gov/geo/)
- TCGA: [https://portal.gdc.cancer.gov/](https://portal.gdc.cancer.gov/)

### Datasets

- GSEXXXX: Description of the dataset, species, tissue, number of samples, etc.
- GSEYYYY: Description of the dataset, species, tissue, number of samples, etc.
- ...



## Protein-Protein Interaction Data

We have collected protein-protein interaction (PPI) data from STRING and BioGRID to incorporate network interactions as additional features for each gene.

### Data Sources

- STRING: [https://string-db.org/](https://string-db.org/)
- BioGRID: [https://thebiogrid.org/](https://thebiogrid.org/)

### Versions

- STRING: Version X.X
- BioGRID: Version Y.Y




In [5]:
string_human_df = pd.read_csv(os.path.join(
    data_dir, 'STRING/9606.protein.links.full.v11.5.txt'), sep=' ')
display(string_human_df.head())
string_aliases = pd.read_csv(os.path.join(
    data_dir, 'STRING/9606.protein.aliases.v11.5.txt'), sep='\t')
display(string_aliases.head())

Unnamed: 0,protein1,protein2,neighborhood,neighborhood_transferred,fusion,cooccurence,homology,coexpression,coexpression_transferred,experiments,experiments_transferred,database,database_transferred,textmining,textmining_transferred,combined_score
0,9606.ENSP00000000233,9606.ENSP00000379496,0,0,0,0,0,0,54,0,0,0,0,103,85,155
1,9606.ENSP00000000233,9606.ENSP00000314067,0,0,0,0,0,0,0,0,180,0,0,0,61,197
2,9606.ENSP00000000233,9606.ENSP00000263116,0,0,0,0,0,0,62,0,152,0,0,0,101,222
3,9606.ENSP00000000233,9606.ENSP00000361263,0,0,0,0,0,0,0,0,161,0,0,47,58,181
4,9606.ENSP00000000233,9606.ENSP00000409666,0,0,0,0,0,60,63,0,213,0,0,0,72,270


Unnamed: 0,#string_protein_id,alias,source
0,9606.ENSP00000000233,2B6H,BLAST_UniProt_DR_PDB
1,9606.ENSP00000000233,2B6H,Ensembl_HGNC_UniProt_ID(supplied_by_UniProt)_D...
2,9606.ENSP00000000233,2B6H,Ensembl_PDB
3,9606.ENSP00000000233,2B6H,Ensembl_UniProt_DR_PDB
4,9606.ENSP00000000233,381,BLAST_KEGG_GENEID


## Data Integration

To create a more relevant network that can yield more valuable insights, we use additional information to describe the genes (nodes) and their relationships (edges). 
<br>
<br>

### Network building
#### Nodes
Nodes will be all genes present in the STRING dataframe. Thus, we commence by standardizing their naming with the HAGR database (Entrez naming).
STRING protein identifiers are encoded in the format 9606.gene, where 9606 is the code for homo sapiens. We translate the ids through a dictionary, but this is not exhaustive as not all entries in the alias dataframe have an entry for EntrezGene. Thus, we use a function that filters possibilities down to the most likely. This process should suffice for our purpose; although it could be more precise with more data, we make use of the resources available.

In [6]:
# We add a check to ensure that, if re-run, this cell wont undo previous progress

previously_run = False if '9606' in string_human_df['protein1'].iloc[0] else True
if not previously_run:  # Filter the DataFrame
    biomart_table = pd.read_csv(os.path.join(
        data_dir, 'misc/mart_export.txt'), sep='\t')

    # We crete the dictionary for translating
    protein_gene_dict = {
        transcript: gene for transcript, gene in
        zip(biomart_table['Protein stable ID'].to_list(), biomart_table['Gene stable ID'].to_list())}
    gene_symbol_dict = {
        gene: symbol for gene, symbol in
        zip(biomart_table['Gene stable ID'].to_list(), biomart_table['Gene name'].to_list())}

    # First, we reformat STRING proteins to drop taxonomy code
    string_human_df['protein1'] = string_human_df['protein1'].map(
        lambda string: string.split('.')[1])
    string_human_df['protein2'] = string_human_df['protein2'].map(
        lambda string: string.split('.')[1])

    # Next, we translate to gene code, then to symbol
    string_human_df['protein1'] = string_human_df['protein1'].map(
        protein_gene_dict)
    string_human_df['protein2'] = string_human_df['protein2'].map(
        protein_gene_dict)

    string_human_df['protein1'] = string_human_df['protein1'].map(
        gene_symbol_dict)
    string_human_df['protein2'] = string_human_df['protein2'].map(
        gene_symbol_dict)
string_human_df

Unnamed: 0,protein1,protein2,neighborhood,neighborhood_transferred,fusion,cooccurence,homology,coexpression,coexpression_transferred,experiments,experiments_transferred,database,database_transferred,textmining,textmining_transferred,combined_score
0,ARF5,PDE1C,0,0,0,0,0,0,54,0,0,0,0,103,85,155
1,ARF5,PAK2,0,0,0,0,0,0,0,0,180,0,0,0,61,197
2,ARF5,RAB36,0,0,0,0,0,0,62,0,152,0,0,0,101,222
3,ARF5,RAPGEF1,0,0,0,0,0,0,0,0,161,0,0,47,58,181
4,ARF5,SUMO3,0,0,0,0,0,60,63,0,213,0,0,0,72,270
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11938493,,,0,0,0,0,872,213,0,0,0,0,0,0,0,213
11938494,,,0,0,0,0,899,152,0,0,0,0,0,0,0,151
11938495,,KRTAP19-2,0,0,0,0,0,182,0,0,0,0,0,0,0,181
11938496,,OR4D6,0,0,0,0,843,155,0,0,0,0,0,0,0,154


In [7]:
biomart_table = 0

Next, we use [Gene2Vec](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-5370-x) embeddings to represent gene co-expression. The embeddings have been found to capture gene localization (tissues) and function (pathways) by using co-expression data from 984 GEO database datasets.

Gene2Vec's [GitHub repository](https://github.com/jingcheng-du/Gene2vec) provides embeddings of genes and we can use those to calculate gene relatedness by identifying the cosine similarity between both embeddings.

In [11]:
# We first load in the embedding data using Gensim's KeyedVectors and generate a function that will fetch each embedding
from gensim.models import KeyedVectors


def find_alternative_gene_name(gene, gene_symbol_dict_rev, protein_gene_dict_rev, aliases_by_protein):
    """
    This function aims to replace gene names previously translated to those gene names found in the 
    gene2vec naming convention.

    Args:
        gene (str): gene synmbol
        gene_symbol_dict_rev (dict): dictionary translating from gene symbol to Entrez gene ID
        protein_gene_dict_rev (dict): dictionary translating from Entrez gene ID to protein
        aliases_df (pd.DataFrame): dataframe on which to look for alternative names

    Returns:
        str: gene name present in g2v embedding object or original name
    """
    if gene in gene_embeddings.key_to_index:  # Changed this line
        return gene
    if gene in gene_symbol_dict_rev:
        gene = gene_symbol_dict_rev[gene]

    protein = protein_gene_dict_rev.get(gene, None)
    if protein:
        alias_genes = aliases_by_protein.get(protein, [])
        for alias in alias_genes:
            if alias in gene_embeddings:
                return alias

    return gene


def cosine_similarity(gene1, gene2):
    """
    Takes in two gene embedding arrays and calculates the cosine similarity
    between the vectors.
    ! This function is equivalent to gensim's .similarity !
    Args:
        gene1 (arr): array of embeddings for the first gene
        gene2 (arr): array of embeddings for the first gene

    Returns:
        float : cosine similarity between both vectors ranging from -1 to 1
    """
    try:
        vec1 = gene_embeddings[gene1]
        vec2 = gene_embeddings[gene2]
        return vec1.dot(vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
    except:
        return np.nan


def cosine_distance(gene1_in, gene2_in):
    """
    Takes in two gene embedding arrays and calculates the cosine similarity and distance distance
    between the vectors.
    ! This function is equivalent to gensim's .distance !
    Args:
        gene1 (arr): array of embeddings for the first gene
        gene2 (arr): array of embeddings for the first gene

    Returns:
        float : cosine similarity between both vectors ranging from -1 to 1
        float : cosine similarity between both vectors ranging from 0 to 2
    """
    gene1, gene2 = alternative_gene_name_dict[gene1_in], alternative_gene_name_dict[gene2_in]
    try:

        similarity = cosine_similarity(gene1, gene2)
        if similarity == np.nan:
            raise 'ERROR: GENE EMBEDDINGS NOT FOUND'
        distance = 1 - similarity
        return pd.Series([similarity, distance])
    except:
        return pd.Series([np.nan, np.nan])


def process_batch(batch_df):
    batch_df[['g2v_cossim', 'g2v_cosdist']] = batch_df[['protein1', 'protein2']].apply(
        lambda row: cosine_distance(row['protein1'], row['protein2']), axis=1, result_type='expand')
    return batch_df

In [12]:
# Load the pre-trained gene embeddings
gene_embeddings = KeyedVectors.load_word2vec_format(os.path.join(
    data_dir, 'G2V/gene2vec_dim_200_iter_9_w2v.txt'), binary=False)

# Translate the STRING gene names to those names which could be in the embedding data
gene_symbol_dict_rev = {value: key for key, value in gene_symbol_dict.items()}
string_aliases['sole_protein_id'] = string_aliases['#string_protein_id'].map(
    lambda value: value.split('.')[1])
protein_gene_dict_rev = {value: key for key,
                         value in protein_gene_dict.items()}

# Because iterating through all genes would be computationally redundant, we create a dictionary with corresponding gene names
unique_genes = set(string_human_df['protein1']).union(
    set(string_human_df['protein2']))
aliases_by_protein = string_aliases.groupby(
    'sole_protein_id')['alias'].apply(list).to_dict()

alternative_gene_name_dict = {}
for gene in unique_genes:
    alternative_gene_name_dict[gene] = find_alternative_gene_name(
        gene, gene_symbol_dict_rev, protein_gene_dict_rev, aliases_by_protein)

# Obtain the embeddings for each pair and get the cosine similarity
# Dropping NAs deletes about 900k pairs. We could receover them by integrating additional
# database ID conversion information
string_human_df = string_human_df.dropna()

# !!!! CHANGE VARIABLE BELOW IF ALREADY RUN, CODE BELOW MAY BE TIME CONSUMING !!!!
# You may also consider increasing the batch size if you're computational resources allow it
previously_run = True
if not previously_run:
    # We process the data in batches due to memory restrictions

    batch_size = 200000  # Adjust this value based on available memory
    num_batches = int(np.ceil(len(string_human_df) / batch_size))

    processed_batches = []

    for i in tqdm(range(num_batches), desc="Processing batches", unit="batch"):
        start = i * batch_size
        end = (i + 1) * batch_size
        batch_df = string_human_df[start:end].copy()
        processed_batch = process_batch(batch_df)
        processed_batches.append(processed_batch)

    # Combine the processed batches
    string_human_df_processed = pd.concat(processed_batches, ignore_index=True)
    string_human_df_processed.to_csv(os.path.join(
        data_dir, 'processed_string_hDf.csv'), index=False)

else:
    string_human_df_processed = pd.read_csv(
        os.path.join(data_dir, 'processed_string_hDf.csv'))


string_human_df_processed.head()

2
3
4
  protein1 protein2  neighborhood  neighborhood_transferred  fusion  \
0     ARF5    PDE1C             0                         0       0   
1     ARF5     PAK2             0                         0       0   
2     ARF5    RAB36             0                         0       0   
3     ARF5  RAPGEF1             0                         0       0   
4     ARF5    SUMO3             0                         0       0   

   cooccurence  homology  coexpression  coexpression_transferred  experiments  \
0            0         0             0                        54            0   
1            0         0             0                         0            0   
2            0         0             0                        62            0   
3            0         0             0                         0            0   
4            0         0            60                        63            0   

   experiments_transferred  database  database_transferred  textmining  \
0     

In [13]:

string_human_df_processed.head()

Unnamed: 0,protein1,protein2,neighborhood,neighborhood_transferred,fusion,cooccurence,homology,coexpression,coexpression_transferred,experiments,experiments_transferred,database,database_transferred,textmining,textmining_transferred,combined_score,g2v_cossim,g2v_cosdist
0,ARF5,PDE1C,0,0,0,0,0,0,54,0,0,0,0,103,85,155,-0.066254,1.066254
1,ARF5,PAK2,0,0,0,0,0,0,0,0,180,0,0,0,61,197,0.148998,0.851002
2,ARF5,RAB36,0,0,0,0,0,0,62,0,152,0,0,0,101,222,0.098858,0.901142
3,ARF5,RAPGEF1,0,0,0,0,0,0,0,0,161,0,0,47,58,181,0.099199,0.900801
4,ARF5,SUMO3,0,0,0,0,0,60,63,0,213,0,0,0,72,270,0.198929,0.801071


### DEBUGGING
The process above lead to some discrepancy in expected volume of data as demonstrated by the first code cell below, where the number of non-na values is much higher than it should (from ~2m to ~10m)

In [None]:
# Get the unique genes in gene_embeddings
genes_in_embeddings = set(gene_embeddings.vocab.keys())

# Get the unique genes in string_human_df for protein1 and protein2
unique_genes_string_human_df = set(string_human_df['protein1']).union(
    set(string_human_df['protein2']))

# Calculate the intersection of genes in gene_embeddings and string_human_df
common_genes = genes_in_embeddings.intersection(unique_genes_string_human_df)

# Print the number of common genes
print(
    f"Number of common genes: {len(common_genes)} of a total {len(unique_genes_string_human_df)} string genes")

# Create a boolean mask to check if both genes in each pair are in common_genes
both_genes_common_mask = string_human_df.apply(
    lambda row: row['protein1'] in common_genes and row['protein2'] in common_genes, axis=1)

# Count the number of pairs with both genes in common_genes
both_genes_common_count = both_genes_common_mask.sum()

print(
    f"Number of pairs with both genes in common genes: {both_genes_common_count}")

Number of common genes: 16353 of a total 18323 string genes
Number of pairs with both genes in common genes: 9785708


Above demonstrates that the number of non-na rows with data in both cosine distance and similarity should be at least 9m

In [None]:
# Create a boolean mask for rows where both genes are in common genes
both_genes_common_mask = string_human_df_processed.apply(
    lambda row: row['protein1'] in common_genes and row['protein2'] in common_genes, axis=1)

# Create a boolean mask for rows where the cosine similarity or distance is NaN
g2v_cosdist_nan_mask = string_human_df_processed['g2v_cosdist'].isna()

# Find the rows where both genes are in common genes, but the cosine similarity or distance is NaN
discrepancy_mask = both_genes_common_mask & g2v_cosdist_nan_mask
discrepancy_df = string_human_df[discrepancy_mask]

print(f"Number of discrepancies: {len(discrepancy_df)}")
print("Examples of discrepancies:")
print(discrepancy_df.head())

  if __name__ == "__main__":


IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

In [None]:
gene_embeddings.distance(string_human_df[['protein1', 'protein2']].iloc[0][0], string_human_df[[
                         'protein1', 'protein2']].iloc[0][1])

1.0662540671033174

In [None]:
# Generate an object that contains node information
node_characteristics_df = pd.DataFrame({"gene": sorted(
    list(set([i for i in gene_symbol_dict.values() if type(i) == str])))})

node_characteristics_df

Unnamed: 0,gene
0,A1BG
1,A1CF
2,A2M
3,A2ML1
4,A3GALT2
...,...
19835,ZYG11A
19836,ZYG11B
19837,ZYX
19838,ZZEF1


In [None]:
gene_symbol_dict

{'ENSG00000198888': 'MT-ND1',
 'ENSG00000198763': 'MT-ND2',
 'ENSG00000198804': 'MT-CO1',
 'ENSG00000198712': 'MT-CO2',
 'ENSG00000228253': 'MT-ATP8',
 'ENSG00000198899': 'MT-ATP6',
 'ENSG00000198938': 'MT-CO3',
 'ENSG00000198840': 'MT-ND3',
 'ENSG00000212907': 'MT-ND4L',
 'ENSG00000198886': 'MT-ND4',
 'ENSG00000198786': 'MT-ND5',
 'ENSG00000198695': 'MT-ND6',
 'ENSG00000198727': 'MT-CYB',
 'ENSG00000278704': nan,
 'ENSG00000271254': nan,
 'ENSG00000281486': 'SNTG2',
 'ENSG00000273735': 'KIR3DL2',
 'ENSG00000280663': 'PCMTD2',
 'ENSG00000262826': 'INTS3',
 'ENSG00000276379': 'KIR3DL1',
 'ENSG00000276433': 'KIR3DL3',
 'ENSG00000273947': 'KIR2DL3',
 'ENSG00000276518': 'KIR3DP1',
 'ENSG00000276595': 'PTCHD3',
 'ENSG00000278152': 'KIR2DS2',
 'ENSG00000277044': 'OPRL1',
 'ENSG00000277339': 'NPBWR2',
 'ENSG00000276044': 'KIR2DL4',
 'ENSG00000277554': 'KIR2DL3',
 'ENSG00000276779': 'KIR2DL4',
 'ENSG00000275974': 'PPIAL4F',
 'ENSG00000276885': 'KIR2DS4',
 'ENSG00000274714': 'KIR2DS4',
 'ENSG00

In [None]:
biomart_table

Unnamed: 0,Gene stable ID,Protein stable ID,Reactome ID,NCBI gene (formerly Entrezgene) ID,"BioGRID Interaction data, The General Repository for Interaction Datasets ID",Gene name
0,ENSG00000198888,ENSP00000354687,R-HSA-1430728,4535.0,110631.0,MT-ND1
1,ENSG00000198888,ENSP00000354687,R-HSA-1428517,4535.0,110631.0,MT-ND1
2,ENSG00000198888,ENSP00000354687,R-HSA-163200,4535.0,110631.0,MT-ND1
3,ENSG00000198888,ENSP00000354687,R-HSA-611105,4535.0,110631.0,MT-ND1
4,ENSG00000198888,ENSP00000354687,R-HSA-6799198,4535.0,110631.0,MT-ND1
...,...,...,...,...,...,...
519070,ENSG00000162437,ENSP00000397069,,55225.0,,RAVER2
519071,ENSG00000122432,ENSP00000514413,,100505741.0,,SPATA1
519072,ENSG00000122432,ENSP00000514414,,100505741.0,,SPATA1
519073,ENSG00000122432,ENSP00000514416,,100505741.0,,SPATA1


## Data Preprocessing

We have preprocessed the gene expression data by performing normalization, quality control, and mapping gene symbols to standardized identifiers. For PPI data, we filtered interactions based on a confidence score threshold and mapped protein identifiers to gene identifiers.

### Normalization Methods

- Gene expression data: RPKM/FPKM, TPM, or other normalization methods depending on the data type (microarray or RNA-seq)

### Quality Control

- Removal of low-quality samples or genes
- Removal of batch effects using methods like ComBat

### Identifier Mapping

- Mapping gene symbols to Entrez Gene IDs or Ensembl Gene IDs



## Exploratory Data Analysis

In this section, we present a series of plots and visualizations that help us understand the distribution and properties of our data.

### Gene Expression Data

- Sample distribution across datasets, tissue types, and organisms
- Heatmap of gene expression values
- Principal Component Analysis (PCA) or t-SNE plots

### Protein-Protein Interaction Data

- Distribution of interaction confidence scores
- Degree distribution of the PPI network
- Distribution of network-based features (e.g., node degree, betweenness centrality, clustering coefficient)

This notebook provides an overview of the data used in our study. The next steps involve generating features, selecting relevant genes and network features, and training machine learning models for biomarker identification.