# Aging network exploration
This notebook is dedicated to the exploration of the STRING database and Gene2Vec embedding dataset. The primary aim is to explore genes that are related to known longevity genes (LGs) by using protein-protein network information. The rationale for this approach is that gene networks provide information on the interaction of genes, thus allowing for a charecterization of gene function and intervention access points by identification of gene influencers. For example, if gene A is known to play a key role in aging when underexpressed and network discovery informs on gene B's role as an excitor of gene A, interventions can now be investigated for both genes.

#### Data
**STRING** data used in this notebook stems from their Homo Sapiens dataset. 
**Gene2Vec** data consists of the embeddings created by Du et al (2019) from the 984 GEO datasets containing information about gene co-expression. Those embeddings will be used to augment the interpretation of the network results.

### Dataset import
We use the formatted STRING protein-protein interaction dataframe from dataset_selection.ipynb.

In [2]:
import pandas as pd
import numpy as np
import os
from tqdm import tqdm
import networkx as nx
pwd = os.getcwd()
data_dir = os.path.join(pwd, '../data')

# Contains network info of protein-protein interactions from the preprocessed STRING df (from dataset_selection.ipynb)
string_human_df_processed = pd.read_csv(
        os.path.join(data_dir, 'processed_string_hDf.csv'))

## Protein interaction
Below we obtain features from protein-protein interaction features. To that aim, we treat the data as a network, where we consider all edge attributes.

### Node importance feature generation
Below we calculate values that represent the importance of a given protein in different manners.
- **Degree** Centrality: Counts a node's connections.
- **Closeness** Centrality: Measures a node's average distance to all other nodes.
- **Betweenness** Centrality: Quantifies a node's control over interactions of other nodes.
- **Eigenvector** Centrality: Assesses a node's influence based on its connections' quality.
- **Clustering** Coefficient: Evaluates how interconnected a node's neighbors are.


In [3]:
# Create network graph
G = nx.from_pandas_edgelist(string_human_df_processed, 'protein1', 'protein2', edge_attr=True)

# Calculate centrality measures
degree_centrality = nx.degree_centrality(G)
closeness_centrality = nx.closeness_centrality(G)
betweenness_centrality = nx.betweenness_centrality(G)
eigenvector_centrality = nx.eigenvector_centrality(G)

# Calculate clustering coefficient
clustering_coefficient = nx.clustering(G)

# Convert to dataframes
degree_df = pd.DataFrame(degree_centrality.items(), columns=['protein', 'degree_centrality'])
closeness_df = pd.DataFrame(closeness_centrality.items(), columns=['protein', 'closeness_centrality'])
betweenness_df = pd.DataFrame(betweenness_centrality.items(), columns=['protein', 'betweenness_centrality'])
eigenvector_df = pd.DataFrame(eigenvector_centrality.items(), columns=['protein', 'eigenvector_centrality'])
clustering_df = pd.DataFrame(clustering_coefficient.items(), columns=['protein', 'clustering_coefficient'])

# Merge all dataframes
feature_df = pd.concat([degree_df, closeness_df['closeness_centrality'], 
                        betweenness_df['betweenness_centrality'], eigenvector_df['eigenvector_centrality'],
                        clustering_df['clustering_coefficient']], axis=1)


19566

In [2]:
string_human_network

Unnamed: 0,protein1,protein2,neighborhood,neighborhood_transferred,fusion,cooccurence,homology,coexpression,coexpression_transferred,experiments,experiments_transferred,database,database_transferred,textmining,textmining_transferred,combined_score
0,9606.ENSP00000000233,9606.ENSP00000379496,0,0,0,0,0,0,54,0,0,0,0,103,85,155
1,9606.ENSP00000000233,9606.ENSP00000314067,0,0,0,0,0,0,0,0,180,0,0,0,61,197
2,9606.ENSP00000000233,9606.ENSP00000263116,0,0,0,0,0,0,62,0,152,0,0,0,101,222
3,9606.ENSP00000000233,9606.ENSP00000361263,0,0,0,0,0,0,0,0,161,0,0,47,58,181
4,9606.ENSP00000000233,9606.ENSP00000409666,0,0,0,0,0,60,63,0,213,0,0,0,72,270
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11938493,9606.ENSP00000485678,9606.ENSP00000354800,0,0,0,0,872,213,0,0,0,0,0,0,0,213
11938494,9606.ENSP00000485678,9606.ENSP00000308270,0,0,0,0,899,152,0,0,0,0,0,0,0,151
11938495,9606.ENSP00000485678,9606.ENSP00000335660,0,0,0,0,0,182,0,0,0,0,0,0,0,181
11938496,9606.ENSP00000485678,9606.ENSP00000300127,0,0,0,0,843,155,0,0,0,0,0,0,0,154
