# Initial notes from discussion with Georg
- Georg likes PCA idea
- Consider how we are no longer modelling protein complexes because we are removing nodes which may be involved. This may influence our choice of community finding algorithms (i.e. k-clique percolation will allow overlaps which better models what occurs in protein complexes, but is this even necessary? What is the biological connection between communities in our graph and the actual PPI network?)
- Consider the connectedness of the resulting network after removing non-NAFLD nodes (is it fully-connected? How many isolated nodes are there?)
- Graph centrality measures against degree. Consider their correlation and the intuitive meaning behind the graph. For example, in a betweenness vs degree graph, a low degree but high betweenness node may be a bottleneck. Nodes which buck the trend will be interesting to consider.
- Use community finding to find clusters, then find bottlenecks between them (can even find functional stuff) by inducing a graph. 
- Consider taking one degree of separation for yeast homolog network so that we have the possibility of finding significant proteins which do not appear in literature. 

# Extra notes that I think of while writing code
- Handling essential nodes: do we want to include them or exclude them???

In [1]:
import networkx as nx
import numpy as np
import scipy as sp
import pandas as pd

In [2]:
# Reading in graph 
G = nx.read_weighted_edgelist("yeast.txt",comments="#",nodetype=str)

In [3]:
# Deleting edges that don't meet threshold score
threshold_score = 700
for edge in G.edges: 
  weight = list(G.get_edge_data(edge[0],edge[1]).values())
  if(weight[0] <= threshold_score):
    G.remove_edge(edge[0],edge[1])

In [4]:
# Removing essential nodes

df = pd.read_csv("essential_proteins.csv", header=None)
essential_proteins = df[1].values

for protein in essential_proteins:
  str = "4932." + protein
  if str in G.nodes:
    G.remove_node(str)

In [5]:
# Getting list of all yeast homologs
df = pd.read_csv("human_to_yeast.csv")

# Note: some homologs in this list may also be present in the essential nodes list, so they wont be included in the subgraph.
homologs = list(set(list(df["homolog_systematic_name"])))


In [6]:
# Relabeling nodes to get rid of ".4932" tag
H = nx.relabel_nodes(G, lambda x: x[5:])

In [7]:
# Getting subgraph induced by all yeast homologs
H0 = H.subgraph(homologs).copy()

In [8]:
# Getting subgraph induced by all nodes with one degree of separation away from a yeast homolog
l = []

for node in H0.nodes:
  l.extend(H[node].keys())
  

l = list(set(l))
H1 = H.subgraph(l).copy()

In [9]:
print(H) # Removing essential nodes only 
print(f"Size of giant component of H: {len(max(nx.connected_components(H), key=len))}")
print(f"Number of isolated nodes of H: {len(list(nx.isolates(H)))}")

print(H1) # Removing essential nodes + all nodes not adjacent to a yeast homolog
print(f"Size of giant component of H1: {len(max(nx.connected_components(H1), key=len))}")
print(f"Number of isolated nodes of H1: {len(list(nx.isolates(H1)))}")

print(H0) # Removing essential nodes + all nodes nodes which arent yeast homologs
print(f"Size of giant component of H0: {len(max(nx.connected_components(H0), key=len))}")
print(f"Number of isolated nodes of H0: {len(list(nx.isolates(H0)))}")

Graph with 5098 nodes and 53343 edges
Size of giant component of H: 4639
Number of isolated nodes of H: 428
Graph with 3264 nodes and 46662 edges
Size of giant component of H1: 3252
Number of isolated nodes of H1: 8
Graph with 602 nodes and 1968 edges
Size of giant component of H0: 527
Number of isolated nodes of H0: 58


We consider the giant component of networks H0 and H1.

In [10]:
GC0 = H.subgraph(max(nx.connected_components(H0), key=len)).copy()
GC1 = H.subgraph(max(nx.connected_components(H1), key=len)).copy()