# Network Analysis with Profiling

In this example, we aim to use 3 different methods to calculate the number of nodes, the number of edges, the number of triangles, the everage degree, and the density of the network, find the top 20 most connected nodes and also track the CPU runtime and the use of RAM.

We will use a loop-based naïve method, a networkx-based method and a scipy sparse network based method to complete the above tasks and compare their performance.

## Example from social circles data


### Data Description

https://snap.stanford.edu/data/ego-Facebook.html

This dataset (*facebook_combined.txt*) consists of circles from Facebook (4,039 nodes and 88,234 edges) 

### Profiling method for monitoring functions

Profiling by function gives us a high-level idea of how often functions are called and how long those calls last. One way to do this is to import the ```cProfile``` module and run a function using the ```cProfile.run()``` function, providing a string argument which is the command used to invoke the function. ```cProfile``` is part of the Python standard library and so is available without installing any additional packages. For example:

In [None]:
# Import the cProfile module
import cProfile

def is_even(value):
  if value%2 == 0:
    return True
  else:
    return False

def halve(value):
  return value / 2

def function_to_test(upper_value):
  result = 0
  for i in range(int(upper_value) +1):
    if is_even(i):
      result=result + halve(i)

  print(result)

# Call the cProfile.run() method with a string argument that is the call to the function you want to profile
cProfile.run('function_to_test(1e7)')

The results show the total time spent running the code and the total number of function calls. Then, for each function, it shows:
* ```ncalls```: the number of times the function was called.
* ```tottime```: the total time spent in the function, excluding time spent in functions called by the function.
* ```percall```: the time spent in the function per call, excluding time spent in functions called by the function.
* ```cumtime```: the total time spent in the function, including time spent in functions called by the function.
* ```percall```: the time spent in the function per call, including time spent in functions called by the function.
* ```filename:lineno(function)```: the filename, line number and function name.

There will normally be a number of functions which are not functions you are written or explicitly called. These are often called as part of how Python internally executes your code. They are normally not very consequential in terms of run-time and can often be ignored.

### Exercise

By implementing the profiling for each of the network analysis functions, observe the outcome and understand the difference in performance of each function.

In [None]:
import networkx as nx
import scipy.sparse as sp
import psutil, time, os, gc, statistics, warnings
import numpy as np
import pandas as pd
# Import the cProfile module
import cProfile

warnings.filterwarnings('ignore')

### Functions

#### Tracking memory usage

In [None]:
# Function to get memory usage
def get_memory_usage():
    process = psutil.Process(os.getpid())
    mem_info = process.memory_info()
    return mem_info.rss / 1024 / 1024  # Memory in MB


#### Three different methods

In [None]:

# Naive method using dictionary-based adjacency list
def analyze_naive(file_path):
    """
    A naïve method to analyze number of vertices (nodes), edges, triangles and most connected nodes.
    
    Args:
        file_path (str): Path to the input file.
        
    
    Returns:
        results is a list of metric dicts, and top_nodes_dfs is a list of DataFrames.
    """

    start_time = time.time()
    start_memory = get_memory_usage()
    
    # Initialize adjacency list as a dictionary
    graph = {}
    
    # Read edge list and build graph
    try:
        with open(file_path, 'r') as f:
            for line in f:
                if line.strip():
                    node1, node2 = map(int, line.strip().split())
                    # Add nodes and edges (undirected)
                    if node1 not in graph:
                        graph[node1] = set()
                    if node2 not in graph:
                        graph[node2] = set()
                    graph[node1].add(node2)
                    graph[node2].add(node1)
    except (ValueError, IOError) as e:
        print(f"Error reading file {file_path}: {e}")
        return {
            "Method": "Naive (Full)",
            "Nodes": 0,
            "Edges": 0,
            "Triangles": 0,
            "Avg Degree": 0,
            "Density": 0,
            "Time (s)": time.time() - start_time,
            "Memory (MB)": get_memory_usage() - start_memory,
            "Top 20 Nodes": []
        }
    
    num_nodes = len(graph)
    num_edges = sum(len(neighbors) for neighbors in graph.values()) // 2  # Divide by 2 for undirected
    
    # Count triangles, ensuring each triangle is counted once
    num_triangles = 0
    for node in graph:
        neighbors = sorted(graph[node])  # Sort neighbors for consistent ordering
        for i, n1 in enumerate(neighbors):
            if n1 < node:  # Only count triangles where node is not the smallest
                continue
            for n2 in neighbors[i+1:]:
                if n2 < node:  # Ensure node is the smallest in the triangle
                    continue
                if n2 in graph[n1]:  # Check if n1 and n2 are connected
                    num_triangles += 1
    
    avg_degree = sum(len(neighbors) for neighbors in graph.values()) / num_nodes if num_nodes > 0 else 0
    density = (2 * num_edges) / (num_nodes * (num_nodes - 1)) if num_nodes > 1 else 0
    
    # Find top 20 nodes by degree
    top_nodes = sorted(graph.items(), key=lambda x: len(x[1]), reverse=True)[:20]
    top_nodes_dict = [{"Node": node, "Degree": len(neighbors)} for node, neighbors in top_nodes]
    
    exec_time = time.time() - start_time
    memory_used = get_memory_usage() - start_memory
    
    print(f"Nodes: {num_nodes}")
    print(f"Edges: {num_edges}")
    print(f"Triangles: {num_triangles}")
    print(f"Average Degree: {avg_degree:,.2f}")
    print(f"Density: {density:,.6f}")
    print(f"Execution Time: {exec_time:,.2f} seconds")
    print(f"Memory Used: {memory_used:,.2f} MB")
    print("Top 20 Most Connected Nodes:")
    for node in top_nodes_dict:
        print(f"  Node {node['Node']}: Degree {node['Degree']}")
    
    return {
        "Method": "Naive (Full)",
        "Nodes": num_nodes,
        "Edges": num_edges,
        "Triangles": num_triangles,
        "Avg Degree": avg_degree,
        "Density": density,
        "Time (s)": exec_time,
        "Memory (MB)": memory_used,
        "Top 20 Nodes": top_nodes_dict
    }




# Function to analyze full graph with NetworkX
def analyze_networkx(file_path):
    start_time = time.time() 
    start_memory = get_memory_usage()
    
    try:
        G = nx.read_edgelist(file_path, nodetype=int, create_using=nx.Graph())
    except (nx.NetworkXError, IOError) as e:
        print(f"Error reading file {file_path}: {e}")
        return {
            "Method": "NetworkX (Full)",
            "Nodes": 0,
            "Edges": 0,
            "Triangles": 0,
            "Avg Degree": 0,
            "Density": 0,
            "Time (s)": time.time() - start_time,
            "Memory (MB)": get_memory_usage() - start_memory,
            "Top 20 Nodes": []
        }
    
    num_nodes = G.number_of_nodes()
    num_edges = G.number_of_edges()
    num_triangles = sum(nx.triangles(G).values()) // 3  # Each triangle counted thrice
    avg_degree = sum(dict(G.degree()).values()) / num_nodes if num_nodes > 0 else 0
    density = (2 * num_edges) / (num_nodes * (num_nodes - 1)) if num_nodes > 1 else 0
    
    # Find top 20 nodes by degree
    degrees = G.degree()
    top_nodes = sorted(degrees, key=lambda x: x[1], reverse=True)[:20]
    top_nodes_dict = [{"Node": node, "Degree": degree} for node, degree in top_nodes]
    
    exec_time = time.time() - start_time
    memory_used = get_memory_usage() - start_memory
    
    print(f"Nodes: {num_nodes}")
    print(f"Edges: {num_edges}")
    print(f"Triangles: {num_triangles}")
    print(f"Average Degree: {avg_degree:,.2f}")
    print(f"Density: {density:,.6f}")
    print(f"Execution Time: {exec_time:,.2f} seconds")
    print(f"Memory Used: {memory_used:,.2f} MB")
    print("Top 20 Most Connected Nodes:")
    for node in top_nodes_dict:
        print(f"  Node {node['Node']}: Degree {node['Degree']}")
    
    return {
        "Method": "NetworkX (Full)",
        "Nodes": num_nodes,
        "Edges": num_edges,
        "Triangles": num_triangles,
        "Avg Degree": avg_degree,
        "Density": density,
        "Time (s)": exec_time,
        "Memory (MB)": memory_used,
        "Top 20 Nodes": top_nodes_dict
    }

# SciPy sparse matrix analysis
def analyze_scipy_sparse(file_path):
    start_time = time.time()   
    start_memory = get_memory_usage()
    
    try:
        # Load edge list with pandas for better performance
        edges_df = pd.read_csv(file_path, sep='\s+', header=None, dtype=np.int32, engine='c')
        edges = edges_df.to_numpy()
    except (pd.errors.EmptyDataError, IOError) as e:
        print(f"Error reading file {file_path}: {e}")
        return {
            "Method": "SciPy Sparse (Full)",
            "Nodes": 0,
            "Edges": 0,
            "Triangles": 0,
            "Avg Degree": 0,
            "Density": 0,
            "Time (s)": time.time() - start_time,
            "Memory (MB)": get_memory_usage() - start_memory,
            "Top 20 Nodes": []
        }
    
    # Vectorized node mapping
    nodes, inverse_indices = np.unique(edges, return_inverse=True)
    num_nodes = len(nodes)
    edge_indices = inverse_indices.reshape(edges.shape)  # Shape: (m, 2)
    
    # Create row and column arrays for symmetric adjacency matrix
    rows = np.concatenate([edge_indices[:, 0], edge_indices[:, 1]])
    cols = np.concatenate([edge_indices[:, 1], edge_indices[:, 0]])
    data = np.ones(len(rows), dtype=np.int32)
    
    adj_matrix = sp.csr_matrix((data, (rows, cols)), shape=(num_nodes, num_nodes))
    
    num_edges = adj_matrix.nnz // 2
    degrees = np.array(adj_matrix.sum(axis=1)).flatten()
    # Count triangles: trace(A^3)/6 for undirected graph
    adj_matrix_cube = adj_matrix @ adj_matrix @ adj_matrix
    num_triangles = int(adj_matrix_cube.diagonal().sum() / 6)
    avg_degree = degrees.mean() if num_nodes > 0 else 0
    density = (2 * num_edges) / (num_nodes * (num_nodes - 1)) if num_nodes > 1 else 0
    
    # Find top 20 nodes by degree
    top_indices = np.argpartition(degrees, -20)[-20:] if num_nodes >= 20 else np.arange(num_nodes)
    top_degrees = degrees[top_indices]
    top_nodes = [(nodes[i], degrees[i]) for i in top_indices]
    top_nodes = sorted(top_nodes, key=lambda x: x[1], reverse=True)[:20]
    top_nodes_dict = [{"Node": node, "Degree": degree} for node, degree in top_nodes]
    
    exec_time = time.time() - start_time
    memory_used = get_memory_usage() - start_memory
    
    print(f"Nodes: {num_nodes}")
    print(f"Edges: {num_edges}")
    print(f"Triangles: {num_triangles}")
    print(f"Average Degree: {avg_degree:,.2f}")
    print(f"Density: {density:,.6f}")
    print(f"Execution Time: {exec_time:,.2f} seconds")
    print(f"Memory Used: {memory_used:,.2f} MB")
    print("Top 20 Most Connected Nodes:")
    for node in top_nodes_dict:
        print(f"  Node {node['Node']}: Degree {node['Degree']}")
    
    return {
        "Method": "SciPy Sparse (Full)",
        "Nodes": num_nodes,
        "Edges": num_edges,
        "Triangles": num_triangles,
        "Avg Degree": avg_degree,
        "Density": density,
        "Time (s)": exec_time,
        "Memory (MB)": memory_used,
        "Top 20 Nodes": top_nodes_dict
    }



#### Functions to run through the different methods one-by-one

In [None]:



def profile_wrapper_naive(file_path, num_runs):
    """
    Wrapper to profile analyze_naive over num_runs and aggregate results.
    
    Args:
        file_path (str): Path to the input file.
        num_runs (int): Number of runs for averaging metrics.
    
    Returns:
        tuple: (results, top_nodes_dfs) where results is a list of metric dicts,
               and top_nodes_dfs is a list of DataFrames.
    """
    results = []
    top_nodes_dfs = []
    
    # TODO: Run the analyze_naive function multiple times
    # - Loop num_runs times
    # - In each iteration, call analyze_naive(file_path) to get the result
    # - Extract metrics (all keys except "Top 20 Nodes") into a dictionary
    # - Convert the "Top 20 Nodes" list into a pandas DataFrame
    # - Add a 'Method' column to the DataFrame with value "Naive (Full)"
    # - Append the metrics dictionary to results and the DataFrame to top_nodes_dfs
    
    # TODO: Aggregate metrics across all runs
    # - Check if results is not empty
    # - Create a dictionary avg_result with:
    #   - "Method": "Naive (Full)"
    #   - Graph metrics ("Nodes", "Edges", "Triangles", "Avg Degree", "Density") from the first result
    #   - "Time (s)": Average of "Time (s)" across all runs (use np.mean)
    #   - "Memory (MB)": Average of "Memory (MB)" across all runs (use np.mean)
    # - Return a list containing only the avg_result dictionary
    # - If results is empty, return empty lists
    if results:
        avg_result = { # Fill this dictionary with metrics            
        }
        results = [avg_result]  
    
    return results, top_nodes_dfs

# TODO: Implement the profile_wrapper_networkx function
# - This function should profile analyze_networkx over num_runs and aggregate results.
# - Follow the same structure as profile_wrapper_naive:

def profile_wrapper_networkx(file_path, num_runs):
    
    
    return results, top_nodes_dfs

# TODO: Implement the profile_wrapper_scipy_sparse function
# - This function should profile analyze_scipy_sparse over num_runs and aggregate results.
# - Follow the same structure as profile_wrapper_naive:


### Test the profiling methods to observe the performance of functions

In [None]:
methods = [
        ("Naive", analyze_naive),
        ("NetworkX", analyze_networkx),
        ("SciPy Sparse", analyze_scipy_sparse)
    ]
file_path = "data/facebook_combined.txt"
num_runs = 1
# call naive method
method_name, method_func = methods[0]
print(f"Profiling {method_name} method with {num_runs} runs on {file_path}")
cProfile.run(f'profile_wrapper_naive("{file_path}", {num_runs})')

In [None]:
methods = [
        ("Naive", analyze_naive),
        ("NetworkX", analyze_networkx),
        ("SciPy Sparse", analyze_scipy_sparse)
    ]
file_path = "data/facebook_combined.txt"
num_runs = 1
# call naive method
method_name, method_func = methods[1]
print(f"Profiling {method_name} method with {num_runs} runs on {file_path}")
cProfile.run(f'profile_wrapper_networkx("{file_path}", {num_runs})')

## Reference

1. 