In [1]:
import networkx as nx
import random
import pandas as pd

In [2]:
def simulate_information_diffusion(G, seed_set, activation_prob=0.1):
    active_nodes = set(seed_set)
    newly_active_nodes = set(seed_set)
    
    while newly_active_nodes:
        next_newly_active_nodes = set()
        for node in newly_active_nodes:
            neighbors = set(G.neighbors(node))
            for neighbor in neighbors - active_nodes:
                if random.random() < activation_prob:  # Probability of information passing
                    next_newly_active_nodes.add(neighbor)
        active_nodes.update(next_newly_active_nodes)
        newly_active_nodes = next_newly_active_nodes
    
    return len(active_nodes)

In [3]:
def estimate_information_diffusion(G, seed_set, num_simulations=10, activation_prob=0.1):
    total_spread = 0
    for _ in range(num_simulations):
        total_spread += simulate_information_diffusion(G, seed_set, activation_prob)
    return total_spread / num_simulations

In [4]:
def greedy_information_diffusion(G, k, num_simulations=10, activation_prob=0.1):
    seed_set = set()
    for _ in range(k):
        best_node = None
        best_spread = 0
        for node in G.nodes:
            if node not in seed_set:
                temp_set = seed_set | {node}
                spread = estimate_information_diffusion(G, temp_set, num_simulations, activation_prob)
                if spread > best_spread:
                    best_spread = spread
                    best_node = node
        if best_node is not None:
            seed_set.add(best_node)
    return seed_set

In [5]:
# Create a sample graph
G = nx.Graph()
df = pd.read_csv('../../facebook_clean_data/tvshow_edges.csv')
# Create a sample graph
G = nx.from_pandas_edgelist(df, 'node_1', 'node_2')

# Run greedy information diffusion
k = 3  # Number of nodes to select
selected_nodes = greedy_information_diffusion(G, k)

print(f"Selected nodes for information diffusion: {selected_nodes}")

Selected nodes for information diffusion: {1840, 3251, 1623}


Functionality:

simulate_information_diffusion: This function models the spread of information through a network starting from a set of initial "seed" nodes. The spread occurs probabilistically, with each active node having a chance to activate its neighbors based on the activation_prob parameter. The function returns the total number of nodes that were activated during the simulation.

estimate_information_diffusion: This function estimates the average spread of information by running multiple simulations (controlled by num_simulations) of the diffusion process. It computes the average number of nodes activated across all simulations, giving a more robust estimate of the spread for a given set of seed nodes.

greedy_information_diffusion: This is a greedy algorithm that iteratively selects nodes to maximize the spread of information. In each iteration, the algorithm evaluates all nodes not yet in the seed set and selects the node that, when added, maximizes the estimated spread of information. This process is repeated until k nodes are selected.

Performance:

Efficiency: The greedy algorithm is effective for influence maximization, providing a good trade-off between computational complexity and accuracy. However, it is still computationally expensive, particularly on large graphs, because it involves multiple simulations for each candidate node in every iteration.

Scalability: While the algorithm can handle moderately sized networks, it might struggle with very large networks due to the repeated simulation process. The computational cost increases with the size of the network and the number of simulations (num_simulations).

Strengths:

Effectiveness: The greedy approach is known to approximate the optimal solution well, especially in problems like influence maximization. It systematically builds a seed set that is likely to maximize the spread of information.

Customizability: The algorithm allows for customization through parameters like k (the number of nodes to select) and activation_prob (the probability of information spreading between connected nodes), making it adaptable to different types of networks and scenarios.

Limitations:

Computational Cost: The algorithm's performance may degrade with very large networks or high values of num_simulations, as it requires extensive computation to evaluate each candidate node.

Simplistic Assumptions: The model assumes a uniform probability (activation_prob) for information spreading across all edges, which may not be realistic for all types of networks where edge weights or other factors could influence diffusion.

This code provides a robust framework for understanding influence maximization and can serve as a basis for more advanced network analysis tasks.