# Introduction to NetworkX: Graphs and Network Analysis

In lecture we covered the mathematical definitions of graphs and networks — vertices, edges, directed vs. undirected, weighted graphs, and key structural concepts like connected graphs and adjacency matrices. Now we will put those ideas into code using **NetworkX**, the industry-standard Python library for creating, analyzing, and manipulating graphs and networks.

NetworkX will let you:
- Create and manipulate graph/network structures in code
- Attach attributes (data) to nodes and edges
- Compute network statistics and structural properties
- Export networks to other tools (e.g., Gephi for visualization)

> **A note on visualization:** NetworkX has basic built-in drawing tools (via matplotlib), but for publication-quality or interactive visualization, **Gephi** is a great free, open-source stand-alone program that can open NetworkX networks directly.

In [None]:
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt

## Creating Graphs

NetworkX provides four main graph classes:

| Class | Description |
|---|---|
| `nx.Graph()` | Undirected simple graph |
| `nx.DiGraph()` | Directed simple graph |
| `nx.MultiGraph()` | Undirected multigraph (multiple edges between same nodes allowed) |
| `nx.MultiDiGraph()` | Directed multigraph |

We'll start with the most common case: an undirected graph.

In [None]:
G = nx.Graph()  # Create an empty undirected graph
print(G)

## Adding Nodes

Nodes can be any **hashable** object — any object whose value cannot change (immutable). In practice, nodes are almost always integers or strings.

- `G.add_node(n)` — add a single node
- `G.add_nodes_from(iterable)` — add many nodes at once from any iterable

In [None]:
# Add nodes one at a time
G.add_node(1)
G.add_node(2)
G.add_node(3)

# Add several nodes at once from a list
G.add_nodes_from([4, 5, 6, 7])

# G.nodes is a read-only view of the node set
print('Nodes:', G.nodes)
print('Number of nodes:', G.number_of_nodes())

## Adding Edges

- `G.add_edge(u, v)` — add a single edge between nodes `u` and `v`. If either node does not yet exist, it is **created automatically**.
- `G.add_edges_from(iterable)` — add many edges at once from an iterable of 2-tuples

In [None]:
# Add edges one at a time
G.add_edge(1, 2)
G.add_edge(1, 3)
G.add_edge(2, 4)
G.add_edge(3, 4)
G.add_edge(4, 5)

# Add several edges at once
G.add_edges_from([(5, 6), (5, 7), (6, 7)])

print('Edges:', G.edges)
print('Number of edges:', G.number_of_edges())

Let's draw this graph to see what we've built. NetworkX uses matplotlib for drawing:

In [None]:
plt.figure(figsize=(6, 4))
# spring_layout positions nodes using a force-directed algorithm
pos = nx.spring_layout(G, seed=42)  # seed for reproducibility
nx.draw(G, pos, with_labels=True, node_color='steelblue',
        node_size=600, font_color='white', font_weight='bold',
        edge_color='gray')
plt.title('Our first graph')
plt.show()

## Accessing Neighbors and Degree

Once a graph is built, you can inspect the structure:

- `G.adj[n]` or `G.neighbors(n)` — get the neighbors of node `n`
- `G.degree[n]` — the degree (number of neighbors) of node `n`
- `G.degree` — a view of `(node, degree)` pairs for all nodes

In [None]:
# Neighbors of node 5
print('Neighbors of node 5:', list(G.neighbors(5)))
# Equivalently:
print('Neighbors of node 5 (via adj):', list(G.adj[5]))

# Degree of a single node
print('Degree of node 5:', G.degree[5])

# Degree of all nodes
print('All degrees:', dict(G.degree))

## Removing Nodes and Edges

- `G.remove_node(n)` — removes node `n` and all edges incident to it
- `G.remove_edge(u, v)` — removes just the edge between `u` and `v`

In [None]:
# Demonstrate removal on a copy so we keep G intact
G_copy = G.copy()
print('Before removal:', G_copy.nodes(), G_copy.edges())

G_copy.remove_edge(5, 6)
G_copy.remove_node(7)

print('After removing edge (5,6) and node 7:')
print('  Nodes:', list(G_copy.nodes))
print('  Edges:', list(G_copy.edges))

## Node and Edge Attributes

NetworkX graphs are *networks* in the full sense: nodes and edges can carry arbitrary attribute data. Internally, each node name is a key in a dictionary, and its value is a dictionary of attribute name-value pairs.

You can add attributes when creating a node or edge, or set them afterward:

In [None]:
# Add attribute when creating the node
G.add_node(1, label='hub', population=500)
G.add_node(4, label='hub', population=300)

# Set attributes on existing nodes via G.nodes[n]
G.nodes[2]['label'] = 'leaf'
G.nodes[2]['population'] = 80

# Inspect node attributes
print('Node 1 attributes:', G.nodes[1])
print('Node 2 attributes:', G.nodes[2])
print()
print('All node data:')
for node, data in G.nodes(data=True):
    print(f'  node {node}: {data}')

In [None]:
# Edge attributes work the same way
# Add a 'weight' attribute to some edges
G.add_edge(1, 2, weight=3.5)
G.add_edge(1, 3, weight=1.2)
G.add_edge(4, 5, weight=2.0)

# Inspect edge attributes
print('Edge (1,2) attributes:', G.edges[1, 2])
print()
print('All edge data:')
for u, v, data in G.edges(data=True):
    print(f'  ({u}, {v}): {data}')

**Exercise:** In the cell below, create a new `nx.Graph()` that represents a graph that you first drew on paper. Add at least two node attributes and at least one edge attribute of your choosing. Print the node list and edge list.

In [None]:
# Your code here


## Weighted Graphs and the Adjacency Matrix

As covered in lecture, any simple network has an **adjacency matrix** $A$ where:
- $A_{ij} = 1$ if there is an edge between node $i$ and node $j$ (unweighted)
- $A_{ij} = w_{ij}$ (the weight) for a weighted network, with $A_{ij} = 0$ meaning no edge

NetworkX can produce the adjacency matrix as a NumPy array via `nx.to_numpy_array`. By default, it uses edge weights if they exist — pass `weight=None` to get the unweighted (0/1) version instead:

In [None]:
# Build a small weighted graph for demonstration
W = nx.Graph()
W.add_edge('A', 'B', weight=4)
W.add_edge('A', 'C', weight=2)
W.add_edge('B', 'C', weight=1)
W.add_edge('B', 'D', weight=5)
W.add_edge('C', 'D', weight=8)
W.add_edge('C', 'E', weight=10)
W.add_edge('D', 'E', weight=2)

# The node ordering used for rows/columns
print('Node order:', list(W.nodes))
print()

# Unweighted adjacency matrix (just 0s and 1s): pass weight=None
A_unweighted = nx.to_numpy_array(W, weight=None)
print('Unweighted adjacency matrix:')
print(A_unweighted)

In [None]:
# Weighted adjacency matrix: default behavior uses the 'weight' edge attribute
A_weighted = nx.to_numpy_array(W)
print('Weighted adjacency matrix:')
print(A_weighted)

In [None]:
# Draw the weighted graph with edge labels showing weights
plt.figure(figsize=(6, 4))
pos_W = nx.spring_layout(W, seed=7)
edge_labels = nx.get_edge_attributes(W, 'weight')
nx.draw(W, pos_W, with_labels=True, node_color='coral',
        node_size=700, font_weight='bold', edge_color='gray')
nx.draw_networkx_edge_labels(W, pos_W, edge_labels=edge_labels)
plt.title('Weighted graph')
plt.show()

## A Brief Look at Directed Graphs

A `DiGraph` works almost identically to `Graph`, but edges have a direction. Adding edge `(u, v)` means $u \to v$, and `(v, u)` is a *different* edge going the other way. This matters for neighbors and degree:

- `G.successors(n)` — nodes that `n` points *to*
- `G.predecessors(n)` — nodes that point *to* `n`
- `G.out_degree[n]` — number of outgoing edges
- `G.in_degree[n]` — number of incoming edges

In [None]:
DG = nx.DiGraph()
DG.add_edges_from([(1, 2), (1, 3), (2, 4), (3, 4), (4, 1)])  # note the cycle 4->1

print('Successors of 1 (1 points to):', list(DG.successors(1)))
print('Predecessors of 1 (points to 1):', list(DG.predecessors(1)))
print('Out-degree of 1:', DG.out_degree[1])
print('In-degree of 1:', DG.in_degree[1])

In [None]:
plt.figure(figsize=(5, 4))
pos_DG = nx.spring_layout(DG, seed=1)
nx.draw(DG, pos_DG, with_labels=True, node_color='mediumseagreen',
        node_size=600, font_color='white', font_weight='bold',
        edge_color='gray', arrows=True, arrowsize=20,
        connectionstyle='arc3,rad=0.1')  # slight curve so both directions are visible
plt.title('Directed graph (DiGraph)')
plt.show()

## Network Statistics and Structural Properties

A key goal of network analysis is to **quantify structural properties** of the network that are relevant to your application. Here we will look at a few of the most important ones, returning to our original graph `G`.

### Degree Distribution

The **degree** of a node is the number of edges it has (its number of neighbors). The **degree distribution** tells you how many nodes have each degree, and is one of the most informative summaries of a network's structure.

In [None]:
degrees = dict(G.degree)
print('Node degrees:', degrees)
print('Average degree:', np.mean(list(degrees.values())))

# Degree sequence (sorted)
degree_sequence = sorted(degrees.values(), reverse=True)
print('Degree sequence:', degree_sequence)

In [None]:
# Plot the degree distribution as a bar chart
degree_counts = {}
for d in degree_sequence:
    degree_counts[d] = degree_counts.get(d, 0) + 1

plt.figure(figsize=(5, 3))
plt.bar(degree_counts.keys(), degree_counts.values(), color='steelblue', edgecolor='black')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.title('Degree distribution of G')
plt.xticks(sorted(degree_counts.keys()))
plt.show()

### Average Path Length

The **average path length** is the mean number of edges (hops) it takes to travel from one randomly chosen node to another, over all pairs of nodes. It measures how "close together" nodes are in the network on average.

> Note: average path length is only defined for **connected** graphs (where a path exists between every pair of nodes). Let's check first.

In [None]:
print('Is G connected?', nx.is_connected(G))

if nx.is_connected(G):
    avg_path = nx.average_shortest_path_length(G)
    print(f'Average path length: {avg_path:.4f}')
    print(f'Diameter (longest shortest path): {nx.diameter(G)}')

In [None]:
# You can also look at shortest paths between specific nodes
path = nx.shortest_path(G, source=1, target=7)
print(f'Shortest path from node 1 to node 7: {path}')
print(f'Length: {len(path) - 1} hops')

### Other Useful Structural Queries

NetworkX has a large library of built-in structural analysis functions. A few useful ones:

In [None]:
# Clustering coefficient: how "cliquey" are the neighbors of a node?
# (fraction of a node's neighbors that are also connected to each other)
print('Clustering coefficients:')
for node, cc in nx.clustering(G).items():
    print(f'  node {node}: {cc:.3f}')

print(f'\nAverage clustering coefficient: {nx.average_clustering(G):.4f}')

**Exercise:** Using the graph `G` from earlier, answer the following:

1. Which node has the highest degree? (Hint: `max()` with a `key` argument might help.)
2. What is the shortest path from node 2 to node 6, and how many hops does it take?
3. Which node has the highest clustering coefficient? What does that tell you about its neighborhood?

In [None]:
# Your code here


## Application: SI Contagion on a Random Graph

Now let's put all of this together in a meaningful application. We will simulate a simple **SI (Susceptible–Infected) contagion model** on a network:

- Each node is either **Susceptible** (S) or **Infected** (I)
- At each time step, each infected node independently attempts to infect each of its susceptible neighbors with probability $\beta$
- Once infected, a node stays infected (no recovery)

This is the simplest possible network contagion model, but it already produces rich dynamics depending on the network structure.

We will run this on an **Erdős–Rényi random graph** $G(n, p)$: a graph on $n$ nodes where each possible edge is included independently with probability $p$.

In [None]:
# Parameters
n_nodes = 100       # number of nodes
p_edge = 0.06       # probability of each edge existing
beta = 0.3          # transmission probability per contact per step
n_steps = 20        # number of time steps to simulate
seed = 17

# Create the random graph
rng = np.random.default_rng(seed)
G_er = nx.erdos_renyi_graph(n_nodes, p_edge, seed=seed)

print(f'Nodes: {G_er.number_of_nodes()}')
print(f'Edges: {G_er.number_of_edges()}')
print(f'Average degree: {np.mean([d for _, d in G_er.degree()]):.2f}')
print(f'Is connected: {nx.is_connected(G_er)}')

We store each node's state as a node attribute: `'state'` is either `'S'` or `'I'`. We seed the infection with a single randomly chosen node.

In [None]:
def run_si(G, beta, n_steps, rng, seed_node=None):
    """Run an SI contagion on graph G.
    
    Parameters
    ----------
    G : nx.Graph
    beta : float
        Transmission probability per edge per step
    n_steps : int
    rng : numpy Generator
    seed_node : node label, optional
        Initial infected node. If None, a random node is chosen.
    
    Returns
    -------
    history : list of dicts
        history[t] maps node -> 'S' or 'I' at time step t
    """
    # Initialize all nodes as susceptible
    state = {node: 'S' for node in G.nodes}
    
    # Seed the infection
    if seed_node is None:
        seed_node = rng.choice(list(G.nodes))
    state[seed_node] = 'I'
    
    history = [state.copy()]
    
    for _ in range(n_steps):
        new_state = state.copy()
        for node in G.nodes:
            if state[node] == 'I':
                # Attempt to infect each susceptible neighbor
                for neighbor in G.neighbors(node):
                    if state[neighbor] == 'S':
                        if rng.random() < beta:
                            new_state[neighbor] = 'I'
        state = new_state
        history.append(state.copy())
    
    return history


history = run_si(G_er, beta, n_steps, rng, seed_node=0)

# Count infected at each step
infected_counts = [sum(1 for s in h.values() if s == 'I') for h in history]
print('Infected count over time:')
for t, count in enumerate(infected_counts):
    print(f'  t={t:2d}: {count:3d} infected')

### Visualizing the Spread

We'll plot the network state at three time points: the start, a middle step, and the end. Infected nodes are shown in red, susceptible nodes in light blue.

In [None]:
# Fix node positions for all plots so the layout doesn't change between snapshots
pos_er = nx.spring_layout(G_er, seed=seed)

def node_colors(state_dict):
    """Return a list of colors in node order."""
    return ['red' if state_dict[n] == 'I' else 'lightsteelblue'
            for n in G_er.nodes]

# Choose three time points to display
t_display = [0, n_steps // 2, n_steps]

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for ax, t in zip(axes, t_display):
    colors = node_colors(history[t])
    n_infected = infected_counts[t]
    nx.draw(G_er, pos_er, ax=ax, node_color=colors,
            node_size=80, edge_color='lightgray', width=0.5)
    ax.set_title(f't = {t}\n{n_infected}/{n_nodes} infected', fontsize=12)

# Add a simple legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='red', label='Infected'),
                   Patch(facecolor='lightsteelblue', label='Susceptible')]
fig.legend(handles=legend_elements, loc='lower center', ncol=2,
           bbox_to_anchor=(0.5, -0.05), fontsize=11)

plt.suptitle('SI Contagion on Erdős–Rényi Random Graph', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Plot the epidemic curve
plt.figure(figsize=(7, 4))
plt.plot(range(n_steps + 1), infected_counts, 'r-o', markersize=4, label='Infected')
plt.plot(range(n_steps + 1),
         [n_nodes - c for c in infected_counts], 'b-o', markersize=4, label='Susceptible')
plt.xlabel('Time step')
plt.ylabel('Number of nodes')
plt.title('SI epidemic curve on Erdős–Rényi graph')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

**Exercise:** Experiment with the contagion model by modifying the parameters.

1. Try lowering `p_edge` to 0.02. Does the graph stay connected? What happens to the spread?
2. Try `beta = 0.05` (much lower transmission). How does the epidemic curve change?
3. The Erdős–Rényi model has a famous **threshold**: a giant connected component emerges around $p = 1/n$. For $n = 100$, this is $p = 0.01$. Try a few values of `p_edge` near this threshold and observe how it affects the spread of the contagion.

In [None]:
# Your code here


## Saving and Loading Networks

NetworkX can save graphs in several formats. The **GraphML** format is XML-based, human-readable, and directly openable in Gephi for visualization.

In [None]:
# Save the random graph to GraphML (can be opened in Gephi)
nx.write_graphml(G_er, 'er_graph.graphml')
print('Saved er_graph.graphml')

# Load it back
G_loaded = nx.read_graphml('er_graph.graphml')
print(f'Loaded graph: {G_loaded.number_of_nodes()} nodes, {G_loaded.number_of_edges()} edges')

NetworkX can also read/write edge lists and adjacency matrices:

```python
# Edge list format (one edge per line: "u v")
nx.write_edgelist(G, 'graph.edgelist')
G2 = nx.read_edgelist('graph.edgelist')

# From/to a NumPy adjacency matrix
A = nx.to_numpy_array(G)
G3 = nx.from_numpy_array(A)
```

The `from_numpy_array` direction is especially useful when you have an adjacency matrix from a model or dataset and want to analyze it as a network.