# Week 4: Network Models & Hubs — Assignment

**Learning objectives** — In this assignment you will:

- Implement preferential attachment from scratch
- Compare degree distributions visually
- Detect hub nodes using a threshold-based approach
- Quantify model fit using the Kolmogorov-Smirnov statistic
- Compute CCDF and estimate power-law exponents
- Implement random and targeted node removal
- Demonstrate the robustness paradox

## Grading

| Section | Function | Points |
|---------|----------|--------|
| 1 | `preferential_attachment(n, m, seed)` | 25 |
| 2 | `compare_degree_dists(G_real, G_model, title)` | 10 |
| 3 | `find_hubs(G, threshold_factor)` | 10 |
| 4 | `ks_degree_fit(G_real, G_model)` | 10 |
| 5 | `compute_ccdf` + `estimate_alpha` | 10 |
| 6 | `random_removal` + `targeted_removal` | 20 |
| — | Written questions | 15 |
| | **Total** | **100** |

## Before You Start

This assignment builds on the Week 4 lab. Make sure you are comfortable with:

- **Erdos-Renyi model** — random edges, Poisson degree distribution, no hubs (Week 3 Lab)
- **Barabasi-Albert model** — growth + preferential attachment, power-law degree distribution, hubs (Lab Section 5)
- **Log-log degree plots** — straight line = power law, downward curve = exponential tail (Lab Sections 3 & 5)
- **CCDF plots** — complementary CDF, no binning artifacts (Lab Section 9)
- **MLE fitting** — the statistically principled way to estimate the power-law exponent (Lab Section 10)
- **Model limitations** — BA captures hubs but misses clustering; no single model does everything (Lab Section 7)
- **Network robustness** — random vs targeted node removal and the robustness paradox (Week 4 Lab, Section 11)

Section 1 asks you to implement preferential attachment from scratch — this is the core algorithm that produces scale-free networks. Section 5 uses the CCDF and MLE methods from the lab.

In [None]:
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from netsci.loaders import load_graph
from netsci.utils import SEED
from netsci import models

In [None]:
G_fb = load_graph("facebook")
G_air = load_graph("airports")

---
## Section 1: Preferential Attachment from Scratch (25 pts)

Implement the Barabasi-Albert algorithm **manually** (do not call `nx.barabasi_albert_graph`):

1. Start with a small **complete graph** of `m + 1` nodes
2. For each new node (from `m + 1` to `n - 1`):
   - Connect it to `m` existing nodes, chosen with probability **proportional to their current degree**
   - Use `rng.choice(nodes, size=m, replace=False, p=probabilities)` for the selection
3. Return the final graph

Use the provided `seed` parameter to create `rng = np.random.default_rng(seed)`.

In [None]:
def preferential_attachment(n, m, seed=SEED):
    """Build a BA graph from scratch using preferential attachment.

    Parameters
    ----------
    n : int — total number of nodes
    m : int — edges per new node
    seed : int — random seed

    Returns
    -------
    nx.Graph
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# --- Validation ---
_g = preferential_attachment(500, 3, seed=SEED)
assert isinstance(_g, nx.Graph)
assert _g.number_of_nodes() == 500
# Should have m*(n-m-1) + m*(m-1)/2 edges approximately
_expected_edges = 3 * (500 - 4) + 6  # m*(n-m-1) + C(m+1,2)
assert abs(_g.number_of_edges() - _expected_edges) < 5, (
    f"Expected {_expected_edges} edges, got {_g.number_of_edges()}"
)
# Should produce hubs (fat tail)
_max_deg = max(d for _, d in _g.degree())
assert _max_deg > 20, (
    f"Max degree {_max_deg} too low — hubs expected from preferential attachment"
)
# Degree distribution should roughly match nx.barabasi_albert_graph
_g_nx = nx.barabasi_albert_graph(500, 3, seed=SEED)
_degs_mine = sorted([d for _, d in _g.degree()], reverse=True)
_degs_nx = sorted([d for _, d in _g_nx.degree()], reverse=True)
# Top degrees should be in the same ballpark (both produce hubs)
assert _degs_mine[0] > 15 and _degs_nx[0] > 15, "Both should have hubs"
print(
    f"Your PA: {_g.number_of_nodes()} nodes, {_g.number_of_edges()} edges, max_deg={_max_deg}"
)
print(
    f"NX BA:   {_g_nx.number_of_nodes()} nodes, {_g_nx.number_of_edges()} edges, max_deg={_degs_nx[0]}"
)
print("Section 1 passed!")

---
## Section 2: Degree Distribution Comparison (10 pts)

Create a log-log scatter plot that overlays the degree distributions of a real graph and a model graph.
Use different markers for each. The function should call `plt.show()`.

In [None]:
def compare_degree_dists(G_real, G_model, title="Degree Distribution Comparison"):
    """Plot overlaid log-log degree distributions.

    Parameters
    ----------
    G_real : nx.Graph
    G_model : nx.Graph
    title : str
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# --- Validation ---
# This should produce a plot without errors
_g_ba = preferential_attachment(G_fb.number_of_nodes(), 8, seed=SEED)
compare_degree_dists(G_fb, _g_ba, title="Facebook vs PA model")
print("Section 2 passed! (visual check: both distributions visible on log-log)")

---
## Section 3: Hub Detection (10 pts)

Find all nodes whose degree exceeds `threshold_factor` times the average degree.
Return a list of those node IDs.

In [None]:
def find_hubs(G, threshold_factor=3.0):
    """Find hub nodes with degree > threshold_factor * avg_degree.

    Parameters
    ----------
    G : nx.Graph
    threshold_factor : float

    Returns
    -------
    list of nodes
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# --- Validation ---
_hubs = find_hubs(G_fb, threshold_factor=3.0)
assert isinstance(_hubs, list)
assert len(_hubs) > 0, "Facebook should have hubs"

_avg_deg = 2 * G_fb.number_of_edges() / G_fb.number_of_nodes()
_threshold = 3.0 * _avg_deg
for h in _hubs:
    assert G_fb.degree(h) > _threshold, (
        f"Node {h} degree {G_fb.degree(h)} below threshold {_threshold}"
    )

print(f"Found {len(_hubs)} hubs in Facebook (threshold = {_threshold:.1f})")
print(f"Hub degrees: {sorted([G_fb.degree(h) for h in _hubs], reverse=True)}")
print("Section 3 passed!")

---
## Section 4: KS Degree Fit (10 pts)

Use the two-sample Kolmogorov-Smirnov test (`scipy.stats.ks_2samp`) to quantify how well
a model's degree distribution matches a real network.

Return the KS statistic (lower = better fit).

In [None]:
def ks_degree_fit(G_real, G_model):
    """Compute KS statistic between degree distributions.

    Parameters
    ----------
    G_real : nx.Graph
    G_model : nx.Graph

    Returns
    -------
    float  (KS statistic; lower = better fit)
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# --- Validation ---
n_fb = G_fb.number_of_nodes()
avg_deg_fb = 2 * G_fb.number_of_edges() / n_fb

_g_er = models.erdos_renyi(n_fb, avg_deg_fb)
_g_ba = preferential_attachment(n_fb, max(1, round(avg_deg_fb / 2)), seed=SEED)

_ks_er = ks_degree_fit(G_fb, _g_er)
_ks_ba = ks_degree_fit(G_fb, _g_ba)

assert isinstance(_ks_er, float)
assert 0 <= _ks_er <= 1
assert 0 <= _ks_ba <= 1

# BA should fit Facebook better than ER
assert _ks_ba < _ks_er, f"PA KS={_ks_ba:.3f} should be < ER KS={_ks_er:.3f}"
print(f"KS (ER vs Facebook): {_ks_er:.4f}")
print(f"KS (PA vs Facebook): {_ks_ba:.4f}")
print(f"PA fits better (lower KS)")
print("Section 4 passed!")

---
## Section 5: CCDF & Power-Law Exponent (10 pts)

Implement two functions:
1. `compute_ccdf(degrees)` — return sorted arrays of (k, P(K>=k))
2. `estimate_alpha(degrees, k_min)` — MLE power-law exponent: alpha = 1 + n / sum ln(x_i / x_min)

The CCDF should return the x and y arrays suitable for a log-log plot.

In [None]:
def compute_ccdf(degrees):
    """Compute the complementary CDF of a degree sequence.

    Parameters
    ----------
    degrees : array-like — degree values

    Returns
    -------
    (np.ndarray, np.ndarray) — (k_values, ccdf_values) sorted by k descending
    """
    # YOUR CODE HERE
    raise NotImplementedError()


def estimate_alpha(degrees, k_min=1):
    """Estimate power-law exponent using MLE.

    Parameters
    ----------
    degrees : array-like — degree values
    k_min : int — minimum degree threshold

    Returns
    -------
    float — estimated exponent alpha
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# --- Validation ---
# Test compute_ccdf
_degs = [1, 1, 2, 2, 3, 5, 10]
_k, _p = compute_ccdf(_degs)
assert len(_k) == len(_degs)
assert _p[0] < 0.5, "CCDF at max degree should be small (e.g. ~1/n)"
assert abs(_p[-1] - 1.0) < 1e-6, "CCDF of smallest degree should be ~1.0"
# k should be sorted descending
assert all(_k[i] >= _k[i + 1] for i in range(len(_k) - 1)), (
    "k values should be sorted descending"
)

# Test estimate_alpha on BA model
_g_ba = preferential_attachment(1000, 3, seed=SEED)
_ba_degs = [d for _, d in _g_ba.degree()]
_alpha = estimate_alpha(_ba_degs, k_min=3)
assert isinstance(_alpha, float)
assert 2.5 < _alpha < 4.0, f"BA exponent should be ~3.0, got {_alpha:.2f}"
print(f"BA(1000, 3) exponent: alpha = {_alpha:.2f} (theory: ~3.0)")
print("Section 5 passed!")

---
## Section 6: Network Robustness (20 pts)

Implement two attack strategies on a network. Both functions should:
1. Make a **copy** of the graph
2. Remove the specified fraction of nodes
3. Return `(G_remaining, gcc_fraction)` where `gcc_fraction` is the size of the giant component divided by the **original** number of nodes

`random_removal(G, fraction)`: remove nodes uniformly at random. Create the RNG inside the function with `np.random.default_rng(SEED)`.

`targeted_removal(G, fraction)`: remove nodes in order of **decreasing degree**, recalculating degrees after each removal.

In [None]:
def random_removal(G, fraction):
    """Remove a random fraction of nodes and measure giant component.

    Parameters
    ----------
    G : nx.Graph
    fraction : float (0 to 1)

    Returns
    -------
    (nx.Graph, float) — (remaining graph, gcc_fraction relative to original N)
    """
    # YOUR CODE HERE
    raise NotImplementedError()


def targeted_removal(G, fraction):
    """Remove highest-degree nodes one at a time (recalculating degrees).

    Parameters
    ----------
    G : nx.Graph
    fraction : float (0 to 1)

    Returns
    -------
    (nx.Graph, float) — (remaining graph, gcc_fraction relative to original N)
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# --- Validation ---
_G_r, _gcc_r = random_removal(G_air, 0.2)
assert isinstance(_G_r, nx.Graph)
assert _G_r.number_of_nodes() == G_air.number_of_nodes() - int(
    G_air.number_of_nodes() * 0.2
)
assert 0 <= _gcc_r <= 1.0
# Original unchanged
assert G_air.number_of_nodes() == 500, "Original graph should not be modified"

_G_t, _gcc_t = targeted_removal(G_air, 0.2)
assert isinstance(_G_t, nx.Graph)
assert _G_t.number_of_nodes() == G_air.number_of_nodes() - int(
    G_air.number_of_nodes() * 0.2
)
assert 0 <= _gcc_t <= 1.0

# Targeted should reduce giant component more than random
assert _gcc_t < _gcc_r, (
    f"Targeted ({_gcc_t:.3f}) should shrink GCC more than random ({_gcc_r:.3f})"
)
print(f"Random removal (20%): GCC = {_gcc_r:.3f}")
print(f"Targeted removal (20%): GCC = {_gcc_t:.3f}")
print("Targeted attack is more devastating — the robustness paradox!")
print("Section 6 passed!")

---
## Written Questions (15 pts)

### Question 1 (5 pts)

Why does the BA degree distribution look different from ER on a log-log plot?
What mechanism in the BA model causes the fat tail, and why can't ER produce one?

*Hints to guide your thinking:*
- *On a log-log plot, what shape does a Poisson distribution make? What about a power law?*
- *In ER, each edge exists independently with the same probability. What does this imply about the variance of degree?*
- *In BA, a node that arrives early accumulates connections over time. Why can't this "rich get richer" dynamic happen in ER, where all edges are assigned simultaneously?*

**Your Answer:**



### Question 2 (5 pts)

Consider a network with hubs (like the internet or airline routes).
- In a **random attack** (removing nodes at random), how resilient is the network?
- In a **targeted attack** (removing the highest-degree nodes first), how resilient is it?
- How would you design a network that is robust to both types of failure?

*Hints to guide your thinking:*
- *In a random attack, what fraction of removed nodes are likely to be hubs vs. low-degree nodes? (Think about the degree distribution.)*
- *In a targeted attack on a scale-free network, removing just 5-10% of hubs can fragment the giant component. Why is this so effective?*
- *For robustness to both attack types, consider: what degree distribution avoids both the ER vulnerability (fragile to random failures at low density) and the BA vulnerability (fragile to targeted hub removal)?*

**Your Answer:**



### Question 3 (5 pts)

Why does targeted hub removal push the Molloy-Reed ratio (⟨k²⟩/⟨k⟩) below 2 faster than random removal?

*Hints to guide your thinking:*
- *The Molloy-Reed ratio depends heavily on ⟨k²⟩ — the average squared degree. Which nodes contribute most to this quantity?*
- *In a scale-free network, hubs have degree much larger than average. How does removing a node with degree 100 vs degree 5 affect ⟨k²⟩?*
- *Random removal mostly hits low-degree nodes (because there are many more of them). How much does removing a degree-5 node change ⟨k²⟩?*

**Your Answer:**

