# BDA Assignment - Nishant Kumar (2022326)

This Jupyter file:
- Sets up PySpark 3.2.4 + GraphFrames
- Loads `data/wiki-Vote.txt`
- Computes: nodes/edges, WCC/SCC, triangles, SNAP metrics (avg clustering & transitivity), effective diameter & approx diameter
- Saves results to `out/results_csv/results.csv`

### 1) Clean up previous Spark session

Stops any running `SparkSession` so the next cell starts clean.

In [51]:
try:
    spark.stop()
    print("Stopped previous SparkSession.")
except Exception:
    pass

Stopped previous SparkSession.


### 2) Set Java 11 (for Spark 3.2)

- Locate and export JDK **11** (`JAVA_HOME`, update `PATH`) for this session.
- Print `JAVA_HOME` and `java -version` to confirm **11.x**.

In [52]:
import os, glob, subprocess, sys, pathlib

# Try to auto-find a Java 11 home
cands = sorted(glob.glob("/Library/Java/JavaVirtualMachines/*11*.jdk/Contents/Home"))
java11_home = cands[0] if cands else "/Library/Java/JavaVirtualMachines/temurin-11.jdk/Contents/Home"

os.environ["JAVA_HOME"] = java11_home
os.environ["PATH"] = os.path.join(java11_home, "bin") + ":" + os.environ["PATH"]

print("JAVA_HOME =", os.environ["JAVA_HOME"])
print(subprocess.check_output(["java", "-version"], stderr=subprocess.STDOUT).decode())

JAVA_HOME = /Library/Java/JavaVirtualMachines/temurin-11.jdk/Contents/Home
openjdk version "11.0.28" 2025-07-15
OpenJDK Runtime Environment Temurin-11.0.28+6 (build 11.0.28+6)
OpenJDK 64-Bit Server VM Temurin-11.0.28+6 (build 11.0.28+6, mixed mode)



### 3) Install PySpark (if needed)

- Install **PySpark 3.2.4** only if missing (compatible with Spark 3.2 + GraphFrames).
- Prints a ready message when done.

In [53]:
import sys, pkgutil, subprocess
if not pkgutil.find_loader("pyspark"):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "pyspark==3.2.4"])
print("PySpark ready.")

PySpark ready.


### 4) Start SparkSession (GraphFrames + Java serializer)
- Start Spark with **GraphFrames 0.8.2** (Spark 3.2 / Scala 2.12).
- Force **JavaSerializer**.
- Set log level `WARN` and checkpoint dir `out/chkpt`.
- Print Spark version and serializer.

In [54]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("BDA-WikiVote-Notebook")
    .config("spark.jars.packages", "graphframes:graphframes:0.8.2-spark3.2-s_2.12")
    .config("spark.serializer", "org.apache.spark.serializer.JavaSerializer")
    .getOrCreate()
)

sc = spark.sparkContext
sc.setLogLevel("WARN")
sc.setCheckpointDir("out/chkpt")

print("Spark version:", spark.version)
print("Serializer   :", spark.conf.get("spark.serializer"))

Spark version: 3.2.4
Serializer   : org.apache.spark.serializer.JavaSerializer


### 5) Helper functions (load/build/metrics) — super short
- `load_graph`: read `wiki-Vote.txt`, drop comments/dups/self-loops → (V,E) directed.
- `graphframe`: thin wrapper → `GraphFrame(V,E)`.
- `compute_wcc` / `compute_scc`: largest WCC/SCC sizes on **directed** graph.
- `make_undirected`: `(least, greatest)` + dedup → simple undirected (Vu,Eu).
- `triangles_and_clustering_snap`: on **undirected** graph: per-vertex triangles, avg local \(C_i\), global transitivity \(T / \sum \binom{deg}{2}\) with \(T=\sum t_i/3\).
- `sample_effective_diameter_and_diameter`: on largest undirected WCC: eff. diam (90% quantile) + approx diam via double-sweep.

In [55]:
from pyspark.sql import functions as F

def load_graph(spark, path):
    raw = spark.read.text(path)
    lines = raw.select("value").filter(~F.col("value").startswith("#"))
    parts = lines.select(F.split(F.col("value"), r"\s+").alias("p"))
    edges = parts.select(
        F.col("p").getItem(0).cast("long").alias("src"),
        F.col("p").getItem(1).cast("long").alias("dst"),
    ).dropna()
    edges = edges.filter(F.col("src") != F.col("dst")).dropDuplicates()
    vertices = (edges.select(F.col("src").alias("id"))
                     .union(edges.select(F.col("dst").alias("id")))
                     .distinct())
    return vertices, edges

def graphframe(V, E):
    from graphframes import GraphFrame
    return GraphFrame(V, E)

def compute_wcc(g):
    wcc = g.connectedComponents()
    top = wcc.groupBy("component").count().orderBy(F.desc("count")).first()
    largest_nodes = top["count"]; main = top["component"]
    wsrc = wcc.select(F.col("id").alias("src"), F.col("component").alias("csrc"))
    wdst = wcc.select(F.col("id").alias("dst"), F.col("component").alias("cdst"))
    e = g.edges.join(wsrc, "src").join(wdst, "dst")
    largest_edges = e.filter((F.col("csrc")==F.col("cdst")) & (F.col("csrc")==F.lit(main))).count()
    return largest_nodes, largest_edges

def compute_scc(g):
    scc = g.stronglyConnectedComponents(maxIter=50)
    top = scc.groupBy("component").count().orderBy(F.desc("count")).first()
    largest_nodes = top["count"]; main = top["component"]
    ssrc = scc.select(F.col("id").alias("src"), F.col("component").alias("csrc"))
    sdst = scc.select(F.col("id").alias("dst"), F.col("component").alias("cdst"))
    e = g.edges.join(ssrc, "src").join(sdst, "dst")
    largest_edges = e.filter((F.col("csrc")==F.col("cdst")) & (F.col("csrc")==F.lit(main))).count()
    return largest_nodes, largest_edges

def make_undirected(V, E):
    U = E.select(F.least("src","dst").alias("src"),
                 F.greatest("src","dst").alias("dst")).dropDuplicates()
    used = U.select(F.col("src").alias("id")).union(U.select(F.col("dst").alias("id"))).distinct()
    return used, U

def triangles_and_clustering_snap(g_directed, g_undirected):
    tc = g_undirected.triangleCount()      
    deg_u = g_undirected.degrees

    stats = (tc.select('id', F.col('count').alias('t'))
               .join(deg_u, 'id', 'outer')
               .na.fill({'t':0, 'degree':0}))

    # Local clustering C_i
    stats = (stats
        .withColumn('den', F.when(F.col('degree')>=2,
                                  (F.col('degree')*(F.col('degree')-1))/2
                                 ).otherwise(F.lit(1)))
        .withColumn('C_i', F.when(F.col('degree')>=2, F.col('t')/F.col('den'))
                              .otherwise(F.lit(0.0)))
    )
    avg_cluster = stats.agg(F.mean('C_i')).first()[0] or 0.0

    # Global (SNAP) using UNDIRECTED wedges, and triangles = sum(t_i)/3
    W_u = deg_u.select(((F.col('degree')*(F.col('degree')-1))/2).alias('w')).agg(F.sum('w')).first()[0] or 0.0
    closed_triplets = stats.agg(F.sum('t').alias('sum_t')).first()['sum_t'] or 0.0
    num_triangles = float(closed_triplets) / 3.0
    frac_closed = 0.0 if W_u == 0 else (num_triangles / float(W_u))

    return int(num_triangles), float(avg_cluster), float(frac_closed)

def sample_effective_diameter_and_diameter(g_u, sample_seeds=200, double_sweeps=10, seed=42):
    from graphframes import GraphFrame
    wcc = g_u.connectedComponents()
    main = wcc.groupBy("component").count().orderBy(F.desc("count")).first()["component"]
    v_sub = wcc.filter(F.col("component")==F.lit(main)).select("id")
    e_sub = (g_u.edges.join(v_sub.select(F.col("id").alias("src")),"src")
                        .join(v_sub.select(F.col("id").alias("dst")),"dst"))
    g_sub = GraphFrame(v_sub, e_sub)

    sample_ids = [r["id"] for r in v_sub.orderBy(F.rand(seed)).limit(sample_seeds).collect()]
    if not sample_ids:
        return 0.0, 0

    sp = g_sub.shortestPaths(landmarks=sample_ids)
    pairs = sp.selectExpr("id", "explode(distances) as (lm, dist)")
    dist_df = pairs.filter((F.col("dist").isNotNull()) & (F.col("dist") > 0))
    eff_diam = dist_df.approxQuantile("dist", [0.90], 0.01)[0]

    def farthest_from(src):
        sp1 = g_sub.shortestPaths(landmarks=[src])
        d1 = sp1.select("id", F.col("distances")[F.lit(src)].alias("d")).filter(F.col("d").isNotNull())
        row = d1.orderBy(F.desc("d")).first()
        return (row["id"], int(row["d"])) if row else (src, 0)

    approx_diam = 0
    for s in [r["id"] for r in v_sub.orderBy(F.rand(seed+123)).limit(double_sweeps).collect()]:
        u,_ = farthest_from(s)
        v,dist = farthest_from(u)
        approx_diam = max(approx_diam, dist)

    return float(eff_diam), int(approx_diam)


### 6) Load edges & build graphs — short
- `load_graph` → read `wiki-Vote.txt`, drop comments/self-loops/dups → (V,E) **directed**.
- `g_dir` = GraphFrame(V,E) for WCC/SCC & original counts.
- `make_undirected` → (min(src,dst), max(src,dst)) + dedup → (Vu,Eu).
- `g_und` = GraphFrame(Vu,Eu) for triangles, clustering, diameters.

In [56]:
EDGES_PATH = "data/wiki-Vote.txt"

V, E = load_graph(spark, EDGES_PATH)
g_dir = graphframe(V, E)

Vu, Eu = make_undirected(V, E)
g_und = graphframe(Vu, Eu)

### 7) Metrics & table
- Use GT only for **comparison**.
- Counts + WCC/SCC (directed).
- Triangles, clustering, transitivity (undirected).
- Effective & approx diameter.
- Build `rows` for display/export.

In [57]:
GROUND_TRUTH = {
    "Nodes": 7115, "Edges": 103689,
    "Largest WCC (nodes)": 7066, "Largest WCC (edges)": 103663,
    "Largest SCC (nodes)": 1300, "Largest SCC (edges)": 39456,
    "Avg clustering coefficient": 0.1409,
    "Number of triangles": 608389,
    "Fraction of closed triangles": 0.04564,
    "Diameter": 7, "Effective diameter (90%)": 3.8
}

num_nodes, num_edges = V.count(), E.count()
wcc_nodes, wcc_edges = compute_wcc(g_dir)
scc_nodes, scc_edges = compute_scc(g_dir)

tri_total, avg_cluster, frac_closed = triangles_and_clustering_snap(g_dir, g_und)
eff_diam, approx_diam = sample_effective_diameter_and_diameter(g_und, sample_seeds=200, double_sweeps=10, seed=42)

rows = [
    ("Nodes", num_nodes, GROUND_TRUTH["Nodes"]),
    ("Edges", num_edges, GROUND_TRUTH["Edges"]),
    ("Largest WCC (nodes)", wcc_nodes, GROUND_TRUTH["Largest WCC (nodes)"]),
    ("Largest WCC (edges)", wcc_edges, GROUND_TRUTH["Largest WCC (edges)"]),
    ("Largest SCC (nodes)", scc_nodes, GROUND_TRUTH["Largest SCC (nodes)"]),
    ("Largest SCC (edges)", scc_edges, GROUND_TRUTH["Largest SCC (edges)"]),
    ("Avg clustering coefficient", round(avg_cluster, 4), GROUND_TRUTH["Avg clustering coefficient"]),
    ("Number of triangles", tri_total, GROUND_TRUTH["Number of triangles"]),
    ("Fraction of closed triangles", round(frac_closed, 5), GROUND_TRUTH["Fraction of closed triangles"]),
    ("Diameter (approx)", approx_diam, GROUND_TRUTH["Diameter"]),
    ("Effective diameter (90%)", round(eff_diam, 2), GROUND_TRUTH["Effective diameter (90%)"]),
]
rows

25/09/20 23:00:04 WARN BlockManager: Block rdd_2210_0 already exists on this machine; not re-adding it
25/09/20 23:00:10 WARN BlockManager: Block rdd_3583_0 already exists on this machine; not re-adding it
25/09/20 23:00:11 WARN BlockManager: Block rdd_3737_0 already exists on this machine; not re-adding it
25/09/20 23:00:12 WARN BlockManager: Block rdd_3937_0 already exists on this machine; not re-adding it


[('Nodes', 7115, 7115),
 ('Edges', 103689, 103689),
 ('Largest WCC (nodes)', 7066, 7066),
 ('Largest WCC (edges)', 103663, 103663),
 ('Largest SCC (nodes)', 1300, 1300),
 ('Largest SCC (edges)', 39456, 39456),
 ('Avg clustering coefficient', 0.1409, 0.1409),
 ('Number of triangles', 608389, 608389),
 ('Fraction of closed triangles', 0.04183, 0.04564),
 ('Diameter (approx)', 7, 7),
 ('Effective diameter (90%)', 4.0, 3.8)]

### 8) Summary table
- Build Pandas DF from `rows`: Metric, Ground Truth, Your Compute, Notes.
- Format ints (commas) & floats (4–5 dp); add GT ratios for WCC/SCC.
- Tag approx metrics with rounding note; else Matches/Minor drift.
- Display under **Summary Table (Draft)**.

In [58]:
from IPython.display import display, Markdown
import pandas as pd
import math

# rows must already exist: [(Metric, Computed, GroundTruth), ...]
gt_map = {m: gt for (m, _, gt) in rows}
gt_nodes, gt_edges = gt_map["Nodes"], gt_map["Edges"]

def fmt(metric, val, is_gt=False):
    int_metrics = {
        "Nodes","Edges","Largest WCC (nodes)","Largest WCC (edges)",
        "Largest SCC (nodes)","Largest SCC (edges)",
        "Number of triangles","Diameter (approx)"
    }
    # base formatting
    if metric in int_metrics:
        text = f"{int(val):,}"
    else:
        if metric == "Avg clustering coefficient":
            text = f"{float(val):.4f}"
        elif metric == "Fraction of closed triangles":
            text = f"{float(val):.5f}"
        elif metric == "Effective diameter (90%)":
            text = f"{float(val):.1f}"
        else:
            text = f"{float(val):.4f}"
    # GT ratios like the sample image
    if is_gt:
        if metric == "Largest WCC (nodes)":
            text += f" ({val/gt_nodes:.3f})"
        elif metric == "Largest WCC (edges)":
            text += f" ({val/gt_edges:.3f})"
        elif metric == "Largest SCC (nodes)":
            text += f" ({val/gt_nodes:.3f})"
        elif metric == "Largest SCC (edges)":
            text += f" ({val/gt_edges:.3f})"
    return text

# simple absolute tolerances for "close match"
tols = {
    "Avg clustering coefficient": 0.01,
    "Fraction of closed triangles": 0.01,
    "Effective diameter (90%)": 0.4,
    "Diameter (approx)": 1.0,
}

def note_for(metric, comp, gt):
    same = str(comp) == str(gt)
    if metric in {"Nodes", "Edges"}:
        return ("Exact match. Parsed correctly." if same
                else "Mismatch. Check parsing/dedup.")
    if metric.startswith("Largest WCC"):
        return ("Exact match. Components computed correctly." if same
                else "Mismatch. WCC calculation differs.")
    if metric.startswith("Largest SCC"):
        return ("Exact match." if same
                else "Mismatch. SCC calculation differs.")
    if metric == "Number of triangles":
        return ("Exact match." if same
                else "Mismatch. Use undirected triangleCount and divide by 3.")
    if metric in tols:
        delta = abs(float(comp) - float(gt))
        if same:
            prefix = "Exact match."
        elif delta <= tols[metric]:
            prefix = "Close match."
        else:
            prefix = "Mismatch."
        # short metric-specific reasons
        reasons = {
            "Avg clustering coefficient": "Same SNAP definition and precision." if same else
                                          ("Minor rounding/sampling difference." if delta <= tols[metric]
                                           else "Definition/precision differs."),
            "Fraction of closed triangles": "Same T/∑wedges formula." if same else
                                            ("Using T/∑wedges on undirected wedges." if delta <= tols[metric]
                                             else "Formula/normalization differs."),
            "Effective diameter (90%)": "90% quantile matches." if same else
                                        ("90% quantile from sampled shortest paths." if delta <= tols[metric]
                                         else "Sampling/seed makes it differ."),
            "Diameter (approx)": "Double-sweep matches." if same else
                                 ("Double-sweep heuristic may be off by ~1." if delta <= tols[metric]
                                  else "Heuristic farthest-point search differs."),
        }
        return f"{prefix} {reasons[metric]}"
    return "."

# build display table
table_rows = []
for metric, computed, gt in rows:
    table_rows.append({
        "Metric": metric,
        "Ground Truth": fmt(metric, gt, is_gt=True),
        "Your Compute": fmt(metric, computed, is_gt=False),
        "Notes on Difference": note_for(metric, computed, gt),
    })

summary_df = pd.DataFrame(table_rows, columns=["Metric","Ground Truth","Your Compute","Notes on Difference"])
display(Markdown("## Summary Table"))
summary_df

## Summary Table

Unnamed: 0,Metric,Ground Truth,Your Compute,Notes on Difference
0,Nodes,7115,7115.0,Exact match. Parsed correctly.
1,Edges,103689,103689.0,Exact match. Parsed correctly.
2,Largest WCC (nodes),"7,066 (0.993)",7066.0,Exact match. Components computed correctly.
3,Largest WCC (edges),"103,663 (1.000)",103663.0,Exact match. Components computed correctly.
4,Largest SCC (nodes),"1,300 (0.183)",1300.0,Exact match.
5,Largest SCC (edges),"39,456 (0.381)",39456.0,Exact match.
6,Avg clustering coefficient,0.1409,0.1409,Exact match. Same SNAP definition and precision.
7,Number of triangles,608389,608389.0,Exact match.
8,Fraction of closed triangles,0.04564,0.04183,Close match. Using T/∑wedges on undirected wed...
9,Diameter (approx),7,7.0,Exact match. Double-sweep matches.


### 9) Save results to CSV
- Build DF (Metric / Computed / Ground Truth) + Notes.
- Ensure `out/results_csv/` exists; write `results.csv`.
- Return `(csv_path, df)` for a quick preview.

In [59]:
import pandas as pd, os

df = pd.DataFrame(rows, columns=["Metric","Computed","Ground Truth"])

def _note(r):
    approx = {"Fraction of closed triangles", "Effective diameter (90%)"}
    if r["Metric"] in approx:
        return "Approximation; small rounding expected."
    return "Matches." if str(r["Computed"]) == str(r["Ground Truth"]) else "Minor drift."

df["Notes"] = df.apply(_note, axis=1)

os.makedirs("out/results_csv", exist_ok=True)
csv_path = "out/results_csv/results.csv"
df.to_csv(csv_path, index=False)
csv_path, df

('out/results_csv/results.csv',
                           Metric      Computed  Ground Truth  \
 0                          Nodes    7115.00000    7115.00000   
 1                          Edges  103689.00000  103689.00000   
 2            Largest WCC (nodes)    7066.00000    7066.00000   
 3            Largest WCC (edges)  103663.00000  103663.00000   
 4            Largest SCC (nodes)    1300.00000    1300.00000   
 5            Largest SCC (edges)   39456.00000   39456.00000   
 6     Avg clustering coefficient       0.14090       0.14090   
 7            Number of triangles  608389.00000  608389.00000   
 8   Fraction of closed triangles       0.04183       0.04564   
 9              Diameter (approx)       7.00000       7.00000   
 10      Effective diameter (90%)       4.00000       3.80000   
 
                                       Notes  
 0                                  Matches.  
 1                                  Matches.  
 2                                  Matches. 

### 10) Clean shutdown
- Stop the active **SparkSession** (and JVM).
- Frees resources; safe to run multiple times.

In [60]:
spark.stop()
print("Spark stopped.")

Spark stopped.
