# NetNeighbors: Domain Discovery Using CommonCrawl Webgraph

**High-Performance JVM Backend**

Discover related domains using link topology analysis from the CommonCrawl web graph.

This notebook uses py4j to maintain a persistent JVM with the graph loaded in memory.
After initial load (~5 seconds), queries are **nearly instant**.

**Run the cells below in order to set up and use the discovery tool.**

In [1]:
# Step 1: Check RAM and setup working directory
import psutil
import os

ram_gb = psutil.virtual_memory().total / (1024**3)
print(f"Available RAM: {ram_gb:.1f} GB")

if ram_gb < 20:
    print("\n‚ö†Ô∏è WARNING: You need Colab Pro for this notebook!")
    print("   Required: 20GB+ RAM")
    print(f"   You have: {ram_gb:.1f} GB")
    print("\n   Please enable High-RAM runtime:")
    print("   Runtime ‚Üí Change runtime type ‚Üí Runtime shape: High-RAM")
    raise Exception("Insufficient RAM. Please upgrade runtime.")
else:
    print("‚úÖ Sufficient RAM available\n")

# Determine NetNeighbors location and set as working directory
if os.path.exists("/content"):
    # Colab environment
    if not os.path.exists("/content/NetNeighbors"):
        print("Cloning NetNeighbors repository...")
        !git clone --depth 1 https://github.com/PeterCarragher/NetNeighbors.git /content/NetNeighbors > /dev/null 2>&1
        print("‚úÖ Repository cloned")
    else:
        print("‚úÖ NetNeighbors repository already exists")
    os.chdir("/content/NetNeighbors")
else:
    # Local environment
    if os.path.exists("src/DiscoveryTool.java"):
        print("‚úÖ Already in NetNeighbors directory")
    elif os.path.exists("NetNeighbors/src/DiscoveryTool.java"):
        os.chdir("NetNeighbors")
        print("‚úÖ Changed to NetNeighbors submodule")
    else:
        raise Exception("Cannot find NetNeighbors directory.")

print(f"Working directory: {os.getcwd()}")

Available RAM: 31.0 GB
‚úÖ Sufficient RAM available

‚úÖ Changed to NetNeighbors submodule
Working directory: /home/peter/dev/apps/NetNeighborsColab/NetNeighbors


### Step 2: Run Environment Setup

Installs Java 17, Maven, py4j, and builds the cc-webgraph tools.

In [2]:
!bash scripts/setup.sh

# Install py4j for JVM bridge
!pip install -q py4j
print("\n‚úÖ py4j installed")

           NetNeighbors Environment Setup
Base directory: /home/peter/dev/apps/NetNeighborsColab
NetNeighbors: /home/peter/dev/apps/NetNeighborsColab/NetNeighbors
Mode: local

1. Setting up Java 17 and Maven...
   ‚úÖ Java and Maven already installed
openjdk version "17.0.15" 2025-04-15

2. Skipping gcsfuse (local mode, not needed)

3. Installing Python dependencies...
   ‚úÖ Python dependencies installed

4. Setting up cc-webgraph...
   ‚úÖ cc-webgraph already built

5. Setting up NetNeighbors...
   ‚úÖ DiscoveryTool already compiled

                    Setup Complete!

Next steps:
  1. Download webgraph data (use utils.download_webgraph)
  2. Run verify.sh to confirm installation

‚úÖ py4j installed


### Step 3: Configure Storage and Download Webgraph

Downloads pre-built graph files from CommonCrawl (~23GB total).

In [3]:
from utils import setup_storage, download_webgraph

# Webgraph version - see https://commoncrawl.org/web-graphs for available versions
VERSION = "cc-main-2024-feb-apr-may"

# Enter GCS bucket name (or leave empty for local storage)
GCS_BUCKET = None #"commoncrawl_webgraph" # e.g., "my-webgraph-bucket"
LOCAL_PATH = "/mnt/d/dev/data/cc/"

if GCS_BUCKET:
  from google.colab import auth; auth.authenticate_user()
WEBGRAPH_DIR = setup_storage(bucket_name=GCS_BUCKET, webgraph_dir=LOCAL_PATH)

Using local storage: /mnt/d/dev/data/cc/


In [4]:
# Download webgraph files (skip if already downloaded)
download_webgraph(WEBGRAPH_DIR, VERSION)

Downloading CommonCrawl webgraph: cc-main-2024-feb-apr-may
Destination: /mnt/d/dev/data/cc/

Already exists: cc-main-2024-feb-apr-may-domain-vertices.txt.gz (889.2 MB)
Already exists: cc-main-2024-feb-apr-may-domain.graph (4298.4 MB)
Already exists: cc-main-2024-feb-apr-may-domain.properties (0.0 MB)
Already exists: cc-main-2024-feb-apr-may-domain-t.graph (4275.1 MB)
Already exists: cc-main-2024-feb-apr-may-domain-t.properties (0.0 MB)
Already exists: cc-main-2024-feb-apr-may-domain.stats (0.0 MB)

All graph files downloaded!

Building offset files (required for graph queries)...
Offsets already exist: cc-main-2024-feb-apr-may-domain.offsets
Offsets already exist: cc-main-2024-feb-apr-may-domain-t.offsets

Webgraph ready for use!


### Step 4: Initialize Graph Bridge (JVM Backend)

This starts a persistent JVM and loads the graph into memory.
**Takes ~5 seconds**, but then all queries are nearly instant!

In [5]:
from graph_bridge import GraphBridge

# Initialize and load graph (this is the ~10 second step)
bridge = GraphBridge(WEBGRAPH_DIR, VERSION)
bridge.load_graph()

print("\n" + "="*60)
print("üöÄ Graph loaded! Queries are now instant.")
print("="*60)

Starting JVM with cc-webgraph...
JAR: /home/peter/dev/apps/NetNeighborsColab/cc-webgraph/target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar
Loading graph: /mnt/d/dev/data/cc/cc-main-2024-feb-apr-may-domain
This takes ~5 seconds...
‚úÖ Graph loaded!
Subsequent queries will be nearly instant!

üöÄ Graph loaded! Queries are now instant.


### Step 5: Quick Test

Let's verify the bridge is working with a quick query.

In [6]:
import time

# Test domain lookup (should be instant)
test_domains = ["cnn.com", "bbc.com", "foxnews.com", "100percentfedup.com", "nonexistentdomain.tld", "4chan.org", "911truth.org"]

def reverse_domain(domain: str) -> str:
    return '.'.join(reversed(domain.split('.')))

reversed_domains = [reverse_domain(d) for d in test_domains]

start = time.time()
found, not_found = bridge.validate_seeds(reversed_domains)
elapsed = time.time() - start

print(f"Validated {len(test_domains)} domains in {elapsed*1000:.1f}ms")
print(f"Found: {found}")
if not_found:
    print(f"Not found: {not_found}")

Validated 7 domains in 10.2ms
Found: ['com.cnn', 'com.bbc', 'com.foxnews', 'com.100percentfedup', 'org.4chan', 'org.911truth']
Not found: ['tld.nonexistentdomain']


In [7]:
# Example: Direct API usage
import time

# Run discovery
threshold = 6
start = time.time()
results = bridge.discover_backlinks(found, min_connections=threshold)
# results = bridge.shared_predecessors(found)
elapsed = time.time() - start

print(f"\nDiscovery completed in {elapsed:.2f} seconds")
print(f"Found {len(results)} domains with >= {threshold} connections")
# print("\nTop 10 results:")
# for r in results[:10]:
#     print(f"  {r['domain']}: {r['connections']} connections ({r['percentage']}%)")
results

Processing 6 seed domains...
Processed 6/6 seeds...
Found 569,203 unique neighbor domains
Found 46 domains with >= 6 connections

Discovery completed in 0.58 seconds
Found 46 domains with >= 6 connections


[{'domain': 'com.50webs', 'connections': 6, 'percentage': 100.0},
 {'domain': 'com.aifsy', 'connections': 6, 'percentage': 100.0},
 {'domain': 'com.amgreatness', 'connections': 6, 'percentage': 100.0},
 {'domain': 'com.angelfire', 'connections': 6, 'percentage': 100.0},
 {'domain': 'com.bitchute', 'connections': 6, 'percentage': 100.0},
 {'domain': 'com.blogspot', 'connections': 6, 'percentage': 100.0},
 {'domain': 'com.dailycaller', 'connections': 6, 'percentage': 100.0},
 {'domain': 'com.ericpetersautos', 'connections': 6, 'percentage': 100.0},
 {'domain': 'com.fc2', 'connections': 6, 'percentage': 100.0},
 {'domain': 'com.forumotion', 'connections': 6, 'percentage': 100.0},
 {'domain': 'com.globalseoarticles', 'connections': 6, 'percentage': 100.0},
 {'domain': 'com.henrymakow', 'connections': 6, 'percentage': 100.0},
 {'domain': 'com.hubpages', 'connections': 6, 'percentage': 100.0},
 {'domain': 'com.kingranks', 'connections': 6, 'percentage': 100.0},
 {'domain': 'com.pirdu', 'conn

---

## Discovery Interface

Use the form below to discover related domains. Queries are **nearly instant** now that the graph is loaded!

In [11]:
!pip install -q gradio pandas
import gradio as gr
import pandas as pd
import time
import os
import traceback

def run_discovery(domains_text, min_connections, direction):
    # --- Validate input ---
    if not domains_text or not domains_text.strip():
        return "‚ùå Error: Please enter at least one domain", None

    seed_domains = [d.strip() for d in domains_text.splitlines() if d.strip()]
    seed_domains = [reverse_domain(d) for d in seed_domains]
    if len(seed_domains) == 0:
        return "‚ùå Error: Please enter at least one domain", None

    if len(seed_domains) > 10000:
        return "‚ùå Error: Maximum 10,000 domains allowed", None

    log_lines = []
    log_lines.append("Configuration:")
    log_lines.append(f"  Seed domains: {len(seed_domains)}")
    log_lines.append(f"  Direction: {direction}")
    log_lines.append(f"  Min connections: {min_connections}")
    log_lines.append("")

    try:
        # --- Run discovery ---
        start_time = time.time()
        results = bridge.discover(
            seed_domains=seed_domains,
            min_connections=min_connections,
            direction=direction
        )
        elapsed = time.time() - start_time

        log_lines.append(f"‚è±Ô∏è Query completed in {elapsed:.2f} seconds")
        log_lines.append("")

        if len(results) == 0:
            log_lines.extend([
                "No Results Found",
                "",
                "Try:",
                "  - Lowering the minimum connections threshold",
                "  - Using different seed domains",
                "  - Switching between backlinks and outlinks"
            ])
            return "\n".join(log_lines), None

        # --- Convert to DataFrame ---
        df = pd.DataFrame(results)
        # reverse urls back to normal format
        df['domain'] = df['domain'].apply(lambda d: '.'.join(reversed(d.split('.'))))

        # --- Summary stats ---
        log_lines.append("=" * 60)
        log_lines.append("Summary Statistics:")
        log_lines.append(f"  Total discovered: {len(df):,} domains")
        log_lines.append(
            f"  Connections range: {df['connections'].min():.0f} - {df['connections'].max():.0f}"
        )
        log_lines.append(f"  Mean connections: {df['connections'].mean():.1f}")
        log_lines.append(f"  Median connections: {df['connections'].median():.0f}")
        log_lines.append("=" * 60)

        # --- Save CSV ---
        csv_path = "/content/results.csv" if os.path.exists("/content") else "results.csv"
        df.to_csv(csv_path, index=False)
        log_lines.append(f"\nüíæ Results saved to {csv_path}")

        return "\n".join(log_lines), df

    except Exception as e:
        tb = traceback.format_exc()
        return f"‚ùå Error during discovery:\n{str(e)}\n\n{tb}", None


with gr.Blocks(title="News Source Discovery") as demo:
    gr.Markdown("## Seed Domains")
    gr.Markdown("Enter one domain per line:")

    domains_input = gr.Textbox(
        placeholder="cnn.com\nbbc.com\nfoxnews.com",
        lines=8,
        label=None
    )

    min_conn_slider = gr.Slider(
        minimum=1,
        maximum=100,
        value=5,
        step=1,
        label="Min Connections"
    )

    direction_radio = gr.Radio(
        choices=[
            ("Backlinks (who links TO seeds)", "backlinks"),
            ("Outlinks (who seeds link TO)", "outlinks"),
        ],
        value="backlinks",
        label="Direction"
    )

    run_button = gr.Button("üöÄ Run Discovery")

    gr.Markdown("---")

    output_log = gr.Textbox(
        label="Run Log",
        lines=14,
        interactive=False
    )

    output_table = gr.Dataframe(
        label="Top Results (sortable)",
        interactive=False
    )

    run_button.click(
        fn=run_discovery,
        inputs=[domains_input, min_conn_slider, direction_radio],
        outputs=[output_log, output_table]
    )

demo.launch(inline=True, share=True)


* Running on local URL:  http://127.0.0.1:7862
* Running on public URL: https://74edbe394fe92582cb.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




---

## Cleanup

When done, you can shutdown the JVM to free memory:

In [15]:
# Uncomment to shutdown JVM
bridge.shutdown()
print("JVM shutdown complete")

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:39925)
Traceback (most recent call last):
  File "/home/peter/miniconda3/envs/net_neighbor/lib/python3.12/site-packages/py4j/java_gateway.py", line 982, in _get_connection
    connection = self.deque.pop()
                 ^^^^^^^^^^^^^^^^
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/peter/miniconda3/envs/net_neighbor/lib/python3.12/site-packages/py4j/java_gateway.py", line 1170, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused


JVM shutdown complete
