# NetNeighbors: Domain Discovery Using CommonCrawl Webgraph

**High-Performance JVM Backend**

Discover related domains using link topology analysis from the CommonCrawl web graph.

This notebook uses py4j to maintain a persistent JVM with the graph loaded in memory.
After initial load (~5 seconds), queries are **nearly instant**.

**Run the cells below in order to set up and use the discovery tool.**

In [1]:
# Step 1: Check RAM and setup working directory
import psutil
import os

ram_gb = psutil.virtual_memory().total / (1024**3)
print(f"Available RAM: {ram_gb:.1f} GB")

if ram_gb < 20:
    print("\n‚ö†Ô∏è WARNING: You need Colab Pro for this notebook!")
    print("   Required: 20GB+ RAM")
    print(f"   You have: {ram_gb:.1f} GB")
    print("\n   Please enable High-RAM runtime:")
    print("   Runtime ‚Üí Change runtime type ‚Üí Runtime shape: High-RAM")
    raise Exception("Insufficient RAM. Please upgrade runtime.")
else:
    print("‚úÖ Sufficient RAM available\n")

# Determine NetNeighbors location and set as working directory
if os.path.exists("/content"):
    # Colab environment
    if not os.path.exists("/content/NetNeighbors"):
        print("Cloning NetNeighbors repository...")
        !git clone --depth 1 https://github.com/PeterCarragher/NetNeighbors.git /content/NetNeighbors > /dev/null 2>&1
        print("‚úÖ Repository cloned")
    else:
        print("‚úÖ NetNeighbors repository already exists")
    os.chdir("/content/NetNeighbors")
else:
    # Local environment
    if os.path.exists("src/DiscoveryTool.java"):
        print("‚úÖ Already in NetNeighbors directory")
    elif os.path.exists("NetNeighbors/src/DiscoveryTool.java"):
        os.chdir("NetNeighbors")
        print("‚úÖ Changed to NetNeighbors submodule")
    else:
        raise Exception("Cannot find NetNeighbors directory.")

print(f"Working directory: {os.getcwd()}")

Available RAM: 31.0 GB
‚úÖ Sufficient RAM available

‚úÖ Changed to NetNeighbors submodule
Working directory: /home/peter/dev/apps/NetNeighborsColab/NetNeighbors


### Step 2: Run Environment Setup

Installs Java 17, Maven, py4j, and builds the cc-webgraph tools.

In [2]:
!bash scripts/setup.sh

# Install py4j for JVM bridge
!pip install -q py4j
print("\n‚úÖ py4j installed")

           NetNeighbors Environment Setup
Base directory: /home/peter/dev/apps/NetNeighborsColab
NetNeighbors: /home/peter/dev/apps/NetNeighborsColab/NetNeighbors
Mode: local

1. Setting up Java 17 and Maven...
   ‚úÖ Java and Maven already installed
openjdk version "17.0.15" 2025-04-15

2. Skipping gcsfuse (local mode, not needed)

3. Installing Python dependencies...
   ‚úÖ Python dependencies installed

4. Setting up cc-webgraph...
   ‚úÖ cc-webgraph already built

5. Setting up NetNeighbors...
   ‚úÖ DiscoveryTool already compiled

                    Setup Complete!

Next steps:
  1. Download webgraph data (use utils.download_webgraph)
  2. Run verify.sh to confirm installation

‚úÖ py4j installed


### Step 3: Configure Storage and Download Webgraph

Downloads pre-built graph files from CommonCrawl (~23GB total).

In [3]:
from utils import setup_storage, download_webgraph

# Webgraph version - see https://commoncrawl.org/web-graphs for available versions
VERSION = "cc-main-2024-feb-apr-may"

# Enter GCS bucket name (or leave empty for local storage)
GCS_BUCKET = None #"commoncrawl_webgraph" # e.g., "my-webgraph-bucket"
LOCAL_PATH = "/mnt/d/dev/data/cc/"

if GCS_BUCKET:
  from google.colab import auth; auth.authenticate_user()
WEBGRAPH_DIR = setup_storage(bucket_name=GCS_BUCKET, webgraph_dir=LOCAL_PATH)

Using local storage: /mnt/d/dev/data/cc/


In [4]:
# Download webgraph files (skip if already downloaded)
download_webgraph(WEBGRAPH_DIR, VERSION)

Downloading CommonCrawl webgraph: cc-main-2024-feb-apr-may
Destination: /mnt/d/dev/data/cc/

Already exists: cc-main-2024-feb-apr-may-domain-vertices.txt.gz (889.2 MB)
Already exists: cc-main-2024-feb-apr-may-domain.graph (4298.4 MB)
Already exists: cc-main-2024-feb-apr-may-domain.properties (0.0 MB)
Already exists: cc-main-2024-feb-apr-may-domain-t.graph (4275.1 MB)
Already exists: cc-main-2024-feb-apr-may-domain-t.properties (0.0 MB)
Already exists: cc-main-2024-feb-apr-may-domain.stats (0.0 MB)

All graph files downloaded!

Building offset files (required for graph queries)...
Offsets already exist: cc-main-2024-feb-apr-may-domain.offsets
Offsets already exist: cc-main-2024-feb-apr-may-domain-t.offsets

Webgraph ready for use!


### Step 4: Initialize Graph Bridge (JVM Backend)

This starts a persistent JVM and loads the graph into memory.
**Takes ~5 seconds**, but then all queries are nearly instant!

In [None]:
from graph_bridge import GraphBridge

# Initialize and load graph (this is the ~10 second step)
bridge = GraphBridge(WEBGRAPH_DIR, VERSION)
bridge.load_graph()

print("\n" + "="*60)
print("üöÄ Graph loaded! Queries are now instant.")
print("="*60)

Starting JVM with cc-webgraph...
JAR: /home/peter/dev/apps/NetNeighborsColab/cc-webgraph/target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar
Loading graph: /mnt/d/dev/data/cc/cc-main-2024-feb-apr-may-domain
This takes ~5 seconds...
‚úÖ Graph loaded!
Subsequent queries will be nearly instant!

üöÄ Graph loaded! Queries are now instant.


### Step 5: Quick Test

Let's verify the bridge is working with a quick query.

In [17]:
import time

# Test domain lookup (should be instant)
test_domains = ["cnn.com", "bbc.com", "foxnews.com", "100percentfedup.com", "nonexistentdomain.tld", "4chan.org", "911truth.org"]

def reverse_domain(domain: str) -> str:
    return '.'.join(reversed(domain.split('.')))

reversed_domains = [reverse_domain(d) for d in test_domains]

start = time.time()
found, not_found = bridge.validate_seeds(reversed_domains)
elapsed = time.time() - start

print(f"Validated {len(test_domains)} domains in {elapsed*1000:.1f}ms")
print(f"Found: {found}")
if not_found:
    print(f"Not found: {not_found}")

Validated 7 domains in 10.0ms
Found: ['com.cnn', 'com.bbc', 'com.foxnews', 'com.100percentfedup', 'org.4chan', 'org.911truth']
Not found: ['tld.nonexistentdomain']


In [20]:
# Example: Direct API usage
import time

# Run discovery
threshold = 6
start = time.time()
# results = bridge.discover_backlinks(found, min_connections=threshold)
results = bridge.shared_predecessors(found)
elapsed = time.time() - start

print(f"\nDiscovery completed in {elapsed:.2f} seconds")
print(f"Found {len(results)} domains with >= {threshold} connections")
# print("\nTop 10 results:")
# for r in results[:10]:
#     print(f"  {r['domain']}: {r['connections']} connections ({r['percentage']}%)")
results


Discovery completed in 0.06 seconds
Found 46 domains with >= 6 connections


['com.50webs',
 'com.aifsy',
 'com.amgreatness',
 'com.angelfire',
 'com.bitchute',
 'com.blogspot',
 'com.dailycaller',
 'com.ericpetersautos',
 'com.fc2',
 'com.forumotion',
 'com.globalseoarticles',
 'com.henrymakow',
 'com.hubpages',
 'com.kingranks',
 'com.pirdu',
 'com.pklea',
 'com.ranksdirectory',
 'com.salon',
 'com.scienceblogs',
 'com.shtfplan',
 'com.substack',
 'com.topbilliondirectory',
 'com.typepad',
 'com.usawatchdog',
 'com.wayranks',
 'com.webranksite',
 'com.webseodirectory',
 'com.weebly',
 'com.wnd',
 'com.worldranksite',
 'net.eturkey',
 'net.gatesofvienna',
 'net.phibetaiota',
 'net.saidit',
 'online.99site',
 'online.allarticles',
 'online.waynews',
 'online.wayranks',
 'org.freedomclubusa',
 'org.horsesass',
 'org.rationalwiki',
 'org.republicbroadcasting',
 'org.softpanorama',
 'se.vaken',
 'tv.thepeoplesvoice',
 'us.thepiratescove']

---

## Discovery Interface

Use the form below to discover related domains. Queries are **nearly instant** now that the graph is loaded!

In [None]:
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output
import pandas as pd
import time

# Create input widgets
domains_input = widgets.Textarea(
    value='',
    placeholder='Enter seed domains, one per line:\ncnn.com\nbbc.com\nfoxnews.com',
    description='',
    layout=widgets.Layout(width='80%', height='200px'),
    style={'description_width': '0px'}
)

min_conn_slider = widgets.IntSlider(
    value=5,
    min=1,
    max=100,
    step=1,
    description='Min Connections:',
    style={'description_width': '150px'},
    layout=widgets.Layout(width='60%')
)

direction_radio = widgets.RadioButtons(
    options=[
        ('Backlinks (who links TO seeds)', 'backlinks'),
        ('Outlinks (who seeds link TO)', 'outlinks')
    ],
    value='backlinks',
    description='Direction:',
    style={'description_width': '150px'}
)

run_button = widgets.Button(
    description='üöÄ Run Discovery',
    button_style='success',
    layout=widgets.Layout(width='200px', height='40px'),
    tooltip='Click to discover related domains (instant!)'
)

output_area = widgets.Output()

# Display form
display(HTML("<h3>Seed Domains</h3>"))
display(HTML("<p>Enter one domain per line:</p>"))
display(domains_input)
display(HTML("<br>"))
display(min_conn_slider)
display(HTML("<br>"))
display(direction_radio)
display(HTML("<br>"))
display(run_button)
display(HTML("<hr>"))
display(output_area)

# Button click handler
def on_run_click(b):
    output_area.clear_output()
    
    with output_area:
        # Validate input
        domains_text = domains_input.value.strip()
        if not domains_text:
            print("Error: Please enter at least one domain")
            return
        
        seed_domains = [d.strip() for d in domains_text.split('\n') if d.strip()]
        
        if len(seed_domains) == 0:
            print("Error: Please enter at least one domain")
            return
        
        if len(seed_domains) > 10000:
            print("Error: Maximum 10000 domains allowed")
            return
        
        print(f"Configuration:")
        print(f"  Seed domains: {len(seed_domains)}")
        print(f"  Direction: {direction_radio.value}")
        print(f"  Min connections: {min_conn_slider.value}")
        print()
        
        try:
            # Run discovery (should be fast!)
            start_time = time.time()
            results = bridge.discover(
                seed_domains=seed_domains,
                min_connections=min_conn_slider.value,
                direction=direction_radio.value
            )
            elapsed = time.time() - start_time
            
            print(f"\n‚è±Ô∏è Query completed in {elapsed:.2f} seconds")
            print()
            
            # Display results
            if len(results) == 0:
                display(HTML("<h3>No Results Found</h3>"))
                print("No domains found matching the criteria.")
                print("\nTry:")
                print("  - Lowering the minimum connections threshold")
                print("  - Using different seed domains")
                print("  - Switching between backlinks and outlinks")
            else:
                display(HTML(f"<h3>Found {len(results):,} Domains</h3>"))
                
                # Convert to DataFrame
                results_df = pd.DataFrame(results)
                
                # Style and display
                display(HTML("<h4>Top Results:</h4>"))
                
                styled_df = results_df.head(100).style.format({
                    'connections': '{:,.0f}',
                    'percentage': '{:.2f}%'
                }).background_gradient(subset=['connections'], cmap='YlOrRd')
                
                display(styled_df)
                
                if len(results_df) > 100:
                    print(f"\n(Showing top 100 of {len(results_df):,} results)")
                
                # Summary statistics
                print("\n" + "="*60)
                print("Summary Statistics:")
                print(f"  Total discovered: {len(results_df):,} domains")
                print(f"  Connections range: {results_df['connections'].min():.0f} - {results_df['connections'].max():.0f}")
                print(f"  Mean connections: {results_df['connections'].mean():.1f}")
                print(f"  Median connections: {results_df['connections'].median():.0f}")
                print("="*60)
                
                # Save to CSV
                csv_path = '/content/results.csv' if os.path.exists('/content') else 'results.csv'
                results_df.to_csv(csv_path, index=False)
                print(f"\nüíæ Results saved to {csv_path}")
                
        except Exception as e:
            display(HTML("<h3>Error During Discovery</h3>"))
            print(f"Error: {str(e)}")
            import traceback
            traceback.print_exc()

run_button.on_click(on_run_click)

print("üí° Tip: Queries are nearly instant now that the graph is loaded!")

---

## Programmatic API

You can also use the bridge directly for more control:

In [None]:
# Example: Get direct predecessors/successors for a single domain
domain = "cnn.com"

start = time.time()
backlinks = bridge.get_predecessors(domain)
elapsed = time.time() - start

print(f"Found {len(backlinks):,} domains linking to {domain}")
print(f"Query time: {elapsed*1000:.1f}ms")
print(f"\nFirst 10: {backlinks[:10]}")

---

## Cleanup

When done, you can shutdown the JVM to free memory:

In [15]:
# Uncomment to shutdown JVM
bridge.shutdown()
print("JVM shutdown complete")

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:39925)
Traceback (most recent call last):
  File "/home/peter/miniconda3/envs/net_neighbor/lib/python3.12/site-packages/py4j/java_gateway.py", line 982, in _get_connection
    connection = self.deque.pop()
                 ^^^^^^^^^^^^^^^^
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/peter/miniconda3/envs/net_neighbor/lib/python3.12/site-packages/py4j/java_gateway.py", line 1170, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused


JVM shutdown complete
