# üîç News Source Discovery Using CommonCrawl Webgraph

Discover related domains using link topology analysis from the CommonCrawl web graph.

**Based on:**
- Carragher, P., Williams, E. M., & Carley, K. M. (2024). *Detection and Discovery of Misinformation Sources using Attributed Webgraphs*. ICWSM 2024. [Paper](https://arxiv.org/abs/2401.02379)
- Carragher, P., Williams, E. M., Spezzano, F., & Carley, K. M. (2025). *Misinformation Resilient Search Rankings with Attributed Webgraphs*. ACM TIST.

**Dataset:**
- CommonCrawl webgraph (Nov-Dec 2024, Jan 2025)
- 93.9M domains, 1.6B edges
- Domain-level aggregation

**What this notebook does:**
Given a list of seed domains, discovers other domains that are connected via backlinks or outlinks in the CommonCrawl web graph.

---

## üìã Setup Instructions

**‚è±Ô∏è Time: ~15 minutes (first time only)**

### Step 1: Enable High-RAM Runtime (REQUIRED)

1. Click **Runtime** ‚Üí **Change runtime type**
2. Set **Runtime shape** to **High-RAM** ‚ö†Ô∏è
3. Set **Hardware accelerator** to **GPU** (optional, for faster processing)
4. Click **Save**

*Why? The CommonCrawl webgraph is 22.5GB and requires >40GB RAM to process.*

### Step 2: (Optional) Mount Google Drive

**Recommended!** This caches the 22.5GB webgraph so you don't re-download it every session.

Run the "Mount Google Drive" cell below and follow the prompts.

### Step 3: Run Setup Cells (One-Time)

**‚ñ∂Ô∏è Click Run on each setup cell in order** (Cells 3-8)

Progress bars will show download status. Wait for each cell to complete before running the next.

### Step 4: Use the Discovery Form

Scroll down to **Section 3: Discovery Interface** and interact with the form!

---

## Section 1: Environment Setup (Run Once)

### Check Available RAM

In [None]:
import psutil

# Check available RAM
ram_gb = psutil.virtual_memory().total / (1024**3)
print(f"Available RAM: {ram_gb:.1f} GB")

if ram_gb < 20:
    print("\n‚ö†Ô∏è WARNING: You need Colab Pro for this notebook!")
    print("   Required: 20GB+ RAM")
    print(f"   You have: {ram_gb:.1f} GB")
    print("\n   Please enable High-RAM runtime:")
    print("   Runtime ‚Üí Change runtime type ‚Üí Runtime shape: High-RAM")
    raise Exception("Insufficient RAM. Please upgrade runtime.")
else:
    print("‚úÖ Sufficient RAM available")
    print("\nYou can proceed with setup!")

### Mount Google Drive (Optional but Recommended)

In [None]:
from google.colab import drive
import os

# Ask user if they want to mount Drive
print("Mount Google Drive to cache webgraph between sessions?")
print("This saves ~15 minutes on future runs.")
print("")
mount_choice = input("Mount Google Drive? (yes/no): ").lower().strip()

if mount_choice in ['yes', 'y']:
    drive.mount('/content/drive')
    WEBGRAPH_DIR = '/content/drive/MyDrive/Colab_Data/webgraph'
    print(f"\n‚úÖ Webgraph will be cached in: {WEBGRAPH_DIR}")
    print("This will persist across sessions!")
else:
    WEBGRAPH_DIR = '/content/webgraph'
    print(f"\n‚ö†Ô∏è Webgraph will be downloaded each session (~15 min)")
    print(f"Stored temporarily in: {WEBGRAPH_DIR}")

# Create directory
os.makedirs(WEBGRAPH_DIR, exist_ok=True)
print(f"\nDirectory created: {WEBGRAPH_DIR}")

### Install Java 17

In [None]:
%%bash
echo "Installing Java 17..."
apt-get update -qq > /dev/null 2>&1
apt-get install -y -qq openjdk-17-jdk-headless maven > /dev/null 2>&1

echo "‚úÖ Java installation complete"
java -version

### Download cc-webgraph Tools

In [None]:
%%bash
# Clone and build cc-webgraph
if [ ! -d "cc-webgraph" ]; then
    echo "Cloning cc-webgraph repository..."
    git clone https://github.com/commoncrawl/cc-webgraph.git > /dev/null 2>&1
    
    echo "Building cc-webgraph (this may take 1-2 minutes)..."
    cd cc-webgraph
    mvn clean package -DskipTests -q
    
    echo "‚úÖ cc-webgraph built successfully"
else
    echo "‚úÖ cc-webgraph already exists"
fi

# Verify JAR file exists
if [ -f "cc-webgraph/target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar" ]; then
    echo "‚úÖ JAR file found"
else
    echo "‚ùå JAR file not found. Build may have failed."
fi

### Download CommonCrawl Webgraph Data (~10 minutes)

In [None]:
import os
from tqdm.auto import tqdm
import urllib.request

VERSION = "cc-main-2025-26-nov-dec-jan"
BASE_URL = f"https://data.commoncrawl.org/projects/hyperlinkgraph/{VERSION}/domain"

files_to_download = [
    f"{VERSION}-domain-vertices.txt.gz",
    f"{VERSION}-domain-edges.txt.gz"
]

def download_with_progress(url, dest_path):
    """Download file with progress bar"""
    if os.path.exists(dest_path):
        print(f"‚úÖ Already downloaded: {os.path.basename(dest_path)}")
        return
    
    print(f"Downloading: {os.path.basename(dest_path)}")
    
    def progress_hook(pbar):
        def update(block_num, block_size, total_size):
            if total_size > 0:
                pbar.total = total_size
                pbar.update(block_size)
        return update
    
    with tqdm(unit='B', unit_scale=True, unit_divisor=1024) as pbar:
        urllib.request.urlretrieve(url, dest_path, reporthook=progress_hook(pbar))
    
    print(f"‚úÖ Downloaded: {os.path.basename(dest_path)}")

print("Downloading CommonCrawl webgraph (22.5GB total)...")
print(f"Destination: {WEBGRAPH_DIR}\n")

for filename in files_to_download:
    url = f"{BASE_URL}/{filename}"
    dest = os.path.join(WEBGRAPH_DIR, filename)
    download_with_progress(url, dest)

print("\n‚úÖ All files downloaded successfully!")

### Build Graph Structures (~2 minutes)

In [None]:
%%bash -s "$WEBGRAPH_DIR"
WEBGRAPH_DIR=$1
VERSION="cc-main-2025-26-nov-dec-jan"

cd /content/cc-webgraph

VERTICES="${WEBGRAPH_DIR}/${VERSION}-domain-vertices.txt.gz"
EDGES="${WEBGRAPH_DIR}/${VERSION}-domain-edges.txt.gz"
OUTPUT="${WEBGRAPH_DIR}/${VERSION}-domain"

# Check if already built
if [ -f "${OUTPUT}.graph" ]; then
    echo "‚úÖ Graph structures already built"
    exit 0
fi

echo "Building graph structures (this takes ~2 minutes)..."
echo "This converts the edge list into an efficient queryable format."
echo ""

./src/script/webgraph_ranking/process_webgraph.sh \
    preference_up \
    "$VERTICES" \
    "$EDGES" \
    "$OUTPUT"

echo ""
echo "‚úÖ Graph structures built successfully"

### Verify Installation

In [None]:
import os
import subprocess
import gzip

print("Verifying installation...\n")
print("="*60)

# Check Java
try:
    result = subprocess.run(['java', '-version'], capture_output=True, text=True, timeout=5)
    if result.returncode == 0:
        version_line = result.stderr.split('\n')[0]
        print(f"‚úÖ Java: {version_line}")
    else:
        print("‚ùå Java: Not working properly")
except Exception as e:
    print(f"‚ùå Java: Error - {e}")

# Check cc-webgraph JAR
jar_path = "/content/cc-webgraph/target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar"
if os.path.exists(jar_path):
    size_mb = os.path.getsize(jar_path) / (1024 * 1024)
    print(f"‚úÖ cc-webgraph JAR: Found ({size_mb:.1f} MB)")
else:
    print(f"‚ùå cc-webgraph JAR: Not found at {jar_path}")

# Check webgraph data
VERSION = "cc-main-2025-26-nov-dec-jan"
vertices_file = os.path.join(WEBGRAPH_DIR, f"{VERSION}-domain-vertices.txt.gz")
edges_file = os.path.join(WEBGRAPH_DIR, f"{VERSION}-domain-edges.txt.gz")
graph_file = os.path.join(WEBGRAPH_DIR, f"{VERSION}-domain.graph")

if os.path.exists(vertices_file):
    size_mb = os.path.getsize(vertices_file) / (1024 * 1024)
    print(f"‚úÖ Vertices file: Found ({size_mb:.1f} MB)")
    
    # Count domains
    print("   Counting domains...")
    try:
        with gzip.open(vertices_file, 'rt', encoding='utf-8') as f:
            num_domains = sum(1 for _ in f)
        print(f"   ‚Üí {num_domains:,} domains in webgraph")
    except Exception as e:
        print(f"   ‚Üí Could not count domains: {e}")
else:
    print(f"‚ùå Vertices file: Not found")

if os.path.exists(edges_file):
    size_mb = os.path.getsize(edges_file) / (1024 * 1024)
    print(f"‚úÖ Edges file: Found ({size_mb:.1f} MB)")
else:
    print(f"‚ùå Edges file: Not found")

if os.path.exists(graph_file):
    size_mb = os.path.getsize(graph_file) / (1024 * 1024)
    print(f"‚úÖ Graph structure: Built ({size_mb:.1f} MB)")
else:
    print(f"‚ùå Graph structure: Not built")

print("="*60)

# Final verdict
all_good = all([
    os.path.exists(jar_path),
    os.path.exists(vertices_file),
    os.path.exists(edges_file),
    os.path.exists(graph_file)
])

if all_good:
    print("\nüéâ Setup complete! Ready to discover domains.")
    print("\nScroll down to Section 2 to use the discovery interface.")
else:
    print("\n‚ö†Ô∏è Setup incomplete. Please re-run failed cells.")

---

## Section 2: Helper Functions

These cells define the discovery functionality. You don't need to modify them.

In [None]:
import subprocess
import pandas as pd
import os
import gzip
from typing import List, Dict, Tuple

class WebgraphDiscovery:
    """
    Wrapper class for running webgraph discovery using cc-webgraph tools.
    """
    
    def __init__(self, webgraph_dir: str, version: str):
        self.webgraph_dir = webgraph_dir
        self.version = version
        self.jar_path = "/content/cc-webgraph/target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar"
        self.graph_base = os.path.join(webgraph_dir, f"{version}-domain")
        self.vertices_file = os.path.join(webgraph_dir, f"{version}-domain-vertices.txt.gz")
        
        # Load domain mapping (for validation)
        self._domain_set = None
        
    def _load_domain_set(self) -> set:
        """Load set of all domains in webgraph (for validation)"""
        if self._domain_set is not None:
            return self._domain_set
        
        print("Loading domain list (one-time, ~30 seconds)...")
        domains = set()
        with gzip.open(self.vertices_file, 'rt', encoding='utf-8') as f:
            for line in f:
                parts = line.strip().split('\t')
                if len(parts) >= 2:
                    reversed_domain = parts[1]
                    # Convert back to normal notation
                    domain = '.'.join(reversed(reversed_domain.split('.')))
                    domains.add(domain)
        
        self._domain_set = domains
        print(f"‚úÖ Loaded {len(domains):,} domains")
        return domains
    
    def validate_seeds(self, seed_domains: List[str]) -> Tuple[List[str], List[str]]:
        """Validate which seed domains exist in webgraph"""
        domain_set = self._load_domain_set()
        
        found = []
        not_found = []
        
        for domain in seed_domains:
            domain_clean = domain.strip().lower()
            if domain_clean in domain_set:
                found.append(domain_clean)
            else:
                not_found.append(domain_clean)
        
        return found, not_found
    
    def discover(self, 
                 seed_domains: List[str], 
                 min_connections: int,
                 direction: str = 'backlinks') -> pd.DataFrame:
        """
        Run discovery algorithm using cc-webgraph tools.
        
        This creates a simple Java program inline and executes it.
        The program uses WebGraph's BVGraph to query neighbors efficiently.
        
        For backlinks: uses the transpose graph (-t.graph) where successors = predecessors
        For outlinks: uses the regular graph (.graph) where successors = outlinks
        """
        # Write seeds to file (in normal notation)
        seeds_file = '/content/seeds.txt'
        with open(seeds_file, 'w') as f:
            for domain in seed_domains:
                f.write(domain.strip().lower() + '\n')
        
        # Create a simple discovery Java program
        # This uses cc-webgraph's existing classes
        java_code = self._generate_discovery_java_code(
            seeds_file, min_connections, direction
        )
        
        # Write Java code
        java_file = '/content/DiscoveryRunner.java'
        with open(java_file, 'w') as f:
            f.write(java_code)
        
        # Compile and run
        print("\nRunning discovery algorithm...")
        print(f"Direction: {direction}")
        print(f"Min connections: {min_connections}")
        print(f"Seed domains: {len(seed_domains)}")
        print("\nThis may take 30 seconds to 2 minutes...\n")
        
        try:
            # Compile
            compile_cmd = [
                'javac',
                '-cp', self.jar_path,
                java_file
            ]
            subprocess.run(compile_cmd, check=True, capture_output=True)
            
            # Run
            run_cmd = [
                'java',
                '-Xmx48g',  # Use 48GB heap
                '-cp', f'{self.jar_path}:/content',
                'DiscoveryRunner'
            ]
            result = subprocess.run(run_cmd, capture_output=True, text=True, timeout=300)
            
            if result.returncode != 0:
                print("Error output:")
                print(result.stderr)
                raise Exception(f"Discovery failed with return code {result.returncode}")
            
            # Parse output
            print(result.stdout)
            
            # Read results CSV
            results_file = '/content/results.csv'
            if os.path.exists(results_file):
                df = pd.read_csv(results_file)
                return df
            else:
                print("No results file generated")
                return pd.DataFrame(columns=['domain', 'connections', 'percentage'])
                
        except subprocess.TimeoutExpired:
            raise Exception("Discovery timed out (>5 minutes). Try fewer seed domains.")
        except Exception as e:
            raise Exception(f"Discovery error: {str(e)}")
    
    def _generate_discovery_java_code(self, seeds_file: str, min_conn: int, direction: str) -> str:
        """
        Generate Java code that uses cc-webgraph to run discovery.
        
        For backlinks: Load the transpose graph (-t suffix). In the transpose graph,
        an edge A->B means B links to A in the original graph. So successors in the
        transpose graph gives us the predecessors (backlinks) in the original graph.
        
        For outlinks: Load the regular graph. Successors gives us outgoing links.
        """
        # Choose the appropriate graph file based on direction
        if direction == 'backlinks':
            # Use transpose graph: successors in transpose = predecessors in original
            graph_path = f"{self.graph_base}-t"
        else:
            # Use regular graph: successors = outlinks
            graph_path = self.graph_base
        
        return f'''import it.unimi.dsi.webgraph.*;
import it.unimi.dsi.fastutil.longs.*;
import java.io.*;
import java.util.*;
import java.util.zip.*;

public class DiscoveryRunner {{
    public static void main(String[] args) throws Exception {{
        // Load the appropriate graph
        // For backlinks: transpose graph (successors = predecessors)
        // For outlinks: regular graph (successors = outlinks)
        System.out.println("Loading graph from: {graph_path}");
        BVGraph graph = BVGraph.load("{graph_path}");
        System.out.println("Graph loaded: " + graph.numNodes() + " nodes");
        
        // Build domain <-> ID mappings
        System.out.println("Loading domain mappings...");
        Map<String, Integer> domainToId = new HashMap<>();
        Map<Integer, String> idToDomain = new HashMap<>();
        
        try (BufferedReader br = new BufferedReader(
                new InputStreamReader(
                    new GZIPInputStream(
                        new FileInputStream("{self.vertices_file}"))))) {{
            String line;
            while ((line = br.readLine()) != null) {{
                String[] parts = line.split("\\t");
                if (parts.length >= 2) {{
                    int id = Integer.parseInt(parts[0]);
                    String revDomain = parts[1];
                    
                    // Convert reversed domain (com.example) to normal (example.com)
                    String[] domainParts = revDomain.split("\\\\.");
                    StringBuilder sb = new StringBuilder();
                    for (int i = domainParts.length - 1; i >= 0; i--) {{
                        if (sb.length() > 0) sb.append(".");
                        sb.append(domainParts[i]);
                    }}
                    String domain = sb.toString();
                    
                    domainToId.put(domain, id);
                    idToDomain.put(id, domain);
                }}
            }}
        }}
        System.out.println("Loaded " + domainToId.size() + " domain mappings");
        
        // Load seed domains
        System.out.println("Loading seed domains...");
        Set<Integer> seedIds = new HashSet<>();
        try (BufferedReader br = new BufferedReader(new FileReader("{seeds_file}"))) {{
            String line;
            while ((line = br.readLine()) != null) {{
                String domain = line.trim().toLowerCase();
                Integer id = domainToId.get(domain);
                if (id != null) {{
                    seedIds.add(id);
                }}
            }}
        }}
        System.out.println("Found " + seedIds.size() + " seed domains in graph");
        
        if (seedIds.isEmpty()) {{
            System.out.println("No valid seed domains found!");
            return;
        }}
        
        // Run discovery: find all neighbors of seed nodes
        System.out.println("Running discovery ({direction})...");
        Map<Integer, Integer> candidateCounts = new HashMap<>();
        
        for (Integer seedId : seedIds) {{
            // Get neighbors using successors()
            // In transpose graph: successors = who links TO this node (backlinks)
            // In regular graph: successors = who this node links TO (outlinks)
            LazyIntIterator neighbors = graph.successors(seedId);
            int neighbor;
            while ((neighbor = neighbors.nextInt()) != -1) {{
                // Don't count seeds themselves
                if (!seedIds.contains(neighbor)) {{
                    candidateCounts.merge(neighbor, 1, Integer::sum);
                }}
            }}
        }}
        System.out.println("Found " + candidateCounts.size() + " unique candidate domains");
        
        // Filter by minimum connection threshold
        System.out.println("Filtering by threshold >= {min_conn}...");
        List<Map.Entry<Integer, Integer>> results = new ArrayList<>();
        for (Map.Entry<Integer, Integer> entry : candidateCounts.entrySet()) {{
            if (entry.getValue() >= {min_conn}) {{
                results.add(entry);
            }}
        }}
        
        // Sort by connection count descending
        results.sort((a, b) -> b.getValue() - a.getValue());
        System.out.println("Found " + results.size() + " domains meeting threshold");
        
        // Write results to CSV
        try (PrintWriter pw = new PrintWriter(new FileWriter("/content/results.csv"))) {{
            pw.println("domain,connections,percentage");
            for (Map.Entry<Integer, Integer> entry : results) {{
                String domain = idToDomain.get(entry.getKey());
                if (domain != null) {{
                    int connections = entry.getValue();
                    double percentage = (connections * 100.0) / seedIds.size();
                    pw.printf("%s,%d,%.2f%n", domain, connections, percentage);
                }}
            }}
        }}
        
        System.out.println("‚úÖ Discovery complete. Results written to /content/results.csv");
    }}
}}
'''

# Initialize discovery object
VERSION = "cc-main-2025-26-nov-dec-jan"
discovery = WebgraphDiscovery(WEBGRAPH_DIR, VERSION)

print("‚úÖ Discovery tools initialized")
print(f"Graph location: {WEBGRAPH_DIR}")
print(f"Version: {VERSION}")

---

## Section 3: Discovery Interface üéØ

### Use this form to discover related domains!

In [None]:
import ipywidgets as widgets
from IPython.display import display, HTML, FileLink, clear_output
import pandas as pd

# Create input widgets
domains_input = widgets.Textarea(
    value='',
    placeholder='Enter seed domains, one per line:\nexample.com\ntest.org\nsample.net',
    description='',
    layout=widgets.Layout(width='80%', height='200px'),
    style={'description_width': '0px'}
)

min_conn_slider = widgets.IntSlider(
    value=5,
    min=1,
    max=100,
    step=1,
    description='Min Connections:',
    style={'description_width': '150px'},
    layout=widgets.Layout(width='60%')
)

direction_radio = widgets.RadioButtons(
    options=[
        ('Backlinks (who links TO seeds)', 'backlinks'),
        ('Outlinks (who seeds link TO)', 'outlinks')
    ],
    value='backlinks',
    description='Direction:',
    style={'description_width': '150px'}
)

run_button = widgets.Button(
    description='üîç Run Discovery',
    button_style='success',
    layout=widgets.Layout(width='200px', height='40px'),
    tooltip='Click to discover related domains'
)

output_area = widgets.Output()

# Display form
display(HTML("<h2>üìù Discovery Configuration</h2>"))
display(HTML("<p><strong>Seed Domains</strong> (one per line):</p>"))
display(domains_input)
display(HTML("<br>"))
display(min_conn_slider)
display(HTML("<br>"))
display(direction_radio)
display(HTML("<br>"))
display(run_button)
display(HTML("<hr>"))
display(output_area)

# Button click handler
def on_run_click(b):
    output_area.clear_output()
    
    with output_area:
        display(HTML("<h3>‚è≥ Processing...</h3>"))
        
        # Validate input
        domains_text = domains_input.value.strip()
        if not domains_text:
            print("‚ùå Error: Please enter at least one domain")
            return
        
        seed_domains = [d.strip() for d in domains_text.split('\n') if d.strip()]
        
        if len(seed_domains) == 0:
            print("‚ùå Error: Please enter at least one domain")
            return
        
        if len(seed_domains) > 1000:
            print("‚ùå Error: Maximum 1000 domains allowed")
            print(f"You entered: {len(seed_domains)} domains")
            return
        
        # Validate seeds exist in webgraph
        print(f"Validating {len(seed_domains)} seed domains...")
        found, not_found = discovery.validate_seeds(seed_domains)
        
        if len(found) == 0:
            print("\n‚ùå Error: None of the seed domains were found in the webgraph")
            print("\nDomains not found:")
            for d in not_found[:10]:
                print(f"  ‚Ä¢ {d}")
            if len(not_found) > 10:
                print(f"  ... and {len(not_found)-10} more")
            return
        
        if len(not_found) > 0:
            print(f"\n‚ö†Ô∏è Warning: {len(not_found)} domains not found in webgraph:")
            for d in not_found[:5]:
                print(f"  ‚Ä¢ {d}")
            if len(not_found) > 5:
                print(f"  ... and {len(not_found)-5} more")
            print(f"\nProceeding with {len(found)} valid domains\n")
        else:
            print(f"‚úÖ All {len(found)} seed domains found in webgraph\n")
        
        print("="*60)
        print(f"Configuration:")
        print(f"  ‚Ä¢ Direction: {direction_radio.value}")
        print(f"  ‚Ä¢ Minimum connections: {min_conn_slider.value}")
        print(f"  ‚Ä¢ Valid seed domains: {len(found)}")
        print("="*60)
        
        try:
            # Run discovery
            results_df = discovery.discover(
                seed_domains=found,
                min_connections=min_conn_slider.value,
                direction=direction_radio.value
            )
            
            # Clear processing message
            clear_output(wait=True)
            
            # Display results
            if len(results_df) == 0:
                display(HTML("<h3>‚ùå No Results Found</h3>"))
                print("No domains found matching the criteria.")
                print("\nTry:")
                print("  ‚Ä¢ Lowering the minimum connections threshold")
                print("  ‚Ä¢ Using different seed domains")
                print("  ‚Ä¢ Switching between backlinks and outlinks")
            else:
                display(HTML(f"<h3>‚úÖ Found {len(results_df):,} Domains</h3>"))
                print(f"Discovered {len(results_df):,} domains with ‚â•{min_conn_slider.value} connections\n")
                
                # Style and display dataframe
                display(HTML("<h4>Top Results:</h4>"))
                
                styled_df = results_df.head(100).style.format({
                    'connections': '{:,.0f}',
                    'percentage': '{:.2f}%'
                }).background_gradient(subset=['connections'], cmap='YlOrRd')
                
                display(styled_df)
                
                if len(results_df) > 100:
                    print(f"\n(Showing top 100 of {len(results_df):,} results. Download CSV for full list.)")
                
                # Summary statistics
                print("\n" + "="*60)
                print("Summary Statistics:")
                print(f"  ‚Ä¢ Total discovered: {len(results_df):,} domains")
                print(f"  ‚Ä¢ Connections range: {results_df['connections'].min():.0f} - {results_df['connections'].max():.0f}")
                print(f"  ‚Ä¢ Mean connections: {results_df['connections'].mean():.1f}")
                print(f"  ‚Ä¢ Median connections: {results_df['connections'].median():.0f}")
                print("="*60)
                
                # Download link
                display(HTML("<br><h4>üíæ Download Full Results</h4>"))
                display(FileLink('/content/results.csv', result_html_prefix="üì• Click to download: "))
                print(f"\nCSV contains all {len(results_df):,} discovered domains")
                
        except Exception as e:
            clear_output(wait=True)
            display(HTML("<h3>‚ùå Error During Discovery</h3>"))
            print(f"Error: {str(e)}")
            print("\nüìù Troubleshooting:")
            print("1. Check that all setup cells completed successfully")
            print("2. Verify you're using High-RAM runtime")
            print("3. Try restarting runtime: Runtime ‚Üí Restart runtime")
            print("4. Try with fewer seed domains")

run_button.on_click(on_run_click)

print("\nüí° Tip: Start with 10-20 seed domains and min_connections=5 for fast results!")

---

## üìö Citation & References

If you use this notebook in your research, please cite:

```bibtex
@article{carragher2024detection,
  title={Detection and Discovery of Misinformation Sources using Attributed Webgraphs},
  author={Carragher, Peter and Williams, Evan M and Carley, Kathleen M},
  journal={Proceedings of the International AAAI Conference on Web and Social Media},
  volume={18},
  pages={218--229},
  year={2024},
  url={https://arxiv.org/abs/2401.02379}
}

@article{carragher2025misinformation,
  title={Misinformation Resilient Search Rankings with Attributed Webgraphs},
  author={Carragher, Peter and Williams, Evan M and Spezzano, Francesca and Carley, Kathleen M},
  journal={ACM Transactions on Intelligent Systems and Technology},
  year={2025}
}
```

**Links:**
- Paper (ICWSM 2024): https://arxiv.org/abs/2401.02379
- GitHub Repository: https://github.com/CASOS-IDeaS-CMU/Detection-and-Discovery-of-Misinformation-Sources
- CommonCrawl Webgraphs: https://commoncrawl.org/web-graphs
- cc-webgraph Tools: https://github.com/commoncrawl/cc-webgraph

**Contact:**
- Peter Carragher: pcarragh@andrew.cmu.edu
- CASOS Lab: http://casos.cs.cmu.edu/

---

**License:** MIT

**Acknowledgments:** This notebook uses the CommonCrawl web graph dataset and the WebGraph framework developed by Sebastiano Vigna and Paolo Boldi.