# Week 2: Web Crawling, Link Analysis & PageRank

**Web and Social Network Analytics**

---

## Learning Objectives

By the end of this lab, you will be able to:

1. **Build** a web crawler that respects robots.txt conventions
2. **Extract** and analyze link structures using BeautifulSoup
3. **Calculate** document similarity using shingling and Jaccard similarity
4. **Construct** link graphs with NetworkX
5. **Compute** PageRank and HITS scores for webpage importance
6. **(Optional)** Use LLMs to summarize scraped content

---

**Disclaimer**: This educational content is provided for instructional purposes only. Always respect website terms of service, robots.txt files, and legal requirements when crawling. Be polite to servers by adding delays between requests.

## Setup

Run this cell first to import all required libraries.

In [None]:
# Standard libraries
import os
import time
import pathlib
from urllib.request import urlopen
from urllib.parse import urlparse
import pprint as pp

# Web scraping
from bs4 import BeautifulSoup
import requests

# Graph analysis
import networkx as nx
import matplotlib.pyplot as plt

# Data handling
import pandas as pd

print('All libraries imported successfully!')

---

# Part 1: Introduction to Web Crawling

---

## 1.1 Web Crawling vs Web Scraping

These terms are often confused, but they serve different purposes:

| Aspect | Web Crawling | Web Scraping |
|--------|--------------|---------------|
| **Purpose** | Discovering and indexing web content | Extracting specific data from pages |
| **Process** | Navigating links, mapping site structure | Parsing HTML for targeted information |
| **Usage** | Search engines, site mapping | Data analysis, research, monitoring |
| **Output** | Links, metadata, page relationships | Structured data (tables, text, prices) |
| **Focus** | Breadth (many pages) | Depth (specific content) |

**This week**: We focus on **web crawling** - discovering how pages link to each other and analyzing that structure.

## 1.2 The Web Crawler Process

A web crawler (also called a spider or bot) follows this process:

1. **Start** with a list of seed URLs (the "frontier")
2. **Fetch** a page from the frontier
3. **Parse** the page to extract all hyperlinks
4. **Add** new links to the frontier (if not already visited)
5. **Repeat** until frontier is empty or limit reached

### Key Concepts

- **Frontier**: Queue of URLs waiting to be crawled
- **Visited set**: URLs already processed (to avoid duplicates)
- **Politeness**: Delays between requests to respect servers
- **robots.txt**: File that tells crawlers what they can/cannot access

### Checking robots.txt

Before crawling any website, you should check its `robots.txt` file:

In [None]:
# Example: Check a website's robots.txt
def check_robots_txt(domain):
    """Fetch and display a website's robots.txt file."""
    url = f"https://{domain}/robots.txt"
    try:
        response = requests.get(url, timeout=5)
        if response.status_code == 200:
            print(f"robots.txt for {domain}:")
            print("-" * 40)
            # Show first 500 characters
            print(response.text[:500])
            if len(response.text) > 500:
                print("... (truncated)")
        else:
            print(f"No robots.txt found (status: {response.status_code})")
    except Exception as e:
        print(f"Error: {e}")

# Try it with Google
check_robots_txt("google.com")

## 1.3 Hands-On: Manual Link Mapping Exercise

Before we automate crawling, let's understand it manually.

### Task

1. Open the `demowebsite` folder in your file browser
2. Open `home.html` in a web browser
3. Click through all the links and draw a map on paper showing:
   - Which pages link to which
   - Use arrows to show direction (A -> B means A links to B)

### Questions to Answer

1. Can you reach all HTML files in the folder by clicking from home.html?
2. Which pages are "dead ends" (no outgoing links)?
3. Which page is linked to by the most other pages?
4. Which page has the most outgoing links?

*Keep your diagram - we'll verify it with code later!*

---

# Part 2: Building a Simple Web Crawler

---

## 2.1 Helper Function for Local Files

Since we're practicing with local HTML files, we need a helper to create proper file URLs that work across operating systems.

In [None]:
def create_local_file_address(folder, file):
    """Create a file:// URL that works on any operating system."""
    file_address = os.path.join(os.getcwd(), folder, file)
    with_schema = pathlib.Path(file_address).as_uri()
    return with_schema

# Test it
test_url = create_local_file_address("demowebsite", "home.html")
print(f"Local URL: {test_url}")

## 2.2 Extracting Links from a Single Page

Let's write a function that visits a page and returns all the links it finds.

In [None]:
def visit_page_and_return_links(page_url):
    """Visit a local HTML page and return all links found."""
    # Create the full file path
    page_url_full = create_local_file_address("demowebsite", page_url)
    print(f"Looking for links in: {page_url}")
    
    # Open and parse the page
    html_of_website = urlopen(page_url_full)
    soup = BeautifulSoup(html_of_website, 'html.parser')
    
    # Find all anchor tags and extract href values
    links = soup.find_all('a')
    link_urls = []
    
    for link in links:
        href = link.get('href')  # Safer than link['href']
        if href:
            print(f"  Found link: {link.text} -> {href}")
            link_urls.append(href)
    
    return link_urls

# Test with home page
starting_website = "home.html"
found_links = visit_page_and_return_links(starting_website)
print(f"\nTotal links found: {len(found_links)}")
print(f"Links: {found_links}")

## 2.3 Returning Structured Data

For building a graph later, we need structured data. Let's modify our function to return a dictionary.

In [None]:
# Example of the structure we want:
demo_page = {
    "address": "home.html",
    "links_to": ["team.html", "news.html", "business_deals.html", "shop.html"]
}

pp.pprint(demo_page)  # Pretty print

In [None]:
def visit_page_and_return_dictionary(page_url):
    """Visit a local HTML page and return structured link data."""
    page_url_full = create_local_file_address("demowebsite", page_url)
    print(f"Visiting: {page_url}")
    
    html_of_website = urlopen(page_url_full)
    soup = BeautifulSoup(html_of_website, 'html.parser')
    
    links = soup.find_all('a')
    link_urls = []
    for link in links:
        href = link.get('href')
        if href:
            link_urls.append(href)
    
    # Return structured data
    return {
        'address': page_url,
        'links_to': link_urls
    }

# Test it
page_info = visit_page_and_return_dictionary("home.html")
print()
pp.pprint(page_info)

## 2.4 Complete Crawler Implementation

Now let's build the full crawler with frontier management.

In [None]:
# Initialize our data structures
starting_website = "home.html"

pages_we_visited = []           # Pages we've already processed
pages_to_visit = [starting_website]  # The frontier
pages_scraped_info = []         # Store structured data for each page

# Crawl until frontier is empty
while len(pages_to_visit) > 0:
    # Get next page from frontier
    next_page = pages_to_visit.pop()
    
    # Visit and get link info
    page_info = visit_page_and_return_dictionary(next_page)
    pages_scraped_info.append(page_info)
    
    # Mark as visited
    pages_we_visited.append(page_info['address'])
    
    # Add new links to frontier (if not visited)
    for link_url in page_info['links_to']:
        if link_url not in pages_we_visited and link_url not in pages_to_visit:
            pages_to_visit.append(link_url)

print("\n" + "=" * 50)
print("CRAWLING COMPLETE!")
print("=" * 50)
print(f"\nPages visited ({len(pages_we_visited)}): {pages_we_visited}")
print(f"\nRemaining in frontier: {pages_to_visit}")

In [None]:
# View the complete scraped data
print("Complete link structure:")
print("-" * 50)
pp.pprint(pages_scraped_info)

### Verify Your Manual Diagram

Compare the output above with the diagram you drew earlier. Does it match?

---

# Part 3: Link Graphs and NetworkX

---

## 3.1 Introduction to Directed Graphs

A **directed graph (DiGraph)** is perfect for representing web links:
- **Nodes** = web pages
- **Edges** = hyperlinks (with direction: from -> to)

Let's start with a simple example:

In [None]:
# Create a simple directed graph
simple_graph = nx.DiGraph()

# Add some edges (automatically creates nodes)
simple_graph.add_edge('A', 'B')  # A links to B
simple_graph.add_edge('A', 'C')  # A links to C
simple_graph.add_edge('A', 'D')  # A links to D
simple_graph.add_edge('C', 'B')  # C links to B

# Calculate layout and draw
positions = nx.spring_layout(simple_graph, seed=42)  # seed for reproducibility
nx.draw(simple_graph, positions, with_labels=True, 
        node_color='lightblue', node_size=1000, 
        font_size=16, arrows=True, arrowsize=20)
plt.title("Simple Directed Graph")
plt.show()

print("Notice the arrows showing link direction!")

## 3.2 Converting Crawler Data to Graph

Now let's convert our `pages_scraped_info` into a NetworkX graph.

In [None]:
# Create graph from our crawler data
graph = nx.DiGraph()

# For each page we scraped
for page in pages_scraped_info:
    origin = page['address']  # The page we visited
    destinations = page['links_to']  # Pages it links to
    
    # Add an edge for each link
    for dest in destinations:
        graph.add_edge(origin, dest)

print(f"Graph has {graph.number_of_nodes()} nodes and {graph.number_of_edges()} edges")

## 3.3 Visualizing the Web Graph

In [None]:
# Create a larger figure for better visibility
plt.figure(figsize=(12, 8))

# Calculate layout
positions = nx.spring_layout(graph, seed=2026, k=2)  # k controls spacing

# Draw the graph
nx.draw(graph, positions, 
        with_labels=True,
        node_color='lightblue',
        node_size=2000,
        font_size=10,
        arrows=True,
        arrowsize=15,
        edge_color='gray')

plt.title("DemoWebsite Link Structure")
plt.show()

### Questions to Consider

Looking at the graph:
1. Which page is the "hub" (most outgoing links)?
2. Which page is the most linked-to (most incoming links)?
3. Are there any dead ends (nodes with no outgoing arrows)?
4. Can you reach every page from home.html?

---

# Part 4: Similarity Detection with Shingling

---

## 4.1 Why Detect Duplicates?

**About 40% of the web is duplicate content!**

This includes:
- Copy-pasted articles across news sites
- Product descriptions on multiple e-commerce platforms
- Slightly modified spam content
- Mirror sites and archived versions

Search engines need efficient ways to detect duplicates and near-duplicates.

## 4.2 Shingling: Breaking Text into Pieces

**Shingling** (like roof shingles that overlap) breaks text into overlapping pieces called **k-shingles**.

**Example** with k=2 (word-level):

Text: `"The quick brown fox"`

2-shingles: `{"the quick", "quick brown", "brown fox"}`

In [None]:
def create_shingles(text, k=2):
    """Create k-word shingles from text.
    
    Args:
        text: Input string
        k: Number of words per shingle
    
    Returns:
        Set of shingles
    """
    # Normalize: lowercase and split into words
    words = text.lower().split()
    
    # Create shingles
    shingles = set()
    for i in range(len(words) - k + 1):
        shingle = ' '.join(words[i:i+k])
        shingles.add(shingle)
    
    return shingles

# Test it
text = "The quick brown fox"
shingles = create_shingles(text, k=2)
print(f"Text: '{text}'")
print(f"2-shingles: {shingles}")

## 4.3 Jaccard Similarity

**Jaccard Similarity** measures how similar two sets are:

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{\text{Intersection}}{\text{Union}}$$

- **J = 1.0**: Identical sets
- **J = 0.0**: No overlap
- **J > 0.8**: Often considered "duplicate"

In [None]:
def jaccard_similarity(set1, set2):
    """Calculate Jaccard similarity between two sets."""
    intersection = set1 & set2  # Elements in both
    union = set1 | set2         # Elements in either
    
    if len(union) == 0:
        return 0.0
    
    return len(intersection) / len(union)

# Example
set_a = {1, 2, 3, 4}
set_b = {3, 4, 5, 6}

print(f"Set A: {set_a}")
print(f"Set B: {set_b}")
print(f"Intersection: {set_a & set_b}")
print(f"Union: {set_a | set_b}")
print(f"Jaccard Similarity: {jaccard_similarity(set_a, set_b):.3f}")

## 4.4 Worked Example: Comparing Two Documents

Let's calculate the Jaccard similarity between two similar texts using k=2 shingles.

This example is from the lecture slides.

In [None]:
# Documents from lecture example
doc1 = "I am Zexun"
doc2 = "Zexun I am"

# Create shingles
shingles1 = create_shingles(doc1, k=2)
shingles2 = create_shingles(doc2, k=2)

print(f"Document 1: '{doc1}'")
print(f"  Shingles: {shingles1}")
print()
print(f"Document 2: '{doc2}'")
print(f"  Shingles: {shingles2}")
print()
print(f"Intersection: {shingles1 & shingles2}")
print(f"Union: {shingles1 | shingles2}")
print()
similarity = jaccard_similarity(shingles1, shingles2)
print(f"Jaccard Similarity: {similarity:.3f} ({similarity*100:.1f}%)")

In [None]:
# Another example with more similar documents
doc_a = "The quick brown fox jumps over the lazy dog"
doc_b = "The quick brown fox leaps over the lazy cat"

shingles_a = create_shingles(doc_a, k=2)
shingles_b = create_shingles(doc_b, k=2)

print(f"Doc A shingles: {shingles_a}")
print(f"Doc B shingles: {shingles_b}")
print()
similarity = jaccard_similarity(shingles_a, shingles_b)
print(f"Jaccard Similarity: {similarity:.3f} ({similarity*100:.1f}%)")
print(f"\nAre they duplicates? {'Yes' if similarity > 0.8 else 'No'} (threshold: 80%)")

---

# Part 5: PageRank and HITS Algorithms

---

## 5.1 The Random Surfer Model

**Imagine**: You're bored and randomly clicking links on the web.

- You start on some page
- You click a random link to go to another page
- You repeat this for hours...

**Question**: Which pages would you visit most often?

**Answer**: Pages that are linked to by many other pages (especially popular ones)!

This is the core insight behind **PageRank** - the algorithm Google was founded on.

### Markov Chains

Mathematically, this random clicking is a **Markov Chain**:
- **States** = web pages
- **Transitions** = clicking links
- **Key property**: Where you go next depends ONLY on where you are now (memoryless)

## 5.2 PageRank Calculation

**The Core Idea**: A page is important if important pages link to it.

NetworkX makes PageRank calculation easy:

In [None]:
# Calculate PageRank for our demowebsite graph
pageranks = nx.pagerank(graph)

print("PageRank scores:")
print("-" * 40)
# Sort by PageRank value (highest first)
sorted_pr = sorted(pageranks.items(), key=lambda x: x[1], reverse=True)
for page, score in sorted_pr:
    print(f"{page:30} {score:.4f}")

In [None]:
# Visualize with node sizes proportional to PageRank
plt.figure(figsize=(12, 8))

# Calculate layout
positions = nx.spring_layout(graph, seed=2026, k=2)

# Node sizes based on PageRank (scaled up for visibility)
sizes = [pageranks[node] * 10000 for node in graph.nodes()]

# Draw
nx.draw(graph, positions,
        with_labels=True,
        node_size=sizes,
        node_color='lightcoral',
        font_size=8,
        arrows=True,
        arrowsize=15)

plt.title("DemoWebsite - Node Size = PageRank")
plt.show()

## 5.3 The Dead End Problem and Teleportation

**Problem**: What happens when the random surfer reaches a page with no outgoing links?

They get **stuck**! This breaks our model.

**Solution**: **Teleportation** (or "damping factor")

With probability **alpha** (typically 0.15), the surfer:
- Ignores the links
- "Teleports" to a completely random page

This is like opening a new browser tab and typing a random URL.

In [None]:
# Compare different alpha values
print("Effect of damping factor (alpha):")
print("=" * 60)

for alpha in [0.85, 0.50, 0.15]:
    pr = nx.pagerank(graph, alpha=alpha)
    # Get top 3 pages
    top3 = sorted(pr.items(), key=lambda x: x[1], reverse=True)[:3]
    print(f"\nalpha = {alpha}:")
    for page, score in top3:
        print(f"  {page}: {score:.4f}")

## 5.4 HITS Algorithm: Hubs vs Authorities

**HITS** (Hyperlink-Induced Topic Search) recognizes two types of importance:

1. **Authorities**: Pages that are linked to by many (experts on a topic)
   - Example: Wikipedia article on "Machine Learning"

2. **Hubs**: Pages that link to many good authorities (curators/directories)
   - Example: "Best ML Resources" blog post with many links

**Key insight**: Good hubs point to good authorities, and good authorities are pointed to by good hubs!

In [None]:
# Calculate HITS scores
hubs, authorities = nx.hits(graph)

print("Hub Scores (pages that link to many):")
print("-" * 40)
sorted_hubs = sorted(hubs.items(), key=lambda x: x[1], reverse=True)
for page, score in sorted_hubs:
    print(f"{page:30} {score:.4f}")

In [None]:
print("Authority Scores (pages linked to by many):")
print("-" * 40)
sorted_auth = sorted(authorities.items(), key=lambda x: x[1], reverse=True)
for page, score in sorted_auth:
    print(f"{page:30} {score:.4f}")

## 5.5 Comparison: PageRank vs HITS

| Aspect | PageRank | HITS |
|--------|----------|------|
| **Scores** | Single importance score | Two scores (hub + authority) |
| **Query-dependent** | No (computed once) | Yes (depends on search topic) |
| **Best for** | Global importance ranking | Topic-specific searches |
| **Dead end handling** | Teleportation | May have convergence issues |
| **Used by** | Google (originally) | Older search engines |

In [None]:
# Create comparison DataFrame
comparison_data = []
for page in graph.nodes():
    comparison_data.append({
        'Page': page,
        'PageRank': round(pageranks[page], 4),
        'Hub Score': round(hubs[page], 4),
        'Authority': round(authorities[page], 4)
    })

df = pd.DataFrame(comparison_data)
df = df.sort_values('PageRank', ascending=False)
print("Comparison of All Ranking Scores:")
print(df.to_string(index=False))

---

# Part 6: LLM-Assisted Content Summarization (OPTIONAL)

---

**Note**: This section requires an OpenAI API key. If you don't have one, you can skip this part or just read through the code.

Get your API key at: https://platform.openai.com/api-keys

Or University of Edinburgh has its own platform called ELM. ELM is the University of Edinburgh's AI innovation platform, a central gateway providing safer access to Generative Artificial Intelligence (GAI) via access to Large Language Models (LLMs).

You can get your own free ELM API (limit to a fixed use) to use OpenAI model! Please refer to https://elm.edina.ac.uk/saml2/authenticate/elm

## 6.1 When to Use LLMs for Web Data

LLMs are useful when you need to:
- Summarize large amounts of scraped text
- Extract structured information from unstructured content
- Categorize or classify web pages
- Generate descriptions from product data

In [None]:
# Install OpenAI library if needed (uncomment to run)
!pip install openai

## 6.2 OpenAI API Setup

In [None]:
from openai import OpenAI

def summarize_with_openai(text, api_key):
    """
    Generate a summary of text using OpenAI's API.
    
    Args:
        text: The text to summarize
        api_key: Your OpenAI API key
        
    Returns:
        Summary string or None if error
    """
    try:
        client = OpenAI(api_key=api_key)
        
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that summarizes web content concisely."},
                {"role": "user", "content": f"Please summarize the following text in 2-3 sentences:\n\n{text[:4000]}"}
            ],
            max_tokens=150,
            temperature=0.5
        )
        
        return response.choices[0].message.content
    
    except Exception as e:
        print(f"Error with OpenAI API: {e}")
        return None

## 6.3 Scraping and Summarizing Web Pages

In [None]:
def scrape_content(url):
    """
    Scrape main content from a webpage.
    
    Args:
        url: The URL to scrape
        
    Returns:
        Text content or None if error
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Remove unwanted elements
        for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
            tag.decompose()
        
        # Extract main content
        content = soup.find('main') or soup.find('article') or soup.find('body')
        
        if content:
            return ' '.join(content.stripped_strings)
        return None
    
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None

## 6.4 Processing Multiple URLs

In [None]:
def process_urls_with_summaries(urls, api_key):
    """
    Scrape multiple URLs and generate summaries.
    
    Args:
        urls: List of URLs to process
        api_key: OpenAI API key
        
    Returns:
        DataFrame with URLs and summaries
    """
    results = []
    
    for url in urls:
        print(f"Processing: {url}")
        
        # Scrape content
        content = scrape_content(url)
        
        if content:
            # Get summary
            summary = summarize_with_openai(content, api_key)
            
            results.append({
                'URL': url,
                'Content Length': len(content),
                'Summary': summary
            })
            
            # Be polite - add delay
            time.sleep(1)
        else:
            results.append({
                'URL': url,
                'Content Length': 0,
                'Summary': 'Failed to scrape'
            })
    
    return pd.DataFrame(results)

In [None]:
# Example usage (uncomment and add your API key to run)

# IMPORTANT: Replace with your actual API key
# Never commit your API key to version control!
OPENAI_API_KEY = "YOUR_API_KEY_HERE"

# Better practice: Use environment variable
# import os
# OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')

urls_to_process = [
    "https://www.drps.ed.ac.uk/current/dpt/cxcmse11615.htm",
    "https://www.drps.ed.ac.uk/current/dpt/cxcmse11427.htm"
]

summary_table = process_urls_with_summaries(urls_to_process, OPENAI_API_KEY)
print(summary_table)

print(summary_table["Summary"][0], "\n",
      summary_table["Summary"][1])

---

# Summary

---

## Key Takeaways

1. **Web Crawling**: Automated link discovery using a frontier and visited set
2. **Shingling**: Breaking text into k-word pieces for comparison
3. **Jaccard Similarity**: Intersection/Union to measure document similarity
4. **PageRank**: Random surfer model with teleportation for dead ends
5. **HITS**: Hubs (link to many) vs Authorities (linked by many)

## Best Practices

- **Always check robots.txt** before crawling
- **Add delays** between requests (`time.sleep(1)`)
- **Limit crawl depth** for testing and politeness
- **Handle errors gracefully** (try/except for network issues)
- **Store results** to avoid re-crawling

## Decision Tree: Which Algorithm?

```
Need to rank pages?
    |
    v
Is ranking for a specific topic/query?
    |
  Yes     No
    |      |
    v      v
  HITS   PageRank
```

---

## Next Steps

1. Complete the **Week2-exercise.ipynb** to practice these concepts
2. Try the **APC2 challenge** to apply crawling to real websites
3. Explore the **NetworkX documentation** for more graph analysis options