# üéì Academic Web Crawler - Clean Pipeline

A modular web crawler for academic websites. Run each cell independently.

## Pipeline Steps:
1. **Setup** - Install packages and start Ollama
2. **Configure** - Set your target URL and parameters
3. **Discover URLs** - Crawl website and find all links
4. **Filter URLs** - Use AI to select relevant academic pages
5. **Download Content** - Get HTML content from filtered URLs
6. **Extract Text** - Clean HTML and save to .txt files

---

## üì¶ Step 1: Installation & Setup

Install all required packages and set up Ollama.

In [1]:
# Install Python packages
!pip install -q langchain-ollama beautifulsoup4 lxml requests tqdm

print("‚úÖ Packages installed successfully!")

‚úÖ Packages installed successfully!


In [7]:
!ollama serve

Error: listen tcp 127.0.0.1:11434: bind: address already in use


In [8]:
# Install Ollama
!sudo apt update > /dev/null 2>&1
!sudo apt install -y pciutils > /dev/null 2>&1
!curl -fsSL https://ollama.com/install.sh | sh

print("‚úÖ Ollama installed successfully!")

>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> NVIDIA GPU installed.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
‚úÖ Ollama installed successfully!


In [9]:
# Download Llama 3.2 model
!ollama pull llama3.2

print("‚úÖ Llama 3.2 model ready!")

[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A

## üöÄ Step 2: Initialize Ollama Server & LLM

Start the Ollama server and initialize the language model.

In [10]:
import subprocess
import threading
import time
from langchain_ollama.llms import OllamaLLM

print("üöÄ Starting Ollama server...")

# Kill any existing Ollama processes
subprocess.run(["pkill", "-9", "ollama"], stderr=subprocess.DEVNULL)
time.sleep(2)

# Start Ollama server in background
def run_ollama_serve():
    subprocess.Popen(["ollama", "serve"],
                     stdout=subprocess.DEVNULL,
                     stderr=subprocess.DEVNULL)

thread = threading.Thread(target=run_ollama_serve, daemon=True)
thread.start()
time.sleep(5)

# Initialize LLM with retry logic
print("ü§ñ Initializing Llama 3.2...")
llm = None

for attempt in range(3):
    try:
        llm = OllamaLLM(model="llama3.2", temperature=0)
        test_response = llm.invoke("Say OK")
        print(f"‚úÖ LLM initialized successfully! Test response: {test_response}")
        break
    except Exception as e:
        if attempt < 2:
            print(f"‚ö†Ô∏è  Retry {attempt + 1}/3...")
            subprocess.run(["pkill", "-9", "ollama"], stderr=subprocess.DEVNULL)
            time.sleep(2)
            threading.Thread(target=run_ollama_serve, daemon=True).start()
            time.sleep(5)
        else:
            raise Exception(f"‚ùå Failed to initialize LLM: {e}")

üöÄ Starting Ollama server...
ü§ñ Initializing Llama 3.2...
‚úÖ LLM initialized successfully! Test response: OK


## ‚öôÔ∏è Step 3: Configuration

Set your target URL and crawling parameters.

In [25]:
from google.colab import drive
drive.mount('/content/drive')
print("‚úÖ Google Drive mounted.")

Mounted at /content/drive
‚úÖ Google Drive mounted.


In [26]:
# ============================================================
# CONFIGURATION - MODIFY THESE VALUES
# ============================================================

# Target website to crawl
START_URL = "https://www.bms.ac.lk/"

# Maximum crawl depth (0 = only start page, 1 = start + linked pages, etc.)
MAX_DEPTH = 100

# Output directory for final text files
# Automatically generate OUTPUT_DIR based on START_URL's domain
parsed_url = urlparse(START_URL)
domain = parsed_url.netloc.replace('www.', '').replace('.', '_')
OUTPUT_DIR = f"/content/drive/MyDrive/academic_content_output/{domain}"

# Parallel processing settings
MAX_DOWNLOAD_WORKERS = 32  # Concurrent downloads
AI_BATCH_SIZE = 20         # URLs per AI batch
AI_MAX_WORKERS = 10        # Concurrent AI requests

# Minimum text length to save (characters)
MIN_TEXT_LENGTH = 500

print("‚úÖ Configuration set:")
print(f"   Start URL: {START_URL}")
print(f"   Max Depth: {MAX_DEPTH}")
print(f"   Output Dir: {OUTPUT_DIR}")

‚úÖ Configuration set:
   Start URL: https://www.bms.ac.lk/
   Max Depth: 100
   Output Dir: /content/drive/MyDrive/academic_content_output/bms_ac_lk


## üï∑Ô∏è Step 4: URL Discovery

Crawl the website and discover all URLs.

In [27]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse, urldefrag
from typing import Set, List

# Keywords to reject during crawling
REJECT_KEYWORDS = [
    "assets", "attachments", "audio", "css", "downloads", "favicon", "fonts", "images", "img", "js", "media", "misc", "pdf", "photo", "pict", "png", "scripts", "static", "styles", "themes", "uploads", "video", "wp-content", "wp-includes", ".jpg", ".jpeg", ".png", ".gif", ".svg", ".mp4", ".mp3", ".zip", ".tar", ".gz",
    # authentication / portals
    "account", "auth", "authenticate", "authentication", "cas", "dashboard", "ezproxy", "forgot", "identity", "login", "logout", "mfa", "my-account", "my-profile", "netid", "password", "portal", "proxy", "register", "saml", "shibboleth", "signin", "signout", "signup", "sso", "user", "validate",
    # news / marketing / media
    "announcement", "archive", "blog", "calendar", "category", "event", "events", "feed", "gallery", "magazine", "news", "newsletter", "press", "rss", "schedule", "slideshow", "stories", "tags", "upcoming", "view-event",
    # careers / jobs
    "applicant", "benefits", "career", "careers", "compensation", "employment", "hiring", "hr", "human-resources", "internship", "job", "jobs", "onboarding", "opportunities", "payroll", "position", "recruitment", "staff-training", "vacancy", "vacancies",
    # legal / policy pages
    "accessibility", "ada", "compliance", "cookie", "cookies", "copyright", "disclaimer", "legal", "license", "maintainer", "maintenance", "policy", "privacy", "security", "terms", "terms-and-conditions",
    # social media / external platforms
    "facebook", "instagram", "linkedin", "pinterest", "share", "snapchat", "tiktok", "tumblr", "twitter", "vimeo", "whatsapp", "youtube",
    # tracking / analytics
    "analytics", "fbclid", "ga_", "gclid", "google-analytics", "log", "logs", "metrics", "pixel", "stats", "tracker", "tracking", "utm_",
    # search / filters / pagination
    "filter", "limit", "offset", "order", "page", "query", "results", "search", "sort", "view", "viewitems",
    # system / administrative
    "admin", "api", "backup", "bin", "cache", "cgi-bin", "config", "configuration", "cron", "devel", "dev", "etc", "install", "modules", "node/add", "php", "plugins", "server-status", "settings", "sql", "structure", "tmp", "update", "upgrade", "var", "wp-admin", "xmlrpc"
]

# File extensions to skip
REJECT_EXTENSIONS = [
    ".jpg", ".jpeg", ".png", ".gif", ".svg", ".webp",
    ".css", ".js", ".map", ".pdf", ".zip", ".rar",
    ".mp4", ".mp3", ".doc", ".docx", ".xls", ".xlsx"
]

def normalize_url(url: str) -> str:
    """Clean and normalize URL."""
    url, _ = urldefrag(url)  # Remove fragments (#section)
    parsed = urlparse(url)

    if parsed.scheme not in ("http", "https"):
        return None

    # Remove query parameters
    clean = parsed._replace(query="").geturl()

    # Remove trailing slash
    return clean.rstrip("/")

def is_useful_url(url: str, base_domain: str) -> bool:
    """Check if URL is worth crawling."""
    url_lower = url.lower()

    # Must be same domain
    if base_domain not in url_lower:
        return False

    # Skip file extensions
    if any(url_lower.endswith(ext) for ext in REJECT_EXTENSIONS):
        return False

    # Skip rejected keywords
    if any(keyword in url_lower for keyword in REJECT_KEYWORDS):
        return False

    return True

def get_links_from_page(url: str, start_url: str, base_domain: str) -> Set[str]:
    """Extract all valid links from a page."""
    try:
        response = requests.get(url, timeout=10,
                               headers={"User-Agent": "Mozilla/5.0"})
        soup = BeautifulSoup(response.text, 'lxml')

        links = set()
        for a_tag in soup.find_all('a', href=True):
            href = str(a_tag['href']).strip()

            # Join with base URL
            full_url = urljoin(start_url, href)

            # Normalize
            normalized = normalize_url(full_url)

            # Validate
            if (normalized and
                normalized.startswith(start_url) and
                is_useful_url(normalized, base_domain)):
                links.add(normalized)

        return links

    except Exception as e:
        print(f"‚ö†Ô∏è  Error fetching {url}: {e}")
        return set()

def crawl_website(start_url: str, max_depth: int) -> List[str]:
    """Crawl website and return all discovered URLs."""
    print(f"üï∑Ô∏è  Starting crawl: {start_url}")
    print(f"   Max depth: {max_depth}\n")

    base_domain = urlparse(start_url).netloc
    visited = set()
    all_urls = set()

    def _crawl_recursive(current_url: str, depth: int):
        if depth > max_depth or current_url in visited:
            return

        visited.add(current_url)
        print(f"[Depth {depth}] Crawling: {current_url}")

        # Get links from current page
        links = get_links_from_page(current_url, start_url, base_domain)
        all_urls.update(links)

        print(f"           Found: {len(links)} links\n")

        # Recursively crawl discovered links
        for link in links:
            _crawl_recursive(link, depth + 1)

    # Start crawling
    _crawl_recursive(start_url, 0)

    result = sorted(list(all_urls))
    print(f"\n‚úÖ Crawl complete!")
    print(f"   Total URLs discovered: {len(result)}")
    return result

# Run the crawler
discovered_urls = crawl_website(START_URL, MAX_DEPTH)

# Display first 10 URLs
print("\nüìã Sample URLs:")
for url in discovered_urls[:10]:
    print(f"   {url}")
if len(discovered_urls) > 10:
    print(f"   ... and {len(discovered_urls) - 10} more")

üï∑Ô∏è  Starting crawl: https://www.bms.ac.lk/
   Max depth: 100

[Depth 0] Crawling: https://www.bms.ac.lk/
           Found: 29 links

[Depth 1] Crawling: https://www.bms.ac.lk/Executive-Certificate-in-Management
           Found: 29 links

[Depth 2] Crawling: https://www.bms.ac.lk/MBA-Digital-Transformation
           Found: 29 links

[Depth 3] Crawling: https://www.bms.ac.lk/HD-Biomedical-Science
           Found: 29 links

[Depth 4] Crawling: https://www.bms.ac.lk/HD-Food-Science-and-Nutrition
           Found: 29 links

[Depth 5] Crawling: https://www.bms.ac.lk/About-BMS
           Found: 29 links

[Depth 6] Crawling: https://www.bms.ac.lk/EARBio/index.html
           Found: 0 links

[Depth 6] Crawling: https://www.bms.ac.lk/Higher-National-Diploma
           Found: 29 links

[Depth 7] Crawling: https://www.bms.ac.lk/MBA-Teesside
           Found: 29 links

[Depth 8] Crawling: https://www.bms.ac.lk/Board-of-Governors
           Found: 30 links

[Depth 9] Crawling: https://www.bm

## üîç Step 5: Filter URLs with AI

Use AI and heuristics to select academically relevant URLs.

In [None]:
import json
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

# Academic keywords for heuristic filtering
ACADEMIC_KEYWORDS = [
    "program", "programme", "course", "degree",
    "undergraduate", "postgraduate", "graduate",
    "faculty", "department", "school",
    "admission", "eligibility", "apply",
    "curriculum", "syllabus",
    "diploma", "certificate", "bachelor", "master", "phd"
]

NON_ACADEMIC_KEYWORDS = [
    "news", "event", "staff", "research",
    "library", "login", "portal",
    "gallery", "download", "contact"
]

def heuristic_filter(url: str) -> bool:
    """Quick heuristic check for academic content."""
    url_lower = url.lower()

    # Must contain academic keywords
    has_academic = any(kw in url_lower for kw in ACADEMIC_KEYWORDS)

    # Must not contain non-academic keywords
    has_non_academic = any(kw in url_lower for kw in NON_ACADEMIC_KEYWORDS)

    return has_academic and not has_non_academic

def ai_filter_batch(urls_batch: List[str], llm) -> List[str]:
    """Use LLM to filter a batch of URLs."""
    urls_text = "\n".join(f"- {url}" for url in urls_batch)

    prompt = f"""You are filtering URLs for an academic web crawler.\n\nSelect URLs that likely contain:\n- Academic programmes, courses, or degrees\n- Admission or eligibility information\n- Faculty or department information\n- Curriculum or syllabus details\n\nReturn ONLY a JSON array of selected URLs.\nIf none are relevant, return [].\n\nURLs to evaluate:\n{urls_text}\n\nJSON array:"""

    try:
        response = llm.invoke(prompt).strip()

        # Extract JSON array
        match = re.search(r'\[.*?\]', response, re.DOTALL)
        if match:
            selected = json.loads(match.group())
            return [url for url in selected if url in urls_batch]
    except Exception as e:
        print(f"‚ö†Ô∏è  AI filter error: {e}")

    return []

def filter_urls_with_ai(urls: List[str], llm, batch_size: int, max_workers: int) -> List[str]:
    """Filter URLs using AI processing."""
    print(f"üîç Filtering {len(urls)} URLs using AI...\n")
    
    # AI filtering in batches
    # We now use the original 'urls' list directly
    batches = [urls[i:i + batch_size] 
               for i in range(0, len(urls), batch_size)]
    
    ai_filtered = []
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(ai_filter_batch, batch, llm): i 
                  for i, batch in enumerate(batches)}
        
        for future in tqdm(as_completed(futures), 
                          total=len(futures), 
                          desc="   Processing batches"):
            try:
                result = future.result()
                if result:
                    ai_filtered.extend(result)
                time.sleep(0.5)  # Rate limiting
            except Exception as e:
                print(f"‚ö†Ô∏è  Batch error: {e}")
    
    # Deduplicate and sort the results approved by the AI
    final = sorted(list(set(ai_filtered)))
    
    print(f"\n‚úÖ AI Filtering complete!")
    print(f"   Final relevant URLs: {len(final)}")
    return final

# Run the filter
filtered_urls = filter_urls_with_ai(
    discovered_urls,
    llm,
    AI_BATCH_SIZE,
    AI_MAX_WORKERS
)

# Display first 10 filtered URLs
print("\nüìã Sample filtered URLs:")
for url in filtered_urls[:10]:
    print(f"   {url}")
if len(filtered_urls) > 10:
    print(f"   ... and {len(filtered_urls) - 10} more")

üîç Filtering 34 URLs using AI...



   Processing batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:09<00:00,  4.75s/it]


‚úÖ AI Filtering complete!
   Final relevant URLs: 18

üìã Sample filtered URLs:
   https://www.bms.ac.lk/BSc-(Hons)-Biomedical-Science
   https://www.bms.ac.lk/BSc-(Hons)-Food-Science-and-Nutrition
   https://www.bms.ac.lk/Bachelor-of-Business-Management(Hons)
   https://www.bms.ac.lk/Executive-Certificate-in-Management
   https://www.bms.ac.lk/Executive-Diploma-in-Management
   https://www.bms.ac.lk/Graduate-Diploma-In-Management
   https://www.bms.ac.lk/HD-Biomedical-Science
   https://www.bms.ac.lk/HD-Food-Science-and-Nutrition
   https://www.bms.ac.lk/Higher-National-Diploma
   https://www.bms.ac.lk/International-Foundation-Diploma-(Applied-Science)
   ... and 8 more





## üì• Step 6: Download Content

Download HTML content from all filtered URLs.

In [29]:
from typing import Dict, Optional

def download_url(url: str) -> Optional[str]:
    """Download HTML content from URL."""
    try:
        response = requests.get(
            url,
            timeout=15,
            headers={"User-Agent": "Mozilla/5.0 (Academic Crawler)"}
        )
        response.raise_for_status()
        return response.text
    except Exception as e:
        print(f"‚ö†Ô∏è  Failed to download {url}: {e}")
        return None

def download_all_content(urls: List[str], max_workers: int) -> Dict[str, str]:
    """Download HTML content from all URLs in parallel."""
    print(f"üì• Downloading content from {len(urls)} URLs...\n")

    content_map = {}

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(download_url, url): url for url in urls}

        for future in tqdm(as_completed(futures),
                          total=len(futures),
                          desc="   Downloading"):
            url = futures[future]
            try:
                html = future.result()
                if html:
                    content_map[url] = html
            except Exception as e:
                print(f"‚ö†Ô∏è  Error processing {url}: {e}")

    print(f"\n‚úÖ Download complete!")
    print(f"   Successfully downloaded: {len(content_map)}/{len(urls)} pages")
    return content_map

# Download all content
downloaded_content = download_all_content(filtered_urls, MAX_DOWNLOAD_WORKERS)

print(f"\nüìä Content statistics:")
total_size = sum(len(html) for html in downloaded_content.values())
print(f"   Total HTML size: {total_size / 1024 / 1024:.2f} MB")
print(f"   Average page size: {total_size / len(downloaded_content) / 1024:.2f} KB")

üì• Downloading content from 18 URLs...



   Downloading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 18/18 [00:00<00:00, 28.99it/s]


‚úÖ Download complete!
   Successfully downloaded: 18/18 pages

üìä Content statistics:
   Total HTML size: 0.83 MB
   Average page size: 46.99 KB





## üìù Step 7: Extract Text & Save Files

Extract clean text from HTML and save to .txt files.

In [30]:
import os
from pathlib import Path

def clean_html_to_text(html: str) -> str:
    """Extract clean text from HTML."""
    soup = BeautifulSoup(html, "lxml")

    # Remove unwanted elements
    for tag in soup(["script", "style", "nav", "footer",
                     "aside", "header", "iframe", "form"]):
        tag.decompose()

    # Remove comments
    for comment in soup.find_all(string=lambda text:
                                 isinstance(text, str) and
                                 text.strip().startswith("<!--")):
        comment.extract()

    # Extract text
    text = soup.get_text(separator="\n", strip=True)

    # Clean up whitespace
    lines = [line.strip() for line in text.splitlines() if line.strip()]
    clean_text = "\n".join(lines)

    return clean_text

def safe_filename(url: str) -> str:
    """Generate safe filename from URL."""
    parsed = urlparse(url)
    path = parsed.path if parsed.path else "root"

    # Create readable filename
    filename = f"{parsed.netloc}{path}"
    filename = filename.replace("/", "_").replace("?", "_")
    filename = filename.replace("&", "_").replace(":", "_")
    filename = filename.replace("=", "_").replace(".", "_")

    # Limit length
    if len(filename) > 150:
        filename = filename[:150]

    return filename + ".txt"

def save_content_to_files(
    content_map: Dict[str, str],
    output_dir: str,
    min_length: int
) -> int:
    """Extract text and save to files."""
    print(f"üìù Extracting text and saving files...\n")

    # Create output directory
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    saved_count = 0
    skipped_count = 0

    for url, html in tqdm(content_map.items(), desc="   Processing"):
        try:
            # Extract clean text
            clean_text = clean_html_to_text(html)

            # Skip if too short
            if len(clean_text) < min_length:
                skipped_count += 1
                continue

            # Generate filename
            filename = safe_filename(url)
            filepath = output_path / filename

            # Write to file
            with open(filepath, "w", encoding="utf-8") as f:
                f.write(f"URL: {url}\n")
                f.write("=" * 80 + "\n\n")
                f.write(clean_text)

            saved_count += 1

        except Exception as e:
            print(f"‚ö†Ô∏è  Error processing {url}: {e}")

    print(f"\n‚úÖ Text extraction complete!")
    print(f"   Files saved: {saved_count}")
    print(f"   Files skipped (too short): {skipped_count}")
    print(f"   Output directory: {output_dir}")

    return saved_count

# Extract and save all content
saved_files = save_content_to_files(
    downloaded_content,
    OUTPUT_DIR,
    MIN_TEXT_LENGTH
)

üìù Extracting text and saving files...



   Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 18/18 [00:00<00:00, 40.05it/s]


‚úÖ Text extraction complete!
   Files saved: 18
   Files skipped (too short): 0
   Output directory: /content/drive/MyDrive/academic_content_output/bms_ac_lk



