#Batch Web Scraper & Page Scorer for Mission‚ÄìVision Discovery

This notebook cell automatically visits a list of company websites and extracts meaningful textual pages such as Mission, Vision, Values, Strategy, Sustainability, or Annual Reports.
It converts each page into a Markdown file for easy reading, and it also ranks them based on how relevant they are to the company‚Äôs purpose and values.

1Ô∏è‚É£ **Batch Web Scraper**

This part of the code is designed to collect and save web pages from a list of URLs (or from a CSV file).
It focuses mainly on fetching and cleaning content, not ranking it yet.

What it does:

* Reads URLs either directly from the code or from a file called candidates.csv.

* Visits each website one by one while respecting the site‚Äôs robots.txt (the file that says what bots are allowed to access).

* Downloads each page‚Äôs HTML content and removes irrelevant parts such as navigation bars, headers, and cookie banners.

* Extracts only the main readable content ‚Äî titles, headings, and paragraphs.

* Converts that clean text into a Markdown file (.md) and saves it locally in a folder (out_md/).

* Creates a manifest.csv file with basic information about each page, such as:


    *  URL
    *  Status (success, blocked, or error)
    *  Number of text blocks found
    *  A short text snippet as preview

üëâ Purpose: This code prepares clean, structured text files from multiple websites so you can later analyze or process them (for example, extract ‚Äúmission‚Äù or ‚Äúvision‚Äù statements).

In [1]:
# @title üîé Discovery Only (Resilient++) ‚Äî expand hub pages to find real .htm targets
!pip -q install duckduckgo_search==6.3.5 tldextract==5.1.2 beautifulsoup4==4.12.3 lxml==5.3.0 pandas==2.2.2

import re, time, csv, random
import pandas as pd
from urllib.parse import urlparse, urljoin, urlsplit, urlunsplit
from urllib import robotparser
import requests
from duckduckgo_search import DDGS
import tldextract
from bs4 import BeautifulSoup

# ========= ‚úèÔ∏è EDIT THESE =========
COMPANY_NAME = "ING Group"
OVERRIDE_HOMEPAGE = "https://www.ing.com"
MAX_CANDIDATES = 60
TOP_N_LIVE = 17
EXPAND_FROM_HUBS = True          # fetch hub pages (/about-us, /sustainability, etc.) and mine links
MAX_HUBS_TO_EXPAND = 5           # safety cap
MAX_LINKS_PER_HUB = 120          # parse limit per hub page
# ================================

UA = ("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
      "(KHTML, like Gecko) Chrome/124.0 Safari/537.36 (+discovery-only)")
TIMEOUT = 20

KEYWORDS_URL = [
    "mission","vision","purpose","values","about","who-we-are","our-company",
    "strategy","culture","sustainability","corporate-governance","what-we-stand-for",
    "purpose-and-values","purpose-values","at-a-glance","principles","code-of-conduct"
]
CANDIDATE_PATHS = [
    "/", "/about", "/about-us", "/company", "/who-we-are", "/our-company",
    "/values", "/purpose", "/mission", "/vision", "/about/mission", "/about/vision",
    "/sustainability", "/culture", "/our-values", "/strategy", "/purpose-and-values"
]

def is_probably_official(domain: str, company_name: str) -> bool:
    bad = {"linkedin.com","facebook.com","instagram.com","x.com","twitter.com",
           "wikipedia.org","crunchbase.com","glassdoor.com","bloomberg.com",
           "reuters.com","yahoo.com","google.com","news.google.com","youtube.com"}
    ext = tldextract.extract(domain)
    root = f"{ext.domain}.{ext.suffix}" if ext.suffix else ext.domain
    if root in bad: return False
    tokens = re.findall(r"[a-z0-9]+", company_name.lower())
    hits = sum(tok in ext.domain.lower() for tok in tokens if len(tok) >= 3)
    return hits >= 1

def normalize_home(url: str) -> str:
    url = url.strip().rstrip("/")
    if not url.startswith("http"):
        url = "https://" + url
    return url

def guess_homepages(company_name: str):
    brand = re.sub(r"[^a-z0-9]+", "", company_name.lower())
    if brand in ("inggroup","ing","ingnv"):
        return ["https://www.ing.com","https://www.ing.nl","https://www.ing.com.tr"]
    bases = [
        "{b}.com","{b}group.com","{b}-group.com","{b}corp.com","{b}corporate.com",
        "{b}.co","{b}.io","{b}.net"
    ]
    return [f"https://{t.format(b=brand)}" for t in bases]

def ddg_with_backoff(query, max_results=10, attempts=3):
    delay = 1.5
    for _ in range(attempts):
        try:
            with DDGS(timeout=TIMEOUT) as ddgs:
                results = list(ddgs.text(query, max_results=max_results, region="wt-wt", safesearch="off"))
                if results:
                    return results
        except Exception:
            time.sleep(delay + random.random()); delay *= 2
    return []

def find_official_homepage(company_name: str) -> str|None:
    if OVERRIDE_HOMEPAGE.strip():
        return normalize_home(OVERRIDE_HOMEPAGE)
    for g in guess_homepages(company_name):
        try:
            r = requests.head(g, timeout=8, allow_redirects=True)
            if r.status_code < 400 and is_probably_official(r.url, company_name):
                return normalize_home(r.url)
        except: pass
    results = ddg_with_backoff(f"{company_name} official site", max_results=10, attempts=3)
    for r in results:
        url = r.get("href") or r.get("url")
        if not url: continue
        p = urlparse(url)
        if p.scheme.startswith("http"):
            base = f"{p.scheme}://{p.netloc}"
            if is_probably_official(base, company_name):
                return base.rstrip("/")
    return None

def allowed_by_robots(url: str) -> bool:
    try:
        p = urlparse(url); base = f"{p.scheme}://{p.netloc}"
        rp = robotparser.RobotFileParser()
        rp.set_url(urljoin(base, "/robots.txt")); rp.read()
        return rp.can_fetch(UA, url)
    except Exception:
        return True

def fetch(url: str):
    return requests.get(url, headers={"User-Agent": UA, "Accept-Language": "en"},
                        timeout=TIMEOUT, allow_redirects=True)

# ---------- Hygiene helpers ----------
def cleanup_url(u: str) -> str:
    """Normalize scheme & drop query/fragment, preserving the path exactly (incl. .htm)."""
    if not u: return u
    u = u.strip()
    if not u: return u
    if not u.startswith(("http://","https://")):
        u = "https://" + u
    p = urlsplit(u)
    return urlunsplit((p.scheme, p.netloc, p.path, "", ""))

def ensure_htm(url: str) -> str:
    """
    ING pages under /About-us/ often end with .htm. Only auto-add when:
    - host ends with ing.com
    - path (case-insensitive) contains '/about-us/'
    - last segment looks like a file (has letters/dashes) and has no '.'
    """
    try:
        p = urlparse(url)
        path_lower = p.path.lower()
        if p.netloc.endswith("ing.com") and "/about-us/" in path_lower:
            last = p.path.rstrip("/").split("/")[-1]
            if last and ("." not in last) and re.search(r"[a-zA-Z\-]", last):
                return url + ".htm"
    except Exception:
        pass
    return url

def preflight(url: str) -> tuple[bool,str,int]:
    """Validate a URL is fetchable. Try HEAD (redirects ok), fall back to GET on 403/405."""
    try:
        r = requests.head(url, headers={"User-Agent": UA}, allow_redirects=True, timeout=12)
        if 200 <= r.status_code < 300:
            return True, r.url, r.status_code
        if r.status_code in (403, 405):
            rg = requests.get(url, headers={"User-Agent": UA}, allow_redirects=True, timeout=12)
            return (200 <= rg.status_code < 300), rg.url, rg.status_code
        return False, getattr(r, "url", url), r.status_code
    except Exception:
        try:
            rg = requests.get(url, headers={"User-Agent": UA}, allow_redirects=True, timeout=12)
            return (200 <= rg.status_code < 300), rg.url, rg.status_code
        except Exception:
            return False, url, 0

# ---------- Discovery ----------
def discover_candidates(base: str, max_candidates=60):
    seen, cands = set(), []

    def add(u, reason, boost=0.0):
        u = u.split("#")[0].rstrip("/")
        if urlparse(u).netloc != urlparse(base).netloc: return
        if u in seen: return
        seen.add(u)
        cands.append({"url": u, "reason": reason, "boost": boost})

    # 1) Common paths (hubs)
    for p in CANDIDATE_PATHS:
        add(urljoin(base, p), "seed", 0.2 if p != "/" else 0.0)

    # 2) Sitemaps (index + children)
    for sm in ["/sitemap.xml", "/sitemap_index.xml", "/sitemap-index.xml"]:
        sm_url = urljoin(base, sm)
        if not allowed_by_robots(sm_url): continue
        try:
            r = fetch(sm_url)
            if not (r.ok and "xml" in (r.headers.get("content-type",""))): continue
            soup = BeautifulSoup(r.text, "xml")
            locs = [loc.get_text(strip=True) for loc in soup.find_all("loc")]
            # follow a few child sitemaps
            child_maps = [u for u in locs if u.endswith(".xml")]
            for cm in child_maps[:10]:
                try:
                    rr = fetch(cm)
                    if rr.ok and "xml" in (rr.headers.get("content-type","")):
                        s2 = BeautifulSoup(rr.text, "xml")
                        locs.extend([loc.get_text(strip=True) for loc in s2.find_all("loc")])
                except: pass
            for u in locs:
                if any(k in u.lower() for k in KEYWORDS_URL):
                    add(u, "sitemap", 0.9)
        except:
            pass

    # 3) DuckDuckGo site search (best-effort)
    try:
        with DDGS(timeout=TIMEOUT) as ddgs:
            query = f"site:{urlparse(base).netloc} " + " ".join(KEYWORDS_URL[:6])
            for r in ddgs.text(query, max_results=25, region="wt-wt", safesearch="off"):
                u = r.get("href") or r.get("url")
                if u: add(u, "site_search", 1.0)
    except:
        pass

    return cands[:max_candidates]

# ---------- Expand hubs by parsing page links ----------
def is_hub_path(path: str) -> bool:
    pl = path.lower()
    return any(pl == x or pl.startswith(x + "/") for x in ["/about-us", "/sustainability", "/about", "/company", "/who-we-are", "/our-company"])

def extract_samehost_links(base_url: str, html: str, limit=120):
    base_host = urlparse(base_url).netloc
    soup = BeautifulSoup(html, "lxml")
    out = []
    for a in soup.find_all("a", href=True):
        href = a["href"].strip()
        if not href or href.startswith("#"):
            continue
        absu = urljoin(base_url, href)
        pu = urlparse(absu)
        if pu.netloc != base_host:
            continue
        # keep keywordy links
        low = absu.lower()
        if any(k in low for k in KEYWORDS_URL):
            out.append(absu.split("#")[0])
        if len(out) >= limit:
            break
    # de-dupe preserving order
    seen, ded = set(), []
    for u in out:
        if u not in seen:
            seen.add(u); ded.append(u)
    return ded

def score(url: str, reason: str, boost: float) -> float:
    path = urlparse(url).path
    low = path.lower()
    s = 0.0
    for k in KEYWORDS_URL:
        if k in low: s += 1.25
    s += boost
    # prefer .htm content pages
    if low.endswith(".htm"): s += 0.8
    # shorter-ish paths preferred a bit (but not too strong)
    s += max(0, 2.0 - 0.12*len(path))
    # shallow depth slight bonus
    if path.count("/") <= 2: s += 0.3
    # explicit sections bonus
    if any(x in low for x in ["/about","/purpose","/values","/mission","/vision","/strategy","/sustainability"]): s += 0.5
    if reason == "site_search": s += 0.3
    if reason == "page_links": s += 0.6   # links mined from relevant hubs are often good
    return round(s, 3)

# ---------- Run (discovery-only with expansion & validation) ----------
home = find_official_homepage(COMPANY_NAME)
if not home:
    raise SystemExit("Could not determine official homepage. Set OVERRIDE_HOMEPAGE explicitly (e.g., 'https://www.ing.com').")

print("üè† Homepage:", home)
raw = discover_candidates(home, max_candidates=MAX_CANDIDATES)

# Expand a few hub pages to gather .htm content links
if EXPAND_FROM_HUBS:
    hub_candidates = [c for c in raw if is_hub_path(urlparse(c["url"]).path)]
    hub_candidates = hub_candidates[:MAX_HUBS_TO_EXPAND]
    expanded = []
    for hc in hub_candidates:
        hub_url = cleanup_url(hc["url"])
        if not allowed_by_robots(hub_url):
            continue
        try:
            r = fetch(hub_url)
            if r.ok and "html" in (r.headers.get("content-type","")):
                links = extract_samehost_links(r.url, r.text, limit=MAX_LINKS_PER_HUB)
                for u in links:
                    expanded.append({"url": u, "reason": "page_links", "boost": 1.0})
        except:
            pass
    raw.extend(expanded)

# score + normalize + validate
seen_urls = set()
rows = []
for c in raw:
    u0 = c["url"]
    # normalize & fix potential missing .htm only when appropriate
    u1 = ensure_htm(cleanup_url(u0))
    # de-dupe by normalized URL
    if u1 in seen_urls:
        continue
    seen_urls.add(u1)
    ok, final_u, status = preflight(u1)
    rows.append({
        "url": u0,
        "normalized_url": u1,
        "final_url": final_u,
        "reason": c["reason"],
        "score": score(u1, c["reason"], c["boost"]),
        "status_code": status,
        "is_live": bool(ok)
    })

# Sort: live first, then score
df = pd.DataFrame(rows)
df.sort_values(by=["is_live","score"], ascending=[False, False], inplace=True)
df.to_csv("candidates.csv", index=False, quoting=csv.QUOTE_ALL)

# Export top-N live URLs (canonical) for scraping later
top_live = df[df["is_live"]].head(TOP_N_LIVE)["final_url"].tolist()
with open("top17_live.txt", "w", encoding="utf-8") as f:
    for u in top_live:
        f.write(u + "\n")

print(f"‚úÖ Discovery complete. {len(df)} candidates saved to candidates.csv")
print(f"üü¢ Live URLs exported to top17_live.txt ({len(top_live)} URLs)\n")

print("Top 15 live candidates:")
preview = df[df["is_live"]].head(15)[["score","reason","final_url","status_code"]]
if preview.empty:
    print("(No live URLs found among candidates ‚Äî try raising MAX_HUBS_TO_EXPAND or widening KEYWORDS_URL.)")
else:
    print(preview.to_string(index=False))


[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/97.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m97.6/97.6 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m147.9/147.9 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m4.9/4.9 MB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m3.3/3.3 MB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[?25hüè† Homep

2Ô∏è‚É£ **Batch + Scoring Version**

This second code expands the scraper by adding a scoring and labeling system.
It doesn‚Äôt just download and clean pages ‚Äî it also identifies and ranks which ones are most relevant to topics like Mission, Vision, Values, Strategy, or Sustainability.

What it adds:

*  **A keyword-based scoring system:**

    *  Each keyword (like ‚Äúmission‚Äù, ‚Äúvision‚Äù,‚Äústrategy‚Äù) has a weight that defines its importance.

*  **The program looks for these words in the page‚Äôs URL, title, and headings.**
    *  The higher the number and importance of matches, the higher the page‚Äôs score.

*  **A labeling system:**

    * Each page is automatically tagged with a label (e.g., ‚ÄúMission‚Äù, ‚ÄúVision‚Äù, ‚ÄúStrategy‚Äù, etc.) depending on which keywords appear.

*  **Ranking and grouping:**

    *   It saves all pages and their scores in best_sites_raw.csv.
    *   It then selects the top pages per company (for example, the 5 most relevant pages) and saves them in best_sites_top.csv.


üëâ Purpose: This version not only scrapes pages but also analyzes and ranks them to help you quickly find the best pages that contain a company‚Äôs mission, vision, or sustainability statements.

In [4]:
# @title üóÇÔ∏è Batch + Scoring: scrape up to 17 pages, save .md, and rank best pages
# @markdown Provide URLs directly in URLS or point to a candidates.csv; outputs out_md/, manifest.csv, best_sites_raw.csv, best_sites_top.csv

import os, re, csv, time, hashlib
import pandas as pd
from urllib.parse import urlparse, urljoin, urlunparse
import requests
from bs4 import BeautifulSoup
from urllib import robotparser

# ========= Choose input source =========
# Option A: paste your chosen URLs here (comment out Option B if you use this):
URLS = [
    # "https://www.ing.com/About-us/Purpose-and-values.htm",
    # "https://www.ing.com/About-us/Strategy.htm",
]

# Option B: read from candidates.csv (produced by discovery step)
READ_FROM_CSV = True            # set False if using Option A above
CANDIDATES_CSV = "candidates.csv"
CSV_URL_COLUMN = "url"          # change if your column is differently named
CSV_COMPANY_COLUMN = "company"  # optional; if missing, "(unknown)" is used
TOP_N = 17

# ========= Settings =========
DELAY_BETWEEN = 1.0             # polite delay between requests (seconds)
RETRY_ATTEMPTS = 3              # per-URL fetch attempts
BACKOFF_SECONDS = 1.5           # exponential backoff base
OUT_DIR = "out_md"
MANIFEST = "manifest.csv"
REQUEST_TIMEOUT = 15
TOP_PER_COMPANY = 5             # how many "best" URLs to keep per company (for best_sites_top.csv)

UA = (
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)

# ========= Scoring model =========
KEYWORD_WEIGHTS = {
    "mission": 10, "vision": 10, "purpose": 9, "values": 9,
    "strategy": 8, "sustainability": 10, "esg": 9, "csr": 8,
    "about": 7, "who-we-are": 7, "our-company": 7, "company": 5,
    "annual": 9, "report": 8, "integrated": 7, "governance": 6,
    "code-of-conduct": 6, "principles": 6, "culture": 6, "impact": 6,
    "responsibility": 6, "investor": 6, "purpose-and-values": 9,
}

LABEL_RULES = [
    ("mission", "Mission"),
    ("vision", "Vision"),
    ("purpose", "Purpose"),
    ("values", "Values"),
    ("strategy", "Strategy"),
    ("sustainability|esg|csr|responsibility|impact", "Sustainability"),
    (r"\bannual\b|\breport\b|\bintegrated\b|\binvestor\b", "Annual/Report"),
    ("about|who-we-are|our-story|our-company|company", "About/Company"),
    ("governance|code-of-conduct|principles", "Governance"),
]
LABEL_PRIORITY = [
    "Mission","Vision","Values","Purpose","Strategy",
    "Sustainability","Annual/Report","About/Company","Governance","General"
]
LABEL_RANK = {lab: i for i, lab in enumerate(LABEL_PRIORITY)}

# ========= Helpers =========

# robots.txt cache
class RobotsCache:
    def __init__(self):
        self._cache = {}
    def allowed(self, user_agent: str, url: str) -> bool:
        netloc = urlparse(url).netloc
        if not netloc:
            return False
        rp = self._cache.get(netloc)
        if rp is None:
            rp = robotparser.RobotFileParser()
            robots_url = f"https://{netloc}/robots.txt"
            try:
                rp.set_url(robots_url)
                rp.read()
            except Exception:
                rp = None
            self._cache[netloc] = rp
        if self._cache[netloc] is None:
            return True
        try:
            return self._cache[netloc].can_fetch(user_agent, url)
        except Exception:
            return True

ROBOTS = RobotsCache()
def allowed_by_robots(url: str) -> bool:
    return ROBOTS.allowed(UA, url)

def normalize_url(url: str) -> str:
    """Normalize URL (drop fragments, ensure scheme)."""
    if not url:
        return url
    p = urlparse(url)
    scheme = p.scheme or "https"
    return urlunparse((scheme, p.netloc, p.path or "/", p.params, p.query, ""))

def fetch(url: str):
    """HTTP GET with UA, timeout, and redirects allowed."""
    return requests.get(url, headers={"User-Agent": UA}, timeout=REQUEST_TIMEOUT, allow_redirects=True)

def _slug_from_url(url: str) -> str:
    """Stable, readable filename slug from URL."""
    p = urlparse(url)
    last = (p.path.rstrip("/").split("/")[-1] or "home").lower()
    last = re.sub(r"[^a-z0-9\-]+", "-", last).strip("-") or "page"
    h = hashlib.sha1(url.encode("utf-8")).hexdigest()[:8]
    return f"{last}-{h}"

def _retry_fetch(url: str, attempts=RETRY_ATTEMPTS, backoff=BACKOFF_SECONDS):
    last_exc = None
    for i in range(attempts):
        try:
            r = fetch(url)
            if 200 <= r.status_code < 400:
                return r
            last_exc = RuntimeError(f"HTTP {r.status_code}")
        except Exception as e:
            last_exc = e
        time.sleep(backoff * (2 ** i))
    raise last_exc if last_exc else RuntimeError("Unknown fetch error")

def _load_urls():
    if URLS:
        return [{"url": normalize_url(u), "company": "(unknown)"} for u in URLS[:TOP_N]]
    if READ_FROM_CSV and os.path.exists(CANDIDATES_CSV):
        df = pd.read_csv(CANDIDATES_CSV)
        url_col = CSV_URL_COLUMN if CSV_URL_COLUMN in df.columns else df.columns[0]
        comp_col = CSV_COMPANY_COLUMN if CSV_COMPANY_COLUMN in df.columns else None
        rows = []
        seen = set()
        for _, r in df.iterrows():
            u = normalize_url(str(r[url_col]).strip())
            if not u or u in seen:
                continue
            seen.add(u)
            company = str(r[comp_col]).strip() if comp_col else "(unknown)"
            rows.append({"url": u, "company": company})
            if len(rows) >= TOP_N:
                break
        return rows
    raise SystemExit("No URLs provided. Fill URLS list or supply candidates.csv.")

# ---- Content cleaning / block extraction ----
def strip_noncontent(soup: BeautifulSoup) -> None:
    """In-place removal of boilerplate to keep meaningful text."""
    for sel in [
        "nav", "header", "footer",
        "[role='banner']", "[role='navigation']", "[role='contentinfo']",
        ".cookie", "#cookie", "[id*='cookie']", "[class*='cookie']",
        ".consent", "[id*='consent']",
        ".newsletter", ".subscribe", ".social", ".share",
        "script", "style", "noscript", "svg"
    ]:
        for el in soup.select(sel):
            el.decompose()

def _clean_text(t: str) -> str:
    return re.sub(r"\s+", " ", t or "").strip()

def collect_blocks(soup: BeautifulSoup):
    """Collect headings/paragraphs in reading order."""
    blocks = []
    main = soup.select_one("main") or soup.body or soup
    for el in main.find_all(["h1","h2","h3","p"], recursive=True):
        txt = _clean_text(el.get_text(" ", strip=True))
        if not txt:
            continue
        blocks.append({"type": el.name.lower(), "text": txt})
    if not blocks:
        for el in soup.find_all(["article","p"]):
            txt = _clean_text(el.get_text(" ", strip=True))
            if txt:
                blocks.append({"type": el.name.lower(), "text": txt})
    return blocks

def to_md(blocks):
    lines = []
    for b in blocks:
        t, x = b["type"], b["text"]
        if t == "h1": lines.append(f"# {x}")
        elif t == "h2": lines.append(f"## {x}")
        elif t == "h3": lines.append(f"### {x}")
        else: lines.append(x)
        lines.append("")
    return "\n".join(lines).strip() + "\n"

# ========= Scoring helpers =========
def _score_blob(text: str) -> int:
    text = (text or "").lower()
    return sum(w for kw, w in KEYWORD_WEIGHTS.items() if kw in text)

def _label_for(url: str, title: str, h1: str, h2: str) -> str:
    blob = " ".join(x for x in [url, title, h1, h2] if x).lower()
    for pat, lab in LABEL_RULES:
        if re.search(pat, blob):
            return lab
    return "General"

def score_page(url: str, html: str):
    """Return dict with url/title/h1/h2/label/score."""
    soup = BeautifulSoup(html, "lxml")
    strip_noncontent(soup)
    title = soup.title.get_text(" ", strip=True) if soup.title else ""
    h1_el = soup.select_one("h1")
    h2_el = soup.select_one("h2")
    h1 = h1_el.get_text(" ", strip=True) if h1_el else ""
    h2 = h2_el.get_text(" ", strip=True) if h2_el else ""

    score = 0
    score += 2 * _score_blob(url)
    score += 2 * _score_blob(title)
    score += _score_blob(h1)
    score += _score_blob(h2)

    return {
        "url": url,
        "title": title,
        "h1": h1,
        "h2": h2,
        "label": _label_for(url, title, h1, h2),
        "score": score,
    }

# ---- run ----
os.makedirs(OUT_DIR, exist_ok=True)

targets = _load_urls()
print(f"Processing {len(targets)} URL(s). Output ‚Üí {OUT_DIR}/ and {MANIFEST}")

rows_manifest = []
rows_scored = []

for idx, item in enumerate(targets, 1):
    url = item["url"]
    company = item["company"]
    print(f"\n[{idx}/{len(targets)}] {company} ‚Äî {url}")
    try:
        if not allowed_by_robots(url):
            print("‚õî Blocked by robots.txt")
            rows_manifest.append({"company": company, "url": url, "status": "robots_blocked", "md_file": "", "blocks": 0, "snippet": ""})
            continue

        resp = _retry_fetch(url)
        ctype = resp.headers.get("content-type","").lower()
        if "html" not in ctype:
            print(f"‚ÑπÔ∏è Skipping non-HTML content: {ctype}")
            rows_manifest.append({"company": company, "url": url, "status": "non_html", "md_file": "", "blocks": 0, "snippet": ""})
            time.sleep(DELAY_BETWEEN)
            continue

        # score
        scored = score_page(resp.url, resp.text)
        scored["company"] = company
        rows_scored.append(scored)

        # collect blocks + write .md (same as before)
        soup = BeautifulSoup(resp.text, "lxml")
        strip_noncontent(soup)
        blocks = collect_blocks(soup)

        if not blocks:
            print("‚ÑπÔ∏è No meaningful blocks found.")
            rows_manifest.append({"company": company, "url": resp.url, "status": "no_blocks", "md_file": "", "blocks": 0, "snippet": ""})
            time.sleep(DELAY_BETWEEN)
            continue

        md = to_md(blocks)
        fname = f"{OUT_DIR}/content_{_slug_from_url(resp.url)}.md"
        with open(fname, "w", encoding="utf-8") as f:
            f.write(md)

        snippet = (blocks[0]["text"][:220] + "‚Ä¶") if blocks and blocks[0].get("text") else ""
        rows_manifest.append({"company": company, "url": resp.url, "status": "ok", "md_file": fname, "blocks": len(blocks), "snippet": snippet})
        print(f"‚úÖ Saved: {fname} (blocks={len(blocks)})")

    except Exception as e:
        print(f"‚ùå Error: {e}")
        rows_manifest.append({"company": company, "url": url, "status": f"error: {type(e).__name__}", "md_file": "", "blocks": 0, "snippet": ""})

    time.sleep(DELAY_BETWEEN)

# Write manifest
pd.DataFrame(rows_manifest, columns=["company","url","status","md_file","blocks","snippet"]).to_csv(MANIFEST, index=False, quoting=csv.QUOTE_ALL)
print(f"\nüìÑ Manifest written: {MANIFEST}")

# Write scored raw
if rows_scored:
    raw_df = pd.DataFrame(rows_scored).drop_duplicates(["company","url"])
    raw_df.to_csv("best_sites_raw.csv", index=False)
    print("‚úì Saved best_sites_raw.csv")

    # Rank: label priority first, then score (desc)
    raw_df["label_rank"] = raw_df["label"].map(LABEL_RANK).fillna(len(LABEL_RANK))
    raw_df = raw_df.sort_values(["company","label_rank","score"], ascending=[True, True, False])

    # Grouped top-K (if no company column, all items will be under '(unknown)')
    best = raw_df.groupby("company").head(TOP_PER_COMPANY)
    best.to_csv("best_sites_top.csv", index=False)
    print(f"‚úì Saved best_sites_top.csv (top {TOP_PER_COMPANY} per company)")
else:
    print("No scored pages; best_sites_* files not written.")


Processing 17 URL(s). Output ‚Üí out_md/ and manifest.csv

[1/17] (unknown) ‚Äî https://www.ing.com/About-us/Purpose-and-values.htm
‚úÖ Saved: out_md/content_purpose-and-values-htm-b75a9602.md (blocks=19)

[2/17] (unknown) ‚Äî https://www.ing.com/About-us/ING-at-a-glance.htm
‚úÖ Saved: out_md/content_ing-at-a-glance-htm-6fba7b37.md (blocks=13)

[3/17] (unknown) ‚Äî https://www.ing.com/About-us/Strategy.htm
‚úÖ Saved: out_md/content_strategy-htm-1713d8e3.md (blocks=8)

[4/17] (unknown) ‚Äî https://www.ing.com/About-us/Corporate-governance.htm
‚úÖ Saved: out_md/content_corporate-governance-htm-61ca237d.md (blocks=7)

[5/17] (unknown) ‚Äî https://www.ing.com/About-us/Corporate-governance/Legal-structure-and-regulators.htm
‚úÖ Saved: out_md/content_legal-structure-and-regulators-htm-2191c1b7.md (blocks=7)

[6/17] (unknown) ‚Äî https://www.ing.com/About-us/Corporate-governance/Shareholder-influence.htm
‚úÖ Saved: out_md/content_shareholder-influence-htm-95dd52b7.md (blocks=10)

[7/17] (unkn

In [None]:
#@title Convert ipynb to HTML in Colab
# Upload ipynb
from google.colab import files
f = files.upload()

# Convert ipynb to html
import subprocess
file0 = list(f.keys())[0]
_ = subprocess.run(["pip", "install", "nbconvert"])
_ = subprocess.run(["jupyter", "nbconvert", file0, "--to", "html"])

# download the html
files.download(file0[:-5]+"html")

Saving WebScrappingWorking_2.ipynb to WebScrappingWorking_2.ipynb


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>