Goal: 
- identify which sites offer subscription options
     - retrieve their historical pricing pages from Jan 1, 2021 through Feb 1, 2026

Task:
 - find sites that have subscription


In [1]:
import pandas as pd

**Determine Whether the Site Offers a Subscription**

In [20]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

COMMON_PATHS = [
    "/subscribe",
    "/subscriptions",
    "/membership",
    "/join",
    "/pricing"
]

KEYWORDS = [
    "subscribe",
    "subscription",
    "member",
    "membership",
    "paywall"
]

def check_domain(domain, timeout=10):
    base_url = f"https://{domain}"

    try:
        r = requests.get(base_url, timeout=timeout)
        if r.status_code >= 400:
            return "inaccessible", ""

        soup = BeautifulSoup(r.text.lower(), "html.parser")
        text = soup.get_text(" ")

        for kw in KEYWORDS:
            if kw in text:
                return "subscription", base_url

        for path in COMMON_PATHS:
            try:
                url = base_url + path
                r2 = requests.get(url, timeout=timeout)
                if r2.status_code < 400:
                    return "subscription", url
            except requests.RequestException:
                pass

        return "no subscription", base_url

    except requests.RequestException:
        return "inaccessible", ""


results = df["domain"].apply(
    lambda d: pd.Series(check_domain(d), index=["subscription_status", "evidence_url"])
)

df = pd.concat([df, results], axis=1)
df.to_csv("domains_with_subscription_status.csv", index=False)


Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(r.text.lower(), "html.parser")


**Identify the Correct Subscription Pricing Page**

- some subscribe page might just be subscribe via email

In [32]:
df = pd.read_csv('./data/domains_with_pricing_page.csv')

In [36]:
has_sub = df[df['subscription_status']=='subscription']

In [37]:
has_sub.head()

Unnamed: 0,topics,domain_normalized,domain,subscription_status,evidence_url,pricing_url,pricing_url_method,pricing_page_ok,dynamic_components,popup_overlay,detected_prices,wayback_available,wayback_url,wayback_page_ok,notes
14,243,1news co nz,1news.co.nz,subscription,https://1news.co.nz/subscribe,https://www.1news.co.nz/newsletter-signup/,common_path,yes,yes,yes,,yes,http://web.archive.org/web/20250513030819/http...,no,no prices found in static HTML | likely JS-ren...
16,561,1prime ru,1prime.ru,subscription,https://1prime.ru/subscribe,https://1prime.ru/,common_path,yes,no,yes,,yes,http://web.archive.org/web/20260201131401/http...,no,possible modal or overlay UI
33,2452495611,24sedam rs,24sedam.rs,subscription,https://24sedam.rs/subscribe,https://24sedam.rs/,common_path,yes,yes,no,,yes,http://web.archive.org/web/20260128040331/http...,no,no prices found in static HTML | likely JS-ren...
37,243,24tv ua,24tv.ua,subscription,https://24tv.ua/subscriptions,https://24tv.ua/subscriptions,common_path,yes,yes,yes,,yes,http://web.archive.org/web/20251112134022/http...,no,likely JS-rendered content | possible modal or...
45,299243,365scores com,365scores.com,subscription,https://365scores.com/subscribe,https://www.365scores.com/subscribe,common_path,yes,yes,yes,,no,,no,likely JS-rendered content | possible modal or...


In [39]:
has_sub['domain'].contains('cnn')

AttributeError: 'Series' object has no attribute 'contains'