<!-- metadata: title -->
# Balancing between Paywalls and Search Engine Optimization

<!-- metadata: subtitle -->
> ### How Media Paywals Should work: A Research Case for Nation Media Group

<!-- metadata: keywords, is_array=true -->
**Keywords:**
  - nation-media-group
  - paywalls
  - search-engine-optimization
  - cloudflare

<!-- metadata: categories, is_array=true -->
**Categories:**
  - cyber-security

**Disclaimer:**
<!-- metadata: disclaimer, strip_markdown=false -->
Please note that this is meant for educational purposes only. You will be peronally liable for any misuse of the information provided here. We [contacted Nation Media Group](https://www.nationmedia.com/contact/) on Jun 3, 2024, 2:30 PM East African Time, but they did not respond. ^[We [contacted Nation Media Group](https://www.nationmedia.com/contact/) through the following emails: <support@nation.africa>, <sales_inquiries@ke.nationmedia.com>, <newsdesk@ke.nationmedia.com>, <publiceditor@ke.nationmedia.com>, <mailbox@ke.nationmedia.com>, <epaper@ke.nationmedia.com>, <Customercare@ke.nationmedia.com>].

![Nation Media Group logo](nation-media-group.jpg)

## Introduction

To maintain independence of content creation, creators and authors need some form of monetization. For text content, one of the most common is advertizing. creators and authors can advertize directly or signup for services such as google adsense. Another form of monetization is publishing premium content that is initially only available to paid users. this is especially common for news media outlets because the most recent contents is usually the most relevant.

However, for users to discover your content, they need to find it in search engines such as google or bing or duck duck go. For bet results, it makes sense to allow search engines to see the whole premium article to allow them to suggest it when users search for something similar. This is called search engine optimization. You might have the best news content, but if no one can find them, then its probably useless!

The conflict to allow partial access to premium content while still restricting the content to unpaid users is a thin balance that most media house have to maintain. Users often find cleaver ways to circumnavigent the paywalls to see the paid content for free. There are online forums dedicated to discovering these vulnerabilities, such as "[Bypassing Daily Nation Paywall](https://www.reddit.com/r/Kenya/comments/s96k01/bypassing_daily_nation_paywall/?rdt=48760)" and "[You can bypass most soft paywalls with a little CSS knowledge](https://www.reddit.com/r/educationalgifs/comments/lk1not/you_can_bypass_most_soft_paywalls_with_a_little/)". The raw ideas shared in these forums often simple but require some basic programming skills to execute. In that sense, most people would prefer to just pay that learn how to execute the ideas.

However, some users also create browser plugins that automatically do the heavy lifting for the user, allowing them to automatically view the premium content without any effort. Website Browser plugin stores and code repositories that contain plugins that allow paywall bypass are usually taken down and shut down, such as the famous <https://github.com/iamadamdev/bypass-paywalls-chrome>, but not before the code has found a new home, such as <https://github.com/nikolqyy/bypass-paywalls-chrome/releases/tag/most-recent> ^[https://news.ycombinator.com/item?id=41294166].

As imagined, this is very time consuming, because you'd have to find out which plugins are currently available to bypass your paywalls. Also, it is not instant. it takes a while to execute a DMCA takedown notice. And even after its successful, someone who has a cloned repo will reupload the code and or plugin and the process continues. There is also a downside with the fact that this only affect publicly available plugins and ideas. Also, the more you try to control, the more it spreads that your website can be bypassed, thereby prompting users who used to happily pay feel like they have been short charged, and start looking for ways to bypass. This strategy also only tends to favor famous plugins and code repositories. The least known repositories are left to grow  because you dont know they exist (eg: <https://github.com/nikolqyy/bypass-paywalls-chrome>), and its very unlikely users who are used to access content for free are going to pay for it even if you disallow the access. there is also the fact that users who have already installed the pluggins will continue to enjoy the premium content without paying. 

## The Better Solution

After a DMCA takedown notice, the most logical next thing is to change some aspects of your website such as class names and aragements of the site contents to make the old plugins not to work. but there is a slightly better solution, one that is scallable, cheap and doesnt compromise on search engine optimization.

The solution involves keeping a select list of search engines allowing to read all the premium content for search engine optimization. These may include `google`, `bing`, `duckduckgo`, `yandex`, `baidu`, `yahoo` and `ahrefs`. In your web servers, you would check the ip address of the calling client and do a reverse DNS lookup to findout if the IP address is associated with the whitelisted search engines.

This ofcourse is an expensive operation and should be optimized by caching the result for about a week. this means an IP adress that has been found to be associated with a search engine should not be re-evaluated again for about a week. this keeps a good balance between functionality and performance.


## Nation Media Case Study

The logic here affects <https://nation.africa/> and <https://www.businessdailyafrica.com/>

In [None]:
#| code-summary: "Show python imports"

import sys
import os
from pathlib import Path

# Add root directory as python path
root_dir = os.path.abspath(Path(sys.executable).parents[2])
sys.path.append(root_dir)

%reload_ext autoreload
%autoreload 2

### The old vulnerability

The old vulnerability only required css to bypass. One would only need to edit the DOM (Document Object Model) and remove some classes and some elements and the premium contents would be visible. The code below would fully display the premium content to any user! See below javascript code

```js
setTimeout(() => {
    // https://nation.africa/
    // Remove the paywall element
    document.querySelector('.wall-guard')?.remove();
    // Allow copying the text
    document.querySelectorAll('.blk-txt')?.forEach(
      i => i.classList.remove('blk-txt'));

    // https://www.businessdailyafrica.com/
    // Remove the paywall spinner
    document.querySelector('.spinner')?.remove();
    // Remove the paywall element
    document.querySelector('.paywall')?.remove();
    // Remove the call for action
    document.querySelector('.grid-container-medium')?.remove();

    // https://www.businessdailyafrica.com/ AND https://nation.africa/
    // Show the hidden content
    document.querySelectorAll('.paragraph-wrapper.nmgp')?.forEach(
      i => i.classList.remove('nmgp'));
}, 1)
```

***

After reporting the issue to them, they added a javascript layer to prevent easy access to the premium content. There is now a javascript code that runs to remove the actual content from the DOM, which means that CSS alone will not show the content.
However, there is a way we can silently disable javascript, by refetching the html again and parsing the html text as DOM but without running javascript. this essentially allowes the old CSS method to continue working. See below javascript code:

```js
setTimeout(async () => {
    // remove popup and make page scrollable
    const removePopup = (maxRetries, retries) => {
        setTimeout(() => {
            const popUp = document.querySelector('.fc-ab-root')
            if (popUp) {
                popUp?.remove()
                document.body.style = ""
            } else if (retries < maxRetries) {
                removePopup(maxRetries, retries + 1)
            }
        }, 300);
    };
    // re-fetch html from the current link
    const htmlString = await fetch(location.href).then(resp => resp.text())
    // parse the HTML without Javascript!
    const newHtmlDocument = new DOMParser().parseFromString(htmlString, 'text/html');
    // https://nation.africa/
    // Remove the paywall element
    newHtmlDocument.querySelector('.wall-guard')?.remove();
    // Allow copying the text
    newHtmlDocument.querySelectorAll('.blk-txt')?.forEach(i
     => i.classList.remove('blk-txt'));

    // https://www.businessdailyafrica.com/
    // Remove the paywall spinner
    newHtmlDocument.querySelector('.spinner')?.remove();
    // Remove the paywall element
    newHtmlDocument.querySelector('.paywall')?.remove();
    // Remove the call for action
    newHtmlDocument.querySelector('.grid-container-medium')?.remove();

    // https://www.businessdailyafrica.com/ AND https://nation.africa/
    // Show the hidden content
    newHtmlDocument.querySelectorAll('.paragraph-wrapper.nmgp')?.forEach(
      i => i.classList.remove('nmgp'));
    // Enable images
    newHtmlDocument.querySelectorAll('img.lazy-img').forEach(
      i => i.classList.remove('lazy-img'))
    newHtmlDocument.querySelectorAll('img[data-src]').forEach(img => {
        const { dataset } = img;
        img.src = dataset.src ?? img.src;
        img.srcset = dataset.srcset ?? img.srcset;
    });
    // Remove spinners
    newHtmlDocument.querySelectorAll('.spinner').forEach(i => i.remove());
    // Remove cloundflare email protection label
    newHtmlDocument.querySelector('.__cf_email__')?.closest('.paragraph-wrapper')?.remove();

    document.body.outerHTML = newHtmlDocument.body.outerHTML;

    removePopup(50, 0)
}, 10)
```

In [None]:
"https://web.archive.org/web/20240601075749/https://nation.africa/kenya/business/inside-world-bank-tough-terms-sh158bn-loan-kenya-4642634"

## Appropriate Fix

My suggested fix involves using cloudflare, which nation.africa is already using for DNS and CDN management. create a web worker that checks the IP address. if the ip address is from search engines, then return the extra paid content for SEO, otherwise reduct the extra content. with this, it would still be possible to see the content by routing the request with a https://pagespeed.web.dev/ , which makes it harder than simple jatascript and css!

The IP check involves an IP reverse lookup

python reverse lookup code, and some tests! test with major search engines!

"AllowedSearchBots": [
    "googlebot.com",
    "google.com",
    "search.msn.com",
    "duckduckgo.com",
    "yandex.ru",
    "yandex.net",
    "yandex.com",
    "crawl.baidu.com",
    "crawl.baidu.jp",
    "crawl.yahoo.net",
    "ahrefs.com"
  ],

In [None]:
#| code-fold: false

import socket
from ipaddress import ip_address as parse_ip_address

async def reverse_dns_lookup(ip_address: str, *host_names: tuple[str, ...]) -> bool:
    """
    Perform reverse DNS lookup.
    Usage example:
        await reverse_dns_lookup("66.249.66.1", "googlebot.com", "google.com")
    
    Parameters
    ----------
    ip_address : str
        the ip address of the client that called the server, 
        or the header value of "X-Forwarded-For" incase a 
        proxy/CDN such as cloudflare is used!
    host_names : list[str]
        allowed search engines
        eg: "googlebot.com", "search.msn.com", "duckduckgo.com", etc

    More Information:
    Verifying Googlebot: 
        https://developers.google.com/search/docs/advanced/crawling/verifying-googlebot
    How to access the sitemap.xml file of stackoverflow.com
        https://meta.stackexchange.com/a/324471
    Reverse IP Domain Check?
        https://stackoverflow.com/a/716753/3563013
    """
    try:
        if len(host_names) > 0:
            # Raises `ValueError`if ip_address is not valid IPv4 or IPv6 address.
            valid_ip_address: str = str(parse_ip_address(ip_address))
            # Perform reverse DNS lookup
            # Get hostname from IP, eg: ('crawl-66-249-66-1.googlebot.com', [], ['66.249.66.1'])
            ip_address_hostname, aliases_1, _ = socket.gethostbyaddr(valid_ip_address)
            # Get all IP addesses resolving the hostname (both IPv4 and IPv6)
            ip_address_list = list(set(
                [ip[4][0] for ip in socket.getaddrinfo(ip_address_hostname, None)]))
            # Check if IP matches any of the addresses for the hostname
            if valid_ip_address in ip_address_list:
                # Perform forward DNS lookup to get all aliases
                _, aliases_2, _ = socket.gethostbyname_ex(ip_address_hostname)
                all_aliases = list(set([ip_address_hostname] + aliases_1 + aliases_2))
                # Check if hostname or its aliases match any of the allowed hosts
                return any(
                    i for i in host_names 
                    if any(
                        j for j in all_aliases 
                        if i.casefold().endswith(j.casefold()) \
                            or j.casefold().endswith(i.casefold())))
    except Exception as e:
        pass
    return False

await reverse_dns_lookup("66.249.66.1", "googlebot.com", "googleusercontent.com", "google.com")

As one can tell, doing this for every request is resource intensive, and it is best to cache this for about 7 days. a verified ip address should be allowed to query for a week without firther checks for a week!

Alternatives to doing this on the server is doing this on==in a CDN like cloudflare, using web workers in this case. this saves server resources and for a start, its free. web workers intercept a request to the server,  and is able to modify the request and the response.