# Seeking Alpha

JJ appears to scrape following webpage in SeekingAlpha based on his `enter.js` file:

- https://seekingalpha.com/alpha-picks/picks/current
- https://seekingalpha.com/alpha-picks/picks/removed


In [12]:
from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
from dotenv import load_dotenv
import os
import re
from pprint import pformat

%load_ext autoreload
%autoreload 2

from src.utils.init_script_utils import append_mimetypes

load_dotenv()
cur_url = "https://seekingalpha.com/alpha-picks/picks/current"
rm_url = "https://seekingalpha.com/alpha-picks/picks/removed"

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
<div aria-labelledby="auth-modal-label" aria-modal="true"
    class="w-full KRkzG pointer-events-auto relative mx-auto my-0 flex flex-col bg-share-modal-container mpcDT w-full md:max-w-full md:w-auto px-18"
    data-test-id="modal-content">
    <div class="wZqjb mx-auto w-full flex-auto overflow-y-auto md:max-w-none md:overflow-y-hidden">
        <div class="relative px-18 pb-20 pt-0 text-center md:px-52 md:pb-28">
            <form novalidate="">
        </div><button
            class="relative inline-flex cursor-pointer select-none items-center whitespace-nowrap break-words transition-colors hover:no-underline motion-reduce:transition-none rounded-4 border-black bg-black text-white hover:bg-black-70 disabled:border-transparent disabled:bg-black-20 dark:border-black-10 dark:bg-black-10 dark:text-black dark:hover:border-black-10 dark:hover:bg-black-10 dark:disabled:bg-black-30 w-full md:min-h-36 md:justify-center md:border md:px-16 md:py-1 md:text-medium-b min-h-44 justify-center border px-18 py-1 text-medium-b"
            data-test-id="sign-in-button" type="submit"><span class="">Sign in</span></button>
        </form>
    </div>
</div>
</div>

# Scrapling

- Fetcher is able to perform GET request for different sections of market news in seekingalpha.com.
- Unfortunately Scrapling (i.e. Fetcher, StealthyFetcher, PlaywrightFetcher) is not able to scrape from `cur_url` and `rm_url` as login is required.


## Fetcher


In [10]:
fetcher = Fetcher(auto_match=False)

page = fetcher.post(cur_url, email=os.getenv("COY_EMAIL"), password="COY_PASSWORD")
print(page.get_all_text())

TypeError: Client.post() got an unexpected keyword argument 'email'

## StealthyFetcher


In [11]:
stealthyfetcher = StealthyFetcher()
page = await stealthyfetcher.async_fetch(
    cur_url,
    extra_headers={
        "email": os.getenv("COY_EMAIL"),
        "password": os.getenv("COY_PASSWORD"),
    },
)
print(page.get_all_text())

[2025-03-17 10:33:06] INFO: Fetched (308) <GET https://seekingalpha.com/alpha-picks/picks/current> (referer: https://www.google.com/search?q=seekingalpha)
[2025-03-17 10:33:06] INFO: Fetched (200) <GET https://seekingalpha.com/alpha-picks/subscribe> (referer: https://www.google.com/)


Alpha Picks by Seeking Alpha, Choose 2 Stocks Per Month | Seeking Alpha






## PlaywrightFetcher


In [12]:
playwrightfetcher = PlayWrightFetcher()
page = await playwrightfetcher.async_fetch(
    cur_url,
    extra_headers={
        "email": os.getenv("COY_EMAIL"),
        "password": os.getenv("COY_PASSWORD"),
    },
)
print(page.get_all_text())

[2025-03-17 10:35:01] INFO: Fetched (308) <GET https://seekingalpha.com/alpha-picks/picks/current> (referer: https://www.google.com/search?q=seekingalpha)
[2025-03-17 10:35:01] INFO: Fetched (200) <GET https://seekingalpha.com/alpha-picks/subscribe> (referer: https://www.google.com/)


Alpha Picks by Seeking Alpha, Choose 2 Stocks Per Month | Seeking Alpha






# Selenium

- https://stackoverflow.com/questions/53039551/selenium-webdriver-modifying-navigator-webdriver-flag-to-prevent-selenium-detec


# Playwright


| Issue                                                                                    | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| ---------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Firefox trigger bot detection                                                            | &bull; It creates popup window (i.e. iframe) to request login via Google (both in head and headless mode). <br>&bull; Then trigger bot detection i.e. request user to 'press and hold' on button.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| chromium trigger bot detection                                                           | &bull; When accessing the webpage directly bypassing 'subscribe' page. <br>&bull; Bot detection occasionally triggered When repeating launching webscraper.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| chromium couldn't redirect                                                               | &bull; When clicking on 'Sign in' button after filling up 'email' and 'password', button is visually clicked and loading window visual (orange light) is visible. <br>&bull; But no redirect from "https://seekingalpha.com/alpha-picks/subscribe" to"https://seekingalpha.com/alpha-picks/picks/current" occurs. <br>&bull; If 'page.wait_for_load_state("networkidle") is added after the click function, it will trigger bot detection; and TimeoutError, which implies that network was still active even after 30 seconds. <br>&bull; Changing the channel from default chromium to chrome doesn't resolve the issue.                                                                                                                                                                                                                                                                         |
| Using Javascript to submit the form closes the 'login' window but url wasn't redirected. | <br>&bull; Address path: "https://seekingalpha.com/alpha-picks/subscribe?email=limjj11%40hotmail.com&password=v2%26B%40l%40xV" suggest that a GET request is send to the server but wasn't successful with and without networkidle. <br>&bull; Use of JavaScript to submit form didn't trigger bot detection despite setting wait for networkidle.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Manual goto required URL trigger bot detection                                           | &bull; networkidle + JavaScript + wait_for_url -> trigger bot detection. <br>&bull; networkidle + JavaScript + no wait_for_url -> TimeoutError or trigger bot detection. <br>&bull; No networkidle + JavaScript + wait_for_url -> Error: Page.goto: net::ERR_ABORTED at https://seekingalpha.com/alpha-picks/picks/current <br>&bull; No networkidle + JavaScript + no wait_for_url -> Error: Page.goto: net::ERR_ABORTED at https://seekingalpha.com/alpha-picks/picks/current <br>&bull; networkidle + click + wait_for_url -> trigger bot detection <br>&bull; networkidle + click + no wait_for_url -> TimeoutError + trigger bot detection. <br>&bull; No networkidle + click + wait_for_url -> TimeoutError or trigger bot detection (sometimes no error but no redirect). <br>&bull; No networkidle + click + no wait_for_url -> trigger bot detection (sometimes no error but no redirect) |
| Playwright stealth doesn't work                                                          | &bull; Automatically trigger bot detection page without accessing 'subscribe' page.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |


In [None]:
import os
import random
import time
from dotenv import load_dotenv
from playwright.sync_api import sync_playwright


def human_delay(min_ms=800, max_ms=2500):
    """Simulate human-like delays."""
    time.sleep(random.uniform(min_ms / 1000, max_ms / 1000))


def advanced_stealth(page):
    """Apply advanced stealth techniques to bypass bot detection."""
    # Override browser properties to mimic a real user
    page.add_init_script(
        """
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined,
            configurable: true
        });
        Object.defineProperty(navigator, 'plugins', {
            get: () => [1, 2, 3] // Fake plugins list
        });
        Object.defineProperty(navigator, 'languages', {
            get: () => ['en-US', 'en']
        });
        window.navigator.chrome = {
            app: { isInstalled: false },
            webstore: { onInstallStageChanged: {}, onDownloadProgress: {} },
            runtime: { platformArch: 'Linux x86_64', platformNaclArch: 'Linux x86_64' }
        };
    """
    )

    # Randomize viewport
    viewport_width = 1280 + random.randint(-50, 50)
    viewport_height = 720 + random.randint(-30, 30)
    page.set_viewport_size({"width": viewport_width, "height": viewport_height})

    # Spoof screen properties
    page.evaluate(
        """() => {  
        Object.defineProperty(screen, 'availWidth', { value: %d });  
        Object.defineProperty(screen, 'availHeight', { value: %d });  
    }"""
        % (viewport_width, viewport_height)
    )


def behavioral_login(page, selector, text, delay_range=(50, 150)):
    """Simulate human-like typing behavior."""
    element = page.wait_for_selector(selector)
    human_delay(300, 800)  # Pause before interacting
    element.hover()
    for char in text:
        element.type(char, delay=random.randint(*delay_range))
        human_delay(50, 150)


def main():
    load_dotenv()  # Load environment variables from .env file

    with sync_playwright() as p:
        # Launch browser with stealth settings
        browser = p.chromium.launch(
            headless=False,
            args=["--disable-blink-features=AutomationControlled", "--start-maximized"],
        )

        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
            locale="en-US",
            color_scheme=random.choice(["light", "dark"]),  # Randomize light/dark mode
            viewport={"width": 1280, "height": 720},
        )

        page = context.new_page()

        # Apply advanced stealth techniques
        advanced_stealth(page)

        # Block known tracking and CAPTCHA scripts
        page.route("**/*captcha*", lambda route: route.abort())
        page.route("**/*google-analytics*", lambda route: route.abort())

        # Navigate to the login page
        login_url = "https://seekingalpha.com/alpha-picks/subscribe"
        cur_url = "https://seekingalpha.com/alpha-picks/picks/current"
        page.goto(login_url)
        page.wait_for_url(login_url)

        # Click the 'LOG IN' button
        page.locator("button:has-text('LOG IN')").click()

        # Wait for the login form to appear
        xpath = "//div[@data-test-id='modal-content']//button[@data-test-id='sign-in-button']"
        page.wait_for_selector(xpath)

        # Fill in email and password using human-like behavior
        behavioral_login(page, "[name='email']", os.getenv("ALPHA_EMAIL"))
        behavioral_login(page, "[name='password']", os.getenv("ALPHA_PASSWORD"))

        # Simulate human-like mouse movement and click the 'Sign In' button
        submit_button = page.locator(xpath)
        box = submit_button.bounding_box()

        for _ in range(3):  # Simulate natural mouse movements before clicking
            page.mouse.move(
                box["x"] + random.randint(10, box["width"] - 10),
                box["y"] + random.randint(10, box["height"] - 10),
                steps=random.randint(5, 15),
            )
            human_delay(50, 200)

        submit_button.click()

        # Wait for redirection or CAPTCHA trigger
        try:
            page.wait_for_url(cur_url, timeout=15000)
            print("Login successful and redirected!")
            content = page.content()
            print(
                content[:500]
            )  # Print first 500 characters of the HTML for verification
        except Exception as e:
            print(f"Login failed or CAPTCHA triggered: {e}")
            page.screenshot(path="debug.png")  # Save screenshot for debugging

        browser.close()


if __name__ == "__main__":
    main()

| S/N | Request Header      | Ranking          | Description                                                                                                                                                                             |
| --- | ------------------- | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1.  | :authority:         | Critical         | &bull; Specifies the domain of the server being accessed (e.g., example.com). <br>&bull; This is critical for HTTP/2 requests.                                                          |
| 2.  | :method:            | Critical         | &bull; Indicates the HTTP method used (e.g., GET, POST). <br>&bull; Ensure it matches the operation you're performing.                                                                  |
| 3.  | :path:              | Critical         | &bull; The specific endpoint being requested (e.g., /alpha-picks/subscribe). <br>&bull; It must accurately reflect the URL path.                                                        |
| 4.  | :scheme:            | Critical         | &bull; Specifies the protocol (https or http). <br>&bull; Critical for HTTP/2 requests.                                                                                                 |
| 5.  | accept:             | Critical         | &bull; Defines acceptable response formats (e.g., text/html, application/json). <br>&bull; Must match what a browser typically sends.                                                   |
| 6.  | accept-encoding:    | Optional         | &bull; Specifies supported compression formats (e.g., gzip, br). <br>&bull; Typically included but not always critical.                                                                 |
| 7.  | accept-language:    | Critical         | &bull; Indicates preferred language(s) for content (e.g., en-US,en;q=0.9). <br>&bull; Important for localization.                                                                       |
| 8.  | content-length:     | Optional         | &bull; Length of the request body in bytes. <br>&bull; Required for POST requests but irrelevant for GET requests.                                                                      |
| 9.  | content-type:       | Optional         | &bull; Specifies the type of data being sent in a request body (e.g., application/json, application/x-www-form-urlencoded). <br>&bull; Relevant for POST requests but not GET requests. |
| 10. | cookie:             | Optional         | &bull; Required if authentication or session management depends on cookies. <br>&bull; If cookies are not necessary, this can be omitted.                                               |
| 11. | origin:             | Optional         | &bull; Indicates the origin of the request, often used in cross-origin requests (e.g., during form submissions). <br>&bull; Include only if relevant.                                   |
| 12. | priority:           | Context Specific | &bull; Used in HTTP/2 to specify request priority. <br>&bull; Typically handled automatically by browsers and may not need manual inclusion.                                            |
| 13. | referer:            | Critical         | &bull; Specifies the URL of the previous page that led to this request. <br>&bull; Often checked to ensure legitimate navigation paths.                                                 |
| 14. | sec-ch-ua:          | Critical         | &bull; Provides client hints about the browser and version (e.g., "Chromium";v="125", "Not.A/Brand";v="24"). <br>&bull; Important for mimicking browser behavior.                       |
| 15. | sec-ch-ua-mobile:   | Critical         | &bull; Indicates whether the client is on a mobile device (?0 for desktop, ?1 for mobile).                                                                                              |
| 16. | sec-ch-ua-platform: | Critical         | &bull; Specifies the operating system (e.g., Windows, macOS). <br>&bull; Helps with fingerprinting checks.                                                                              |
| 17. | sec-fetch-dest:     | Context Specific | &bull; Indicates the destination of the fetch request (e.g., document, script, image). <br>&bull; Useful for mimicking browser behavior but context-specific.                           |
| 18. | sec-fetch-mode:     | Context Specific | &bull; Specifies how resources are fetched (e.g., navigate, cors, same-origin). <br>&bull; Include if observed in real browser traffic for your target website.                         |
| 19. | sec-fetch-site:     | Context Specific | &bull; Describes the relationship between the origin of the page and requested resource (e.g., same-origin, cross-site). <br>&bull; Important for anti-bot checks.                      |
| 20. | user-agent:         | Critical         | &bull; Identifies the browser, OS, and device type (e.g., Mozilla/5.0 ...). <br>&bull; One of the most critical headers.                                                                |


| S/N | Request Header      | Value                                                                                                 |
| --- | ------------------- | ----------------------------------------------------------------------------------------------------- |
| 1.  | :authority:         | seekingalpha.com                                                                                      |
| 2.  | :method:            | POST                                                                                                  |
| 3.  | :path:              | /api/v3/login_tokens                                                                                  |
| 4.  | :scheme:            | https                                                                                                 |
| 5.  | accept:             | application/json                                                                                      |
| 6.  | accept-encoding:    | gzip, deflate, br, zstd                                                                               |
| 7.  | accept-language:    | en-US,en;q=0.9                                                                                        |
| 8.  | content-length:     | 120                                                                                                   |
| 9.  | content-type:       | application/json                                                                                      |
| 10. | cookie:             | machine_cookie=...                                                                                    |
| 11. | origin:             | https://seekingalpha.com                                                                              |
| 12. | priority:           | u=1, i                                                                                                |
| 13. | referer:            | https://seekingalpha.com/alpha-picks/subscribe                                                        |
| 14. | sec-ch-ua:          | "Chromium";v="134", "Not:A-Brand";v="24", "Google Chrome";v="134"                                     |
| 15. | sec-ch-ua-mobile:   | ?0                                                                                                    |
| 16. | sec-ch-ua-platform: | "Linux"                                                                                               |
| 17. | sec-fetch-dest:     | empty                                                                                                 |
| 18. | sec-fetch-mode:     | cors                                                                                                  |
| 19. | sec-fetch-site:     | same-origin                                                                                           |
| 20. | user-agent:         | Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 |


In [35]:
text

'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'

In [37]:
import platform
from pprint import pformat


def get_runtime(user_agent: str) -> str:
    """Get operating system from user-agent for desktop device only i.e. 'Chrome OS',
    'Chromium OS', 'Linux', 'macOS', and 'Windows'."""

    # Get processor architecture
    proc_arch = platform.machine()

    mapping = {
        "Linux": {
            "PlatformOs": "linux",
            # "PlatformArch": proc_arch,
            # "PlatformNaclArch": proc_arch,
        },
        "Macintosh": {
            "PlatformOs": "mac",
            # "PlatformArch": proc_arch,
            # "PlatformNaclArch": proc_arch,
        },
        "Windows": {
            "PlatformOs": "win",
            # "PlatformArch": proc_arch,
            # "PlatformNaclArch": proc_arch,
        },
        "CrOS x86_64": {
            "PlatformOs": "cros",
            # "PlatformArch": proc_arch,
            # "PlatformNaclArch": proc_arch,
        },
    }

    arch_dict = {
        "PlatformArch": proc_arch,
        "PlatformNaclArch": proc_arch,
    }

    mapping = {k: dict(**v, **arch_dict) for k, v in mapping.items()}

    return mapping


get_runtime(pformat(text, sort_dicts=False))

{'Linux': {'PlatformOs': 'linux',
  'PlatformArch': 'x86_64',
  'PlatformNaclArch': 'x86_64'},
 'Macintosh': {'PlatformOs': 'mac',
  'PlatformArch': 'x86_64',
  'PlatformNaclArch': 'x86_64'},
 'Windows': {'PlatformOs': 'win',
  'PlatformArch': 'x86_64',
  'PlatformNaclArch': 'x86_64'},
 'CrOS x86_64': {'PlatformOs': 'cros',
  'PlatformArch': 'x86_64',
  'PlatformNaclArch': 'x86_64'}}

In [None]:
import re

pattern = r"\d+(?=\b)"  # Match digits followed by a word boundary
text = "123abc1.456 def789"

matches = re.findall(pattern, text)
print(matches)  # Output: ['123', '456', '789']

['1', '456', '789']


In [34]:
import platform

platform.machine()

'x86_64'

In [20]:
text = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36"

re.findall(r"(?<=Chrome/)\d+", text)

['134']

In [12]:
import re

pattern = r"(?<=\$)\d+"  # Match digits preceded by a dollar sign
text = "The price is $50 for one item and $30 for another."

matches = re.findall(pattern, text)
print(matches)  # Output: ['50', '30']

['50', '30']


ec-ch-ua:
"Chromium";v="134", "Not:A-Brand";v="24", "Google Chrome";v="134"
sec-ch-ua-mobile:
?0
sec-ch-ua-platform:
"Linux"
sec-fetch-dest:
empty
sec-fetch-mode:
cors
sec-fetch-site:
same-origin
user-agent:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36


async def randomize_webgl(page):
await page.add_init_script("""
() => {
const gpus = [
'ANGLE (NVIDIA, NVIDIA GeForce RTX 3080 Direct3D11 vs_5_0 ps_5_0)',
'Intel Iris Xe Graphics',
'AMD Radeon Pro 5500M OpenGL Engine'
];
const getParameter = WebGLRenderingContext.prototype.getParameter;
WebGLRenderingContext.prototype.getParameter = function(parameter) {
if (parameter === 37445) return gpus[Math.floor(Math.random()*3)];
return getParameter.call(this, parameter);
};
}
""")


In [15]:
plugins = [
    {
        "name": "Chromium PDF Viewer",
        "description": "Portable Document Format",
        "filename": "mhjfbmdgcfjbbpaeojofohoefgiehjai",
    },
    {
        "name": "Native Client",
        "description": "Native Client Execution",
        "filename": "internal-nacl-plugin",
    },
    {
        "name": "PDF Viewer",
        "description": "Portable Document Format",
        "filename": "oemmndcbldboiebfnladdacbdfmadadm",
    },
]

a = append_mimetypes(plugins)
print(pformat(a, sort_dicts=False))

[{'name': 'Chromium PDF Viewer',
  'description': 'Portable Document Format',
  'filename': 'mhjfbmdgcfjbbpaeojofohoefgiehjai',
  'mimeTypes': [{'type': 'application/pdf',
                 'suffixes': 'pdf',
                 'description': 'Portable Document Format'},
                {'type': 'text/pdf',
                 'suffixes': 'pdf',
                 'description': 'Portable Document Format'}]},
 {'name': 'Native Client',
  'description': 'native Client Execution',
  'filename': 'internal-nacl-plugin'},
 {'name': 'PDF Viewer',
  'description': 'Portable Document Format',
  'filename': 'oemmndcbldboiebfnladdacbdfmadadm',
  'mimeTypes': [{'type': 'application/pdf',
                 'suffixes': 'pdf',
                 'description': 'Portable Document Format'},
                {'type': 'text/pdf',
                 'suffixes': 'pdf',
                 'description': 'Portable Document Format'}]}]
