# Web Scraping for LLM: Extract symptoms and remedies from websites

Now first let's get some websites for us to extract data from

In [1]:
home_remedy_sites = [
    {
        "name": "Allina Health - Natural Remedies for Everyday Illness",
        "url": "https://www.allinahealth.org/healthysetgo/heal/natural-remedies-for-everyday-illnesses"
    },
    {
        "name": "Healthline - 9 Home Remedies backed by Science",
        "url": "https://www.healthline.com/health/home-remedies"
    },
    {
        "name": "WebMD - Home Remedies, What Works?",
        "url": "https://www.webmd.com/balance/ss/slideshow-home-remedies"
    },
    {
        "name": "Piedmont - 9 natural cold and flu remedies",
        "url": "https://www.piedmont.org/living-real-change/9-natural-cold-and-flu-remedies"
    },
    {
        "name": "Prevention - 35 All-Time Favorite Natural Remedies",
        "url": "https://www.prevention.com/health/a20477585/35-all-time-favorite-natural-remedies/"
    },
    {
        "name": "Elliot - Cold and Flu: Home Remedies, Natural Treatments, & When to Seek Help from VirtualER",
        "url": "https://www.elliothospital.org/about-us/newsroom/news/cold-and-flu-home-remedies-natural-treatments-and-when-seek-help-virtualer"
    },
    {
        "name": "Stamford Health - 10 Natural Home Remedies for Type 2 Diabetes",
        "url": "https://www.stamfordhealth.org/healthflash-blog/diabetes-and-endocrine/type-2-diabetes-natural-remedies/"
    },
]

## Since I'll be testing multiple scraping methods, i would want to see the scraping results in a nice table format, so let us set the code for that.

In [2]:
! pip install prettytable tqdm --quiet

```prettytable``` is for making the terminal output in a neat table format without the natural skew 

In [3]:
from typing import List, Callable, Dict
from prettytable import PrettyTable, ALL
from tqdm import tqdm

def view_scraped_content(scrape_url_functions: List[Dict[str, Callable[[str], str]]], sites_list: List[Dict[str, str]], characters_to_display: int = 500, table_max_width: int = 50) -> List[Dict[str, str]]:
    content_table_headers = ["Site Name"] + [f"{func['name']} content" for func in scrape_url_functions]
    content_table = PrettyTable()
    content_table.field_names = content_table_headers

    scraped_data = []

    for site in sites_list:
        content_row = [site['name']]
        site_data = {"provider": site['name'], "sites": []}

        for scrape_function in scrape_url_functions:
            function_name = scrape_function['name']
            for _ in tqdm([site], desc=f"Processing site {site['name']} using {function_name}"):
                try:
                    content = scrape_function['function'](site['url'])
                    content_snippet = content[:characters_to_display]
                    content_row.append(content_snippet)

                    site_data["sites"].append({"name": function_name, "content": content})
                except Exception as e:
                    error_message = f"Error: {str(e)}"
                    content_row.append(error_message)

                    site_data["sites"].append({"name": function_name, "content": error_message})
                    continue
        
        content_table.add_row(content_row)
        scraped_data.append(site_data)
    
    content_table.max_width = table_max_width
    content_table.hrules = ALL

    print("Content Table:")
    print(content_table)

    return scraped_data

  from prettytable import PrettyTable, ALL


## Now let us set up all the scrapers we would like to use and compare.

### Beautiful Soup

In [4]:
! pip install requests beautifulsoup4 --quiet

In [5]:
import requests
from bs4 import BeautifulSoup

def beautiful_soup_scrape_url(url: str):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return str(soup)

### Reader API by Jina AI

In [6]:
import requests

def scrape_jina_ai(url: str) -> str:
  response = requests.get("https://r.jina.ai/" + url)
  return response.text

## Let's run all scrapers on all sites and compare

In [7]:
list_of_scraper_functions = [
      {"name": "Beautiful Soup", "function": beautiful_soup_scrape_url},
      {"name": "Jina AI", "function": scrape_jina_ai}
      ]

all_content = view_scraped_content(list_of_scraper_functions, home_remedy_sites, 700, 20)

Processing site Allina Health - Natural Remedies for Everyday Illness using Beautiful Soup:   0%|          | 0/1 [00:00<?, ?it/s]

Processing site Allina Health - Natural Remedies for Everyday Illness using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00,  2.26it/s]
Processing site Allina Health - Natural Remedies for Everyday Illness using Jina AI: 100%|██████████| 1/1 [00:07<00:00,  7.50s/it]
Processing site Healthline - 9 Home Remedies backed by Science using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00,  3.25it/s]
Processing site Healthline - 9 Home Remedies backed by Science using Jina AI: 100%|██████████| 1/1 [00:08<00:00,  8.38s/it]
Processing site WebMD - Home Remedies, What Works? using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00, 11.80it/s]
Processing site WebMD - Home Remedies, What Works? using Jina AI: 100%|██████████| 1/1 [00:14<00:00, 14.96s/it]
Processing site Piedmont - 9 natural cold and flu remedies using Beautiful Soup: 100%|██████████| 1/1 [00:01<00:00,  1.40s/it]
Processing site Piedmont - 9 natural cold and flu remedies using Jina AI: 100%|██████████| 1/1 [00:15<00:00, 15.10s/it

Content Table:
+----------------------+------------------------+----------------------+
|      Site Name       | Beautiful Soup content |   Jina AI content    |
+----------------------+------------------------+----------------------+
|   Allina Health -    |    <!DOCTYPE html>     |    Title: Natural    |
| Natural Remedies for |                        |     Remedies for     |
|   Everyday Illness   |    <html class="ahn    |  Everyday Illnesses  |
|                      |     ah_bootstrap"      |                      |
|                      | lang="en" xmlns="http: | URL Source: https:// |
|                      | //www.w3.org/1999/xhtm | www.allinahealth.org |
|                      |          l">           | /healthysetgo/heal/n |
|                      | <head runat="server">  | atural-remedies-for- |
             |  everyday-illnesses  |title>
             |                      |   Natural Remedies for Illness | Natural Cures | Allina Health
|                      |        </tit




**Observations:**
- Beautiful Soup takes less amount of time to scrape the websites than Jina AI.
- Jina AI gets rid of all the unnecessary HTML code tags to give just the content

*Different use cases may need different content so we cannot say one is better than the other.*

For this use case, I'll stick with the Jina AI reader API, as each website has relevant content in different HTML tags, making code reusability impossible while using BeautifulSoup.

## Let's now structure relevant data from all the sources

Manually writing code to extract the plethora of data, especially with diverse and inconsistent data across platforms, is a tedious task. So, we can utilize a text-to-structure LLM to perform this task.

|Model|Output Format|Medical Understanding|JSON Consistency|Deployment|
|---|---|---|---|---|
|GPT-4/GPT-4o|🟢 Excellent|🟢 Excellent|🟢 Excellent|API|
|Claude 3|🟢 Excellent|🟢 Excellent|🟢 Excellent|API|
|Mistral 7B|🟡 Moderate|🔴 Basic (add RAG)|🟡 With good prompt|Local (Ollama)|
|GPT-3.5|🟢 Good|🟡 Decent|🟡 Okay|API|
|Med-Alpaca|🔴 Poor (JSON)|🟢 Great (bio NER)|🔴 Poor|Local / HF|


In [8]:
! pip install --upgrade openai --quiet

First let's get all the scraped data 

In [9]:
jina_results = []

for site in home_remedy_sites:
    try:
        content = scrape_jina_ai(site['url'])
        jina_results.append({
            "content": content,
            "url": site['url']
        })
        print(f"[✓] Scraped: {site}")
    except Exception as e:
        print(f"[✗] Failed: {site} → {e}")

[✓] Scraped: {'name': 'Allina Health - Natural Remedies for Everyday Illness', 'url': 'https://www.allinahealth.org/healthysetgo/heal/natural-remedies-for-everyday-illnesses'}
[✓] Scraped: {'name': 'Healthline - 9 Home Remedies backed by Science', 'url': 'https://www.healthline.com/health/home-remedies'}
[✓] Scraped: {'name': 'WebMD - Home Remedies, What Works?', 'url': 'https://www.webmd.com/balance/ss/slideshow-home-remedies'}
[✓] Scraped: {'name': 'Piedmont - 9 natural cold and flu remedies', 'url': 'https://www.piedmont.org/living-real-change/9-natural-cold-and-flu-remedies'}
[✓] Scraped: {'name': 'Prevention - 35 All-Time Favorite Natural Remedies', 'url': 'https://www.prevention.com/health/a20477585/35-all-time-favorite-natural-remedies/'}
[✓] Scraped: {'name': 'Elliot - Cold and Flu: Home Remedies, Natural Treatments, & When to Seek Help from VirtualER', 'url': 'https://www.elliothospital.org/about-us/newsroom/news/cold-and-flu-home-remedies-natural-treatments-and-when-seek-help

In [10]:
jina_results

  'url': 'https://www.allinahealth.org/healthysetgo/heal/natural-remedies-for-everyday-illnesses'},
  'url': 'https://www.healthline.com/health/home-remedies'},
  'url': 'https://www.webmd.com/balance/ss/slideshow-home-remedies'},
  'url': 'https://www.piedmont.org/living-real-change/9-natural-cold-and-flu-remedies'},
  'url': 'https://www.prevention.com/health/a20477585/35-all-time-favorite-natural-remedies/'},
  'url': 'https://www.elliothospital.org/about-us/newsroom/news/cold-and-flu-home-remedies-natural-treatments-and-when-seek-help-virtualer'},
  'url': 'https://www.stamfordhealth.org/healthflash-blog/diabetes-and-endocrine/type-2-diabetes-natural-remedies/'}]

In [11]:
! pip install python-dotenv --quiet

In [12]:
from dotenv import load_dotenv
import os

load_dotenv()  # looks for .env file
secrets_path = os.getenv("SECRETS_PATH")

In [13]:
import os
import json

with open(secrets_path, "r") as f:
    secrets = json.load(f)

import openai

openai.api_key = secrets["OPENAI_API_KEY"]

In [14]:
! pip install pydantic --quiet

In [15]:
from pydantic import BaseModel, ValidationError

class RemedyRecord(BaseModel):
    symptom: str
    remedy: str
    description: str
    warnings: str
    source_url: str

In [35]:
PROMPT_TEMPLATE = """
Given the following health-related text, extract all distinct symptom → remedy mappings mentioned.

Return the result as a list of JSON objects with the following fields:
- symptom: The name of the health condition or complaint
- remedy: The home remedy mentioned
- description: A brief explanation of how the remedy helps
- warnings: Any warnings or limitations mentioned
- source_url: Use this source_url: {source_url}

If not present then fill that field with 'None'

TEXT:
{text}

OUTPUT FORMAT:
[
  {{
    "symptom": "...",
    "remedy": "...",
    "description": "...",
    "warnings": "...",
    "source_url": "..."
  }},
  ...
]
"""


In [36]:
from openai import OpenAI

def extract_remedy_record(content: str, source_url: str) -> RemedyRecord | None:
    prompt = PROMPT_TEMPLATE.format(text=content, source_url=source_url)

    try:
        client = OpenAI(api_key=secrets["OPENAI_API_KEY"])

        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )

        response_text = response.choices[0].message.content
        response_json = json.loads(response_text)
        validated_records = []
        for record in response_json:
            record["source_url"] = source_url  # override to ensure consistency
            validated = RemedyRecord(**record)
            validated_records.append(validated)

        return validated_records

    except (json.JSONDecodeError, ValidationError) as e:
        print("❌ Validation failed:", e)
        return None

In [42]:
results = []

for item in jina_results:
    record = extract_remedy_record(item["content"], item["url"])
    if record:
        for r in record:
            data = r.model_dump()
            if not data.get("symptom") or data["symptom"].lower() == "none":
                continue
            if not data.get("remedy") or data["remedy"].lower() == "none":
                continue
            results.append(data)

In [43]:
results

[{'symptom': 'cough and sore throat',
  'remedy': 'Tea, Honey, Echinacea, Elderberry syrup, Pelargonium',
  'description': 'Throat-coating properties to reduce irritation, soothe sore throats, suppress coughs, reduce cold symptoms, and have antiviral properties',
  'source_url': 'https://www.allinahealth.org/healthysetgo/heal/natural-remedies-for-everyday-illnesses'},
 {'symptom': 'upset stomach, nausea, motion sickness',
  'remedy': 'Ginger',
  'description': 'Helpful for digestive issues',
  'source_url': 'https://www.allinahealth.org/healthysetgo/heal/natural-remedies-for-everyday-illnesses'},
 {'symptom': 'diarrhea',
  'remedy': 'Probiotics',
  'description': 'Good bacteria for the digestive system, helps with diarrhea',
  'source_url': 'https://www.allinahealth.org/healthysetgo/heal/natural-remedies-for-everyday-illnesses'},
 {'symptom': 'muscle aches, pains, bruising',
  'remedy': 'Arnica cream',
  'description': 'Soothes muscle discomfort',
  'source_url': 'https://www.allinahea

In [44]:
len(results)

88

### Basic Evaluation

In [45]:
import re

def basic_eval(record):
    issues = []

    # Check for required fields
    required_fields = ["symptom", "remedy", "description", "warnings", "source_url"]
    for field in required_fields:
        if not record.get(field):
            issues.append(f"Missing field: {field}")

    # Check if source_url is a valid URL
    if not re.match(r"^https?://", record.get("source_url", "")):
        issues.append("Invalid source URL")

    # Check for hallucinated safety risks in warnings
    hallucination_terms = [
        "cure cancer", "miracle", "always safe", "100% effective", "no side effects",
        "consume bleach", "drink turpentine", "inject", "toxic", "dangerous"
    ]
    warning_text = record.get("warnings", "").lower()
    if any(term in warning_text for term in hallucination_terms):
        issues.append("🚨 Potentially unsafe or hallucinated warning")

    return issues

In [46]:
for i, record in enumerate(results):
    issues = basic_eval(record)
    print(f"\n--- Record {i+1} ---")
    if not issues:
        print("✅ Passed basic checks")
    else:
        for issue in issues:
            print("❌", issue)


--- Record 1 ---
✅ Passed basic checks

--- Record 2 ---
✅ Passed basic checks

--- Record 3 ---
✅ Passed basic checks

--- Record 4 ---
✅ Passed basic checks

--- Record 5 ---
✅ Passed basic checks

--- Record 6 ---
✅ Passed basic checks

--- Record 7 ---
✅ Passed basic checks

--- Record 8 ---
✅ Passed basic checks

--- Record 9 ---
✅ Passed basic checks

--- Record 10 ---
✅ Passed basic checks

--- Record 11 ---
✅ Passed basic checks

--- Record 12 ---
✅ Passed basic checks

--- Record 13 ---
✅ Passed basic checks

--- Record 14 ---
✅ Passed basic checks

--- Record 15 ---
✅ Passed basic checks

--- Record 16 ---
✅ Passed basic checks

--- Record 17 ---
✅ Passed basic checks

--- Record 18 ---
✅ Passed basic checks

--- Record 19 ---
✅ Passed basic checks

--- Record 20 ---
✅ Passed basic checks

--- Record 21 ---
✅ Passed basic checks

--- Record 22 ---
✅ Passed basic checks

--- Record 23 ---
✅ Passed basic checks

--- Record 24 ---
✅ Passed basic checks

--- Record 25 ---
✅ Pass

### Store as JSONL

In [47]:
import json
# Save as JSONL for future vector DB + RAG
with open("symptom_remedy_data.jsonl", "w", encoding="utf-8") as f:
    for record in results:
        f.write(json.dumps(record, ensure_ascii=False) + "\n")