# Web Scraping for LLM: Extract symptoms and remedies from websites

Now first let's get some websites for us to extract data from

In [1]:
home_remedy_sites = [
    {
        "name": "Allina Health - Natural Remedies for Everyday Illness",
        "url": "https://www.allinahealth.org/healthysetgo/heal/natural-remedies-for-everyday-illnesses"
    },
    {
        "name": "Healthline - 9 Home Remedies backed by Science",
        "url": "https://www.healthline.com/health/home-remedies"
    },
    {
        "name": "WebMD - Home Remedies, What Works?",
        "url": "https://www.webmd.com/balance/ss/slideshow-home-remedies"
    },
    {
        "name": "Piedmont - 9 natural cold and flu remedies",
        "url": "https://www.piedmont.org/living-real-change/9-natural-cold-and-flu-remedies"
    },
    {
        "name": "Prevention - 35 All-Time Favorite Natural Remedies",
        "url": "https://www.prevention.com/health/a20477585/35-all-time-favorite-natural-remedies/"
    },
    {
        "name": "Elliot - Cold and Flu: Home Remedies, Natural Treatments, & When to Seek Help from VirtualER",
        "url": "https://www.elliothospital.org/about-us/newsroom/news/cold-and-flu-home-remedies-natural-treatments-and-when-seek-help-virtualer"
    },
    {
        "name": "Stamford Health - 10 Natural Home Remedies for Type 2 Diabetes",
        "url": "https://www.stamfordhealth.org/healthflash-blog/diabetes-and-endocrine/type-2-diabetes-natural-remedies/"
    },
]

## Since I'll be testing multiple scraping methods, i would want to see the scraping results in a nice table format, so let us set the code for that.

In [2]:
! pip install prettytable tqdm --quiet

```prettytable``` is for making the terminal output in a neat table format without the natural skew 

In [3]:
from typing import List, Callable, Dict
from prettytable import PrettyTable, ALL
from tqdm import tqdm

def view_scraped_content(scrape_url_functions: List[Dict[str, Callable[[str], str]]], sites_list: List[Dict[str, str]], characters_to_display: int = 500, table_max_width: int = 50) -> List[Dict[str, str]]:
    content_table_headers = ["Site Name"] + [f"{func['name']} content" for func in scrape_url_functions]
    content_table = PrettyTable()
    content_table.field_names = content_table_headers

    scraped_data = []

    for site in sites_list:
        content_row = [site['name']]
        site_data = {"provider": site['name'], "sites": []}

        for scrape_function in scrape_url_functions:
            function_name = scrape_function['name']
            for _ in tqdm([site], desc=f"Processing site {site['name']} using {function_name}"):
                try:
                    content = scrape_function['function'](site['url'])
                    content_snippet = content[:characters_to_display]
                    content_row.append(content_snippet)

                    site_data["sites"].append({"name": function_name, "content": content})
                except Exception as e:
                    error_message = f"Error: {str(e)}"
                    content_row.append(error_message)

                    site_data["sites"].append({"name": function_name, "content": error_message})
                    continue
        
        content_table.add_row(content_row)
        scraped_data.append(site_data)
    
    content_table.max_width = table_max_width
    content_table.hrules = ALL

    print("Content Table:")
    print(content_table)

    return scraped_data

  from prettytable import PrettyTable, ALL


## Now let us set up all the scrapers we would like to use and compare.

### Beautiful Soup

In [4]:
! pip install requests beautifulsoup4 --quiet

In [5]:
import requests
from bs4 import BeautifulSoup

def beautiful_soup_scrape_url(url: str):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return str(soup)

### Reader API by Jina AI

In [6]:
import requests

def scrape_jina_ai(url: str) -> str:
  response = requests.get("https://r.jina.ai/" + url)
  return response.text

## Let's run all scrapers on all sites and compare

In [7]:
list_of_scraper_functions = [
      {"name": "Beautiful Soup", "function": beautiful_soup_scrape_url},
      {"name": "Jina AI", "function": scrape_jina_ai}
      ]

all_content = view_scraped_content(list_of_scraper_functions, home_remedy_sites, 700, 20)

Processing site Allina Health - Natural Remedies for Everyday Illness using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00,  2.12it/s]
Processing site Allina Health - Natural Remedies for Everyday Illness using Jina AI: 100%|██████████| 1/1 [00:03<00:00,  3.57s/it]
Processing site Healthline - 9 Home Remedies backed by Science using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00,  1.38it/s]
Processing site Healthline - 9 Home Remedies backed by Science using Jina AI: 100%|██████████| 1/1 [00:31<00:00, 31.67s/it]
Processing site WebMD - Home Remedies, What Works? using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00,  7.67it/s]
Processing site WebMD - Home Remedies, What Works? using Jina AI: 100%|██████████| 1/1 [00:00<00:00,  6.73it/s]
Processing site Piedmont - 9 natural cold and flu remedies using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00,  1.28it/s]
Processing site Piedmont - 9 natural cold and flu remedies using Jina AI: 100%|██████████| 1/1 [00:11<00:00, 11.49s/it

Content Table:
+----------------------+------------------------+----------------------+
|      Site Name       | Beautiful Soup content |   Jina AI content    |
+----------------------+------------------------+----------------------+
|   Allina Health -    |    <!DOCTYPE html>     |    Title: Natural    |
| Natural Remedies for |                        |     Remedies for     |
|   Everyday Illness   |    <html class="ahn    |  Everyday Illnesses  |
|                      |     ah_bootstrap"      |                      |
|                      | lang="en" xmlns="http: | URL Source: https:// |
|                      | //www.w3.org/1999/xhtm | www.allinahealth.org |
|                      |          l">           | /healthysetgo/heal/n |
|                      | <head runat="server">  | atural-remedies-for- |
             |  everyday-illnesses  |title>
             |                      |   Natural Remedies for Illness | Natural Cures | Allina Health
|                      |        </tit




**Observations:**
- Beautiful Soup takes less amount of time to scrape the websites than Jina AI.
- Jina AI gets rid of all the unnecessary HTML code tags to give just the content

*Different use cases may need different content so we cannot say one is better than the other.*

For this use case, I'll stick with the Jina AI reader API, as each website has relevant content in different HTML tags, making code reusability impossible while using BeautifulSoup.

## Let's now structure relevant data from all the sources

Manually writing code to extract the plethora of data, especially with diverse and inconsistent data across platforms, is a tedious task. So, we can utilize a text-to-structure LLM to perform this task.

|Model|Output Format|Medical Understanding|JSON Consistency|Deployment|
|---|---|---|---|---|
|GPT-4/GPT-4o|🟢 Excellent|🟢 Excellent|🟢 Excellent|API|
|Claude 3|🟢 Excellent|🟢 Excellent|🟢 Excellent|API|
|Mistral 7B|🟡 Moderate|🔴 Basic (add RAG)|🟡 With good prompt|Local (Ollama)|
|GPT-3.5|🟢 Good|🟡 Decent|🟡 Okay|API|
|Med-Alpaca|🔴 Poor (JSON)|🟢 Great (bio NER)|🔴 Poor|Local / HF|
