<a href="https://colab.research.google.com/github/trancethehuman/ai-workshop-code/blob/main/Web_scraping_for_LLM_in_2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Let's setup our test: Get my competitors' pricing from their websites

This is real. I am doing this not for shits and giggles.

I'm building an interactive learning platform where content is taught using AI. Seems like everyone else is focusing on augmenting the authoring process and not the learning experience, but whatever.

In [None]:
competitor_sites = [
    {
        "name": "Poshmark",
        "url": "https://poshmark.com/category/Women-Dresses"
    },
    {
        "name": "dpop",
        "url": "https://www.depop.com/category/womens/dresses/"
    },
    {
        "name": "vinted",
        "url": "https://www.vinted.com/catalog/10-dresses"
    },

]

### Let's setup cost calculations

So we can compare them side-by-side

We can calculate how much it'll cost by using OpenAI's `tiktoken` library.

(side note: as of today, OpenAI hasn't updated `tiktoken` with the actual algorithm used to in `gpt-4o`, so we'll guesstimate using `gpt-4` tokenization encoding (cl100k_base).

In [None]:
pip install tiktoken --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.1 MB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.1/1.1 MB[0m [31m16.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import tiktoken

def count_tokens(input_string: str) -> int:
    tokenizer = tiktoken.get_encoding("cl100k_base")

    tokens = tokenizer.encode(input_string)

    return len(tokens)

def calculate_cost(input_string: str, cost_per_million_tokens: float = 5) -> float:
    num_tokens = count_tokens(input_string)

    total_cost = (num_tokens / 1_000_000) * cost_per_million_tokens

    return total_cost

# Example usage:
input_string = "What's the difference between beer nuts and deer nuts? Beer nuts are about 5 dollars. Deer nuts are just under a buck."
cost = calculate_cost(input_string)
print(f"The total cost for using gpt-4o is: $US {cost:.6f}")

The total cost for using gpt-4o is: $US 0.000135


### Additionally, I want to see the test results in a nice table, so let's set that up.

In [None]:
pip install prettytable tqdm --quiet

In [None]:
from typing import List, Callable, Dict
from prettytable import PrettyTable, ALL
from tqdm import tqdm

def view_scraped_content(scrape_url_functions: List[Dict[str, Callable[[str], str]]], sites_list: List[Dict[str, str]], characters_to_display: int = 500, table_max_width: int = 50) -> List[Dict[str, str]]:
    content_table_headers = ["Site Name"] + [f"{func['name']} content" for func in scrape_url_functions]
    cost_table_headers = ["Site Name"] + [f"{func['name']} cost" for func in scrape_url_functions]

    content_table = PrettyTable()
    content_table.field_names = content_table_headers

    cost_table = PrettyTable()
    cost_table.field_names = cost_table_headers

    scraped_data = []

    for site in sites_list:
        content_row = [site['name']]
        cost_row = [site['name']]
        site_data = {"provider": site['name'], "sites": []}

        for scrape_function in scrape_url_functions:
            function_name = scrape_function['name']
            for _ in tqdm([site], desc=f"Processing site {site['name']} using {function_name}"):
                try:
                    content = scrape_function['function'](site['url'])
                    content_snippet = content[:characters_to_display]
                    content_row.append(content_snippet)

                    cost = calculate_cost(content)
                    cost_row.append(f"${cost:.6f}")

                    site_data["sites"].append({"name": function_name, "content": content})
                except Exception as e:
                    error_message = f"Error: {str(e)}"
                    content_row.append(error_message)
                    cost_row.append("Error")

                    site_data["sites"].append({"name": function_name, "content": error_message})
                    continue

        content_table.add_row(content_row)
        cost_table.add_row(cost_row)
        scraped_data.append(site_data)

    content_table.max_width = table_max_width
    content_table.hrules = ALL

    cost_table.max_width = table_max_width
    cost_table.hrules = ALL

    print("Content Table:")
    print(content_table)

    print("\nCost Table:\nThis is how much it would cost to use gpt-4o to parse this content for extraction.")
    print(cost_table)

    return scraped_data



## Setup all the scrapers

Let's setup all of our scrapers.

### Beautiful Soup

In [None]:
pip install requests beautifulsoup4 --quiet

In [None]:
# Beautiful Soup utility functions

import requests
from bs4 import BeautifulSoup

def beautiful_soup_scrape_url(url: str):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return str(soup)


### Reader API by Jina AI

Let's setup Jina AI's scrape method. This one is dead easy.

In [None]:
import requests

def scrape_jina_ai(url: str) -> str:
  response = requests.get("https://r.jina.ai/" + url)
  return response.text

### Firecrawl from Mendable.

In [None]:
pip install firecrawl-py --quiet

In [None]:
import firecrawl
import getpass

FIRECRAWL_API_KEY = getpass.getpass("Mendable API Key: ")

def scrape_firecrawl(url: str):
    app = firecrawl.FirecrawlApp(api_key=FIRECRAWL_API_KEY)
    scraped_data = app.scrape_url(url)["markdown"]
    return scraped_data

KeyboardInterrupt: Interrupted by user

## Let's run all the scrapers and display them in our comparison table

In [None]:
list_of_scraper_functions = [
      {"name": "Beautiful Soup", "function": beautiful_soup_scrape_url},
      {"name": "Jina AI", "function": scrape_jina_ai}
      ]

all_content = view_scraped_content(list_of_scraper_functions, competitor_sites, 700, 20)

Processing site Poshmark using Beautiful Soup: 100%|██████████| 1/1 [00:02<00:00,  2.36s/it]
Processing site Poshmark using Jina AI: 100%|██████████| 1/1 [00:00<00:00,  1.40it/s]
Processing site dpop using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00,  5.59it/s]
Processing site dpop using Jina AI: 100%|██████████| 1/1 [00:03<00:00,  3.60s/it]
Processing site vinted using Beautiful Soup: 100%|██████████| 1/1 [00:01<00:00,  1.91s/it]
Processing site vinted using Jina AI: 100%|██████████| 1/1 [00:04<00:00,  4.83s/it]

Content Table:
+-----------+------------------------+----------------------+
| Site Name | Beautiful Soup content |   Jina AI content    |
+-----------+------------------------+----------------------+
|  Poshmark |    <!DOCTYPE html>     |  Title: The request  |
|           |                        |     could not be     |
|           | <html data-vue-meta="% |      satisfied       |
|           | 7B%22lang%22:%7B%221%2 |                      |
|           | 2:%22en%22%7D,%22xml:l | URL Source: https:// |
|           | ang%22:%7B%221%22:%22e | poshmark.com/categor |
|           | n%22%7D,%22xmlns%22:%7 |   y/Women-Dresses    |
|           | B%221%22:%22http://www |                      |
|           | .w3.org/1999/xhtml%22% |  Markdown Content:   |
|           |  7D,%22data-vue-meta-  |      403 ERROR       |
|           | server-rendered%22:%7B |      ---------       |
|           |  %221%22:true%7D%7D"   |                      |
|           | data-vue-meta-server-  |        * * *    




## Now let's use OpenAI and extract just the information we need

Let's see how accurate the extraction task is between each provider.



First, we create an extraction function using OpenAI's gpt-4o to get only the pricing content from each scraped website from each provider.

In [None]:
pip install openai --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.5/325.5 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import getpass
from openai import OpenAI

OPENAI_API_KEY = getpass.getpass('Enter your OpenAI API key: ')

client = OpenAI(api_key=OPENAI_API_KEY)

def extract(user_input: str):
  entity_extraction_system_message = {"role": "system", "content": "Get me the three pricing tiers from this website's content, and return as a JSON with three keys: {cheapest: {name: str, price: float}, middle: {name: str, price: float}, most_expensive: {name: str, price: float}}"}

  messages = [entity_extraction_system_message]
  messages.append({"role": "user", "content": user_input})

  response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        stream=False,
        response_format={"type": "json_object"}
    )

  return response.choices[0].message.content

Enter your OpenAI API key: ··········


### Then, we create a utility function to display that content in a table.

In [None]:
def display_extracted_content(results: List[Dict[str, any]], num_objects: int):
    table = PrettyTable()
    table.field_names = ["Site", "Provider Name", "Extracted Content"]

    # Ensure num_objects does not exceed the length of the results list
    num_objects = min(num_objects, len(results))

    # Process the specified number of items from the results list with a progress bar
    for result in tqdm(results[:num_objects], desc="Processing results"):
        provider_name = result["provider"]

        for site in result["sites"]:
            function_name = site["name"]
            content = site["content"]

            # Progress bar for each function
            for _ in tqdm(range(1), desc=f"Extracting content with {provider_name} for {function_name}"):
                extracted_content = extract(content)
                table.add_row([provider_name, function_name, extracted_content])

    table.max_width = 50  # Set the maximum width for better display
    table.hrules = ALL

    print("Extracted Content Table:")
    print(table)

In [None]:

display_extracted_content(all_content, num_objects=9)

Processing results:   0%|          | 0/3 [00:00<?, ?it/s]
Extracting content with Poshmark for Beautiful Soup:   0%|          | 0/1 [00:00<?, ?it/s]
Processing results:   0%|          | 0/3 [00:00<?, ?it/s]


NotFoundError: Error code: 404 - {'error': {'message': 'The model `gpt-4o` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}

## Bonus: Scrapegraph-ai

### Scrapegraph-ai

Scrapegraph-ai takes care of the entire flow (from scrape to extraction). It's also interesting that it's node-based, and can run off of local models (Ollama supported). But I couldn't find a way to get cost estimates based on tokens used.

Demo link: https://scrapegraph-ai-demo.streamlit.app/
