Improve web-crawling system #4

Torantulino · 2023-03-29T05:18:54Z

Auto-GPT should be able to see links on webpages it visits, as well as the existing GPT-3 powered summary, so that it can browse further than 1 page away from google.

Torantulino · 2023-03-30T11:52:40Z

Has anyone had any luck extracting key information from a website's text with gpt3.5?

GPT4 does it easily, but it's too expensive for this task.

GPT3.5:

GPT4:

What we need is for the key information to be extracted from a webpage, rather than it's general purpose to be described.

Torantulino · 2023-04-01T04:34:55Z

This is currently majorly limiting Auto-GPT's capabilities.
It essentially is unable to reliably browse the web and frequently misses information even though it's visiting the correct URLs.

0xcha05 · 2023-04-01T23:49:17Z

can you provide more context, please?
I've tried "extract key information from this webpage text" and given it hackernews text, and it did a decent job.

Torantulino · 2023-04-02T06:57:36Z

Current Implementation

Web browsing is currently handled in the following way:

def browse_website(url):
    summary = get_text_summary(url)
    links = get_hyperlinks(url)

    # Limit links to 5
    if len(links) > 5:
        links = links[:5]

    result = f"""Website Content Summary: {summary}\n\nLinks: {links}"""

    return result

def get_text_summary(url):
    text = browse.scrape_text(url)
    summary = browse.summarize_text(text)
    return """ "Result" : """ + summary

def get_hyperlinks(url):
    link_list = browse.scrape_links(url)
    return link_list

Where scrape_text uses BeautifulSoup to extract the text contained within a webpage.

The Problem

The summarize_text function feeds GPT3.5 this scraped text, and whilst gpt3.5 does a great job of summarising it, it doesn't know what we're specifically after.

This leads to instances where AutoGPT can be looking to find some news on CNN, and instead receives a summary of what CNN is.

An Illustrated Example

Here, Auto-GPT is trying to browse a news site to find a technology related news story.
They select https://www.theverge.com/tech, a screenshot from which can be seen below:

Even though there are clearly news stories contained within this page, the summary does not include them specifically, just mentions they are there.

Summary:

Website Content Summary: "Result" : The Tech subpage of The Verge provides news on hardware, apps, and technology from various companies including Google and Apple as well as smaller startups. It includes articles on various tech topics and links to the website's legal and privacy information.

0xcha05 · 2023-04-02T07:41:14Z

okay, thanks for the context. I'll try to see what I can do.

0xcha05 · 2023-04-03T14:14:37Z

tried to get it to work, but I don't have access to gpt-4 API, but with gpt-3.5-turbo, it works for some and doesn't for others even with strict prompting. but gpt-4 should be able to do it with strict prompting.

djrba · 2023-04-04T14:59:14Z

Might be easier to add 3rd party support for example: https://www.algolia.com/pricing/
If their API can do the heavy lifting, it will be easier, there are some oss plugins as well

ryanpeach · 2023-04-04T15:19:14Z

#121

ThatXliner · 2023-04-06T13:50:00Z

Might be easier to add 3rd party support for example: https://www.algolia.com/pricing/
If their API can do the heavy lifting, it will be easier, there are some oss plugins as well

https://github.com/neuml/txtai

JohnnyIrvin · 2023-04-07T16:26:54Z

@Torantulino I see that you're using BeautifulSoup for processing the content of the site. This won't handle data that has to be injected into a site via say JavaScript. I'm not sure exactly how, but some of the RESTful / GraphQL / etc calls could be helpful for summarizing a page.

We could also consider pulling the metadata from the page and using that to determine how to prompt the summarization. To be fair, I haven't looked at the prompting code yet and don't know if you're already doing this.

Nnnsightnnn · 2023-04-07T19:27:40Z

@Torantulino

What about bs4 (BeautifulSoup4)?

"""
This module is designed to scrape text and hyperlinks from a given URL, summarize the text,
and return a limited number of hyperlinks. It uses the requests library for making HTTP requests,
BeautifulSoup for parsing HTML, and a custom 'browse' module for text summarization.
"""

import requests
from bs4 import BeautifulSoup
import browse

def browse_website(url, max_links=5, retries=3):
    # Use a session object for connection pooling and better performance
    with requests.Session() as session:
        scraped_data = scrape_data(url, session, retries)
        summary = get_text_summary(scraped_data['text'])
        links = get_hyperlinks(scraped_data['links'], max_links)

    result = f"""Website Content Summary: {summary}\n\nLinks: {links}"""
    return result

def get_text_summary(text):
    # Use the custom 'browse' module to summarize the text
    summary = browse.summarize_text(text)
    return f' "Result" : {summary}'

def get_hyperlinks(links, max_links):
    # Limit the number of hyperlinks returned to the specified maximum
    return links[:max_links]

def scrape_data(url, session, retries):
    # Make requests to the specified URL and parse the HTML content
    for i in range(retries):
        try:
            response = session.get(url, headers={'User-Agent': 'Mozilla/5.0'})
            response.raise_for_status()
            soup = BeautifulSoup(response.text, 'html.parser')
            text = ' '.join([p.get_text() for p in soup.find_all('p')])
            links = [a.get('href') for a in soup.find_all('a')]
            return {'text': text, 'links': links}
        except requests.exceptions.RequestException as e:
            # Retry the request if an exception occurs, up to the specified retry limit
            if i == retries - 1:
                raise e

JustinHinh · 2023-04-08T03:10:26Z

This is the single biggest issue in facing with GPT3.5-turbo and Auto-GPT

It is not reliably able to go to a website and pull out and summarize key information.

My use case is going to a job posting and pulling out a summary to compare to my resume. This works be a game changer

Nnnsightnnn · 2023-04-08T11:26:14Z

I also modified above code to handle Javascript using Selenium.

pip install selenium


"""
This module is designed to scrape text and hyperlinks from a given URL, summarize the text,
and return a limited number of hyperlinks. It uses the requests and Selenium libraries for making
HTTP requests, BeautifulSoup for parsing HTML, and a custom 'browse' module for text summarization.
"""

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import browse

def browse_website(url, max_links=5, retries=3):
    # Use a session object for connection pooling and better performance
    with requests.Session() as session:
        scraped_data = scrape_data(url, session, retries)
        summary = get_text_summary(scraped_data['text'])
        links = get_hyperlinks(scraped_data['links'], max_links)

    result = f"""Website Content Summary: {summary}\n\nLinks: {links}"""
    return result

def get_text_summary(text):
    # Use the custom 'browse' module to summarize the text
    summary = browse.summarize_text(text)
    return f' "Result" : {summary}'

def get_hyperlinks(links, max_links):
    # Limit the number of hyperlinks returned to the specified maximum
    return links[:max_links]

def scrape_data(url, session, retries):
    # Make requests to the specified URL and parse the HTML content
    for i in range(retries):
        try:
            # Set up a headless Chrome browser
            chrome_options = Options()
            chrome_options.add_argument("--headless")
            driver = webdriver.Chrome(options=chrome_options)

            # Navigate to the URL and render the JavaScript content
            driver.get(url)
            html = driver.page_source

            # Parse the rendered HTML with BeautifulSoup
            soup = BeautifulSoup(html, 'html.parser')
            text = ' '.join([p.get_text() for p in soup.find_all('p')])
            links = [a.get('href') for a in soup.find_all('a')]

            driver.quit()

            return {'text': text, 'links': links}
        except Exception as e:
            # Retry the request if an exception occurs, up to the specified retry limit
            if i == retries - 1:
                raise e

gigabit-eth · 2023-04-08T17:55:12Z

Looking for feedback on #507, it's my first-time @Torantulino making a "official" PR, but the project is beyond compelling and I had to get this out to the community. It's magical what happens when it's able to access information from as recent as today (April 8th, 2023) in it's analysis, reasoning, and logic.

edit: I made a 26-minute video showing what's possible with this PR. This isn't a minor incremental bug fix, it's a MAJOR unlock!
https://youtu.be/yM_yxVn4y2I

IsleOf · 2023-04-09T12:54:30Z

Few tips for scraping here, use selenium with opencv, make sure to force scroll to bottom of the page to load everything. Actually, current chapgpt sometimes gives correct scraping results instead of code to scrape, so it's able to do it, just jailed for some reason.

Nnnsightnnn · 2023-04-09T14:52:03Z

Here's pull request for fix. It utilizes Pyppteer to navigate JavaScript sites. Lots of adaptability and flexibility.

#508

IsleOf · 2023-04-09T14:59:25Z

Pix2struct to capture structure and plain ocr could give acceptable results if everything else fails .

Nnnsightnnn · 2023-04-10T16:05:28Z

Where are we at for logging into websites during web-crawling?

IsleOf · 2023-04-10T16:31:10Z

Selenium or pupetteer or similar for web ui logins would be optimal i guess imho.

jheathco · 2023-04-12T05:06:40Z

Mechanize would work well, although not functional with JS.

myInstagramAlternative · 2023-04-12T20:09:04Z

+1 for selenium browser, supports JS injecting, and might be better than google console if you are running it from residental IP, just add jitter and wait time before reading, and you can easly safe page_source as html or parse it later with bs4

opencv is too OP for this and resource intensive, no point of using it, except for images, as I said, no point.

I just found out about this project, might gonna look further into this for the weekend

Pwuts · 2023-04-17T23:36:12Z

Selenium + Chromium + Firefox are now in master. Closing as completed! :)

…n-support [WIP] Openai plugins support

holyr00d · 2023-04-24T17:05:57Z

how much flexibility do we have to configure default code to direct the AI where we want it to go and grab what we want along the way?

Co-authored-by: lc0rp <2609411+lc0rp@users.noreply.github.com>

Torantulino added enhancement New feature or request help wanted Extra attention is needed and removed enhancement New feature or request labels Mar 29, 2023

Torantulino pinned this issue Apr 1, 2023

Torantulino added the high priority label Apr 1, 2023

0xcha05 mentioned this issue Apr 2, 2023

Make Auto-GPT aware of it's running cost #6

Closed

richbeales unpinned this issue Apr 14, 2023

Bourn3Dynsty mentioned this issue Apr 16, 2023

Command browse_website returned: Error: Message: unknown error: Chrome failed to start: crashed. (unknown error: DevToolsActivePort file doesn't exist) #1788

Closed

1 task

ngmisl mentioned this issue Apr 16, 2023

issues with chromium #1853

Closed

1 task

amsator mentioned this issue Apr 16, 2023

autoGPT cannot recognize and use the websites URL from the google returned #1959

Closed

2 tasks

kwikcoins mentioned this issue Apr 17, 2023

Browsing broken #2043

Closed

2 tasks

contributorai mentioned this issue Apr 17, 2023

browse_website fails without --headless flag #2090

Closed

2 tasks

ngmisl mentioned this issue Apr 17, 2023

stable branch vs master branch #1963

Closed

2 tasks

m8l91 referenced this issue Apr 17, 2023

Fix selenium browsing in dev container

414bd4c

horazius mentioned this issue Apr 17, 2023

"DevToolsActivePort file doesn't exist" in browse_website #1978

Closed

2 tasks

Pwuts closed this as completed Apr 17, 2023

moltra mentioned this issue Apr 18, 2023

browse_website -> Chrome failed to start: exited abnormally #2304

Closed

riensen pushed a commit to riensen/Auto-GPT that referenced this issue Apr 19, 2023

Merge pull request Significant-Gravitas#4 from evahteev/_openai-plugi…

b188c2b

…n-support [WIP] Openai plugins support

bjm88 mentioned this issue Apr 27, 2023

chromedriver wont launch #3432

Closed

1 task

CharlieCappn mentioned this issue Apr 29, 2023

browse_website is broken on Apple M1 & M1 Max due to chromedriver mismatch #2600

Closed

1 task

mryarikm mentioned this issue May 4, 2023

This version of ChromeDriver only supports Chrome version 113 Current browser version is 112.0.5615.138 #3803

Closed

1 task

MauroDruwel referenced this issue in MauroDruwel/Auto-GPT May 19, 2023

Edit test #4

f8d6e6c

coblerc pushed a commit to coblerc/Auto-GPT that referenced this issue Sep 11, 2023

Add autopack (Significant-Gravitas#4)

9094137

jmikedupont2 pushed a commit to jmikedupont2/Auto-GPT that referenced this issue Oct 15, 2023

Interactive shell commands (Significant-Gravitas#4)

0b557e1

Co-authored-by: lc0rp <2609411+lc0rp@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve web-crawling system #4

Improve web-crawling system #4

Torantulino commented Mar 29, 2023

Torantulino commented Mar 30, 2023 •

edited

Loading

Torantulino commented Apr 1, 2023

0xcha05 commented Apr 1, 2023

Torantulino commented Apr 2, 2023 •

edited

Loading

0xcha05 commented Apr 2, 2023

0xcha05 commented Apr 3, 2023

djrba commented Apr 4, 2023

ryanpeach commented Apr 4, 2023

ThatXliner commented Apr 6, 2023

JohnnyIrvin commented Apr 7, 2023

Nnnsightnnn commented Apr 7, 2023 •

edited

Loading

JustinHinh commented Apr 8, 2023

Nnnsightnnn commented Apr 8, 2023 •

edited

Loading

gigabit-eth commented Apr 8, 2023 •

edited

Loading

IsleOf commented Apr 9, 2023 •

edited

Loading

Nnnsightnnn commented Apr 9, 2023

IsleOf commented Apr 9, 2023

Nnnsightnnn commented Apr 10, 2023

IsleOf commented Apr 10, 2023

jheathco commented Apr 12, 2023

myInstagramAlternative commented Apr 12, 2023

Pwuts commented Apr 17, 2023

holyr00d commented Apr 24, 2023

Improve web-crawling system #4

Improve web-crawling system #4

Comments

Torantulino commented Mar 29, 2023

Torantulino commented Mar 30, 2023 • edited Loading

Torantulino commented Apr 1, 2023

0xcha05 commented Apr 1, 2023

Torantulino commented Apr 2, 2023 • edited Loading

Current Implementation

The Problem

An Illustrated Example

0xcha05 commented Apr 2, 2023

0xcha05 commented Apr 3, 2023

djrba commented Apr 4, 2023

ryanpeach commented Apr 4, 2023

ThatXliner commented Apr 6, 2023

JohnnyIrvin commented Apr 7, 2023

Nnnsightnnn commented Apr 7, 2023 • edited Loading

What about bs4 (BeautifulSoup4)?

JustinHinh commented Apr 8, 2023

Nnnsightnnn commented Apr 8, 2023 • edited Loading

gigabit-eth commented Apr 8, 2023 • edited Loading

IsleOf commented Apr 9, 2023 • edited Loading

Nnnsightnnn commented Apr 9, 2023

IsleOf commented Apr 9, 2023

Nnnsightnnn commented Apr 10, 2023

IsleOf commented Apr 10, 2023

jheathco commented Apr 12, 2023

myInstagramAlternative commented Apr 12, 2023

Pwuts commented Apr 17, 2023

holyr00d commented Apr 24, 2023

Torantulino commented Mar 30, 2023 •

edited

Loading

Torantulino commented Apr 2, 2023 •

edited

Loading

Nnnsightnnn commented Apr 7, 2023 •

edited

Loading

Nnnsightnnn commented Apr 8, 2023 •

edited

Loading

gigabit-eth commented Apr 8, 2023 •

edited

Loading

IsleOf commented Apr 9, 2023 •

edited

Loading