<a href="https://colab.research.google.com/github/RazanHL/webscraping_for_ai_analysis/blob/main/webscraping_ai_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install -q transformers

In [1]:
# Importing required libraries
import requests
from bs4 import BeautifulSoup
import re
from transformers import pipeline

In [5]:
# Scraping content from a website page
def scrape_website(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        # Extract text content from paragraphs
        paragraphs = soup.find_all('p')
        text_content = ' '.join([p.get_text() for p in paragraphs])
        # Cleaning text using regular expressions
        cleaned_text = re.sub(r'\s+', ' ', text_content).strip()
        return cleaned_text
    else:
        print("Failed to fetch the webpage.")
        return ""


In [6]:
# Scraping content from langchain documentation as an example
url = "https://python.langchain.com/v0.1/docs/modules/agents/concepts/"
website_content = scrape_website(url)

In [7]:
website_content

"The core idea of agents is to use a language model to choose a sequence of actions to take. In chains, a sequence of actions is hardcoded (in code). In agents, a language model is used as a reasoning engine to determine which actions to take and in which order. There are several key components here: LangChain has several abstractions to make working with agents easy. This is a dataclass that represents the action an agent should take. It has a tool property (which is the name of the tool that should be invoked) and a tool_input property (the input to that tool) This represents the final result from an agent, when it is ready to return to the user. It contains a return_values key-value mapping, which contains the final agent output. Usually, this contains an output key containing a string that is the agent's response. These represent previous agent actions and corresponding outputs from this CURRENT agent run. These are important to pass to future iteration so the agent knows what work

In [28]:
# Loading a pre-trained question-answering model
qa_model = pipeline(
    task="question-answering",
    model="distilbert/distilbert-base-cased-distilled-squad",
    clean_up_tokenization_spaces=True)

In [24]:
qa_model

<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x7f469c5b6ef0>

In [9]:
# Fine-tuning or contextually using the scraped data
context = website_content


In [25]:
question = "What services does the website offer?"
result = qa_model(question=question, context=context)
print(f"Question: {question}\nAnswer: {result['answer']}\n")

Question: What services does the website offer?
Answer: built-in agents see agent types



In [29]:
result

{'score': 6.753708021278726e-06,
 'start': 1490,
 'end': 1521,
 'answer': 'built-in agents see agent types'}

In [13]:
question = "What is Intermediate Steps?"
result = qa_model(question=question, context=context)
print(f"Question: {question}\nAnswer: {result['answer']}\n")

Question: What is Intermediate Steps?
Answer: one required key



In [15]:
question = "What is AgentExecutor?"
result = qa_model(question=question, context=context)
print(f"Question: {question}\nAnswer: {result['answer']}\n")

Question: What is AgentExecutor?
Answer: the runtime for an agent

