[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/13IVFrSBIxutja_kiQo4_ORWplWOlnsO3/view?usp=sharing)

# WebScraping with AI


  ## 1. Traditional WebScraping using Beautiful Soup


> Beautiful Soup (Python) parses HTML/XML, turning it into a navigable structure. This lets you easily search and extract data from websites, making it useful for web scraping tasks.




In [None]:
!pip install requests beautifulsoup4 tiktoken

In [5]:
import requests
from bs4 import BeautifulSoup

def beautiful_soup_scrape_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return str(soup)

In [6]:
url = "https://buildfastwithai.com/courses"

In [7]:
data = beautiful_soup_scrape_url(url)
print(data)

<!DOCTYPE html>

- Introduction to GPT-4o

- Gain a comprehensive understanding of GPT-4o's capabilities, key features, and the innovative technology behind its development.

- Utilizing GPT-4o with APIs

- Learn how to effectively integrate GPT-4o's features into your projects using APIs, opening up new possibilities for your applications.

- Exploring GPT-4o's Vision Capabilities

- Discover GPT-4o's impressive vision capabilities and how it seamlessly combines visual processing with natural language understanding.

- Building AI Applications with GPT-4o</p></div><article class="max-w-max mt-2 rounded-full p-[1px] dark:bg-gradient-to-r from-blue-700 to-violet-400"><div class="rounded-full px-3 py-1 bg-muted text-xs">Free</div></article></div></div></a></div><script>$RS("S:5","P:5")</script><div hidden="" id="S:6"><a class="hover:scale-105 duration-200" href="/courses/info/10x-developer-productivity-with-ai"><div class="mx-auto flex h-full max-w-sm flex-col overflow-hidden rounded-xl 

In [8]:
import requests
from bs4 import BeautifulSoup

def scrape_headings(url):
    """
    Scrape all headings (h1 to h6) from a webpage.

    :param url: The URL of the webpage to scrape
    :return: A list of dictionaries containing heading tag and text
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    headings = []
    for tag in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']): ## "fetch the headings"
        headings.append({
            'tag': tag.name,
            'text': tag.get_text(strip=True)
        })

    return headings


url = "https://buildfastwithai.com/courses"
data = scrape_headings(url)
print(data)

[{'tag': 'h2', 'text': 'Free AI Resources'}, {'tag': 'h3', 'text': 'Support'}, {'tag': 'h3', 'text': 'Company'}, {'tag': 'h3', 'text': 'Legal'}, {'tag': 'h2', 'text': 'Free AI Resources'}]




---


## 2. Scraping ScrapegraphAI




> Scrapegraph uses AI to simplify web scraping. Instead of writing complex code, you tell it what data you want, and it figures out how to extract it. It works on websites and even local files like HTML.



In [10]:
%%capture
!pip install scrapegraphai --upgrade
!apt install chromium-chromedriver
!pip install nest_asyncio
!pip install playwright
!playwright install

In [11]:
import nest_asyncio
nest_asyncio.apply()

In [12]:
from google.colab import userdata

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

### 2.2.1 SmartScraperGraph

single-page scraper that only needs a user prompt and an input source;



In [13]:
graph_config_openai = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
        "temperature":0,
    },
    "verbose":True,
}

In [14]:
from scrapegraphai.graphs import SmartScraperGraph


smart_scraper_graph = SmartScraperGraph(
    prompt="List all courses and their description.",
    # also accepts a string with the already downloaded HTML code
    source="https://buildfastwithai.com/courses",
    config=graph_config_openai
)

result = smart_scraper_graph.run()

--- Executing Fetch Node ---
--- (Fetching HTML from: https://buildfastwithai.com/courses) ---
--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|██████████| 1/1 [00:35<00:00, 35.63s/it]


In [16]:
import json





# https://www.buildfastwithai.com/genai-course

smart_scraper_graph = SmartScraperGraph(
    prompt="Give me a summary of this webpage",
    source="https://www.buildfastwithai.com/genai-course",
    config=graph_config_openai
)

result = smart_scraper_graph.run()
print(json.dumps(result,indent=2))

--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.buildfastwithai.com/genai-course) ---
--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|██████████| 1/1 [00:03<00:00,  3.12s/it]

{
  "summary": {
    "title": "Build Fast with AI",
    "body": "This webpage offers a Crash Course on Generative AI starting from July 6th, with a 4-week cohort. The course is led by Satvik, an AI expert from IIT Delhi, and aims to help learners build innovative Gen AI applications. The course covers various topics such as building AI chatbots, fine-tuning models, AI agents, multimodal models, and image models. The course benefits include acquiring skills sought after in the tech industry, developing a portfolio, and gaining a solid foundation in Generative AI. The fee for the course is Rs 15,000 with a 25% discount for the cohort. The course is designed for aspiring AI developers, professionals, students, and hobbyists with no prior AI experience required. The course includes practical sessions and projects, and a certification is provided at the end. The course aims to equip learners with the knowledge and skills valued by employers in the AI domain."
  }
}





In [17]:
smart_scraper_graph = SmartScraperGraph(
    prompt="List of the products and their description.",
    # also accepts a string with the already downloaded HTML code
    source="https://www.orae.in/",
    config=graph_config_openai
)

result = smart_scraper_graph.run()

print(json.dumps(result,indent=2))

--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.orae.in/) ---
--- Executing Parse Node ---
--- Executing RAG Node ---
--- (updated chunks metadata) ---
--- (tokens compressed and vector stored) ---
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|██████████| 1/1 [00:06<00:00,  6.20s/it]

{
  "products": [
    {
      "name": "Sequence Cork Yoga Mat",
      "description": "Regular price Rs. 5,499.00, Sale price Rs. 5,499.00, Unit price / per"
    },
    {
      "name": "Rise Cork Yoga Mat",
      "description": "Regular price Rs. 5,499.00, Sale price Rs. 5,499.00, Unit price / per"
    },
    {
      "name": "Pose Cork Yoga Mat",
      "description": "Regular price Rs. 5,499.00, Sale price Rs. 5,499.00, Unit price / per"
    },
    {
      "name": "Cork Support Block",
      "description": "Regular price Rs. 999.00, Sale price Rs. 999.00, Unit price / per"
    },
    {
      "name": "Cork Yoga Roller",
      "description": "Regular price Rs. 1,199.00, Sale price Rs. 1,199.00, Unit price / per"
    },
    {
      "name": "Yoga Starter Kit",
      "description": "Regular price Rs. 6,999.00, Sale price Rs. 6,999.00, Unit price / per"
    },
    {
      "name": "Yoga Essentials Kit",
      "description": "Regular price Rs. 7,199.00, Sale price Rs. 7,199.00, Unit price / per




### 2.2.2 SpeechGraph
  WebScrape -> Audio

In [None]:
from scrapegraphai.graphs import SpeechGraph

# slight changes in graph_config.
graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
    "tts_model": {
        "api_key": OPENAI_API_KEY,
        "model": "tts-1",
        "voice": "alloy"
    },
    "output_path": "website_summary.mp3",
}

speech_graph = SpeechGraph(
    prompt="Make an Audio Summary on this blog",
    source="https://www.marktechpost.com/2024/06/18/meet-deepseek-coder-v2-by-deepseek-ai-the-first-open-source-ai-model-to-surpass-gpt4-turbo-in-coding-and-math-supporting-338-languages-and-128k-context-length/",
    config=graph_config
)

In [None]:
result = speech_graph.run()
answer = result.get("answer", "No answer found")

In [None]:
from IPython.display import Audio
wn = Audio("website_summary.mp3", autoplay=True)
display(wn)



---


## 3. Web Scraping using Jina



> Jina (AI) cleans webpages for AI. It grabs a URL, removes extra elements, and gives you the main content in a format perfect for AI tools.



### 3.1 Intro to Jina

In [18]:
def scrape_jina_ai(url: str) -> str:
  response = requests.get("https://r.jina.ai/" + url)
  return response.text

In [19]:
result = scrape_jina_ai("https://www.buildfastwithai.com/genai-course")
print(result)

Title: Build Fast with AI

URL Source: https://www.buildfastwithai.com/genai-course

Markdown Content:
[![Image 1](https://www.buildfastwithai.com/_next/image?url=%2Fbg_particles.png&w=3840&q=75)](https://www.buildfastwithai.com/genai-course/buy)

Crash Course on Generative AI
-----------------------------

Transform Your Ideas into Real-World Applications with Generative AI

Ready to elevate your dreams?

Starting from 6th July | 4 Week Cohort

Our Learners Work at World-Class Companies
------------------------------------------

About the course
----------------

Want to build innovative Gen AI applications but don't know where to start? I'm Satvik, an IIT Delhi alumnus and AI expert who has trained 3000+ people. Join our meticulously crafted 4-week course to go from zero to pro, learning by building real-world projects.

Ready to dive in? Let's build fast with AI, together!

About the  
Mentor
------------------

Satvik is the founder of Build Fast with AI and the lead mentor for th

### 3.2 Competitors analysis using Jina

In [None]:
# List of cometitors

competitor_sites = [
    {
        "name": "Articulate 360 by Adobe",
        "url": "https://www.articulate.com/360/pricing/freelancers"
    },
    {
        "name": "7taps",
        "url": "https://www.7taps.com/pricing"
    },
    {
        "name": "Mindsmith AI",
        "url": "https://www.mindsmith.ai/pricing"
    },
    {
        "name": "Cards-microlearning",
        "url": "https://www.cards-microlearning.com/en/tarifs"
    },
]


In [None]:
pip install prettytable tqdm --quiet

In [None]:
from typing import List, Callable, Dict
from prettytable import PrettyTable, ALL
from tqdm import tqdm

def view_scraped_content(scrape_url_functions: List[Dict[str, Callable[[str], str]]], sites_list: List[Dict[str, str]], characters_to_display: int = 500, table_max_width: int = 50) -> List[Dict[str, str]]:
    content_table_headers = ["Site Name"] + [f"{func['name']} content" for func in scrape_url_functions]
    cost_table_headers = ["Site Name"] + [f"{func['name']} cost" for func in scrape_url_functions]

    content_table = PrettyTable()
    content_table.field_names = content_table_headers

    cost_table = PrettyTable()
    cost_table.field_names = cost_table_headers

    scraped_data = []

    for site in sites_list:
        content_row = [site['name']]
        cost_row = [site['name']]
        site_data = {"provider": site['name'], "sites": []}

        for scrape_function in scrape_url_functions:
            function_name = scrape_function['name']
            for _ in tqdm([site], desc=f"Processing site {site['name']} using {function_name}"):
                try:
                    content = scrape_function['function'](site['url'])
                    content_snippet = content[:characters_to_display]
                    content_row.append(content_snippet)

                    cost = calculate_cost(content)
                    cost_row.append(f"${cost:.6f}")

                    site_data["sites"].append({"name": function_name, "content": content})
                except Exception as e:
                    error_message = f"Error: {str(e)}"
                    content_row.append(error_message)
                    cost_row.append("Error")

                    site_data["sites"].append({"name": function_name, "content": error_message})
                    continue

        content_table.add_row(content_row)
        cost_table.add_row(cost_row)
        scraped_data.append(site_data)

    content_table.max_width = table_max_width
    content_table.hrules = ALL

    cost_table.max_width = table_max_width
    cost_table.hrules = ALL

    print("Content Table:")
    print(content_table)

    print("\nCost Table:\nThis is how much it would cost to use gpt-4o to parse this content for extraction.")
    print(cost_table)

    return scraped_data

In [None]:
list_of_scraper_functions = [
      {"name": "Jina AI", "function": scrape_jina_ai}
      ]

In [None]:
all_content = view_scraped_content(list_of_scraper_functions, competitor_sites, 700, 20)

In [None]:
pip install openai --quiet

In [None]:
from google.colab import userdata
from openai import OpenAI

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

client = OpenAI(api_key=OPENAI_API_KEY)

def extract(user_input: str):
  entity_extraction_system_message = {"role": "system", "content": "Get me the three pricing tiers from this website's content, and return as a JSON with three keys: {cheapest: {name: str, price: float}, middle: {name: str, price: float}, most_expensive: {name: str, price: float}}"}

  messages = [entity_extraction_system_message]
  messages.append({"role": "user", "content": user_input})

  response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        stream=False,
        response_format={"type": "json_object"}
    )

  return response.choices[0].message.content

In [None]:
def display_extracted_content(results: List[Dict[str, any]], num_objects: int):
    table = PrettyTable()
    table.field_names = ["Site", "Provider Name", "Extracted Content"]

    # Ensure num_objects does not exceed the length of the results list
    num_objects = min(num_objects, len(results))

    # Process the specified number of items from the results list with a progress bar
    for result in tqdm(results[:num_objects], desc="Processing results"):
        provider_name = result["provider"]

        for site in result["sites"]:
            function_name = site["name"]
            content = site["content"]

            # Progress bar for each function
            for _ in tqdm(range(1), desc=f"Extracting content with {provider_name} for {function_name}"):
                extracted_content = extract(content)
                table.add_row([provider_name, function_name, extracted_content])

    table.max_width = 50  # Set the maximum width for better display
    table.hrules = ALL

    print("Extracted Content Table:")
    print(table)


In [None]:
display_extracted_content(all_content, num_objects=9)



---


## 4. Streamlit APP (bonus)

In [None]:
```python

# Import the required libraries
import streamlit as st
from scrapegraphai.graphs import SmartScraperGraph

# Set up the Streamlit app
st.title("Web Scrapping AI Agent 🕵️‍♂️")
st.caption("This app allows you to scrape a website using OpenAI API")

# Get OpenAI API key from user
openai_access_token = st.text_input("OpenAI API Key", type="password")

if openai_access_token:
    model = st.radio(
        "Select the model",
        ["gpt-3.5-turbo", "gpt-4"],
        index=0,
    )
    graph_config = {
        "llm": {
            "api_key": openai_access_token,
            "model": model,
        },
    }
    # Get the URL of the website to scrape
    url = st.text_input("Enter the URL of the website you want to scrape")
    # Get the user prompt
    user_prompt = st.text_input("What you want the AI agent to scrae from the website?")

    # Create a SmartScraperGraph object
    smart_scraper_graph = SmartScraperGraph(
        prompt=user_prompt,
        source=url,
        config=graph_config
    )
    # Scrape the website
    if st.button("Scrape"):
        result = smart_scraper_graph.run()
        st.write(result)


```