# Instant Gratification!
Let's build a useful LLM solution - in a matter of minutes.

Our goal is to code a new kind of Web Browser. Give it a URL, and it will respond with a summary. The Reader's Digest of the internet!!

Before starting, be sure to create your API key with OpenAI and add it to the .env file.

# Overall Flow of the Code

Load Environment Variables: The .env file is loaded, which likely contains the OpenAI API key.

Fetch Webpage Content: The requests library is used to make an HTTP request to the target webpage and retrieve its HTML content.

Parse HTML with BeautifulSoup: Once the HTML is fetched, BeautifulSoup is employed to extract the relevant textual content (e.g., removing headers, footers, and other irrelevant HTML tags).

Call GPT-4: After obtaining the clean webpage content, the OpenAI library is used to send the text to GPT-4 and request a summary.

Display the Summary: The summary generated by GPT-4 is displayed in markdown format within the notebook using IPython.display.

In [1]:
# The os module is a standard Python library that provides utilities for interacting with the operating system, 
# including environment variable management and file system access.
# In this specific code, it is used to fetch environment variables (such as API keys) from the system or to access file paths.
import os

# The requests library is used to make HTTP requests to retrieve the contents of a web page. 
# In this project, it will be used to fetch the HTML of the page that needs to be summarized.
# This library simplifies making HTTP requests, such as GET and POST, handling network errors, 
# and providing access to response content in different formats (text, JSON, etc.).
# This step is essential to retrieve and prepare text data before summarization using GPT-4.
import requests

# The load_dotenv function is used to load environment variables from a .env file into the Python environment. 
# .env files contain environment-specific variables and are loaded by dotenv to prevent hard-coding sensitive information in the script.
# The OpenAI API key, which will be required to interact with GPT-4, is stored in this .env file.
from dotenv import load_dotenv

# BeautifulSoup is part of the bs4 library and is used to parse and extract data from HTML and XML documents. 
# After fetching the web page content using requests, BeautifulSoup helps extract clean, readable text from the raw HTML, 
# which can then be summarized by the GPT-4 model.
from bs4 import BeautifulSoup

# These functions are used to display rich output (like markdown-formatted text) in Jupyter notebooks. 
# display allows you to render different types of objects in an IPython (Jupyter) environment. 
# Markdown renders text formatted using markdown syntax (like headings, bold, italics, etc.)
# Once the webpage is summarized, the output can be shown in a readable, structured markdown format.
from IPython.display import Markdown, display

# This imports the OpenAI library, which allows interaction with OpenAI’s models (including GPT-4). 
# The script will use this library to send text data to the GPT-4 model and retrieve the summarized version.
# The OpenAI Python package provides an interface for making requests to OpenAI’s API. 
# This is where API keys and requests are configured to interact with GPT-4, specifying parameters like the model to use, prompt, and temperature.
from openai import OpenAI

# Key Concepts and Mechanism Behind the Class:
# Web Scraping:
This class effectively scrapes a webpage by making an HTTP request to the URL and parsing the HTML to extract meaningful data.
Tools like requests and BeautifulSoup are commonly used for web scraping because they simplify the process of fetching and parsing HTML.

# Data Cleaning:
Before sending text to an LLM like GPT-4, it’s important to clean it. This code removes noisy elements (like scripts, styles, etc.) that would confuse or overwhelm the summarization model.
The decompose() method ensures that these elements are completely removed from the parsed HTML.

# LLM Input Preparation:
Large language models like GPT-4 work best when provided with clean, structured input. This class cleans and prepares the text, ensuring that only relevant content is sent to the model.
The clean, structured content improves the model’s ability to generate accurate and concise summaries.

In [2]:
# This class is designed to encapsulate the essential elements of a webpage (URL, title, and main text content). 
# Once instantiated with a URL, it automatically fetches and processes the webpage to extract these details.
# Object-oriented design is used here to wrap the webpage-related functionality into a single object, making it easier to manage and use. 
class Website:
    url: str # This is a string attribute where the user or developer provides the URL of the webpage that will be fetched and summarized.
    title: str# The title of a webpage is found in the <title> tag of the HTML document. If no title is found, a fallback value "No title found" is set.
    text: str # The text content is extracted from the HTML <body> tag, after cleaning out irrelevant elements (like scripts and images).
    
    def __init__(self, url): # The constructor is responsible for initializing the Website object with the URL
        self.url = url # The provided URL is stored in the url attribute of the instance.
        
        response = requests.get(url) # This line makes an HTTP GET request to fetch the HTML content of the webpage using the requests library.
        
        soup = BeautifulSoup(response.content, 'html.parser') # The fetched HTML content is passed to BeautifulSoup for parsing
        
        #  The title of the webpage is extracted from the <title> tag. If no <title> tag exists, a fallback string "No title found" is used.       
        self.title = soup.title.string if soup.title else "No title found"
        
        # This loop removes certain irrelevant tags from the HTML, such as scripts, styles, images, and input fields. 
        # These tags don’t contain useful textual content for summarization and should be excluded.
        # The soup.body(["script", "style", "img", "input"]) method finds all elements in the body that match the tags script, style, img, and input.
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            #  Completely removes these elements from the DOM tree. This ensures they don’t interfere with the extracted text.
            irrelevant.decompose()

        # After cleaning the HTML, this line extracts the remaining text from the webpage’s <body> tag.
        # get_text(separator="\n", strip=True) is a BeautifulSoup method that retrieves all the text content from an HTML element (in this case, the body).
        # separator="\n" ensures that different sections of text are separated by a newline when extracted. This helps in preserving the structure of the text.
        # strip=True: removes leading and trailing whitespace from the text.
        self.text = soup.body.get_text(separator="\n", strip=True)

# Connecting to OpenAI
The next cell is where we load in the environment variables in your .env file and connect to OpenAI.

Troubleshooting if you have problems:

OpenAI takes a few minutes to register after you set up an account. If you receive an error about being over quota, try waiting a few minutes and try again.
As a fallback, replace the line openai = OpenAI() with openai = OpenAI(api_key="your-key-here") - while it's not recommended to hard code tokens in Jupyter lab, because then you can't share your lab with others, it's a workaround for now

Setting the API key in the environment and initializing the OpenAI client to interact with OpenAI’s GPT-4 API.

In [11]:
# Load the environment variables defined in the .env file into the current Python environment.
load_dotenv()

# This line ensures that the OpenAI API key is stored in the os.environ dictionary. 
# This dictionary holds all environment variables, which can be accessed by the script at runtime.
# If the key is not found in the environment, it defaults to 'your-key-if-not-using-env'.
# os.environ['OPENAI_API_KEY']: This explicitly sets the environment variable OPENAI_API_KEY to be used by the OpenAI client.
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')

# This initializes an instance of the OpenAI class, which will be used to interact with the OpenAI API (in this case, GPT-4).
# The OpenAI() class is part of the OpenAI Python package, which provides an interface for communicating with OpenAI’s API. 
# Once initialized, it enables you to call different models (like GPT-4) and send prompts for completion, summarization, and other NLP tasks.
openai = OpenAI()


In [4]:
# Let's try one out

# Instantiate a Website Object: 
ed = Website("https://edwarddonner.com") # provide a URL to the class constructor, which fetches and processes the webpage.

# Access Webpage Data: After initialization, you can access the url, title, and text attributes of the object.
print(ed.title) # Displays the webpage's title
print(ed.text) # Displays the cleaned webpage text

Home - Edward Donner
Home
Outsmart
An arena that pits LLMs against each other in a battle of diplomacy and deviousness
About
Posts
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (
very
amateur) and losing myself in
Hacker News
, nodding my head sagely to things I only half understand.
I’m the co-founder and CTO of
Nebula.io
. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,
acquired in 2021
.
We work with groundbreaking, proprietary LLMs verticalized for talent, we’ve
patented
our matching model, and our award-winning platform has happy customers and tons of press coverage.
Connect
with me for

# system_prompt
This string defines the system-level instructions given to GPT-4. 
It sets the behavior and tone of the model by specifying that it should act as an assistant that provides a summary of a website while ignoring irrelevant content (like navigation links)
# System Role: 
GPT-4 can be guided to behave in specific ways based on the system prompt. Here, it's being instructed to provide a summary and to ignore text related to navigation, which is often not useful in a website summary.
# Markdown: 
The instruction to "Respond in markdown" ensures that the response is formatted in a way suitable for rendering within markdown viewers (like in Jupyter notebooks or Markdown editors).

In [5]:
system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

# user_prompt_for(website)
This function generates the user-specific prompt to send to the GPT-4 model. It describes the website's title and content, then asks the model to summarize it.

In [6]:
def user_prompt_for(website):
    # The title of the webpage (website.title) is dynamically inserted into the user prompt, which gives GPT-4 context about the content it's summarizing.
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "The contents of this website is as follows; \
please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    # The text content extracted from the webpage (website.text) is appended after the main prompt, asking GPT-4 to summarize it.
    user_prompt += website.text
    return user_prompt


# messages_for(website)
This function generates the entire message structure needed for a conversation with GPT-4. It includes both the system prompt and the user prompt (which was created in the user_prompt_for function). This structure is required when working with OpenAI’s chat-based models, such as GPT-4. TThe function returns a list of message objects that represent the conversation:

System Message: Defines the assistant's role and behavior.

User Message: Contains the dynamically generated user prompt, which gives context and instructions to GPT-4

In [7]:
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt}, # The system message sets the model's role
        {"role": "user", "content": user_prompt_for(website)} # The user message describes the task at hand.
    ]

# summarize(url)
This is the main function that orchestrates the webpage summarization process. It takes in a URL, fetches the webpage content, and then sends a prompt to GPT-4 to generate a summary.

Website Object: 
The function first creates a Website object for the given URL. This object (from the earlier code) extracts the webpage's title and main text content.

API Call to GPT-4: 
After generating the necessary messages (using messages_for), it makes a call to OpenAI's GPT-4 API using the openai.chat.completions.create() method. This sends the messages to the model and retrieves the generated summary.

Return Summary: 
The function returns the summary produced by GPT-4 from response.choices[0].message.content.

In [12]:
def summarize(url):
    website = Website(url)  # Initialize the Website object for the given URL
    
    response = openai.chat.completions.create(
        model="gpt-4o-mini",  # GPT-4 model variant
        messages=messages_for(website)  # Provide the system and user messages
    )
    return response.choices[0].message.content  # Return the generated summary

In [9]:
summarize("https://edwarddonner.com")

'# Website Summary - Edward Donner\n\nEdward Donner\'s website features a personal introduction and showcases his interests in coding, LLMs (Large Language Models), DJing, and electronic music production. He is the co-founder and CTO of Nebula.io, which utilizes AI to improve talent discovery and recruitment. Previously, he founded the AI startup untapt, acquired in 2021.\n\n## Recent Posts\n- **August 6, 2024**: Introduced "Outsmart LLM Arena," a platform for LLMs to engage in diplomacy and competitive scenarios.\n- **June 26, 2024**: Provided guidance on selecting the right LLM, including tools and resources.\n- **February 7, 2024**: Discussed methods for fine-tuning LLMs based on individual texts in a simulated manner.\n- **January 31, 2024**: Continued the discussion on fine-tuning LLMs, specifically focusing on QLoRA techniques.'

# display_summary(url)
This function ties everything together. It first generates a summary of the webpage provided by the url, then displays that summary in markdown format using IPython.display.

Call to summarize(url):
The function first calls summarize(url) to generate a summary of the given webpage. This function was defined earlier and is responsible for retrieving the webpage content, sending it to GPT-4, and getting the summarized text.

display(Markdown(summary)): The display() function from the IPython package is used to render the summary in a notebook.

Markdown(summary) is used to format the output as markdown. This ensures that any markdown syntax (like headers, lists, bold, or italic text) in the summary is rendered properly when displayed.

In [13]:
def display_summary(url):
    summary = summarize(url)  # Generate summary using the summarize function
    display(Markdown(summary))  # Display the summary in markdown format

In [14]:
# This line calls the display_summary function with the URL https://edwarddonner.com, initiating the entire process of webpage summarization and display.
# The URL https://edwarddonner.com is passed to summarize, which extracts the webpage content and generates a summary using GPT-4.
# The summary is then displayed in markdown format within the notebook environment
display_summary("https://edwarddonner.com")

# Summary of Edward Donner's Website

Edward Donner's website serves as a platform for his interests in coding, experimenting with large language models (LLMs), and electronic music production. As the co-founder and CTO of Nebula.io, he focuses on applying AI to improve talent discovery and management. His previous venture, untapt, was acquired in 2021, complementing his extensive experience in the AI field.

## Recent Posts
- **August 6, 2024**: **Outsmart LLM Arena** – A competitive environment designed for LLMs, emphasizing strategies involving diplomacy and cunning.
- **June 26, 2024**: **Choosing the Right LLM: Toolkit and Resources** – Guidance on selecting appropriate LLMs for various applications.
- **February 7, 2024**: **Fine-tuning an LLM on your texts: a simulation of you** – Insights into customizing LLMs using personal text data.
- **January 31, 2024**: **Fine-tuning an LLM on your texts: part 4 – QLoRA** – A continuation of the fine-tuning discussion, focusing on the QLoRA method.

In [15]:
display_summary("https://cnn.com")

# Summary of CNN Website

The CNN website provides comprehensive coverage of breaking news, latest updates, and in-depth articles across various categories including US news, world events, politics, business, health, entertainment, travel, sports, and science. The site features live broadcasts, audio and video content, and articles on trending topics and current events.

## Latest News Highlights
- **Hurricane Helene**: The hurricane has intensified to Category 4, with extreme winds and storm surge affecting the Florida coast.
- **NTSB Warning**: An urgent safety warning has been issued concerning certain Boeing 737 models.
- **New York City Mayor**: Eric Adams faces mounting pressure and legal scrutiny, including a five-count indictment.
- **Ukraine-Russia** and **Israel-Hamas Wars**: Ongoing coverage and updates on the conflicts.

## Recent Developments
- **Entertainment**: Hoda Kotb announces departure from the 'Today' show.
- **Politics**: Ongoing discussions and events related to the upcoming 2024 elections, including implications for Democratic candidates.

CNN also engages users through a feedback mechanism for ads and technical issues. The site is designed to keep readers informed and engaged with the latest developments in various domains.

In [16]:
display_summary("https://finance.yahoo.com")

# Yahoo Finance - Stock Market Live

Yahoo Finance provides comprehensive coverage of the stock market, finance, and business news. The homepage features an array of topics including live stock quotes, market data, and news highlights. Key areas include:

## News Updates
- **Economy**: The U.S. appears to be emerging from an economic slowdown with positive changes on the horizon. Recent reports indicate a 3% annualized growth rate and decreasing jobless claims.
- **Corporate Highlights**:
  - **Costco** reported Q4 earnings that beat expectations, however, revenue fell short.
  - **BlackBerry** announced a loss in its quarterly earnings report.
  - **Trump** has launched a new venture focused on luxury watches.
- **Market Performance**: Asian stocks show gains, influenced by optimistic trends in the China and U.S. markets.

## Market Data
- The site features **real-time stock data**, with highlights such as:
  - **Futures and Commodities**: S&P 500 futures down slightly, along with Dow and Nasdaq futures.
  - Notable movements include major stocks such as **NVIDIA**, **Micron**, and **Alibaba** showing positive trends.

## Sector Insights
- Investments are notably strong in technology and consumer sectors, bolstered by recent earnings reports and market forecasts.

## Personal Finance
- The site includes resources for personal finance management, such as mortgage rates, credit cards, and investment advice.

## Investment Ideas
- A collection of **stocks** and **ETFs** currently attracting attention, including growth stocks and top daily gainers.

Overall, Yahoo Finance serves as a robust platform for investors and financial enthusiasts, offering timely news, dynamic market insights, and a wide array of financial tools.