This notebook goes through the process of creating a webpage summarizer utilizing an OpenAI API call

Import Statements

In [14]:
import os
import requests
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
from openai import OpenAI

Connecting to OpenAI

In [15]:
# Load environment variables in a file called .env
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

# Check the key
if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")


API key found and looks good so far!


Testing API call

In [16]:
openai = OpenAI()

message = "Hello, GPT! This is my first ever message to you! Hi!"
response = openai.chat.completions.create(model="gpt-4o-mini", messages=[{"role":"user", "content":message}])
print(response.choices[0].message.content)

Hello! Welcome! I'm glad you reached out. How can I assist you today?


Creating a Class to Represent a WebPage

In [17]:
# A class to represent a Webpage
# If you're not familiar with Classes, check out the "Intermediate Python" notebook

# Some websites need you to use proper headers when fetching them:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:

    def __init__(self, url):
        """
        Create this Website object from the given url using the BeautifulSoup library
        """
        self.url = url
        response = requests.get(url, headers=headers)

        # Beautiful soup is a package for web scraping
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        self.text = soup.body.get_text(separator="\n", strip=True)

Testing Out Beautiful Soup

In [18]:
# Let's try one out. Change the website and add print statements to follow along.

ed = Website("https://edwarddonner.com")
print(ed.title)
print(ed.text)

Home - Edward Donner
Home
Outsmart
An arena that pits LLMs against each other in a battle of diplomacy and deviousness
About
Posts
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (
very
amateur) and losing myself in
Hacker News
, nodding my head sagely to things I only half understand.
I’m the co-founder and CTO of
Nebula.io
. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,
acquired in 2021
.
We work with groundbreaking, proprietary LLMs verticalized for talent, we’ve
patented
our matching model, and our award-winning platform has happy customers and tons of press coverage.
Connect
with me for

In [19]:
# The system prompt is what explains the context of the situation to the frontier model
# It tells them what kind of task they are performing and what tone to use

system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

In [20]:
# A function that writes a User Prompt that asks for summaries of websites:
# The user prompt is the actual conversation itself
# The converstaion start and the role of the LLM is to figure out what way to respond to the user prompt in the context of the system prompt

def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "\nThe contents of this website is as follows; \
please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

In [21]:
print(user_prompt_for(ed))

You are looking at a website titled Home - Edward Donner
The contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.

Home
Outsmart
An arena that pits LLMs against each other in a battle of diplomacy and deviousness
About
Posts
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (
very
amateur) and losing myself in
Hacker News
, nodding my head sagely to things I only half understand.
I’m the co-founder and CTO of
Nebula.io
. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,
acquired in 2021
.

***Messages***

The API from OpenAI expects to receive messages in a particular structure. Many of the other APIs share this structure:

```json
[
    {"role": "system", "content": "system message goes here"},
    {"role": "user", "content": "user message goes here"}
]


In [25]:
# When you are trying to describe a conversation, you describe it using a python list of dictionaries
# Each element in the list is a dictionary, a dictionary with 2 elements

messages = [
    {"role": "system", "content": "You are a snarky assistant"},
    {"role": "user", "content": "What is 2 + 2?"}
]

# To give you a preview -- calling OpenAI with system and user messages:

response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
print(response.choices[0].message.content)

Oh, you're really looking for a challenge, huh? Well, hold on to your hat: 2 + 2 equals 4. Mind-blowing, I know!


In [26]:
# See how this function creates exactly the format above

def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

In [27]:
print(messages_for(ed))

[{'role': 'system', 'content': 'You are an assistant that analyzes the contents of a website and provides a short summary, ignoring text that might be navigation related. Respond in markdown.'}, {'role': 'user', 'content': 'You are looking at a website titled Home - Edward Donner\nThe contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.\n\nHome\nOutsmart\nAn arena that pits LLMs against each other in a battle of diplomacy and deviousness\nAbout\nPosts\nWell, hi there.\nI’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (\nvery\namateur) and losing myself in\nHacker News\n, nodding my head sagely to things I only half understand.\nI’m the co-founder and CTO of\nNebula.io\n. We’re applying AI to a field where it can make a massive, positive 

Summarize function

In [28]:
# And now: call the OpenAI API. You will get very familiar with this!

def summarize(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model = "gpt-4o-mini",
        messages = messages_for(website)
    )
    return response.choices[0].message.content

In [29]:
summarize("https://edwarddonner.com")

'# Summary of Edward Donner\'s Website\n\nEdward Donner\'s website features a personal introduction highlighting his interests in coding, experimenting with Large Language Models (LLMs), DJing, and electronic music production. He is the co-founder and CTO of Nebula.io, a company focused on leveraging AI to assist individuals in discovering their potential and career paths. Previously, he founded the AI startup untapt, which was acquired in 2021. The site also offers a section called "Outsmart," which is an arena for LLMs to compete in diplomacy and strategy.\n\n## Recent Posts\n- **December 21, 2024**: Welcome, SuperDataScientists!\n- **November 13, 2024**: Mastering AI and LLM Engineering – Resources\n- **October 16, 2024**: From Software Engineer to AI Data Scientist – Resources\n- **August 6, 2024**: Outsmart LLM Arena – a battle of diplomacy and deviousness\n\nThe website encourages visitors to connect and collaborate with Edward on topics related to AI and LLM technology.'

In [30]:
# A function to display this nicely in the Jupyter output, using markdown

def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

In [31]:
display_summary("https://edwarddonner.com")

# Summary of Edward Donner's Website

Edward Donner's website showcases his interests in coding, experimenting with large language models (LLMs), and electronic music. He is the co-founder and CTO of [Nebula.io](http://nebula.io), a company utilizing AI to assist in talent discovery and management. He was previously the founder and CEO of untapt, an AI startup acquired in 2021. The website features a section called "Outsmart," which presents an engaging platform for LLMs that compete in diplomatic and strategic scenarios.

## Key Announcements
- **December 21, 2024:** Welcoming SuperDataScientists to the community.
- **November 13, 2024:** Shared resources for mastering AI and LLM engineering.
- **October 16, 2024:** Offered resources for transitioning from software engineer to AI data scientist.
- **August 6, 2024:** Introduction to the Outsmart LLM Arena concept.

In [32]:
display_summary("https://cnn.com")

# Summary of CNN Website

CNN is a global news organization providing the latest updates on a wide array of topics including US and world news, politics, business, health, entertainment, science, climate, sports, and more. The site features live updates and breaking news, along with various CNN-exclusive content.

## Current Highlights:
- **Israel-Hamas Conflict**: A ceasefire has begun, allowing the release of hostages. Three female hostages have been reunited with their families, and the situation in Gaza continues to unfold, with many families returning to devastated homes.
- **US Politics**: Donald Trump's presidential transition is ongoing, with plans for significant executive actions on his first day. Current polling shows low ratings for the Democratic Party.
- **TikTok**: The app is reinstating its services following Trump's pledge to restore it amidst ongoing discussions about its future.

### Additional Topics:
- Environmental insights on LA wildfires and their impact on communities.
- Human interest stories including efforts after natural disasters and unique societal choices in Japan.
- Popular culture updates including arts, entertainment, and lifestyle trends.

CNN's website is structured to provide a comprehensive news experience with multimedia content and various sections tailored to specific interests.

In [33]:
display_summary("https://anthropic.com")

# Anthropic Website Summary

Anthropic is an AI safety and research company based in San Francisco, dedicated to creating reliable and beneficial AI systems. Their primary product is Claude, an AI model designed to prioritize safety in its operations.

## Notable Features:

- **Claude 3.5 Sonnet**: The latest version of their AI model, released on October 22, 2024. 
- **Claude 3.5 Haiku**: Another new model introduced alongside 3.5 Sonnet.
- **API Access**: Developers can build AI-powered applications and custom experiences using Claude through API integrations.
- **Enterprise Solutions**: Tailored AI solutions for organizational use.

## Recent Announcements:
- **October 22, 2024**: Introduction of Claude 3.5 Sonnet and slide update on computer use.
- **September 4, 2024**: Announcement regarding Claude for Enterprise.
- **Past Research**: Discussion on AI alignment and safety, including topics like "Constitutional AI" and core views on AI safety shared in earlier announcements.

Overall, Anthropic merges advanced AI technology with a commitment to establishing safety and ethical standards in AI development.