<a href="https://colab.research.google.com/github/1uch0/LLM_training/blob/main/LLM_text_sum_scarp_from_website.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install openai



In [3]:
# imports

import os
import requests
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
from openai import OpenAI
from google.colab import userdata

# If you get an error running this cell, then please head over to the troubleshooting notebook!




In [4]:
api_key = userdata.get('Open_AI_API') ##Open_AI_API It is the API key from OPEN_AI saved in your collab

In [5]:
# Check the key

if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")

API key found and looks good so far!


In [6]:
openai = OpenAI(api_key=api_key)

In [7]:
# To give you a preview -- calling OpenAI with these messages is this easy. Any problems, head over to the Troubleshooting notebook.

message = "Hello, GPT! This is my first ever message to you! Hi!"
response = openai.chat.completions.create(model="gpt-4o-mini", messages=[{"role":"user", "content":message}])
print(response.choices[0].message.content)

Hello! Welcome! I'm glad to hear from you. How can I assist you today?


In [8]:
# A class to represent a Webpage
# If you're not familiar with Classes, check out the "Intermediate Python" notebook

# Some websites need you to use proper headers when fetching them:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:

    def __init__(self, url):
        """
        Create this Website object from the given url using the BeautifulSoup library
        """
        self.url = url
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        self.text = soup.body.get_text(separator="\n", strip=True)

In [9]:
# Let's try one out. Change the website and add print statements to follow along.

ed = Website("https://edwarddonner.com")
print(ed.title)
print(ed.text)

Home - Edward Donner
Home
Connect Four
Outsmart
An arena that pits LLMs against each other in a battle of diplomacy and deviousness
About
Posts
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (
very
amateur) and losing myself in
Hacker News
, nodding my head sagely to things I only half understand.
I’m the co-founder and CTO of
Nebula.io
. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,
acquired in 2021
.
We work with groundbreaking, proprietary LLMs verticalized for talent, we’ve
patented
our matching model, and our award-winning platform has happy customers and tons of press coverage.
Connec

In [10]:
## Types of prompts

#You may know this already - but if not, you will get very familiar with it!

#Models like GPT4o have been trained to receive instructions in a particular way.

#They expect to receive:

#**A system prompt** that tells them what task they are performing and what tone they should use

#**A user prompt** -- the conversation starter that they should reply to

#They expect to receive:

#**A system prompt** that tells them what task they are performing and what tone they should use

#**A user prompt** -- the conversation starter that they should reply to

In [11]:
# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish."

system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

In [12]:
# A function that writes a User Prompt that asks for summaries of websites:

def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "\nThe contents of this website is as follows; \
please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

In [13]:
system_prompt

'You are an assistant that analyzes the contents of a website and provides a short summary, ignoring text that might be navigation related. Respond in markdown.'

In [14]:
print(user_prompt_for(ed))

You are looking at a website titled Home - Edward Donner
The contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.

Home
Connect Four
Outsmart
An arena that pits LLMs against each other in a battle of diplomacy and deviousness
About
Posts
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (
very
amateur) and losing myself in
Hacker News
, nodding my head sagely to things I only half understand.
I’m the co-founder and CTO of
Nebula.io
. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,
acqui

In [15]:
## Messages

##The API from OpenAI expects to receive messages in a particular structure.
##Many of the other APIs share this structure:

##```
##[
##    {"role": "system", "content": "system message goes here"},
##    {"role": "user", "content": "user message goes here"}
##]

##To give you a preview, the next 2 cells make a rather simple call - we won't stretch the might GPT (yet!)

In [16]:
messages = [
    {"role": "system", "content": "You are a snarky assistant"},
    {"role": "user", "content": "What is 2 + 2?"}
]

In [17]:
# To give you a preview -- calling OpenAI with system and user messages:

response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
print(response.choices[0].message.content)

Oh, let me think... I’m going to go out on a limb here and say it’s 4! Shocking, I know.


In [18]:
# See how this function creates exactly the format above

def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

In [19]:
# Try this out, and then try for a few more websites

messages_for(ed)

[{'role': 'system',
  'content': 'You are an assistant that analyzes the contents of a website and provides a short summary, ignoring text that might be navigation related. Respond in markdown.'},
 {'role': 'user',
  'content': 'You are looking at a website titled Home - Edward Donner\nThe contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.\n\nHome\nConnect Four\nOutsmart\nAn arena that pits LLMs against each other in a battle of diplomacy and deviousness\nAbout\nPosts\nWell, hi there.\nI’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (\nvery\namateur) and losing myself in\nHacker News\n, nodding my head sagely to things I only half understand.\nI’m the co-founder and CTO of\nNebula.io\n. We’re applying AI to a field where it can make a

In [20]:
# And now: call the OpenAI API. You will get very familiar with this!

def summarize(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model = "gpt-4o-mini",
        messages = messages_for(website)
    )
    return response.choices[0].message.content

In [21]:
summarize("https://edwarddonner.com")

'# Summary of Edward Donner\'s Website\n\nEdward Donner’s website showcases his interests in coding, language models (LLMs), and electronic music. He describes himself as the co-founder and CTO of Nebula.io, a startup using AI to enhance talent discovery and engagement. Previously, he founded untapt, which was acquired in 2021.\n\n## Recent News and Announcements\n- **January 23, 2025**: Launched resources for "LLM Workshop – Hands-on with Agents".\n- **December 21, 2024**: Welcomed a new community, "SuperDataScientists".\n- **November 13, 2024**: Provided resources for "Mastering AI and LLM Engineering".\n- **October 16, 2024**: Shared resources for transitioning from "Software Engineer to AI Data Scientist". \n\nThe site serves as a connection point for individuals interested in LLMs and offers various resources in the AI field.'

In [22]:
# A function to display this nicely in the Jupyter output, using markdown

def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

In [23]:
display_summary("https://edwarddonner.com")

# Summary of the Edward Donner Website

The website for Edward Donner highlights his interests in coding, LLM experimentation, and electronic music. He is the co-founder and CTO of Nebula.io, which focuses on applying AI in the recruitment sector to assist in talent discovery and management. His previous venture, untapt, was an AI startup acquired in 2021. The site invites visitors to connect with him for collaboration and discussions.

## News and Announcements
- **January 23, 2025**: Resources for a workshop titled "LLM Workshop – Hands-on with Agents."
- **December 21, 2024**: A welcome message for a community called "SuperDataScientists."
- **November 13, 2024**: Resources shared for "Mastering AI and LLM Engineering."
- **October 16, 2024**: Resources for transitioning "From Software Engineer to AI Data Scientist."

In [24]:
#LETS TRY MORE WEBSITES:

#Note that this will only work on websites that can be scraped using this simplistic approach.

#Websites that are rendered with Javascript, like React apps, won't show up. See the community-contributions folder for a Selenium implementation that gets around this. You'll need to read up on installing Selenium (ask ChatGPT!)

#Also Websites protected with CloudFront (and similar) may give 403 errors - many thanks Andy J for pointing this out.

#But many websites will work just fine!


In [25]:
display_summary("https://cnn.com")

# Summary of CNN Website

CNN provides a comprehensive range of news articles, videos, and analysis on a variety of topics including:

- **US News**: Coverage of major events and developments within the United States, including political updates and social issues.
- **World News**: International stories, highlighting events such as conflicts, elections, and major global incidents.
- **Politics**: In-depth analysis of political events, elections, and governmental decisions affecting both the US and around the world.
- **Business and Tech**: Updates on the economy, stock market trends, and innovations in technology, featuring insights from industry leaders.
- **Health**: Reports on health issues, trends, and advancements in medical sciences.
- **Entertainment and Culture**: News from the entertainment industry, celebrity updates, and cultural insights.
- **Sports**: Coverage of various sports events and issues affecting athletes and teams.
- **Climate and Science**: Articles on climate change, scientific discoveries, and environmental issues impacting the planet.

## Recent Highlights

- **US-Ukraine Relations**: An ongoing meeting to discuss the war in Ukraine, prompted by significant drone attacks on Russian territory.
- **Internal US Politics**: Democrats are split on an impending vote regarding government funding amid concerns of a potential shutdown.
- **Missing Students**: Coverage of a US college student currently missing in the Dominican Republic, with ongoing search efforts.

CNN also promotes its diverse array of shows and segments, including breaking news, analysis, and exclusive reports. The site emphasizes user engagement, inviting feedback on ads and site performance.

In [26]:
display_summary("https://anthropic.com")

# Summary of Anthropic Website

The Anthropic website showcases its AI research and products, emphasizing safety and reliability in artificial intelligence. The highlight is the introduction of **Claude 3.7 Sonnet**, described as the most intelligent AI model to date and the first hybrid reasoning model. Alongside it, **Claude Code** is launched, which is focused on coding tasks.

### Key Features
- **Claude 3.7 Sonnet**: The latest AI model featuring advanced reasoning capabilities.
- **Claude Code**: A tool designed for coding use cases.
- **API Access**: Developers can create custom AI-powered applications through the Claude API.

### Research Commitments
The company is actively involved in AI safety research, with notable announcements such as:
- **Core Views on AI Safety** - Discussing the principles and practices of ensuring AI technologies are safe for users (March 8, 2023).
- An earlier initiative titled **Constitutional AI: Harmlessness from AI Feedback** that highlights their approach to making AI safer (December 15, 2022).

### Company Overview
Anthropic is based in San Francisco and aims to develop beneficial AI systems through a collaborative approach that combines expertise from various fields, including machine learning and policy-making. The site also details job opportunities and invites individuals to join their mission.