## Instant Gratification project --> Website Text Summarization

Check if our Ollama server is running at: [http://localhost:11434/](http://localhost:11434/)

You should see the message `Ollama is running`.  

In [2]:
# Importing essential libraries

import requests
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
import ollama

In [4]:
# Constants

#OLLAMA_API = "http://localhost:11434/api/chat" # The base URL for the Ollama server running locally
#HEADERS = {"Content-Type": "application/json"} # Tells the server that you're sending JSON-formatted data
MODEL = "llama3.2"

In [6]:
class Website:
    """ A utility class to represent the website that we have scrapped """
    url: str
    title: str
    text: str

    def __init__(self, url):
        """  Create this Website object from the given url using the BeautifulSoup library """
        self.url = url
        response = requests.get(url) # fetch the raw HTML content of the page
        soup = BeautifulSoup(response.content, "html.parser")
        self.title = soup.title.string if soup.title else "No title found!"

        # This loop removes unwanted tags from the page
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose

        # This would collect all visible text from the cleaned HTML tree and assign it to self.text
        self.text = soup.body.get_text(separator = "\n", strip = True)

In [9]:
# Lets test this out
hg = Website("https://edwarddonner.com/")
print(hg.title)
print(hg.text)

Home - Edward Donner
Home
Connect Four
Outsmart
An arena that pits LLMs against each other in a battle of diplomacy and deviousness
About
Posts
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (
very
amateur) and losing myself in
Hacker News
, nodding my head sagely to things I only half understand.
I’m the co-founder and CTO of
Nebula.io
. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,
acquired in 2021
.
We work with groundbreaking, proprietary LLMs verticalized for talent, we’ve
patented
our matching model, and our award-winning platform has happy customers and tons of press coverage.
Connec

## Types of prompts

Models like GPT4o have been trained to receive instructions in a particular way.

They expect to receive:

**A system prompt** that tells them what task they are performing and what tone they should use

**A user prompt** -- the conversation starter that they should reply to

In [11]:
# Defining our system prompt

system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

In [12]:
# A function that writes a User Prompt that asks for summaries of websites

def user_prompt(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "The contents of this website is as follows; \
    please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    
    user_prompt += website.text
    return user_prompt

## Messages

The API from Ollama expects the same message format as OpenAI:

```
[
    {"role": "system", "content": "system message goes here"},
    {"role": "user", "content": "user message goes here"}
]

In [14]:
# See how this function creates exactly the format above

def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt(website)}
    ]

## Time to bring it together with Ollama

In [15]:
def summarize(url):
    website = Website(url)
    messages = messages_for(website)
    response = ollama.chat(model = MODEL, messages = messages)
    return response['message']['content']

In [16]:
# A function to display this nicely in the Jupyter output, using markdown

def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

In [17]:
display_summary("https://cnn.com")

**Summary of the Website**
====================================

The website is a news portal from CNN, providing up-to-date information on various topics such as:

* **News**: Breaking news headlines, including stories on Ukraine-Russia War, Israel-Hamas War, and more.
* **Politics**: Analysis and updates on US Politics, European politics, and international relations.
* **Business**: News and insights on markets, economy, and business trends.
* **Health**: Updates on various health topics, including wellness, fitness, and disease prevention.
* **Science**: Breaking news on scientific discoveries, breakthroughs, and space exploration.
* **Entertainment**: Latest news on movies, television shows, music, and celebrity gossip.

**Featured Stories**
-------------------

* Ukraine-Russia War: Reports of renewed air attacks in Kyiv, with at least 13 people killed in drone and ballistic missile strikes.
* US Politics: Trump announces 'partnership' between US Steel and Nippon Steel, amid controversy over his presidency.
* Climate Change: A fungi that can 'eat you from the inside out' could spread as the world heats up, according to a recent study.

**Other Notable Stories**
-------------------------

* SpaceX cleared to launch Starship test flight after two explosive failures.
* US Supreme Court pauses attempt by lower court to force DOGE to provide records.
* Flat shoes allowed at Cannes Film Festival for the first time.

We have successfully scrapped the website and displayed it properly using Markdown styles.

## Points to note

Note that this will only work on websites that can be scraped using this simplistic approach.

Websites that are rendered with Javascript, like React apps, won't show up. See the community-contributions folder for a Selenium implementation that gets around this. You'll need to read up on installing Selenium (ask ChatGPT!)

Also Websites protected with CloudFront (and similar) may give 403 errors!