## Exercise Objective

The goal of this exercise is to build a **simple but complete pipeline** that combines:
1. **Web scraping**, to collect textual content from a website
2. **Large Language Models**, to process and summarize the extracted content

More specifically, the workflow is:
- fetch the raw content of a web page,
- pass the content to an LLM through an API call,
- select a model explicitly,
- generate a concise and meaningful summary of the website.

This exercise is not focused on performance optimization, but on understanding how different components (scraping, prompting, model selection, and output visualization) interact in a real-world generative AI scenario.


In [10]:
import os
from dotenv import load_dotenv
from IPython.display import Markdown, display
from openai import OpenAI
from scraper import fetch_website_contents

In [3]:
# Load environment variables in a file called .env

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

# Check the key

if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")

API key found and looks good so far!


## A first call

In [5]:
# Write a massage
message = "Hello, GPT! This is my first ever message to you! Hi!"

messages = [{"role": "user", "content": message}]

messages

# The first call
openai = OpenAI()

response = openai.chat.completions.create(model="gpt-5-nano", messages=messages)
response.choices[0].message.content

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

## Web Scraping: Collecting Website Content

The first step of the pipeline consists in **retrieving textual data from a web page**.

Web scraping allows us to transform unstructured HTML content into raw text that can be processed by a language model. At this stage, the goal is not perfect text cleaning, but rather obtaining a representative snapshot of the page content.

This step highlights an important practical aspect:
> Large Language Models do not access the web directly — they only operate on the text that we explicitly provide as input.


In [35]:
# Let's try out this utility
website_path = "https://it.wikipedia.org/wiki/The_Walt_Disney_Company"
content = fetch_website_contents(website_path)
print(content)

The Walt Disney Company - Wikipedia

Vai al contenuto
Menu principale
Menu principale
sposta nella barra laterale
nascondi
Navigazione
Pagina principale
Ultime modifiche
Una voce a caso
Nelle vicinanze
Vetrina
Aiuto
Sportello informazioni
Pagine speciali
Comunità
Portale Comunità
Bar
Il Wikipediano
Contatti
Ricerca
Ricerca
Aspetto
Fai una donazione
registrati
entra
Strumenti personali
Fai una donazione
registrati
entra
Indice
sposta nella barra laterale
nascondi
Inizio
1
Storia
Attiva/disattiva la sottosezione Storia
1.1
1923-1966: Periodo Walt Disney
1.1.1
1923-1927: i primi cortometraggi
1.1.2
1928-1937: la creazione di Topolino e il successo
1.1.3
1937-1954: i primi lungometraggi e la crisi della Seconda guerra mondiale
1.1.4
1955-1966: Disneyland e morte di Walt Disney
1.2
1966-1983: l'era di Roy Disney e Card Walker
1.2.1
1966-1971: l'apertura di Walt Disney World Resort
1.2.2
1971-1983: morte di Roy Disney, passaggio a Card Walker e la creazione di Tokyo Disneyland
1.3
1984-2005:

## Prompt Design: System Prompt vs User Prompt

The interaction with the language model is driven by **two different types of prompts**:

### System Prompt
The system prompt defines **how the model should behave**.
It sets global instructions such as:
- the role of the model (e.g. assistant, analyst, summarizer),
- the expected style of the response,
- constraints on tone, length, or level of detail.

In other words, the system prompt provides the **behavioral context** for the model.

### User Prompt
The user prompt represents the **starting point of the conversation**.
It contains the actual request, such as:
- what task to perform,
- what content to analyze,
- what output is expected.

The user prompt is interpreted *within the boundaries established by the system prompt*.

### Why this distinction matters
Separating system and user prompts improves:
- clarity of intent,
- controllability of the model’s behavior,
- reproducibility of results.

This distinction is especially important when building structured or multi-step LLM applications.


In [14]:
# Define the system prompt

system_prompt = """
You are a snarky assistant that analyzes the contents of a website,
and provides a short, snarky, humorous summary, ignoring text that might be navigation related.
Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.
"""

# Define the user prompt

user_prompt_prefix = """
Here are the contents of a website.
Provide a short summary of this website.
If it includes news or announcements, then summarize these too.

"""

In [36]:
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_prefix + website}
    ]

In [37]:
messages_for(website_path)

[{'role': 'system',
  'content': '\nYou are a snarky assistant that analyzes the contents of a website,\nand provides a short, snarky, humorous summary, ignoring text that might be navigation related.\nRespond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.\n'},
 {'role': 'user',
  'content': '\nHere are the contents of a website.\nProvide a short summary of this website.\nIf it includes news or announcements, then summarize these too.\n\nhttps://it.wikipedia.org/wiki/The_Walt_Disney_Company'}]

## LLM Interaction and Model Selection

Once the content is collected and the prompts are defined, the next step is to **call the OpenAI API**.

In this exercise, the model is explicitly selected rather than relying on defaults. This makes the interaction:
- more transparent,
- easier to compare across different models,
- reproducible over time.

The language model receives:
- the system prompt (behavioral instructions),
- the user prompt (task definition),
- the scraped website content as input.

The output is a synthesized summary generated by the model based on these inputs.


In [38]:
url_to_scrape = website_path

In [39]:
# Call the OpenAI API

def summarize(url):
    website = fetch_website_contents(url)
    response = openai.chat.completions.create(
        model = "gpt-4.1-mini",
        messages = messages_for(website)
    )
    return response.choices[0].message.content

## Visualizing the Result

The final output of the pipeline is the summary generated by the language model.

Instead of printing plain text, the result is displayed using **Markdown rendering**, which improves readability and allows structured formatting (headings, bullet points, emphasis).

In [None]:
def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

In [40]:
display_summary(url_to_scrape)

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}