## Exercise Objective

The goal of this exercise is to build a **simple but complete pipeline** that combines:
1. **Web scraping**, to collect textual content from a website
2. **Large Language Models**, to process and summarize the extracted content

More specifically, the workflow is:
- fetch the raw content of a web page,
- pass the content to an LLM through an API call,
- select a model explicitly,
- generate a concise and meaningful summary of the website.

This exercise is not focused on performance optimization, but on understanding how different components (scraping, prompting, model selection, and output visualization) interact in a real-world generative AI scenario.


In [None]:
import os
from dotenv import load_dotenv
from IPython.display import Markdown, display
from openai import OpenAI
from week1.scraper import fetch_website_contents

In [None]:
# Load environment variables in a file called .env

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

# Check the key

if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")

## A first call

In [None]:
# Write a massage
message = "Hello, GPT! This is my first ever message to you! Hi!"

messages = [{"role": "user", "content": message}]

# The first call
openai = OpenAI()

response = openai.chat.completions.create(model="gpt-5-nano", messages=messages)
response.choices[0].message.content

## Web Scraping: Collecting Website Content

The first step of the pipeline consists in **retrieving textual data from a web page**.

Web scraping allows us to transform unstructured HTML content into raw text that can be processed by a language model. At this stage, the goal is not perfect text cleaning, but rather obtaining a representative snapshot of the page content.

This step highlights an important practical aspect:
> Large Language Models do not access the web directly — they only operate on the text that we explicitly provide as input.


In [None]:
# Let's try out this utility
website_path = "https://it.wikipedia.org/wiki/The_Walt_Disney_Company"
content = fetch_website_contents(website_path)
print(content)

## Prompt Design: System Prompt vs User Prompt

The interaction with the language model is driven by **two different types of prompts**:

### System Prompt
The system prompt defines **how the model should behave**.
It sets global instructions such as:
- the role of the model (e.g. assistant, analyst, summarizer),
- the expected style of the response,
- constraints on tone, length, or level of detail.

In other words, the system prompt provides the **behavioral context** for the model.

### User Prompt
The user prompt represents the **starting point of the conversation**.
It contains the actual request, such as:
- what task to perform,
- what content to analyze,
- what output is expected.

The user prompt is interpreted *within the boundaries established by the system prompt*.

### Why this distinction matters
Separating system and user prompts improves:
- clarity of intent,
- controllability of the model’s behavior,
- reproducibility of results.

This distinction is especially important when building structured or multi-step LLM applications.


In [None]:
# Define the system prompt

system_prompt = """
You are a snarky assistant that analyzes the contents of a website,
and provides a short, snarky, humorous summary, ignoring text that might be navigation related.
Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.
"""

# Define the user prompt

user_prompt_prefix = """
Here are the contents of a website.
Provide a short summary of this website.
If it includes news or announcements, then summarize these too.

"""

In [None]:
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_prefix + website}
    ]

In [None]:
messages_for(website_path)

## LLM Interaction and Model Selection

Once the content is collected and the prompts are defined, the next step is to **call the OpenAI API**.

In this exercise, the model is explicitly selected rather than relying on defaults. This makes the interaction:
- more transparent,
- easier to compare across different models,
- reproducible over time.

The language model receives:
- the system prompt (behavioral instructions),
- the user prompt (task definition),
- the scraped website content as input.

The output is a synthesized summary generated by the model based on these inputs.


In [None]:
url_to_scrape = website_path

In [None]:
# Call the OpenAI API

def summarize(url):
    website = fetch_website_contents(url)
    response = openai.chat.completions.create(
        model = "gpt-4.1-mini",
        messages = messages_for(website)
    )
    return response.choices[0].message.content

## Visualizing the Result

The final output of the pipeline is the summary generated by the language model.

Instead of printing plain text, the result is displayed using **Markdown rendering**, which improves readability and allows structured formatting (headings, bullet points, emphasis).

In [None]:
def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

In [None]:
display_summary(url_to_scrape)