# 🧾 AI News Scraper

This project is a simple web application that scrapes a news website and generates a short briefing using `OpenAI's GPT-4o`model.
It works by:

- Scraping the homepage of a given news site,
- Extracting article titles and links,
- Downloading full content of selected articles,
- Sending the text to GPT-4o with a custom prompt,
- Displaying the result in a readable format using Gradio.

The goal is to create a fast and simple way to get summarized, AI-generated news briefings — useful for analysts, researchers, or anyone who wants quick insight into current events.

⚙️ Technologies used:
- Python (requests, BeautifulSoup)
- OpenAI API (GPT-4o)
- Gradio (UI)
- dotenv (API key handling)

## 📦 1. Imports & Setup

In [None]:
import os
import requests
import json
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI

## 🧩 Loading API Keys

In [17]:
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')
if api_key and api_key.startswith('sk-proj-') and len(api_key)>10:
    print("API key looks good so far")
else:
    print("There might be a problem with your API key? Please visit the troubleshooting notebook!")
MODEL = 'gpt-4o-mini'
openai = OpenAI()

API key looks good so far


## 🧱 2. Website Scraper Class

This section defines the `Website` class, which is responsible for downloading and parsing the content of a given news website.


In [18]:
import requests
from bs4 import BeautifulSoup
from typing import List

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    url: str
    title: str
    description: str
    text: str
    body: str
    links: List[str]
    link_titles: List[tuple]

    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string.strip() if soup.title else "No title found"
        meta = soup.find("meta", attrs={"name": "description"})
        self.description = meta["content"].strip() if meta and meta.get("content") else ""
        self.text = self.extract_article_text(soup)
        self.links = []
        self.link_titles = []

        for link in soup.find_all("a", href=True):
            href = link["href"]
            title = link.get_text(strip=True)

            if href.startswith("http"):
                self.links.append(href)
                if title:
                    self.link_titles.append((title, href))

    def extract_article_text(self, soup):
        selectors = [
            "article p",
            "div.article-content p",
            "div.wp-content-text-raw p",
            "p"
        ]

        for selector in selectors:
            paragraphs = soup.select(selector)
            if paragraphs:
                break
        else:
            return ""

        text = "\n".join(p.get_text(strip=True) for p in paragraphs)
        return text.strip()

    def get_content(self):
        return (
            f"Webpage Title:\n{self.title}\n\n"
            f"Meta Description:\n{self.description}\n\n"
            f"Article Content:\n{self.text}\n"
        )


## 🧠 3. System Prompt for Link Extraction

This prompt is sent to the GPT model as a **system message**, telling it how to behave.

In [19]:
link_system_prompt = """

You are to collect from the given informational website (its main page) the titles and links to articles. Format the collected data as a JSON structure like the example below:

{
  "links": [
    {"title": "Headline text", "url": "https://full.url/article1"},
    ...
  ]
}
"""


## ✏️ 4. User Prompt Generator

This function builds the **user message** (user prompt) that is sent to the GPT model, together with the system prompt.


In [20]:
def get_news_links_user_prompt(website):
    user_prompt = f"Here are the links found on the website: {website.url}\n"
    user_prompt += "Each line includes the article title and its full URL.\n"
    user_prompt += "Please extract only the most important news links based on the system instructions.\n\n"

    if hasattr(website, 'link_titles') and website.link_titles:
        for title, link in website.link_titles:
            if len(title) > 5:
                user_prompt += f"- {title} → {link}\n"
    else:
        for link in website.links:
            user_prompt += f"- {link}\n"

    return user_prompt


## 🔗 5. Fetching Clean Article Links via GPT

This function sends the parsed website data to the OpenAI GPT model and returns a clean list of important article links.


In [21]:
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model = MODEL,
        messages = [
            {"role":"system", "content":link_system_prompt},
            {"role":"user", "content":get_news_links_user_prompt(website)}
        ],
        response_format={"type":"json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

### 🔍 Function: `get_all_details(url, max_articles=10)`

This function orchestrates the full scraping and summarization pipeline.

**Steps:**
1. Downloads and parses the landing page using the `Website` class,
2. Sends extracted links to GPT (via `get_links`) to identify key articles,
3. Iteratively fetches the full content of up to `max_articles` articles,
4. Combines all text into a structured markdown report.

The output is truncated at 65,000 characters to comply with token/processing limits.

In [36]:
import json

def get_all_details(url, max_articles=10):
    result = "Landing page:\n"

    try:
        result += Website(url).get_content()
    except Exception as e:
        result += f"\n Could not fetch landing page: {e}\n"
    links_data = get_links(url)
    if isinstance(links_data, str):
        try:
            parsed = json.loads(links_data)
            links = parsed.get("links", [])
        except Exception as e:
            print(f" JSON decode error: {e}")
            links = []
    else:
        links = links_data.get("links", [])

    print(f"Parsed links count: {len(links)}")

    for link in links[:max_articles]:
        title = link.get("title", "No title")
        article_url = link.get("url")

        if not article_url:
            continue

        print(f"Scraping article: {title} → {article_url}")

        try:
            result += f"\n\n\n---\n\n\n"
            result += f"### {title}\n"
            result += Website(article_url).get_content()
            result += f"\n\n[📖 Read full article →]({article_url})\n"
        except Exception as e:
            result += f"{title} (Error: {e})\n"

    return result[:65000]


In [25]:

from IPython.display import Markdown, display

#display(Markdown(get_all_details("https://businessinsider.com.pl")))

## 🧠 7. System Prompt: News Analyst Assistant

This system prompt defines the role and tone of the GPT model when generating the final news summary.

### 📝 Purpose:
Instructs the model to act like a **personal news analyst** — not just summarizing facts, but explaining their importance and implications.


In [26]:
system_prompt = """You are my personal information assistant and news analyst. Your task is to review the scraped content from a news homepage and its linked articles.

Based on that material, produce a clear, insightful, and human-sounding summary of recent developments in Poland and globally. Write as if you’re speaking directly to me — explaining what’s happening, why it matters, and what possible consequences these events might have.

Please include:
- A summary of key **facts and developments** (from the articles),
- An analysis of their **significance and potential impact** (political, economic, social, etc.),
- Your own **interpretations or reflections**, when appropriate.

Avoid robotic summaries. Speak in your own words — like a smart, well-informed advisor helping me stay ahead of the curve.

Use **Markdown format** (headings, bullet points, paragraphs).

"""


## 📝 8. User Prompt for Final Briefing

This function generates the full **user message** for the GPT model, combining:

- High-level instructions for summarization and analysis,
- The full text scraped from the homepage and selected articles.

### 🎯 Purpose:
To give the model enough context to produce a **clear, structured, and insightful news briefing**.


In [27]:
def get_brochure_user_prompt(url):
    user_prompt = f"## Source: {url}\n\n"
    user_prompt += (
    f"You are reviewing the main content and linked articles from the website: **{url}**.\n\n"
    "Below is the full scraped text from the homepage, followed by selected full articles.\n"
    "Your goal is to produce an intelligent, insightful, and well-structured market briefing.\n\n"
    "Please:\n"
    "- Identify key developments and summarize them clearly,\n"
    "- Highlight relevant trends and explain their potential impact,\n"
    "- Reflect on possible political, economic, or societal consequences,\n"
    "- Use your own judgment and speak in your own words, like a smart advisor.\n\n"
    "Write in clear, professional English — as if advising an informed investor.\n"
    "Format the response in **Markdown** (with headings, bullets, etc.)."
    )
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:20_000]
    return user_prompt

## 📤 9. Streaming the AI-Generated Briefing

This function sends the full prompt to GPT and **streams the generated response** in real time.

### 🔄 What it does:
- Combines the `system_prompt` (defines the model’s role) and the `user_prompt` (full content + instructions),
- Sends both to GPT using the `stream=True` parameter,
- Receives the response in small chunks,
- Reconstructs the final output piece by piece,
- Cleans up unnecessary formatting (e.g. code blocks),
- Uses `yield` to display results live — ideal for web interfaces like Gradio.

This is the final step where the AI-generated market briefing is produced and 

In [28]:
def stream_news_summary(url):
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user', 'content': get_brochure_user_prompt(url)}
        ],
        stream=True
    )

    response = ""

    for chunk in stream:
        content = chunk.choices[0].delta.content or ''
        response += content
        cleaned = response.replace("```", "").replace("markdown", "")
        yield cleaned


## 🖥️ 10. User Interface (Gradio App)

This section creates a simple web interface using **Gradio**, allowing users to generate AI-based news briefings in real time.


In [38]:
import gradio as gr
custom_css = """
#brief-box {
    background-color: #1f1f1f;
    color: white;
    padding: 20px;
    font-size: 18px;
    border-radius: 12px;
    min-height: 400px;
    max-height: 600px;
    overflow-y: auto;
}
"""
with gr.Blocks(css=custom_css, title="News Generator") as ui:
    gr.Markdown("NewsGrid News Generator")
    gr.Markdown("Enter a news URL and get a real-time AI-generated news report based content.")

    with gr.Row():
        with gr.Column(scale=1):
            url_input = gr.Textbox(label="🔗 News Website URL", placeholder="https://example.com", value="https://businessinsider.com.pl")
            generate_btn = gr.Button("⚡ Generate Briefing")
            clear_btn = gr.Button("🧹 Clear")

        with gr.Column(scale=2):
            output_box = gr.Markdown("Waiting for URL...", elem_id="brief-box",)

    generate_btn.click(fn=stream_news_summary, inputs=[url_input], outputs=[output_box])
    clear_btn.click(fn=lambda: "Waiting for URL...", outputs=[output_box], queue=False)

ui.launch()


* Running on local URL:  http://127.0.0.1:7865
* To create a public link, set `share=True` in `launch()`.


