<a href="https://colab.research.google.com/github/Saisuman55/E_commerce_website_minor_project/blob/main/ai-systems-engineering-1/unit-1/01-ai-systems-engineering-1-unit1-web-scraping-and-summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">
<img src="https://poorit.in/image.png" alt="Poorit" width="40" style="vertical-align: middle;"> <b>AI SYSTEMS ENGINEERING 1</b>

## Unit 1: Web Scraping and Summarization with LLMs

**CV Raman Global University, Bhubaneswar**  
*AI Center of Excellence*

---

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Poorit-Technologies/cvraman-ai-notebooks/blob/main/ai-systems-engineering-1/unit-1/01-ai-systems-engineering-1-unit1-web-scraping-and-summarization.ipynb)

</div>

---

### What You'll Learn

In this notebook, you will:

1. **Make your first OpenAI API call** and understand the Chat Completions API
2. **Learn about system and user prompts** and how to structure messages
3. **Build a web scraper** using BeautifulSoup to extract website content
4. **Create a website summarizer** that generates concise summaries using GPT

**Duration:** ~1.5 hours

---

## 1. Environment Setup

Let's install the required packages and configure our API key.

In [1]:
# Install required packages
!pip install -q openai requests beautifulsoup4

In [2]:
# Import required libraries
import os
from getpass import getpass
from openai import OpenAI
from bs4 import BeautifulSoup
import requests
from IPython.display import Markdown, display

In [9]:
# Configure API Key for Gemini
api_key = getpass("Enter your Google API Key: ")

if api_key and api_key.strip():
    os.environ['OPENAI_API_KEY'] = api_key # This is used by the OpenAI client library, even for Gemini compatibility
    GEMINI_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
    client = OpenAI(base_url=GEMINI_BASE_URL,api_key=api_key)
    MODEL = "gemini-2.5-flash"
    print(f"OpenAI client configured for Gemini model: {MODEL}")
else:
    print("No API key provided. Please enter your key to continue.")

Enter your Google API Key: ··········
OpenAI client configured for Gemini model: gemini-2.5-flash


---

## 2. Your First API Call

Let's start with a simple API call to understand how to communicate with GPT models.

The OpenAI API expects messages in this structure:

```python
[
    {"role": "system", "content": "system message goes here"},
    {"role": "user", "content": "user message goes here"}
]
```

In [10]:
# Simple API call
messages = [{"role": "user", "content": "Hello! This is my first message to you."}]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages
)

print(response.choices[0].message.content)

Hello! Welcome! It's great to hear from you.

How can I help you today, or what would you like to talk about? I'm ready when you are!


In [11]:
# Using system and user prompts
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2 + 2?"}
]

response = client.chat.completions.create(model=MODEL, messages=messages)
print(response.choices[0].message.content)

2 + 2 = 4


---

## 3. Web Scraping with BeautifulSoup

Before we can summarize websites, we need to extract their content. We'll use BeautifulSoup to parse HTML and extract readable text.

In [12]:
# Web scraping utility functions

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

def fetch_website_contents(url, max_chars=2000):
    """
    Fetch and return the title and text content of a website.
    Removes scripts, styles, and other non-text elements.
    """
    response = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(response.content, "html.parser")

    title = soup.title.string if soup.title else "No title found"

    if soup.body:
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        text = soup.body.get_text(separator="\n", strip=True)
    else:
        text = ""

    return (title + "\n\n" + text)[:max_chars]


def fetch_website_links(url):
    """
    Return all links found on a webpage.
    """
    response = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(response.content, "html.parser")
    links = [link.get("href") for link in soup.find_all("a")]
    return [link for link in links if link]

In [13]:
# Test the scraper
website_content = fetch_website_contents("https://anthropic.com")
print(website_content[:500] + "...")

Home \ Anthropic

Skip to main content
Skip to footer
Research
Economic Futures
Commitments
Initiatives
Claude's Constitution
Transparency
Responsible Scaling Policy
Trust center
Security and compliance
Learn
Learn
Anthropic Academy
Tutorials
Use cases
Engineering at Anthropic
Developer docs
Company
About
Careers
Events
News
Try Claude
Try Claude
Try Claude
Learn more about Claude
Products
Claude
Claude Code
Claude Developer Platform
Pricing
Contact sales
Models
Opus
Sonnet
Haiku
Log in
Claude.a...


---

## 4. Building the Website Summarizer

Now let's combine web scraping with LLM capabilities to create a website summarizer.

### Types of Prompts

- **System Prompt**: Tells the model what task to perform and what tone to use
- **User Prompt**: The actual content or question to respond to

In [14]:
# Define prompts for summarization

SYSTEM_PROMPT = """
You are an expert content analyst that analyzes website contents
and provides concise, informative summaries.
Ignore navigation-related text and focus on the main content.
Respond in markdown format.
"""

USER_PROMPT_PREFIX = """
Here are the contents of a website.
Provide a clear summary of this website.
If it includes news or announcements, summarize these too.

"""

In [15]:
def create_messages(website_content):
    """Create the message structure for the API call."""
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT_PREFIX + website_content}
    ]

In [16]:
def summarize_website(url):
    """Fetch a website and generate a summary using GPT."""
    website_content = fetch_website_contents(url)
    response = client.chat.completions.create(
        model=MODEL,
        messages=create_messages(website_content)
    )
    return response.choices[0].message.content


def display_summary(url):
    """Display a formatted summary of a website."""
    summary = summarize_website(url)
    display(Markdown(summary))

In [17]:
# Test the summarizer
display_summary("https://anthropic.com")

This website belongs to Anthropic, a public benefit corporation focused on AI research and products that prioritize safety. Their mission is to secure the benefits of AI while mitigating its risks.

The site provides information on their products, models, and research initiatives, along with resources for developers and users.

**Key Offerings:**
*   **Products:** Claude, Claude Code, Claude Developer Platform.
*   **AI Models:** Opus, Sonnet, and Haiku.
*   **Resources:** Anthropic Academy, Tutorials, Use cases, Developer docs.

**Research & Safety Focus:**
Anthropic highlights its commitment to responsible AI through initiatives like Economic Futures research, Claude's Constitution, a Responsible Scaling Policy, and a Trust Center emphasizing security and compliance.

**News and Announcements:**
*   **Featured Story:** "Four Hundred Meters on Mars" details the first AI-planned drive on another planet.
*   **Latest Releases:**
    *   **February 4, 2026:** An announcement about "Claude is a space to think," emphasizing a helpful conversation space free of ads or sponsored content.
    *   **February 5, 2026:** An announcement introducing "Claude Opus 4.6," described as the world’s most powerful model for coding, agents, and professional work.

In [21]:
# Try another website
display_summary("https://ollama.com/")

Ollama is a platform that enables users to easily run, launch, and integrate open-source AI models locally. It aims to empower developers and users to build with and leverage various open models directly from their machines.

**Key Offerings and Features:**
*   **Local Model Execution:** Users can download Ollama and run a variety of open models on their local systems, often managed through a command-line interface.
*   **Specialized AI Tools:**
    *   **OpenClaw:** An open-source AI assistant designed to automate work, answer questions, and handle tasks, powered by open models.
    *   **Claude Code:** A tool for coding with various open models, including Claude Code, Codex, and OpenCode.
*   **Extensive Integrations:** Ollama supports over 40,000 integrations, allowing users to connect open models to a wide range of applications and agents across categories such as Coding, Documents & RAG (e.g., LangChain, LlamaIndex), Automation (e.g., OpenClaw, n8n), and Chat (e.g., Open WebUI).
*   **Account Benefits:** Signing up for an account provides users with updates on new model releases, access to cloud hardware for running faster and larger models, and the ability to customize and share models with others.

**News and Announcements:**
*   The current version of the Ollama software is **0.16.1**.

---

## 5. Exercises

### Exercise 1: Create an Email Subject Generator

Create a function that takes email content and suggests an appropriate subject line.

In [23]:
# Exercise 1: Email Subject Generator
# Step 1: Create your prompts

email_system_prompt = "you are a helpful assistance and you give subject based on the mails "
# Define the system prompt

email_content = """
Hi Team,

I wanted to follow up on our meeting yesterday about the Q4 targets.
We discussed increasing sales by 15% and expanding into two new markets.
Please review the attached proposal and share your feedback by Friday.

Best regards,
Manager
"""

# Step 2: Make the messages list and call OpenAI
messages = [{"role":"system","content":email_system_prompt},{"role":"user","content":email_content}]
response = client.chat.completions.create(model=MODEL, messages=messages)

# Step 3: Print the result
print(response.choices[0].message.content)

Here are a few subject line options based on your email:

**Concise & Direct:**

*   Q4 Targets Proposal Review
*   Follow-up: Q4 Targets Meeting & Proposal
*   Q4 Targets Proposal - Feedback Required

**Action-Oriented with Deadline:**

*   Action Required: Q4 Targets Proposal Review (Feedback by Friday)
*   Q4 Targets Proposal - Feedback Due Friday

**Content-Specific:**

*   Proposal for Q4 Sales Increase & Market Expansion


### Exercise 2: Summarize Multiple Websites

Create a function that takes a list of URLs and returns summaries for all of them.

In [20]:
# Exercise 2: Batch summarizer
def summarize_multiple(urls):
    """Summarize multiple websites."""
    # Your implementation here
    pass

# Test with:
# urls = ["https://anthropic.com", "https://openai.com"]
# summarize_multiple(urls)

---

## Key Takeaways

1. **OpenAI API** uses a simple message structure with roles: system, user, and assistant

2. **System prompts** define the behavior and tone of the model

3. **BeautifulSoup** is excellent for extracting text content from web pages

4. **Summarization** is a powerful use case - applicable to news, documents, emails, and more

### Limitations

- JavaScript-rendered websites (React apps) won't work with basic scraping
- Some websites block scrapers (CloudFlare protection)
- Content is truncated to fit within context limits

### What's Next?

In the next notebook, we'll explore:
- Comparing different model providers (OpenAI, Gemini, Ollama)
- Understanding HTTP endpoints vs client libraries
- Running models locally

---

## Additional Resources

- [OpenAI Chat Completions API](https://platform.openai.com/docs/guides/text-generation)
- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

---

**Course Information:**
- **Institution:** CV Raman Global University, Bhubaneswar
- **Program:** AI Center of Excellence
- **Course:** AI Systems Engineering 1
- **Developed by:** [Poorit Technologies](https://poorit.in) - *Transform Graduates into Industry-Ready Professionals*

---