<a href="https://colab.research.google.com/github/RohanTheCoderX/Git-Lecture/blob/main/ai-systems-engineering-1/unit-1/01-ai-systems-engineering-1-unit1-web-scraping-and-summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">
<img src="https://poorit.in/image.png" alt="Poorit" width="40" style="vertical-align: middle;"> <b>AI SYSTEMS ENGINEERING 1</b>

## Unit 1: Web Scraping and Summarization with LLMs

**CV Raman Global University, Bhubaneswar**  
*AI Center of Excellence*

</div>

---

### What You'll Learn

In this notebook, you will:

1. **Make your first OpenAI API call** and understand the Chat Completions API
2. **Learn about system and user prompts** and how to structure messages
3. **Build a web scraper** using BeautifulSoup to extract website content
4. **Create a website summarizer** that generates concise summaries using GPT

**Duration:** ~1.5 hours

---

## 1. Environment Setup

Let's install the required packages and configure our API key.

In [None]:
# Install required packages
!pip install -q openai requests beautifulsoup4 1234rohan.

In [None]:
# Import required libraries
import os
from getpass import getpass
from openai import OpenAI
from bs4 import BeautifulSoup
import requests
from IPython.display import Markdown, display

In [None]:
# Configure OpenAI API Key
api_key = getpass("Enter your OpenAI API Key: ")

if api_key and api_key.strip():
    os.environ['OPENAI_API_KEY'] = api_key
    client = OpenAI(api_key=api_key)
    MODEL = "gpt-4o-mini"
    print(f"OpenAI configured with model: {MODEL}")
else:
    print("No API key provided. Please enter your key to continue.")

---

## 2. Your First API Call

Let's start with a simple API call to understand how to communicate with GPT models.

The OpenAI API expects messages in this structure:

```python
[
    {"role": "system", "content": "system message goes here"},
    {"role": "user", "content": "user message goes here"}
]
```

In [None]:
# Simple API call
messages = [{"role": "user", "content": "Hello! This is my first message to you."}]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages
)

print(response.choices[0].message.content)

In [None]:
# Using system and user prompts
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2 + 2?"}
]

response = client.chat.completions.create(model=MODEL, messages=messages)
print(response.choices[0].message.content)

---

## 3. Web Scraping with BeautifulSoup

Before we can summarize websites, we need to extract their content. We'll use BeautifulSoup to parse HTML and extract readable text.

In [None]:
# Web scraping utility functions

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

def fetch_website_contents(url, max_chars=2000):
    """
    Fetch and return the title and text content of a website.
    Removes scripts, styles, and other non-text elements.
    """
    response = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(response.content, "html.parser")

    title = soup.title.string if soup.title else "No title found"

    if soup.body:
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        text = soup.body.get_text(separator="\n", strip=True)
    else:
        text = ""

    return (title + "\n\n" + text)[:max_chars]


def fetch_website_links(url):
    """
    Return all links found on a webpage.
    """
    response = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(response.content, "html.parser")
    links = [link.get("href") for link in soup.find_all("a")]
    return [link for link in links if link]

In [None]:
# Test the scraper
website_content = fetch_website_contents("https://anthropic.com")
print(website_content[:500] + "...")

---

## 4. Building the Website Summarizer

Now let's combine web scraping with LLM capabilities to create a website summarizer.

### Types of Prompts

- **System Prompt**: Tells the model what task to perform and what tone to use
- **User Prompt**: The actual content or question to respond to

In [None]:
# Define prompts for summarization

SYSTEM_PROMPT = """
You are an expert content analyst that analyzes website contents
and provides concise, informative summaries.
Ignore navigation-related text and focus on the main content.
Respond in markdown format.
"""

USER_PROMPT_PREFIX = """
Here are the contents of a website.
Provide a clear summary of this website.
If it includes news or announcements, summarize these too.

"""

In [None]:
def create_messages(website_content):
    """Create the message structure for the API call."""
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT_PREFIX + website_content}
    ]

In [None]:
def summarize_website(url):
    """Fetch a website and generate a summary using GPT."""
    website_content = fetch_website_contents(url)
    response = client.chat.completions.create(
        model=MODEL,
        messages=create_messages(website_content)
    )
    return response.choices[0].message.content


def display_summary(url):
    """Display a formatted summary of a website."""
    summary = summarize_website(url)
    display(Markdown(summary))

In [None]:
# Test the summarizer
display_summary("https://anthropic.com")

In [None]:
# Try another website
display_summary("https://cnn.com")

---

## 5. Exercises

### Exercise 1: Create an Email Subject Generator

Create a function that takes email content and suggests an appropriate subject line.

In [None]:
# Exercise 1: Email Subject Generator
# Step 1: Create your prompts

email_system_prompt = ""  # Define the system prompt

email_content = """
Hi Team,

I wanted to follow up on our meeting yesterday about the Q4 targets.
We discussed increasing sales by 15% and expanding into two new markets.
Please review the attached proposal and share your feedback by Friday.

Best regards,
Manager
"""

# Step 2: Make the messages list and call OpenAI
# messages = [...]
# response = client.chat.completions.create(...)

# Step 3: Print the result
# print(response.choices[0].message.content)

### Exercise 2: Summarize Multiple Websites

Create a function that takes a list of URLs and returns summaries for all of them.

In [None]:
# Exercise 2: Batch summarizer
def summarize_multiple(urls):
    """Summarize multiple websites."""
    # Your implementation here
    pass

# Test with:
# urls = ["https://anthropic.com", "https://openai.com"]
# summarize_multiple(urls)

---

## Key Takeaways

1. **OpenAI API** uses a simple message structure with roles: system, user, and assistant

2. **System prompts** define the behavior and tone of the model

3. **BeautifulSoup** is excellent for extracting text content from web pages

4. **Summarization** is a powerful use case - applicable to news, documents, emails, and more

### Limitations

- JavaScript-rendered websites (React apps) won't work with basic scraping
- Some websites block scrapers (CloudFlare protection)
- Content is truncated to fit within context limits

### What's Next?

In the next notebook, we'll explore:
- Comparing different model providers (OpenAI, Gemini, Ollama)
- Understanding HTTP endpoints vs client libraries
- Running models locally

---

## Additional Resources

- [OpenAI Chat Completions API](https://platform.openai.com/docs/guides/text-generation)
- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

---

**Course Information:**
- **Institution:** CV Raman Global University, Bhubaneswar
- **Program:** AI Center of Excellence
- **Course:** AI Systems Engineering 1
- **Developed by:** [Poorit Technologies](https://poorit.in) - *Transform Graduates into Industry-Ready Professionals*

---