
# 🔹 Interview-Style Q\&A – Web Data Summarization Project

**Q1. Can you describe the Generative AI–Powered Web Data Summarization project?**
**A:**
“At Globant, I built a data pipeline that scraped energy proposal websites, extracted unstructured text, and applied Generative AI models to produce contextual summaries. The pipeline automated the extraction of key fields like proposal title, duration, budget, deadlines, and highlights. This transformed scattered website data into structured insights that business analysts could use directly, saving significant manual effort.”

---

**Q2. What was the architecture of this pipeline?**
**A:**
“The architecture was straightforward but modular:

1. **Web scraping** with BeautifulSoup to collect proposal pages.
2. **Text preprocessing** — cleaning HTML, removing boilerplate, and normalizing data.
3. **Chunking** content for LLM processing.
4. **Generative AI summarization** — OpenAI GPT took chunks and generated structured outputs with fields like title, budget, and deadline.
5. **Storage** — structured data was stored in a database for querying and reporting.

This end-to-end pipeline turned noisy, unstructured web pages into consistent, structured data.”

---

**Q3. How did you ensure reliable extraction from noisy web pages?**
**A:**
“I used BeautifulSoup with custom parsing logic to handle different page layouts. I also implemented regex patterns for common fields like currency and dates. To deal with inconsistent formats, I combined deterministic rules with GPT-based summarization — where rules failed, the model filled in gaps. This hybrid approach improved robustness.”

---

**Q4. How did you structure the prompts for GPT summarization?**
**A:**
“I designed prompts that explicitly asked the LLM to extract structured JSON fields. For example:
*‘Given the following proposal text, extract and return JSON with {title, budget, duration, deadline, highlights}.*’
By enforcing schema-based outputs, we minimized hallucinations and made the summaries machine-readable for downstream analytics.”

---

**Q5. What challenges did you face, and how did you overcome them?**
**A:**
“One challenge was inconsistent website formatting. Some sites had well-structured HTML, others were messy. I solved this with a hybrid approach: regex + rule-based parsing for obvious fields, and GPT summarization for ambiguous sections. Another challenge was cost optimization — I minimized tokens by chunking content intelligently and preprocessing text before sending to GPT.”

---

**Q6. What business impact did this project deliver?**
**A:**
“This pipeline reduced manual effort in analyzing proposals by automating data extraction and summarization. Analysts no longer had to manually read through entire websites. Instead, they could work with structured summaries, which accelerated decision-making. This not only saved hours of manual effort but also improved coverage of proposals being tracked.”

---

**Q7. If you were to extend this solution, what would you add?**
**A:**
“Future extensions could include:

* Integrating with vector databases to enable semantic search across proposals.
* Adding OCR for PDF-based proposals.
* Deploying as a FastAPI microservice for real-time data extraction.
* Using LangChain agents to orchestrate scraping, summarization, and validation workflows automatically.”



In [None]:
# web_summarizer.py

import requests
from bs4 import BeautifulSoup
from openai import OpenAI
import os

# Set your API key (export OPENAI_API_KEY in your environment)
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def scrape_website(url: str) -> str:
    """Scrape and clean webpage text."""
    response = requests.get(url, timeout=10)
    soup = BeautifulSoup(response.text, "html.parser")

    # Remove scripts, styles, and nav elements
    for tag in soup(["script", "style", "nav", "footer", "header"]):
        tag.decompose()

    text = " ".join(soup.get_text().split())
    return text

def summarize_with_gpt(text: str) -> dict:
    """Summarize scraped content into structured JSON fields."""
    prompt = f"""
    Extract the following fields from the proposal text and return as JSON:
    - Title
    - Budget
    - Duration
    - Deadline
    - Key Highlights

    Text:
    {text[:2000]}  # truncate to first 2000 chars for demo
    """

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    return response.choices[0].message.content

if __name__ == "__main__":
    url = "https://example.com/proposal-page"  # Replace with an actual proposal site
    print(f"Scraping: {url}")

    raw_text = scrape_website(url)
    print("\n--- Raw Extract (truncated) ---")
    print(raw_text[:500], "...")

    summary = summarize_with_gpt(raw_text)
    print("\n--- Structured Summary ---")
    print(summary)
