In [2]:
!pip install scraper

Collecting scraper
  Downloading scraper-0.1.0.tar.gz (2.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scraper
  Building wheel for scraper (setup.py) ... [?25l[?25hdone
  Created wheel for scraper: filename=scraper-0.1.0-py3-none-any.whl size=3467 sha256=ba6d70c6d812391bf648dc28bc887df4c3ced0047a9347a78701429733b6f05e
  Stored in directory: /root/.cache/pip/wheels/1c/0b/5b/73e1b96a17d86b497ec51e061ce25285e21af2ae0fda386c72
Successfully built scraper
Installing collected packages: scraper
Successfully installed scraper-0.1.0


In [10]:
import os
import json
from dotenv import load_dotenv
from IPython.display import display,Markdown,update_display
from openai import OpenAI

In [13]:
# Example usage:
example_url = "https://www.google.com"

print(f"Fetching links from {example_url}...")
links = fetch_website_links(example_url)
print(f"Found {len(links)} links. Displaying first 5:\n{links[:5]}\n")

print(f"Fetching content from {example_url}...")
content = fetch_website_contents(example_url)
print(f"Content fetched (first 500 characters):\n{content[:500]}...")

Fetching links from https://www.google.com...
Found 17 links. Displaying first 5:
['https://www.google.com/intl/en/about.html', 'https://www.google.com/intl/en/ads/', 'https://www.google.com/intl/en/policies/privacy/', 'https://www.google.com/imghp?hl=en&tab=wi', 'http://www.google.com/history/optout?hl=en']

Fetching content from https://www.google.com...
Content fetched (first 500 characters):
GoogleSearch Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in Advanced searchAdvertisingBusiness SolutionsAbout Google© 2026 - Privacy - Terms...


Now, let's define the `fetch_website_links` and `fetch_website_contents` functions:

In [14]:
load_dotenv(override=True)
api_key = os.getenv("OPENAI_API_KEY")

if api_key is None:
    raise ValueError("OPENAI_API_KEY environment variable is not set.")

MODEL = 'gpt-4-mini'
openai = OpenAI(api_key=api_key)

In [16]:
linka = fetch_website_links('https://developers.openai.com/api/docs/')
linka = fetch_website_links('https://www.google.com')
linka = fetch_website_links('https://openai.com/about/')



Error fetching links from https://openai.com/about/: 403 Client Error: Forbidden for url: https://openai.com/about/


## First step: Have GPT-5-nano figure out which links are relevant

### Use a call to gpt-5-nano to read the links on a webpage, and respond in structured JSON.  
It should decide which links are relevant, and replace relative links such as "/about" with "https://company.com/about".  
We will use "one shot prompting" in which we provide an example of how it should respond in the prompt.

This is an excellent use case for an LLM, because it requires nuanced understanding. Imagine trying to code this without LLMs by parsing and analyzing the webpage - it would be very hard!

Sidenote: there is a more advanced technique called "Structured Outputs" in which we require the model to respond according to a spec. We cover this technique in Week 8 during our autonomous Agentic AI project.

In [17]:
link_system_prompt = """
You are given a list of URLs extracted from a company website.

Your task is to identify ONLY the links that are highly relevant for inclusion in a company brochure.

Relevant links typically include:
- About / About Us pages
- Company Overview / Corporate Information pages
- Mission / Vision / Values pages
- Leadership / Team / Management pages
- Careers / Jobs / Work With Us pages
- Press / Media / News pages
- Investor Relations pages (if applicable)
- Contact pages (only if clearly corporate-level)

STRICT FILTERING RULES:
- Exclude product pages, blog posts, help centers, legal policies, privacy policies, terms of service, login/signup pages, dashboards, documentation, landing pages, promotional campaigns, and duplicate URLs.
- Exclude links that are not clearly related to company-level information.
- If multiple links point to the same category (e.g., two About pages), choose the most canonical/main version.
- Only include full absolute URLs (not relative paths).
- Do not guess URLs. Only select from the provided list.
- If no relevant links exist, return an empty list.

OUTPUT FORMAT:
Return ONLY valid JSON. No explanations. No extra text.

Format exactly as:

{
    "links": [
        {"type": "about page", "url": "https://example.com/about"},
        {"type": "careers page", "url": "https://example.com/careers"}
    ]
}

The "type" field must be one of:
- "about page"
- "company page"
- "mission/vision page"
- "leadership page"
- "careers page"
- "press/media page"
- "investor relations page"
- "contact page"

Return strictly valid JSON.
"""



In [18]:
def get_links_user_prompt(url):
    user_prompt = f"""
Here is the list of links on the website {url} -
Please decide which of these are relevant web links for a brochure about the company,
respond with the full https URL in JSON format.
Do not include Terms of Service, Privacy, email links.

Links (some might be relative links):

"""
    links = fetch_website_links(url)
    user_prompt += "\n".join(links)
    return user_prompt

In [19]:
print(get_links_user_prompt('https://www.google.com'))


Here is the list of links on the website https://www.google.com -
Please decide which of these are relevant web links for a brochure about the company, 
respond with the full https URL in JSON format.
Do not include Terms of Service, Privacy, email links.

Links (some might be relative links):

https://www.google.com/intl/en/about.html
https://www.google.com/intl/en/ads/
https://www.google.com/intl/en/policies/privacy/
https://www.google.com/imghp?hl=en&tab=wi
http://www.google.com/history/optout?hl=en
https://www.youtube.com/?tab=w1
https://www.google.com/advanced_search?hl=en&authuser=0
https://www.google.com/intl/en/policies/terms/
https://mail.google.com/mail/?tab=wm
https://play.google.com/?hl=en&tab=w8
https://drive.google.com/?tab=wo
https://www.google.com/intl/en/about/products?tab=wh
https://news.google.com/?tab=wn
https://maps.google.com/maps?hl=en&tab=wl
https://www.google.com/services/
https://www.google.com/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&

In [26]:
print(get_links_user_prompt('https://www.geeksforgeeks.org/sql/sql-interview-questions/'))


Here is the list of links on the website https://www.geeksforgeeks.org/sql/sql-interview-questions/ -
Please decide which of these are relevant web links for a brochure about the company, 
respond with the full https URL in JSON format.
Do not include Terms of Service, Privacy, email links.

Links (some might be relative links):

https://twitter.com/geeksforgeeks
https://www.geeksforgeeks.org/sql/sql-intersect-clause/
https://in.linkedin.com/company/geeksforgeeks
https://www.geeksforgeeks.org/machine-learning/ai-ml-and-data-science-tutorial-learn-ai-ml-and-data-science/
https://www.geeksforgeeks.org/dbms/data-definition-and-control-ddl-dcl-tcl-interview-questions-sql/
https://www.geeksforgeeks.org/python/introduction-to-psycopg2-module-in-python/
https://www.geeksforgeeks.org/dbms/dbms/
https://www.geeksforgeeks.org/sql/sql-where-clause/
https://www.geeksforgeeks.org/courses/category/cloud-devops
https://www.geeksforgeeks.org/aptitude/puzzles/
https://geeksforgeeksapp.page.link/gfg-ap

In [27]:
def select_relevent_links(url):
  response = openai.chat.completions.create(
      model=MODEL,
      messages=[
          {"role":"system","content": link_system_prompt},
          {"role":"user","content": get_links_user_prompt(url)}
      ],
      response_format={"type":"json_object"}
  )
  result_content = response.choices[0].message.content
  links = json.loads(result_content)
  return links

In [38]:
# MODEL = 'gpt-4o-mini'
select_relevent_links("https://www.google.com")

{'links': [{'type': 'about page',
   'url': 'https://www.google.com/intl/en/about.html'}]}

## Second step: make the brochure!

Assemble all the details into another prompt to GPT-5-nano

In [41]:
def fetch_page_and_all_relevant_links(url):
    contents = fetch_website_contents(url)
    relevant_links = select_relevent_links(url)
    result = f"## Landing Page:\n\n{contents}\n## Relevant Links:\n"
    for link in relevant_links['links']:
        result += f"\n\n### Link: {link['type']}\n"
        result += fetch_website_contents(link["url"])
    return result

In [42]:
print(fetch_page_and_all_relevant_links("https://www.google.com"))

## Landing Page:

GoogleSearch Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in Advanced searchAdvertisingBusiness SolutionsAbout Google© 2026 - Privacy - Terms
## Relevant Links:


### Link: about page
About Google: Our products, technology and company information - About Google
Jump to Content
About
Products
Company Info
News
How Google AI is helping U.S. Olympians find their edge
In partnership with Google DeepMind, Google Cloud has built an industry-first, AI-powered video analysis platform for athletes.
Hit the slopes
Link to Youtube Video (visible only when JS is disabled)
Explore our products and features across Search, Google Workspace and more
Learn all about our leading AI models â and discover their capabilities
See how weâre tackling some of the most challenging problems in computer science
How we're helping kids and teens learn and grow online
From easier parental controls to AI literacy resources, take a look at our latest updates to pr

In [43]:
brochure_system_prompt ="""
You are a senior corporate communications strategist, market analyst, and brand positioning expert.

Your task is to analyze content collected from multiple relevant pages of a company’s website (e.g., About, Products/Services, Careers, Blog, Leadership, Investors, Press, Case Studies) and synthesize the information into a concise, high-impact brochure.

The brochure is intended for three key audiences:

Prospective customers

Investors and strategic partners

Potential recruits

Core Instructions

Write in professional Markdown format.

Do NOT use code blocks.

Maintain a polished, executive-level tone.

Be concise but information-dense.

Eliminate fluff, vague marketing language, and unsupported claims.

Do not invent, assume, or extrapolate beyond the provided content.

If information is not available, omit that section rather than speculate.

Prioritize clarity, credibility, and strategic positioning.

Required Structure (Include Only If Supported by Source Material)

Company Overview

Mission and vision

Founding purpose or origin (if stated)

Core value proposition

Market positioning

Products & Services

Primary offerings

Key features or capabilities

Problems solved and customer impact

Technology or methodology (if relevant)

Customers & Market Focus

Target industries or segments

Notable customers or partnerships

Geographic presence (if mentioned)

Competitive Differentiation

Unique strengths

Proprietary technology or intellectual property

Strategic advantages

Measurable results or performance indicators

Company Culture & Values

Workplace philosophy

Leadership principles

Diversity, innovation, collaboration, or performance emphasis

Employee experience indicators (if described)

Careers & Growth Opportunities

Hiring focus or talent strategy

Professional development opportunities

Benefits, flexibility, or growth pathways (if provided)

Traction & Credibility

Key metrics

Funding status

Awards, recognitions, press mentions

Major milestones

Writing Standards

Use strong, declarative language.

Avoid exaggeration or promotional hype.

Avoid generic phrases such as “world-class,” “cutting-edge,” or “industry-leading” unless explicitly supported.

Ensure logical flow between sections.

Keep the brochure compact but comprehensive.

Aim for clarity and strategic narrative cohesion.

If tone adjustment is requested (e.g., humorous, witty, bold), adapt style while preserving factual accuracy and structure.

Your final output should read like a professionally crafted corporate brochure suitable for external distribution.

"""

In [44]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"""
You are looking at a company called: {company_name}
Here are the contents of its landing page and other relevant pages;
use this information to build a short brochure of the company in markdown without code blocks.\n\n
"""
    user_prompt += fetch_page_and_all_relevant_links(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [45]:
get_brochure_user_prompt("Google", "https://www.google.com")

"\nYou are looking at a company called: Google\nHere are the contents of its landing page and other relevant pages;\nuse this information to build a short brochure of the company in markdown without code blocks.\n\n\n## Landing Page:\n\nGoogleSearch Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in\xa0Advanced searchAdvertisingBusiness SolutionsAbout Google© 2026 - Privacy - Terms\n## Relevant Links:\n\n\n### Link: about page\nAbout Google: Our products, technology and company information - About Google\nJump to Content\nAbout\nProducts\nCompany Info\nNews\nHow Google AI is helping U.S. Olympians find their edge\nIn partnership with Google DeepMind, Google Cloud has built an industry-first, AI-powered video analysis platform for athletes.\nHit the slopes\nLink to Youtube Video (visible only when JS is disabled)\nExplore our products and features across Search, Google Workspace and more\nLearn all about our leading AI models â\x80\x94 and discover their ca

In [46]:
def create_brochure(company_name, url):
    response = openai.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": brochure_system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
        ],
    )
    result = response.choices[0].message.content
    display(Markdown(result))

In [47]:
create_brochure("Google", "https://www.google.com")

# Google Corporate Brochure

## Company Overview

Google is a global technology leader specializing in internet-related products and services, including search, advertising, cloud computing, software, and hardware. The company emphasizes innovation in artificial intelligence (AI), cloud solutions, and online safety, aiming to solve complex problems in computer science and improve global access to information.

## Mission and Vision

Google's mission focuses on organizing the world's information and making it universally accessible and useful. The company promotes responsible technology development, particularly through AI, to enhance user experience and address societal challenges.

## Core Value Proposition

Google offers a comprehensive suite of products and features across Search, Google Workspace, AI models, and more, driven by advanced research and technology to support individuals, businesses, and organizations worldwide. Its solutions are designed to foster learning, safety, and sustainability.

## Products & Services

- **Google Search & Advertising**: Core platforms for information discovery and business marketing solutions.
- **Google Workspace**: Integrated productivity and collaboration tools.
- **Google Cloud & AI Platforms**: Industry-first AI-powered applications, including video analysis for athletes (in partnership with Google DeepMind) and genetic research involving endangered species.
- **Security & Online Safety Tools**: Enhanced parental controls and AI literacy resources aimed at safer digital experiences.
- **Developer and Research Initiatives**: Google AI, Google DeepMind, Google Research, and Google Labs provide advanced research capabilities and developer resources.

## Customers & Market Focus

Google serves a broad global market that includes individual users, businesses of all sizes, educational institutions, and public sector organizations. Its technology supports advancements in sports training, scientific research, education, and various commercial applications. Geographic presence is worldwide, with initiatives and outreach spanning multiple continents.

## Competitive Differentiation

Google’s strategic advantages stem from its integration of AI across platforms, extensive data resources, proprietary machine learning models, and its investment in collaborative research through entities like Google DeepMind. This enables high-impact innovation and scalability in service delivery.

## Company Culture & Values

Google upholds principles that emphasize innovation, transparency, responsibility, and inclusion. The company invests in initiatives supporting privacy, safety, and ethics in technology development. Its culture encourages collaboration and continuous learning to address evolving challenges in the digital landscape.

## Careers & Growth Opportunities

Google provides diverse career paths focused on technology, research, product development, and operational roles. The company offers professional development programs, supports workforce flexibility, and fosters a workplace environment dedicated to growth and meaningful impact.

## Traction & Credibility

As a subsidiary of Alphabet Inc., Google maintains a leading position in technology innovation with significant investments in AI research, global sustainability projects, and public policy engagement. Its reputation is supported by extensive media coverage and ongoing advancements across its product portfolio.

---

For further information, visit [google.com/about](https://www.google.com/about) and explore Google's commitments, products, and initiatives.

## Finally - a minor improvement

With a small adjustment, we can change this so that the results stream back from OpenAI,
with the familiar typewriter animation

In [48]:
def stream_brochure(company_name, url):
    stream = openai.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": brochure_system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True
    )
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        update_display(Markdown(response), display_id=display_handle.display_id)

In [49]:
stream_brochure("Google", "https://www.google.com")

# Google Corporate Brochure

---

## Company Overview

Google is a global technology company dedicated to organizing the world’s information and making it universally accessible and useful. Established at the forefront of innovation, Google continuously develops products and technologies that impact individuals, businesses, and communities worldwide.

### Mission and Vision  
Google's mission centers on leveraging technology, particularly artificial intelligence (AI), to solve challenging problems across various domains—from enhancing everyday online experiences to advancing scientific research. Through this mission, Google aims to empower users, promote safety, and foster learning and growth globally.

### Core Value Proposition  
By integrating advanced AI models with a diversified product portfolio encompassing Search, Google Workspace, Google Cloud, and more, Google delivers scalable solutions that improve productivity, accessibility, and information utility across industries and user segments.

### Market Positioning  
Positioned as a leader in AI research and cloud computing services, Google serves millions globally, supporting key sectors including healthcare, education, sports, and conservation. Its collaborative efforts with organizations and governments amplify its impact and innovation reach.

---

## Products & Services

Google offers a broad range of technology products and platforms designed for individual users, enterprises, and developers:  

- **Search Engine**: Facilitates efficient access to vast information resources.
- **Google Workspace**: A suite of productivity tools streamlining communication and collaboration.
- **Google Cloud & DeepMind**: Delivers AI-driven cloud computing and advanced research capabilities.
- **AI-Powered Solutions**: Includes video analysis for athletic performance and genome sequencing for endangered species preservation.
- **Google for Developers & Labs**: Platforms supporting innovation and AI model experimentation.

### Customer Impact  
Google’s products empower users to enhance productivity, facilitate scientific breakthroughs, and promote digital safety for families and communities, contributing actively to societal development.

---

## Customers & Market Focus

Google engages a diverse global user base spanning:

- Individuals seeking information and productivity tools.
- Enterprises leveraging AI and cloud technology.
- Scientific and sports organizations utilizing AI for research and performance enhancement.
- Educational and governmental bodies advancing digital literacy and safety.

---

## Competitive Differentiation

- **AI Leadership**: Pioneering sophisticated AI models and applying them to real-world challenges.
- **Technology Integration**: Seamlessly combining search, cloud, and AI technologies to create comprehensive solutions.
- **Global Scale and Reach**: Operating services accessible worldwide with localized support and initiatives.
- **Research and Innovation**: Active investment in scientific research through Google AI, DeepMind, and Labs.

---

## Company Culture & Values

Google fosters a culture emphasizing innovation, collaboration, and continuous learning. It promotes online safety, transparency, and ethical use of technology as foundational principles. Through initiatives in digital literacy and accessibility, Google commits to inclusive growth.

---

## Careers & Growth Opportunities

Google encourages talent acquisition focused on expertise in technology, research, and innovation domains. It supports professional development through diverse programs and invests in creating flexible and inclusive work environments to cultivate long-term career growth.

---

## Traction & Credibility

- In collaboration with Google DeepMind, Google Cloud developed AI-powered platforms that support U.S. Olympians and conservation efforts.
- Recognized for contributions to AI literacy, online safety, and large-scale data analysis.
- Continuous global outreach through Google.org and sustainability programs.

---

## Contact & Further Information

For more details about Google’s products, innovations, or initiatives, please visit [about.google](https://about.google) or subscribe to their newsletter for updates and insights.

---

© 2026 Google Inc. All rights reserved.  
Privacy | Terms | Accessibility  

---