# Company Brochure Generator

This notebook creates an AI-powered system that automatically generates professional company brochures by scraping websites and using LLMs for content analysis and generation.

## Workflow

```mermaid
flowchart TD
    A[Company URL Input] --> B[Scrape Landing Page]
    B --> C[Extract All Links]
    C --> D[LLM Link Analysis]
    D --> E{Relevant Links?}
    E -->|Yes| F[About Pages<br/>Careers Pages<br/>Company Info]
    E -->|No| G[Skip Privacy/Terms<br/>Email Links]
    F --> H[Scrape Selected Pages]
    H --> I[Aggregate Content]
    I --> J[Content Optimization]
    J --> K[Token Limit Check]
    K --> L[LLM Brochure Generation]
    L --> M[Markdown Output]
    
    style A fill:#e1f5fe
    style D fill:#fff3e0
    style L fill:#f3e5f5
    style M fill:#e8f5e8
    style G fill:#ffebee,stroke:#f44336,stroke-dasharray: 5 5
```

## Process Steps

### Step 1: Web Scraping
- Extract content from company landing page
- Clean HTML (remove scripts, styles, images)
- Collect all hyperlinks for analysis

### Step 2: Intelligent Link Filtering
- Use LLM to identify relevant pages (About, Careers, Company info)
- Convert relative links to absolute URLs
- Filter out irrelevant content (privacy policies, terms of service)

### Step 3: Content Aggregation
- Scrape content from filtered pages
- Combine all information while respecting API token limits
- Handle errors gracefully with fallback options

### Step 4: Brochure Generation
- Analyze aggregated content using LLM
- Generate structured markdown brochure
- Include company culture, customers, and career information

## Key Features
- Automatic content extraction and cleaning
- AI-powered link relevance detection
- Token management and content optimization
- Professional brochure formatting
- Error handling and recovery

## **BUSINESS CHALLENGE:**
Create a product that builds a Brochure for a company to be used for prospective clients, investors and potential recruits.

We will be provided a company name and their primary website.

In [15]:
import os
import requests
import json
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI
from groq import Groq

In [16]:
load_dotenv(override=True)

python-dotenv could not parse statement starting at line 11


True

In [17]:
openai = OpenAI()
groq = Groq()

In [18]:
class Website:
    """
    A utility class to represent a Website that we have scraped.
    """

    url: str
    title: str
    text: str
    body: str
    links: List[str]

    def __init__(self, url):
        """
        Create this Website object from the given url using the BeautifulSoup library.
        """

        self.url = url
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links]
    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"
    

In [19]:
ed = Website('https://edwarddonner.com')
print(ed.get_contents())

Webpage Title:
Home - Edward Donner
Webpage Contents:
Home
Connect Four
Outsmart
An arena that pits LLMs against each other in a battle of diplomacy and deviousness
About
Posts
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (
very
amateur) and losing myself in
Hacker News
, nodding my head sagely to things I only half understand.
I’m the co-founder and CTO of
Nebula.io
. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,
acquired in 2021
.
We work with groundbreaking, proprietary LLMs verticalized for talent, we’ve
patented
our matching model, and our award-winning platform has happy customers a

In [20]:
ed.links

['https://edwarddonner.com/',
 'https://edwarddonner.com/connect-four/',
 'https://edwarddonner.com/outsmart/',
 'https://edwarddonner.com/about-me-and-about-nebula/',
 'https://edwarddonner.com/posts/',
 'https://edwarddonner.com/',
 'https://news.ycombinator.com',
 'https://nebula.io/?utm_source=ed&utm_medium=referral',
 'https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html',
 'https://patents.google.com/patent/US20210049536A1/',
 'https://www.linkedin.com/in/eddonner/',
 'https://edwarddonner.com/2025/05/28/connecting-my-courses-become-an-llm-expert-and-leader/',
 'https://edwarddonner.com/2025/05/28/connecting-my-courses-become-an-llm-expert-and-leader/',
 'https://edwarddonner.com/2025/05/18/2025-ai-executive-briefing/',
 'https://edwarddonner.com/2025/05/18/2025-ai-executive-briefing/',
 'https://edwarddonner.com/2025/04/21/the-complete-agentic-ai-engineering-course/',
 'https://edwarddonner.com/2025/04/21/the-

### **First Step**: Have GPT-4o-mini figure out which links are relevant  
#### Use a call to gpt-40-mini to read the links on a webpage, and respond in structured json.  
It should decide which links are relevant, and replace relative links such as "/about" with "https://company.com/about".
We will use "one shot prompting" in which we provide an example of how it should respond in the prompt.

This is an excellent use case for an LLM, because it requires nuanced understanding. Imagine trying to code this without LLMs by parsing and analyzing the webpage - it would be very hard!

Sidenote: there is a more advanced technique called "Structured Outputs" in which we require the model to respond according to a spec. We cover this technique in Week 8 during our autonomous Agentic AI project.

In [21]:
link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page", "url": "https://another.full.url/careers"}
    ]
}
"""

In [22]:
def get_links_user_prompt(Website):
    user_prompt = f"Here is the list of links on the website of {Website.url} -"
    user_prompt += "please decide which of these are relevant web links for the brochure about the company, respond with the full https URL. \
        DO NOT include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links).\n"
    user_prompt += "\n".join(Website.links)
    return user_prompt

In [23]:
print(get_links_user_prompt(ed))

Here is the list of links on the website of https://edwarddonner.com -please decide which of these are relevant web links for the brochure about the company, respond with the full https URL.         DO NOT include Terms of Service, Privacy, email links.
Links (some might be relative links).
https://edwarddonner.com/
https://edwarddonner.com/connect-four/
https://edwarddonner.com/outsmart/
https://edwarddonner.com/about-me-and-about-nebula/
https://edwarddonner.com/posts/
https://edwarddonner.com/
https://news.ycombinator.com
https://nebula.io/?utm_source=ed&utm_medium=referral
https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html
https://patents.google.com/patent/US20210049536A1/
https://www.linkedin.com/in/eddonner/
https://edwarddonner.com/2025/05/28/connecting-my-courses-become-an-llm-expert-and-leader/
https://edwarddonner.com/2025/05/28/connecting-my-courses-become-an-llm-expert-and-leader/
https://edwarddonner.c

In [57]:
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model='gpt-4.1-nano',
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

#### Important Links here!!

In [26]:
get_links("https://anthropic.com")

{'links': [{'type': 'about page', 'url': 'https://www.anthropic.com/company'},
  {'type': 'careers page', 'url': 'https://www.anthropic.com/careers'},
  {'type': 'team page', 'url': 'https://www.anthropic.com/team'},
  {'type': 'solutions page',
   'url': 'https://www.anthropic.com/solutions/agents'},
  {'type': 'research page', 'url': 'https://www.anthropic.com/research'},
  {'type': 'news page', 'url': 'https://www.anthropic.com/news'},
  {'type': 'press releases / announcements',
   'url': 'https://www.anthropic.com/news/claude-character'},
  {'type': 'partnerships / customers',
   'url': 'https://www.anthropic.com/customers'},
  {'type': 'transparency / responsible AI',
   'url': 'https://www.anthropic.com/transparency'}]}

#### How many links we have filtered ?? 

In [28]:
anthropic = Website("https://anthropic.com")
anthropic.links

['#main',
 '#footer',
 'https://www.anthropic.com/',
 'https://www.anthropic.com/claude',
 'https://www.anthropic.com/max',
 'https://www.anthropic.com/team',
 'https://www.anthropic.com/enterprise',
 'https://www.anthropic.com/pricing',
 'https://claude.ai/download',
 'https://claude.ai/',
 'https://www.anthropic.com/news/claude-character',
 'https://www.anthropic.com/api',
 'https://docs.anthropic.com/',
 'https://www.anthropic.com/pricing#api',
 'https://console.anthropic.com/',
 'https://docs.anthropic.com/en/docs/welcome',
 'https://www.anthropic.com/solutions/agents',
 'https://www.anthropic.com/solutions/coding',
 'https://www.anthropic.com/solutions/customer-support',
 'https://www.anthropic.com/solutions/education',
 'https://www.anthropic.com/solutions/financial-services',
 'https://www.anthropic.com/customers',
 'https://www.anthropic.com/research',
 'https://www.anthropic.com/economic-index',
 'https://www.anthropic.com/claude/opus',
 'https://www.anthropic.com/claude/sonne

### **Step 2: Make the Brochure**

In [31]:
# collecting info from the landing page as well as from all the important links selected above

def get_all_details(url):
    result = "Landing Page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    print("Found Links: ", links)

    for link in links["links"]:  
        result += f"\n\n{link['type']}\n"
        result += Website(link['url']).get_contents()
    return result

In [32]:
print(get_all_details("https://anthropic.com"))

Found Links:  {'links': [{'type': 'about page', 'url': 'https://www.anthropic.com/company'}, {'type': 'careers page', 'url': 'https://www.anthropic.com/careers'}, {'type': 'news', 'url': 'https://www.anthropic.com/news'}, {'type': 'research page', 'url': 'https://www.anthropic.com/research'}, {'type': 'solutions page', 'url': 'https://www.anthropic.com/solutions/coding'}, {'type': 'team page', 'url': 'https://www.anthropic.com/team'}, {'type': 'partnerships', 'url': 'https://www.anthropic.com/partners/mcp'}, {'type': 'company info', 'url': 'https://www.anthropic.com/company'}]}
Landing Page:
Webpage Title:
Home \ Anthropic
Webpage Contents:
Skip to main content
Skip to footer
Claude
Chat with Claude
Overview
Max plan
Team plan
Enterprise plan
Explore pricing
Download apps
Claude log in
News
Claude’s character
API
Build with Claude
API overview
Developer docs
Explore pricing
Console log in
News
Learn how to build with Claude
Solutions
Collaborate with Claude
AI agents
Coding
Customer su

In [33]:
system_prompt = "You are an assistant that analyzes the contents of several relevant page from a company's website \
    and creates a short brochure about the company for prospective customers, investors and recruiters. Respond in Markdown.  \
    Include details of the company culture, customers and careers/jobs if you have the information."

In [34]:
# Brochure user prompt

def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in Markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt += user_prompt[:24_000]
    return user_prompt

In [None]:
get_brochure_user_prompt("Anthropic", "https://anthropic.com")