# Overview of the "Company Brochure Generator" Code

This code is designed to automatically scrape a company's website, identify relevant sub-pages (such as About, Careers, or other important sections), and compile the scraped information into a concise *markdown* brochure. It uses **Large Language Models (LLMs)**—either GPT or Claude—to determine which links are relevant and to produce the final brochure text.

## Key Components

1. **Imports and Configuration**  
   - Loads libraries like `requests` and `BeautifulSoup` for web scraping.  
   - Utilizes environment variables (via `dotenv`) for API keys, such as OpenAI, Claude, and Google’s PaLM-based generative AI.  
   - Sets up basic request headers to resemble a standard web browser.

2. **`Website` Class**  
   - Represents a single webpage.  
   - Fetches the page, parses out the `<title>`, removes non-text elements (scripts, style, images, etc.), and stores the cleaned-up textual content.  
   - Collects all hyperlinks (`<a>` tags) for later analysis.

3. **LLM Prompts and Link Handling**  
   - Includes predefined system instructions (`link_system_prompt`) that instruct the LLM to filter links relevant to a brochure.  
   - The `get_links_user_prompt` function prepares the user prompt listing all discovered links.  
   - `get_links(url, model)` then uses either GPT or Claude to parse those links and return a JSON list of relevant URLs (e.g., About pages, Careers pages, etc.).

4. **Gathering Additional Page Content**  
   - The `get_all_details(url, model)` function first retrieves textual content from the main (landing) page.  
   - It calls `get_links` to identify sub-pages (like About, Careers), then fetches each page’s text, accumulating everything into a single string.

5. **Creating the Brochure**  
   - A separate system prompt (`system_prompt`) tells the LLM how to structure the final brochure, focusing on company information, culture, customers, and career opportunities.  
   - The `get_brochure_user_prompt(url, model)` function compiles the landing page content and relevant sub-page content, truncating if it’s over 5,000 characters.  
   - Finally, `create_brochure(url, model)` sends that text, along with the system prompt, to GPT or Claude, receiving back a short, markdown-formatted brochure.

6. **Gradio Web Interface**  
   - The last part of the code defines a Gradio `Interface` with two inputs:
     1. A textbox for the **company website URL**.  
     2. A dropdown for selecting the **LLM model** (GPT or Claude).  
   - The **output** is a markdown text block that displays the generated brochure to the user.

## Why This Code Is Useful

- **Automates Website Content Gathering**  
  Instead of manually copying and pasting a company’s About/Careers sections, the script programmatically collects data from the main webpage and relevant linked pages.

- **Leverages Large Language Models**  
  By employing GPT or Claude, the code transforms raw text into a *coherent, formatted brochure*, potentially saving a lot of manual editing work.

- **Flexible and Configurable**  
  - It can be adapted to different websites or even different LLMs.  
  - The user can quickly switch between GPT and Claude to see which model yields better (or more cost-effective) results.

- **Easy-to-Use Web App**  
  Thanks to Gradio, anyone (technical or not) can enter a company site URL and get an instant summary/brochure, without having to run Python scripts in a CLI.

Overall, this code serves as a powerful demonstration of how to combine **web scraping** with **text-generation AI** to produce dynamic, context-rich marketing or informational materials.



In [1]:
# imports

import os
import requests
from bs4 import BeautifulSoup
from typing import List
from dotenv import load_dotenv
from openai import OpenAI
import google.generativeai
import anthropic
import gradio as gr 
import json

In [2]:
# Load environment variables in a file called .env

load_dotenv(override=True)
openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY')


In [3]:
# Connect to OpenAI, Anthropic and Google

openai = OpenAI()

claude = anthropic.Anthropic()

google.generativeai.configure()

In [4]:
# Define request headers to mimic a typical browser
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}


# Define a Website class for scraping and storing webpage data
class Website:
    """
    A utility class to represent a Website that we have scraped with links
    """

    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

In [5]:
# Define system prompts and user prompts for link handling

link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}
"""

In [6]:
def get_links_user_prompt(website):
    """
    Creates a user prompt string that lists all links for the given website.
    Asks the assistant to identify only those relevant for a company brochure.
    """
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

In [7]:
# Function to retrieve relevant links from a given URL using either GPT or Claude models

def get_links(url, model):
    """
    Fetches the webpage at 'url', scrapes the links, and uses a chosen LLM model
    to classify which links are relevant for a company brochure.

    :param url: Website URL to scrape
    :param model: 'GPT' or 'Claude' to indicate which LLM to use
    :return: JSON-like Python dict containing relevant links
    """
    website = Website(url)
    if model == 'GPT':
        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": link_system_prompt},
                {"role": "user", "content": get_links_user_prompt(website)}
          ],
            response_format={"type": "json_object"}
        )
        return json.loads(response.choices[0].message.content)
    else:
        result = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=1000,
        temperature=0.7,
        system=link_system_prompt + 'Please return the answer in valid JSON format only. Do not include any extra text or explanations outside the JSON.',
        messages=[
            {"role": "user", "content": get_links_user_prompt(website)},
        ]
    )
        return json.loads(result.content[0].text)

In [8]:
# Function to gather details from the landing page and its relevant sub-pages

def get_all_details(url, model):
    """
    Scrapes the main landing page (given by 'url') and also scrapes any relevant
    linked pages, as determined by the chosen LLM model. Returns a concatenated
    string of the textual content from all these pages.

    :param url: Website URL to scrape
    :param model: 'GPT' or 'Claude' for link classification
    :return: A formatted string with the contents of the landing page and relevant sub-pages
    """
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url, model)
    # print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [9]:
# Define a system prompt for generating the brochure text

system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
Include details of company culture, customers and careers/jobs if you have the information."

# Or uncomment the lines below for a more humorous brochure - this demonstrates how easy it is to incorporate 'tone':

# system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
# and creates a short humorous, entertaining, jokey brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
# Include details of company culture, customers and careers/jobs if you have the information."

In [10]:
# Function to build the user prompt used in the final brochure creation step

def get_brochure_user_prompt(url, model):
    """
    Constructs a user prompt by pulling all relevant details from the website (landing + sub-pages)
    and instructing the model to create a short brochure in markdown format.

    :param url: Main website URL
    :param model: 'GPT' or 'Claude'
    :return: A truncated string (max 5000 chars) containing relevant textual content
    """
    user_prompt = f"You are looking at a company webpage:\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url, model)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [11]:
# Main function to create the brochure text using either GPT or Claude

def create_brochure(url, model):
    """
    Uses the combined text from 'get_brochure_user_prompt' and a pre-defined 'system_prompt'
    to generate a short company brochure in markdown.

    :param url: Company website URL
    :param model: 'GPT' or 'Claude'
    :return: The generated markdown brochure
    """
    if model == 'GPT':
        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": get_brochure_user_prompt(url, model)}
              ],
        )
        return response.choices[0].message.content
    else:
        result = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=2000,
        temperature=0.7,
        system= system_prompt,
        messages=[
            {"role": "user", "content": get_brochure_user_prompt(url, model)},
        ],
    )
        return result.content[0].text

In [12]:
# Gradio interface to launch the web UI
# This lets users input a URL and select a model (GPT or Claude)
# and displays the generated brochure in markdown.

view = gr.Interface(
    fn=create_brochure,
    inputs=[gr.Textbox(label="Company website:"), gr.Dropdown(["GPT", "Claude"], label="Select model", value="GPT")],
    outputs=[gr.Markdown(label="Brochure:")],
    flagging_mode="never"
)
view.launch()

* Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.


