<a href="https://colab.research.google.com/github/Shriansh16/LLM_Engineering/blob/main/02_Web_Scrapping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A full business solution
Create a product that builds a Brochure for a company to be used for prospective clients, investors and potential recruits.

We will be provided a company name and their primary website.

See the end of this notebook for examples of real-world business applications.

And remember: I'm always available if you have problems or ideas! Please do reach out.

In [4]:
import os
import requests
import json
from typing import List
#from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI

In [10]:

os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', "")
MODEL = 'gpt-3.5-turbo'
openai = OpenAI()

In [8]:
# A class to represent a Webpage

class Website:
    url: str
    title: str
    body: str
    links: List[str]

    def __init__(self, url):
        self.url = url
        response = requests.get(url)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

In [9]:
ed = Website("https://edwarddonner.com")
print(ed.get_contents())

Webpage Title:
Home - Edward Donner
Webpage Contents:
Home
Outsmart
An arena that pits LLMs against each other in a battle of diplomacy and deviousness
About
Posts
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (
very
amateur) and losing myself in
Hacker News
, nodding my head sagely to things I only half understand.
I’m the co-founder and CTO of
Nebula.io
. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,
acquired in 2021
.
We work with groundbreaking, proprietary LLMs verticalized for talent, we’ve
patented
our matching model, and our award-winning platform has happy customers and tons of pr

Use a call to gpt-4o-mini to read the links on a webpage, and respond in structured JSON.
It should decide which links are relevant, and replace relative links such as "/about" with "https://company.com/about".
We will use "one shot prompting" in which we provide an example of how it should respond in the prompt.

Sidenote: there is a more advanced technique called "Structured Outputs" in which we require the model to respond according to a spec. We cover this technique in Week 8 during our autonomous Agentic AI project.

In [11]:
link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}
"""

In [12]:
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

In [13]:
print(get_links_user_prompt(ed))

Here is the list of links on the website of https://edwarddonner.com - please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. Do not include Terms of Service, Privacy, email links.
Links (some might be relative links):
https://edwarddonner.com/
https://edwarddonner.com/outsmart/
https://edwarddonner.com/about-me-and-about-nebula/
https://edwarddonner.com/posts/
https://edwarddonner.com/
https://news.ycombinator.com
https://nebula.io/?utm_source=ed&utm_medium=referral
https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html
https://patents.google.com/patent/US20210049536A1/
https://www.linkedin.com/in/eddonner/
https://edwarddonner.com/2024/08/06/outsmart/
https://edwarddonner.com/2024/08/06/outsmart/
https://edwarddonner.com/2024/06/26/choosing-the-right-llm-resources/
https://edwarddonner.com/2024/06/26/choosing-the-right-llm-resources/
https

In [14]:
def get_links(url):
    website = Website(url)
    completion = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = completion.choices[0].message.content
    return json.loads(result)

In [15]:
get_links("https://anthropic.com")

{'links': [{'type': 'company page', 'url': 'https://anthropic.com/company'},
  {'type': 'careers page', 'url': 'https://anthropic.com/careers'},
  {'type': 'research page', 'url': 'https://anthropic.com/research'}]}


Second step: make the brochure!
Assemble all the details into another prompt to GPT4-o

In [16]:
def get_all_details(url):
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [17]:
print(get_all_details("https://anthropic.com"))

Found links: {'links': [{'url': 'https://anthropic.com/company'}, {'url': 'https://anthropic.com/careers'}]}


KeyError: 'type'

In [18]:
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
Include details of company culture, customers and careers/jobs if you have the information."

# Or uncomment the lines below for a more humorous brochure - this demonstrates how easy it is to incorporate 'tone':

# system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
# and creates a short humorous, entertaining, jokey brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
# Include details of company culture, customers and careers/jobs if you have the information."

In [19]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:20_000] # Truncate if more than 20,000 characters
    return user_prompt

In [20]:
def create_brochure(company_name, url):
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
    )
    result = response.choices[0].message.content
    display(Markdown(result))

In [21]:
create_brochure("Anthropic", "https://anthropic.com")

Found links: {'links': [{'type': 'company page', 'url': 'https://anthropic.com/company'}, {'type': 'careers page', 'url': 'https://anthropic.com/careers'}]}


# Anthropic Company Brochure

## About Us

Anthropic is an AI safety and research company based in San Francisco. We specialize in developing reliable, interpretable, and steerable AI systems. Our team consists of experts in machine learning, physics, policy, and product development. At Anthropic, we are dedicated to building AI systems that are not only beneficial but also safe for society.

## Company Culture

- **Mission-Driven:** We exist to ensure transformative AI helps people and society flourish. Our team is committed to building frontier systems and responsibly deploying them.
  
- **High Trust Environment:** We prioritize honesty, emotional maturity, and intellectual openness, fostering a collaborative and trusting work culture.
  
- **One Big Team:** Collaboration is central to our values. We work as a cohesive unit towards our shared mission, with leadership setting the strategy and valuing input from everyone.
  
- **Simplicity and Pragmatism:** We believe in doing what works. Our approach is pragmatic, emphasizing open communication and empiricism in all aspects of our work.

## Our Work

- **Research:** We conduct frontier AI research across various modalities, focusing on safety research areas such as interpretability and policy impacts analysis.
  
- **Policy:** We engage with policymakers and civil society to communicate our findings and promote safe and reliable AI practices.
  
- **Product:** We translate our research into practical tools like Claude, benefiting businesses, nonprofits, and society globally.

## Careers

Join us in making AI safe at Anthropic. Our team spans various backgrounds, from physics and machine learning to public policy and business. We offer competitive compensation, generous benefits, and a supportive work environment.

### What We Offer

- **Health & Wellness:** Comprehensive health insurance, inclusive fertility benefits, generous parental leave, unlimited PTO, and more.
  
- **Compensation & Support:** Competitive salary, equity packages with optional donation matching, 401(k) plan, and additional benefits like wellness stipend and education allowance.
  
- **Interview Process:** Our thorough interview process ensures unbiased hiring decisions, tailored to the role and candidate, prioritizing work tests and soft skills assessment.

If you are passionate about AI safety and want to be part of a team dedicated to building a better future, explore open roles at Anthropic and join us on our mission.

---

**Contact Us:**

Website: [Anthropic](https://www.anthropic.com)  
Email: contact@anthropic.com  
Stay Connected: [Twitter](https://twitter.com/AnthropicAI) | [LinkedIn](https://www.linkedin.com/company/anthropic-company) | [YouTube](https://www.youtube.com/anthropic)

Finally - a minor improvement
With a small adjustment, we can change this so that the results stream back from OpenAI, with the familiar typewriter animation

In [22]:
def stream_brochure(company_name, url):
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True
    )

    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        response = response.replace("```","").replace("markdown", "")
        update_display(Markdown(response), display_id=display_handle.display_id)

In [23]:
stream_brochure("Anthropic", "https://anthropic.com")

Found links: {'links': [{'type': 'company page', 'url': 'https://anthropic.com/company'}, {'type': 'careers page', 'url': 'https://anthropic.com/careers'}]}


# Anthropic: Building Safe and Reliable AI Systems

## About Us

At Anthropic, we are at the forefront of AI safety and research. Our team in San Francisco comprises interdisciplinary experts in machine learning, physics, policy, and product development. We are dedicated to creating AI systems that are not only beneficial but also safe and reliable for society.

## Company Culture

- **Mission-Driven**: We exist to ensure that transformative AI benefits humanity. We actively collaborate with various stakeholders to fulfill our mission.
- **High Trust Environment**: We value open communication, honesty, and emotional maturity within our team.
- **Collaborative**: We work together as one big team, each contributing to the shared goal of advancing AI safety.
- **Practical and Empirical**: We prioritize pragmatic solutions and empirical approaches to our research and development.

## Our Work

- **Building Safer Systems**: We focus on developing AI systems that are not only advanced but also interpretable and steerable for users.
- **Interdisciplinary Research**: From AI research to policy implications, we explore various safety facets to drive responsible AI development.
- **Product Development**: Our research translates into tangible tools like Claude, benefiting businesses, nonprofits, and society as a whole.

## Careers at Anthropic

- **Diverse Team**: Our team members come from diverse backgrounds such as physics, machine learning, policy, and more.
- **Benefits**: We offer competitive compensation, health, wellness benefits, and opportunities to contribute to charitable causes through equity.
- **Hiring Process**: Our inclusive hiring process involves skill assessments tailored to the role, ensuring transparency and minimizing biases.

## Join Us

Are you passionate about building a safer future with AI? Explore our open roles at [Anthropic Careers](URL) and be part of a team dedicated to shaping the future of AI for the better.

---
**Claude | API | Team | Pricing | Research | Company | Customers | News | Careers**
*Follow us on Twitter, LinkedIn, and YouTube for updates and insights.*
**© 2024 Anthropic PBC**

In [24]:
stream_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'url': 'https://huggingface.co/about', 'type': 'About page'}, {'url': 'https://huggingface.co/careers', 'type': 'Careers/Jobs page'}]}


# Hugging Face: Building the Future of AI

Welcome to **Hugging Face**, the AI community shaping the future of technology. At Hugging Face, we offer a collaborative platform for machine learning enthusiasts to work on cutting-edge models, datasets, and applications.

### Company Culture
Join a diverse community of over 50,000 organizations, ranging from non-profits to enterprise giants like Amazon Web Services and Google. Our foundation is built on open-source projects like Transformers, Diffusers, and Tokenizers, providing state-of-the-art tools for ML development.

### Our Customers
We cater to a wide range of customers utilizing AI technologies. Whether you are a data scientist, researcher, or business in need of AI solutions, Hugging Face has you covered. Our paid Compute and Enterprise solutions offer optimized inference endpoints, enterprise-grade security, and dedicated support.

### Careers and Jobs
Passionate about AI and ML? Consider joining our dynamic team at Hugging Face. As part of our growing organization, you will have the opportunity to work on groundbreaking projects, collaborate with industry experts, and contribute to the future of machine learning.

#### Discover More about Hugging Face
- **Website:** [Hugging Face](https://huggingface.co)
- **Documentation:** Find detailed information on our solutions.
- **Blog:** Stay updated on the latest developments in AI.
- **Social:** Connect with us on GitHub, Twitter, LinkedIn, and Discord.

Join us at **Hugging Face** as we innovate, collaborate, and create the AI-driven future together.