<a href="https://colab.research.google.com/github/1uch0/LLM_Udemy_course/blob/main/LLM_Brochure_Creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [26]:
# imports
# If these fail, please check you're running from an 'activated' environment with (llms) in the command prompt

import os
import requests
import json
from typing import List
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI
from google.colab import userdata




In [27]:
api_key = userdata.get('Open_AI_API') ##Open_AI_API It is the API key from OPEN_AI saved in your collab
MODEL = 'gpt-4o-mini'

In [11]:
# Check the key

if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")

API key found and looks good so far!


In [12]:
openai = OpenAI(api_key=api_key)

In [14]:
# A class to represent a Webpage

# Some websites need you to use proper headers when fetching them:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    """
    A utility class to represent a Website that we have scraped, now with links
    """

    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

In [15]:
ed = Website('https://edwarddonner.com')

In [18]:
ed = Website("https://edwarddonner.com")
#print(ed.get_contents())
#Now we print all the links in the webpage
ed.links

['https://edwarddonner.com/',
 'https://edwarddonner.com/connect-four/',
 'https://edwarddonner.com/outsmart/',
 'https://edwarddonner.com/about-me-and-about-nebula/',
 'https://edwarddonner.com/posts/',
 'https://edwarddonner.com/',
 'https://news.ycombinator.com',
 'https://nebula.io/?utm_source=ed&utm_medium=referral',
 'https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html',
 'https://patents.google.com/patent/US20210049536A1/',
 'https://www.linkedin.com/in/eddonner/',
 'https://edwarddonner.com/2025/01/23/llm-workshop-hands-on-with-agents-resources/',
 'https://edwarddonner.com/2025/01/23/llm-workshop-hands-on-with-agents-resources/',
 'https://edwarddonner.com/2024/12/21/llm-resources-superdatascience/',
 'https://edwarddonner.com/2024/12/21/llm-resources-superdatascience/',
 'https://edwarddonner.com/2024/11/13/llm-engineering-resources/',
 'https://edwarddonner.com/2024/11/13/llm-engineering-resources/',
 'ht

In [19]:
##First step: Have GPT-4o-mini figure out which links are relevant
##Use a call to gpt-4o-mini to read the links on a webpage, and respond in structured JSON.
##It should decide which links are relevant, and replace relative links such as "/about" with "https://company.com/about".
##We will use "one shot prompting" in which we provide an example of how it should respond in the prompt.

##This is an excellent use case for an LLM, because it requires nuanced understanding. Imagine trying to code this without LLMs by parsing and analyzing the webpage - it would be very hard!

##Sidenote: there is a more advanced technique called "Structured Outputs" in which we require the model to respond according to a spec. We cover this technique in Week 8 during our autonomous Agentic AI project.

In [20]:
link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}
"""

In [21]:
link_system_prompt

'You are provided with a list of links found on a webpage. You are able to decide which of the links would be most relevant to include in a brochure about the company, such as links to an About page, or a Company page, or Careers/Jobs pages.\nYou should respond in JSON as in this example:\n{\n    "links": [\n        {"type": "about page", "url": "https://full.url/goes/here/about"},\n        {"type": "careers page": "url": "https://another.full.url/careers"}\n    ]\n}\n'

In [22]:
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

In [23]:
print(get_links_user_prompt(ed))

Here is the list of links on the website of https://edwarddonner.com - please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. Do not include Terms of Service, Privacy, email links.
Links (some might be relative links):
https://edwarddonner.com/
https://edwarddonner.com/connect-four/
https://edwarddonner.com/outsmart/
https://edwarddonner.com/about-me-and-about-nebula/
https://edwarddonner.com/posts/
https://edwarddonner.com/
https://news.ycombinator.com
https://nebula.io/?utm_source=ed&utm_medium=referral
https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html
https://patents.google.com/patent/US20210049536A1/
https://www.linkedin.com/in/eddonner/
https://edwarddonner.com/2025/01/23/llm-workshop-hands-on-with-agents-resources/
https://edwarddonner.com/2025/01/23/llm-workshop-hands-on-with-agents-resources/
https://edwarddonner.com/2024/12/21/

In [28]:
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result) #Bring as a json

In [30]:
get_links("https://anthropic.com")

{'links': [{'type': 'about page', 'url': 'https://www.anthropic.com/company'},
  {'type': 'careers page', 'url': 'https://www.anthropic.com/careers'},
  {'type': 'team page', 'url': 'https://www.anthropic.com/team'},
  {'type': 'research page', 'url': 'https://www.anthropic.com/research'},
  {'type': 'news page', 'url': 'https://www.anthropic.com/news'}]}

In [31]:
anthropic = Website("https://anthropic.com")
anthropic.links

['#main-content',
 '#footer',
 '/',
 '/news',
 'https://claude.ai/',
 'https://www.anthropic.com/research#entry:8@1:url',
 'https://www.anthropic.com/claude',
 'https://claude.ai/',
 '/api',
 'https://www.anthropic.com/news/claude-3-7-sonnet',
 '/claude/sonnet',
 '/news/visible-extended-thinking',
 '/news/claude-for-enterprise',
 '/research/constitutional-ai-harmlessness-from-ai-feedback',
 '/news/core-views-on-ai-safety',
 '/jobs',
 '/',
 '/claude',
 '/team',
 '/enterprise',
 'https://claude.ai/download',
 '/pricing',
 'http://claude.ai/login',
 '/api',
 'https://docs.anthropic.com/',
 '/pricing#anthropic-api',
 'https://console.anthropic.com/',
 '/research',
 '/economic-index',
 '/claude/sonnet',
 '/claude/haiku',
 '/news/claude-3-family',
 '/transparency',
 '/responsible-scaling-policy',
 'https://trust.anthropic.com',
 '/solutions/coding',
 '/news',
 '/customers',
 '/engineering',
 '/company',
 '/careers',
 'https://status.anthropic.com/',
 '/supported-countries',
 'https://support

In [32]:
# Anthropic has made their site harder to scrape, so I'm using HuggingFace..

huggingface = Website("https://huggingface.co")
huggingface.links

['/',
 '/models',
 '/datasets',
 '/spaces',
 '/posts',
 '/docs',
 '/enterprise',
 '/pricing',
 '/login',
 '/join',
 '/spaces',
 '/models',
 '/Qwen/QwQ-32B',
 '/deepseek-ai/DeepSeek-R1',
 '/microsoft/Phi-4-multimodal-instruct',
 '/SparkAudio/Spark-TTS-0.5B',
 '/tencent/HunyuanVideo-I2V',
 '/models',
 '/spaces/ASLP-lab/DiffRhythm',
 '/spaces/Qwen/QwQ-32B-Demo',
 '/spaces/Wan-AI/Wan2.1',
 '/spaces/nanotron/ultrascale-playbook',
 '/spaces/black-forest-labs/FLUX.1-dev',
 '/spaces',
 '/datasets/facebook/natural_reasoning',
 '/datasets/FreedomIntelligence/medical-o1-reasoning-SFT',
 '/datasets/Congliu/Chinese-DeepSeek-R1-Distill-data-110k',
 '/datasets/KodCode/KodCode-V1',
 '/datasets/SynthLabsAI/Big-Math-RL-Verified',
 '/datasets',
 '/join',
 '/pricing#endpoints',
 '/pricing#spaces',
 '/pricing',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/allenai',
 '/facebook',
 '/amazon',
 '/google',
 '/Intel',
 '/microsoft',
 '/gramma

In [33]:
get_links("https://huggingface.co")

{'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'},
  {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'},
  {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'},
  {'type': 'blog page', 'url': 'https://huggingface.co/blog'},
  {'type': 'community page', 'url': 'https://discuss.huggingface.co'},
  {'type': 'GitHub page', 'url': 'https://github.com/huggingface'},
  {'type': 'LinkedIn page',
   'url': 'https://www.linkedin.com/company/huggingface/'},
  {'type': 'Twitter page', 'url': 'https://twitter.com/huggingface'}]}

In [34]:
#Second step: make the brochure!

In [35]:
def get_all_details(url):
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url)  ##Get the links that matter the most
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [36]:
print(get_all_details("https://huggingface.co"))

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'company page', 'url': 'https://www.linkedin.com/company/huggingface/'}, {'type': 'discussion forum', 'url': 'https://discuss.huggingface.co'}]}
Landing page:
Webpage Title:
Hugging Face – The AI community building the future.
Webpage Contents:
Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
The AI community building the future.
The platform where the machine learning community collaborates on models, datasets, and applications.
Explore AI Apps
or
Browse 1M+ models
Trending on
this week
Models
Qwen/QwQ-32B
Updated
about 5 hours ago
•
169k
•
1.94k
deepseek-ai/DeepSeek-R1
Updated
16 days ago
•
3.21M
•
11.2k
microsoft/Phi-4-multimodal-instruct
Updated
4 days ago
•
366k
•
1.09k
SparkAudio/Spark-TTS-0.5B
Updated
4 days ago


In [37]:
#system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
#and creates a short brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
#Include details of company culture, customers and careers/jobs if you have the information."

# Or uncomment the lines below for a more humorous brochure - this demonstrates how easy it is to incorporate 'tone':

system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short humorous, entertaining, jokey brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
Include details of company culture, customers and careers/jobs if you have the information."

In [38]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [39]:
get_brochure_user_prompt("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'company page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'learning resources page', 'url': 'https://huggingface.co/learn'}, {'type': 'community page', 'url': 'https://discuss.huggingface.co'}]}


'You are looking at a company called: HuggingFace\nHere are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\nLanding page:\nWebpage Title:\nHugging Face – The AI community building the future.\nWebpage Contents:\nHugging Face\nModels\nDatasets\nSpaces\nPosts\nDocs\nEnterprise\nPricing\nLog In\nSign Up\nThe AI community building the future.\nThe platform where the machine learning community collaborates on models, datasets, and applications.\nExplore AI Apps\nor\nBrowse 1M+ models\nTrending on\nthis week\nModels\nQwen/QwQ-32B\nUpdated\nabout 5 hours ago\n•\n169k\n•\n1.94k\ndeepseek-ai/DeepSeek-R1\nUpdated\n16 days ago\n•\n3.21M\n•\n11.2k\nmicrosoft/Phi-4-multimodal-instruct\nUpdated\n4 days ago\n•\n366k\n•\n1.09k\nSparkAudio/Spark-TTS-0.5B\nUpdated\n4 days ago\n•\n5.61k\n•\n253\ntencent/HunyuanVideo-I2V\nUpdated\nabout 8 hours ago\n•\n1.63k\n•\n214\nBrowse 1M+ models\nSpaces\nRunning\non\nZero\n376\n37

In [40]:
def create_brochure(company_name, url):
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)} #This is going to call ChatGPT mini in order to meet the specific requirements.
          ],
    )
    result = response.choices[0].message.content
    display(Markdown(result))

In [41]:
create_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'community page', 'url': 'https://discuss.huggingface.co'}, {'type': 'GitHub page', 'url': 'https://github.com/huggingface'}, {'type': 'LinkedIn page', 'url': 'https://www.linkedin.com/company/huggingface/'}, {'type': 'Twitter page', 'url': 'https://twitter.com/huggingface'}]}


# Welcome to Hugging Face!
> *The AI community building the future... one Hug at a time!*

---

## Who Are We?
At Hugging Face, we don’t just create AI; we create **FANTASTIC** AI models, datasets, and applications—like a buffet but for machine learning. You can indulge in our **1M+ models** and **250k+ datasets** without worrying about calories! 🍽️

---

## Our Customers
More than **50,000 organizations** are hugging us tightly! From tech giants like **Google** and **Amazon** to passionate non-profits like **AI2**, we cater to anyone needing a good AI cuddle. 💼

---

## A Day in the Life at Hugging Face
Ever wondered what it's like to work at a place where the coffee is always strong, the code is always open-source, and the only drama is deciding whether to use TensorFlow or PyTorch? Here’s a typical day:
1. **Morning:** Staff meeting – everyone hugs it out before diving into code.
2. **Afternoon:** Collaboration session fueled by snacks (who doesn't love gummy bears in algorithms?).
3. **Evening:** Group yoga to relax after optimizing some of the hottest models like **Qwen/QwQ-32B**.

---

## Job Opportunities
We're on the lookout for new team members to help shape the AI future:
- Want to be a model organizer? Join us!
- Backend wizard? Perfect, our models could use some magic!
- If you can make a mean cup of coffee, we might just hire you on the spot. ☕️✨

---

## Why Join Us?
- **Learn & Grow:** Hugging Face isn’t just a workplace, it’s your personal research lab! Experiment, collaborate, and build your ML portfolio.
- **Innovative Culture:** We believe in creating a relaxed, open atmosphere where ideas can flow as openly as the data.
- **Open Source Love:** Join a community that believes in sharing: we have over **140,983 transformers** (no, not the car kind).

---

## Join the Fun!
Curious about our models? Excited to hyperparameter-tune your life? Dive into our **Spaces** and check out trending AI Apps! We promise it’ll be a ride wilder than a rollercoaster built by AI itself! 🎢

**Explore Hugging Face: [Sign Up Here!](#)**

---
*Hugging Face – where every byte is a hug and every project is a warm embrace!* 🤗

---

In [42]:
#Finally - a minor improvement
#With a small adjustment, we can change this so that the results stream back from OpenAI, with the familiar typewriter animation

In [45]:
def stream_brochure(company_name, url):
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True ## It is iterating every chunk
    )

    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        response = response.replace("```","").replace("markdown", "")
        update_display(Markdown(response), display_id=display_handle.display_id)

In [46]:
stream_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'discussion page', 'url': 'https://discuss.huggingface.co'}, {'type': 'GitHub page', 'url': 'https://github.com/huggingface'}, {'type': 'Twitter page', 'url': 'https://twitter.com/huggingface'}, {'type': 'LinkedIn page', 'url': 'https://www.linkedin.com/company/huggingface/'}]}


# 🤗 Hugging Face Brochure: Bringing AI to Life, One Hug at a Time! 

---

## Welcome to the Future

At **Hugging Face**, we’re not just building models; we’re building a community! A place where AI, machine learning, and a splash of fun come together. Join our quest and be a part of the AI revolution — no actual hugging required (unless you're into that)!

---

## Our Magic Ingredients 🪄

- **1M+ Models:** That’s more than the number of times you’ve tried to learn to ride a bike yet fell. Explore tiny models like *Qwen/QwQ-32B* and impressive giants like *deepseek-ai/DeepSeek-R1* — we promise they won’t need training wheels!
- **Datasets Galore:** You think your fridge is stocked? Try **250k+ datasets**! That’s enough data to make even your own pet cat question its existence. 
- **Spaces:** Whether you're a musician or a wizard of words, our **Spaces** let you create, discover, and collaborate. Explore applications like **Di♪♪Rhythm** for song generation — still waiting for our star, the new AI Mozart!

---

## Who's Hugging Us? 🤗💚

More than **50,000 organizations** are already on board! That includes the likes of **Amazon**, **Microsoft**, **Google**, and even **Grammarly**! You know you’ve made it when your name gets dropped in the same breath as those tech juggernauts.

---

## Careers at Hugging Face 💼

Looking to join a company that’s not just building the future but tickling it with delightful laughter? We’ve got jobs! From machine learning enthusiasts to snack specialists for our brainstorming sessions, we welcome folks from all walks of AI.

### Why Join Us? ⁉️

1. **Community-Centric Culture:** We value collaboration over competition. Think of it like a potluck—bring your skills, and we’ll bring the fun!
2. **Flexibility:** Work from anywhere! Whether you prefer your home office, a coffee shop, or under a tree in the park — as long as there’s Wi-Fi, we’re all good!
3. **Career Growth:** With our abundance of opportunities, you’ll progress faster than a cat chasing a laser pointer.

---

## Join the Hug Brigade! 🚀

- **For Customers:** Dive into our ocean of models and datasets, and build the AI of your dreams. Or just use our tools to make your day jobs a little less boring.
- **For Investors:** We’re creating the future of AI— and yes, we're also looking for partners who want to hug their way into profitability!
- **For Recruits:** If you love cutting-edge technology and have a penchant for being part of a warm culture, we’ve got a spot saved just for you here at Hugging Face.

---

## Let’s Keep in Touch!

Want to learn more? Visit us at [Hugging Face](https://huggingface.co) or connect with us on social media, where the fun never stops!

---

*Who knew building the future could be this much fun? We’re not just Hugging Face; we’re HUGGING THE WORLD! 🤗✨*