<a href="https://colab.research.google.com/github/MohHasan1/ScrapeSummarize/blob/main/Brochure_Genertator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# imports

import os
import requests
import json
from typing import List
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI

In [7]:
OPENAPI_KEY = "Your api key"

if not OPENAPI_KEY:
    print("No API key found. Please add the OPENAPI_KEY above.")
elif not OPENAPI_KEY.startswith("sk-proj-"):
    print("Invalid API key format. Ensure it starts with 'sk-proj-'.")
elif OPENAPI_KEY.strip() != OPENAPI_KEY:
    print("API key has leading/trailing spaces. Remove them.")
else:
    print("API key is valid!")

API key is valid!


In [9]:
# Connect to OpenAI

openai = OpenAI(api_key=OPENAPI_KEY)

# Test the connection.

response = openai.chat.completions.create(model="gpt-4o-mini", messages=[{"role":"user", "content":"Hello, GPT! How are you?"}])
print(response.choices[0].message.content)

Hello! I'm just a program, so I don't have feelings, but I'm here and ready to help you. How can I assist you today?


# **Website Class**

In [10]:
# A class to represent a Webpage

# Some websites need you to use proper headers when fetching them:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    """
    A utility class to represent a Website that have been scraped, with links.
    """

    def __init__(self, url):
        self.url = url

        # Fetch teh html content (will not werk for web app like react)
        response = requests.get(url, headers=headers)
        self.body = response.content

        # Parse the html
        soup = BeautifulSoup(self.body, 'html.parser')

        # clean the html and set the contents
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""

        # Getting all the links
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

In [26]:
website = Website("http://hasan-swe.dev/")
website.links

['/home',
 'https://github.com/MohHasan1',
 'https://www.linkedin.com/in/hasan-in/',
 'https://mail.google.com/mail/?view=cm&fs=1&to=hasan.swe.dev@gmail.com',
 'https://algo-mazes.netlify.app/',
 'https://byte-link.netlify.app/',
 'https://ignore-git.netlify.app/',
 'https://goal-app-70cf4.firebaseapp.com/',
 '/portfolio/achievement',
 '/portfolio/user-profile',
 '/portfolio/project-explorer',
 '/portfolio/work-experience',
 '/portfolio/skill-set',
 '/portfolio/connect',
 '/resume']

# **First step: Have GPT-4o-mini figure out which links are relevant**

Use a call to gpt-4o-mini to read the links on a webpage, and respond in structured JSON.

It should decide which links are relevant, and replace relative links such as "/about" with "https://company.com/about".

We will use "one shot prompting" in which we provide an example of how it should respond in the prompt.

This is an excellent use case for an LLM, because it requires nuanced understanding. Imagine trying to code this without LLMs by parsing and analyzing the webpage - it would be very hard!


Sidenote: there is a more advanced technique called "Structured Outputs" in which we require the model to respond according to a spec. We cover this technique in Week 8 during our autonomous Agentic AI project.

# **Prompt Engineering**

In [21]:
# System (one-shot prompting)
link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}
"""

In [22]:
# User
def get_links_user_prompt(website : Website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

# **Get all relevent links using gpt**

In [27]:
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

In [30]:
get_links("http://hasan-swe.dev/")

{'links': [{'type': 'portfolio achievements',
   'url': 'http://hasan-swe.dev/portfolio/achievement'},
  {'type': 'portfolio user profile',
   'url': 'http://hasan-swe.dev/portfolio/user-profile'},
  {'type': 'portfolio project explorer',
   'url': 'http://hasan-swe.dev/portfolio/project-explorer'},
  {'type': 'portfolio work experience',
   'url': 'http://hasan-swe.dev/portfolio/work-experience'},
  {'type': 'portfolio skill set',
   'url': 'http://hasan-swe.dev/portfolio/skill-set'},
  {'type': 'portfolio connect',
   'url': 'http://hasan-swe.dev/portfolio/connect'},
  {'type': 'resume', 'url': 'http://hasan-swe.dev/resume'},
  {'type': 'GitHub profile', 'url': 'https://github.com/MohHasan1'},
  {'type': 'LinkedIn profile',
   'url': 'https://www.linkedin.com/in/hasan-in/'}]}

# **Second step: make the brochure!**
Assemble all the details into another prompt to GPT4-o

# **Scrape all teh relevent links**

In [31]:
def get_all_details(url):
  # Home or landing page
    result = "Home/Landing page:\n"

    # Get all Home/landing page conents
    result += Website(url).get_contents()

    # Get all relevent links
    links = get_links(url)
    print("Found links:", links)

    # get info of all the relevent links
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [None]:
print(get_all_details("http://hasan-swe.dev/"))

# **Prompt engineering**

In [43]:
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
Include details of company culture, customers and careers/jobs if you have the information."

In [41]:
# user
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a website called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [None]:
get_brochure_user_prompt("HasanOS", "http://hasan-swe.dev/")

# **Create the brochure**

In [39]:
def create_brochure(company_name, url):
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
    )
    result = response.choices[0].message.content
    display(Markdown(result))

In [46]:
create_brochure("Github", "http://github.com")

Found links: {'links': [{'type': 'about page', 'url': 'https://github.com/about'}, {'type': 'careers page', 'url': 'https://github.careers'}, {'type': 'company blog', 'url': 'https://github.blog'}, {'type': 'customer stories', 'url': 'https://github.com/customer-stories'}, {'type': 'team page', 'url': 'https://github.com/team'}, {'type': 'newsroom', 'url': 'https://github.com/newsroom'}]}


# GitHub Brochure

---

## Welcome to GitHub

### Build and Ship Software on a Single, Collaborative Platform

Join the world’s most widely adopted AI-powered developer platform and discover how your organization can innovate and grow with GitHub.

### Our Vision

GitHub empowers developers worldwide by providing them with tools that streamline their workflows, simplify collaboration, and enhance productivity. Our mission is to create a platform where developers can connect, collaborate, and build remarkable software together.

---

## Key Features

- **GitHub Copilot**: Discover AI-driven code assistance that boosts productivity by 25% or more. Get code suggestions, refactor code, and even chat with AI—all integrated into your development pipeline.

- **GitHub Actions**: Automate your workflows effortlessly with a flexible CI/CD solution. Optimize processes and enhance deployment strategies to save time and resources.

- **Codespaces**: Start building instantly with a complete development environment in the cloud. Access your projects from anywhere, whether from the office, home, or on the go.

- **Security**: Protect your code with built-in security features. Identify and resolve vulnerabilities before they become a problem with advanced security alerts and robust compliance tools.

---

## Our Customers

GitHub is trusted by a diverse range of customers worldwide, including:

- **Enterprises**
- **Small and Medium Teams**
- **Startups**
- **Nonprofits**

### Industry Applications

We cater to various industries, including:

- Healthcare
- Financial Services
- Manufacturing
- Government

By focusing on sector-specific solutions, we ensure that our tools meet the distinct needs of every organization.

---

## Company Culture

At GitHub, we foster an environment of collaboration, innovation, and inclusivity. We believe in the power of community—our teams work in a culture that promotes open communication, continuous learning, and shared knowledge.

Our workforce reflects a range of backgrounds and experiences, creating a vibrant culture that encourages diversity and creativity. We value feedback and consider it a critical component of our growth.

---

## Careers at GitHub

GitHub is always on the lookout for passionate individuals to join our dynamic team. We offer various career paths across technology, product management, marketing, and customer support. 

### Why Work with Us?

- **Innovative Environment**: Work with cutting-edge technology and collaborate with brilliant minds.
- **Flexible Work Arrangement**: We promote a healthy work-life balance with remote work options.
- **Learning and Development**: Benefit from various learning pathways, workshops, and access to events.

Join us in shaping the future of software development!

---

## Get Started with GitHub

Whether you're looking to build software, collaborate with a team, or find a new career opportunity, GitHub is your go-to platform. 

**Try GitHub Copilot today!** 

For more information, visit our website or connect with us on social media.

---

*Discover the power of collaborative software development with GitHub—where innovation meets community.*

# **Streaming**

In [49]:
def stream_brochure(company_name, url):
    stream = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True
    )

    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        # replace code ``` with "" and markdown with ""
        response = response.replace("```","").replace("markdown", "")
        # update teh markdown
        update_display(Markdown(response), display_id=display_handle.display_id)

In [50]:
stream_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog', 'url': 'https://huggingface.co/blog'}, {'type': 'community page', 'url': 'https://discuss.huggingface.co'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'GitHub page', 'url': 'https://github.com/huggingface'}, {'type': 'Twitter page', 'url': 'https://twitter.com/huggingface'}, {'type': 'LinkedIn page', 'url': 'https://www.linkedin.com/company/huggingface/'}]}


# 🎉 Welcome to Hugging Face: The AI Community Building the Future

### 🌟 About Us
Hugging Face is at the forefront of the artificial intelligence revolution. Our mission is to democratize machine learning by providing a platform that fosters collaboration among researchers, developers, and enthusiasts globally. With an emphasis on community-driven development, we offer innovative tools and resources for model creation, datasets, and applications.

### 🤝 Company Culture
At Hugging Face, we believe in the power of collaboration and inclusivity. Our culture encourages constant learning, sharing, and evolving together as a community. Whether you are developing advanced AI models or engaging with datasets, you are part of a vibrant ecosystem dedicated to pushing the boundaries of machine learning.

### 📊 Our Customers
We proudly serve over **50,000 organizations** including major enterprises such as:

- **Meta AI**
- **Amazon**
- **Google**
- **Intel**
- **Microsoft**
- **Grammarly**

Our clients benefit from advanced machine learning solutions tailored to their specific needs, which allows them to accelerate their projects effectively and responsibly.

### 👩‍💻 Careers at Hugging Face
Join us in shaping the future of AI! We are looking for passionate team members who are eager to innovate and contribute to our mission. At Hugging Face, you will collaborate with world-class professionals in an environment that promotes creativity and growth. Explore our **[Careers Page](https://huggingface.co/jobs)** to find your place within our team!

### 🔍 Explore Our Offerings
- **Models**: Access and collaborate on over **1 Million models**, including state-of-the-art solutions in text, image, video, audio, and more.
- **Datasets**: Discover and utilize more than **250,000 datasets** perfect for any machine learning task.
- **Spaces**: Engage with our dynamic applications and tools designed for real-time experimentation and collaboration.
- **Enterprise Solutions**: Tailored solutions designed for organizations, with benefits such as enhanced security, access controls, and dedicated support starting at **$20/user/month**.

#### 🚀 Get Started
Join the Hugging Face community today! Explore models, datasets, and applications that are transforming the AI landscape. **[Sign Up Now](https://huggingface.co/signup)** to start your journey!

### 💬 Connect With Us
Stay updated and be part of the conversation. Follow us on:

- [Twitter](https://twitter.com/huggingface)
- [LinkedIn](https://www.linkedin.com/company/huggingface/)
- [GitHub](https://github.com/huggingface)
- [Discord](https://discord.gg/huggingface)

---

Together, let's build a future powered by artificial intelligence!