# Text Summarization Using OpenAI s GPT 4

## Importing Libraries

In [None]:
import os
import requests
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display

## Connecting to OpenAI
load_dotenv(override=True):
This argument means that if an environment variable is already set in the system, its value will be replaced with the value from the .env file. Without this flag, existing environment variables would remain unchanged.

os.getenv('OPEN_AI_KEY'):
The line api_key = os.getenv('OPENAI_API_KEY') fetches the value of the OPENAI_API_KEY environment variable (which should have been loaded from the .env file).

In [None]:
# Load environment variables in a file called .env
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if not api_key:
    print("No API key was found")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end")
else:
    print("API key found!")

## Making an instance for OpenAI class

In [None]:
from openai import OpenAI
openai = OpenAI()

##  fetching, Parsing, and Extracting Useful Information from a Webpage URL

headers = {"User-Agent": "..."}:

By setting this header, the code mimics a request coming from a modern browser, which helps avoid potential blocking or serving of altered content.

def __init__(self, url):

The constructor (__init__) is responsible for initializing a new instance of the Website class. It takes one parameter, url, which is the address of the webpage to be processed.

NOTE: Constructors in Python is a special class method for creating and initializing an object instance at that class. Every Python class has a constructor


response = requests.get(url, headers=headers)

Explanation:

. The requests.get() function is used to send an HTTP GET request to the specified URL.

. The custom headers are passed along to mimic a real browser.

. The response from the website, which includes the HTML content, is stored in the variable response.


soup = BeautifulSoup(response.content, 'html.parser')

Explanation:

What is parsing? Transforming code into a machine-readable format

. The HTML content (response.content) is parsed using BeautifulSoup, a popular library for parsing HTML and XML documents.

. The 'html.parser' tells BeautifulSoup to use Python's built-in HTML parser.

. The resulting parsed document is stored in the variable soup.

self.title = soup.title.string if soup.title else "No title found"

Explanation:

. The code checks if the parsed HTML (soup) contains a <title> tag.

. If it exists, soup.title.string extracts the text within the <title> tag.

. If there is no <title> tag, it defaults to "No title found".

. The resulting title is stored in the instance attribute self.title.



In [None]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    def __init__(self, url):
        """
        Create this Website object from the given url using the BeautifulSoup library
        """
        self.url = url
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"

        #Removing Irrelevant Elements
        for irrelevant in soup.body(["script",'style','img','input']):
            irrelevant.decompose()
        
        #Extracting the Cleaned Text
        self.text = soup.body.get_text(separator="\n", strip=True)

## Defining a website

In [None]:
ed = Website("your_desired_webiste")
print(ed.title)
print(ed.text)

## Types of prompts

Models like GPT4o have been trained to receive instructions in a particular way.

They expect to receive:

A system prompt that tells them what task they are performing and what tone they should use

A user prompt -- the conversation starter that they should reply to

## Defining the system prompt

In [None]:
system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

## Defining User prompt

In [None]:
def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled: {website.title}"
    user_prompt += "\nThe contents of this website is as follows; \
please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

In [None]:
print(user_prompt_for(ed))

## Messages
The API from OpenAI expects to receive messages in a particular structure. Many of the other APIs share this structure:

[
    {"role": "system", "content": "system message goes here"},
    {"role": "user", "content": "user message goes here"}
]

To give you a preview, the next 2 cells make a rather simple call - we won't stretch the mighty GPT (yet!)

## Defining a message for the model (GPT-4o-mini)

In [None]:
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

In [None]:
message_for(ed)

## Calling openAI API

In [None]:
def summarize(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model = "gpt-4o-mini",
        messages = messages_for(website)
    )
    return response.choices[0].message.content

In [None]:
summarize("your_desired_webiste")

## Displaying using markdown

In [None]:
def display_summary(url):
    summary = summariz(url)
    display(Markdown(summary))

In [None]:
display_summary("your_desired_webiste")