# End of week 1 exercise

To demonstrate your familiarity with OpenAI API, and also Ollama, build a tool that takes a technical question,  
and responds with an explanation. This is a tool that you will be able to use yourself during the course!

In [21]:
# imports
import os
import requests
import json
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI
import ollama

In [11]:
# constants

MODEL_GPT = 'gpt-4o-mini'
MODEL_LLAMA = 'llama3.2'

In [12]:
# set up environment
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if api_key and api_key.startswith('sk-proj-') and len(api_key)>10:
    print("API key looks good so far")
else:
    print("There might be a problem with your API key? Please visit the troubleshooting notebook!")

openai = OpenAI()

API key looks good so far


In [13]:
# here is the question; type over this to ask something new

# Initial system prompt
system_prompt = """You are provided with a code snippet and a question about this code. 
You are a tutor who explains the code and answers the question as prompted.
Respond clearly and concisely, using examples to enhance understanding.
"""

# Questions and answers as objects in an array
examples = [
    {
        "question": """
Please explain what this code does and why:
yield from {book.get("author") for book in books if book.get("author")}
""",
        "answer": """
This code snippet is a generator expression that yields unique authors from a list of dictionaries called `books`. 
1. `{book.get("author") for book in books if book.get("author")}`: Creates a set of authors by iterating over `books` and extracting the value associated with the "author" key, only if it exists.
2. `yield from`: Allows the generator to yield elements from the set one by one. This avoids duplicates due to the set's nature.
"""
    },
    {
        "question": """
What does this code do, and why is it written this way?:
squared = {x: x**2 for x in range(5)}
""",
        "answer": """
This code creates a dictionary using a dictionary comprehension. The dictionary maps each number in the range `[0, 4]` to its square.
1. `range(5)`: Generates numbers from 0 to 4.
2. `{x: x**2}`: For each `x`, the dictionary key is `x` and the value is `x**2`.
The purpose is to efficiently compute and store the squares of numbers in a single step.
"""
    },
    {
        "question": """
What does this code do in the context of NLP?
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer("Hello, how are you?", return_tensors="pt")
""",
        "answer": """
This code uses the `transformers` library to tokenize a text input for NLP tasks:
1. `AutoTokenizer.from_pretrained("bert-base-uncased")`: Loads a pre-trained BERT tokenizer.
2. `tokenizer("Hello, how are you?", return_tensors="pt")`: Converts the input text into token IDs and returns the output as PyTorch tensors (`"pt"`).
The output is ready for further processing in a deep learning model.
"""
    },
    {
        "question": """
How does this bubble sort implementation work, and what are its limitations?
def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        for j in range(0, n-i-1):
            if arr[j] > arr[j+1]:
                arr[j], arr[j+1] = arr[j+1], arr[j]
""",
        "answer": """
This code implements the bubble sort algorithm:
1. It iterates through the array multiple times (`for i in range(n)`) and compares adjacent elements.
2. If an element is greater than the next, they are swapped.
3. The largest elements "bubble" to the end of the array with each iteration.

**Limitations**:
- Time complexity: O(n^2) in the worst case, making it inefficient for large datasets.
- It is not stable for sorting complex data structures unless explicitly handled.
"""
    }
]

# Add examples to the system prompt
for i, example in enumerate(examples, start=1):
    system_prompt += f"\n### Example {i}:\n"
    system_prompt += f"Question:\n{example['question']}\n"
    system_prompt += f"Answer:\n{example['answer']}\n"

In [14]:
# Print the resulting system prompt
print(system_prompt)

You are provided with a code snippet and a question about this code. 
You are a tutor who explains the code and answers the question as prompted.
Respond clearly and concisely, using examples to enhance understanding.

### Example 1:
Question:

Please explain what this code does and why:
yield from {book.get("author") for book in books if book.get("author")}

Answer:

This code snippet is a generator expression that yields unique authors from a list of dictionaries called `books`. 
1. `{book.get("author") for book in books if book.get("author")}`: Creates a set of authors by iterating over `books` and extracting the value associated with the "author" key, only if it exists.
2. `yield from`: Allows the generator to yield elements from the set one by one. This avoids duplicates due to the set's nature.


### Example 2:
Question:

What does this code do, and why is it written this way?:
squared = {x: x**2 for x in range(5)}

Answer:

This code creates a dictionary using a dictionary compr

In [15]:
user_prompt = """
Can you tell me what this code snippet do and why?

headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:

    def __init__(self, url):
        
        # Create this Website object from the given url using the BeautifulSoup library
        
        self.url = url
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        self.text = soup.body.get_text(separator="\n", strip=True)
"""


In [16]:
def get_messages_for():
    return [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
      ]

In [17]:
# Get gpt-4o-mini to answer, with streaming

def get_answer_with_stream():
    stream = openai.chat.completions.create(
        model=MODEL_GPT,
        messages= get_messages_for(),
        stream = True
    )    
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        response = response.replace("```","").replace("markdown", "")
        update_display(Markdown(response), display_id=display_handle.display_id)

In [18]:
get_answer_with_stream()

This code snippet defines a class `Website` that retrieves and processes data from a specified URL using the BeautifulSoup library. Here’s a breakdown of what it does and why it’s written this way:

1. **Headers Dictionary**:
   python
   headers = {
       "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
   }
   
   - This dictionary simulates a web browser request by specifying a `User-Agent`. This helps prevent the request from being blocked by some websites that restrict web scrapers.

2. **Website Class Initialization**:
   python
   class Website:

       def __init__(self, url):
   
   - The `Website` class is initialized with a URL. When an instance of this class is created, it fetches the web page content of the given URL.

3. **Fetching URL Content**:
   python
   response = requests.get(url, headers=headers)
   
   - This line performs a GET request to the URL using the `requests` library, sending the custom headers defined earlier. It retrieves the page's content.

4. **Parsing HTML with BeautifulSoup**:
   python
   soup = BeautifulSoup(response.content, 'html.parser')
   
   - The HTML content of the page is parsed using BeautifulSoup, allowing for easy navigation and manipulation of the HTML structure.

5. **Extracting the Title**:
   python
   self.title = soup.title.string if soup.title else "No title found"
   
   - It attempts to retrieve the text of the page title by accessing `soup.title.string`. If the title is not present, it defaults to "No title found".

6. **Cleaning Up Irrelevant Tags**:
   python
   for irrelevant in soup.body(["script", "style", "img", "input"]):
       irrelevant.decompose()
   
   - This loop removes irrelevant tags (such as scripts, styles, images, and input elements) from the body of the document. `decompose()` effectively removes these elements from the parsed structure.

7. **Extracting Text Content**:
   python
   self.text = soup.body.get_text(separator="
", strip=True)
   
   - Finally, it extracts the clean text content from the body, using a newline as a separator and stripping extra whitespace. The result is stored in the `self.text` attribute.

**Purpose**:
The main purpose of this code is to create a `Website` object that encapsulates the URL, its title, and the text content, while stripping out unnecessary HTML elements. This makes it suitable for web scraping tasks, text analysis, or any application that requires clean text from web pages.

In [22]:
# Get Llama 3.2 to answer

def get_answer_with_llama():
    response = ollama.chat(model=MODEL_LLAMA, messages=get_messages_for())
    return response['message']['content']

In [26]:
display(Markdown(get_answer_with_llama()))

**Code Explanation**

This code snippet is used to scrape information from a given website. Here's what it does:

1. **User Agent Header**: The `headers` dictionary sets the User-Agent header of the HTTP request made to the website, which simulates a browser making a request.

2. **Website Class**: A class named `Website` is defined with an initializer (`__init__`) method that takes a URL as input.

3. **Website Initialization**:
    - The URL is stored in the object's `url` attribute.
    - An HTTP GET request is sent to the website using the `requests.get()` function, passing the URL and the `headers` dictionary to set the User-Agent header.
    - The server response is parsed using the `BeautifulSoup` library, which creates a parse tree from the HTML content of the webpage.

4. **Parsing the Webpage**:
    - The title of the webpage is extracted and stored in the object's `title` attribute. If no title exists, it defaults to "No title found".
    - Irrelevant elements like scripts, styles, images, inputs, etc., are removed from the HTML content by calling their `decompose()` method.
    - The text content of the webpage is extracted and stored in the object's `text` attribute. The `get_text()` method is used with an optional separator (`" "`) to split the text into lines, and `strip=True` to remove leading/trailing whitespace.

**Purpose**

The purpose of this code snippet is to create a simple web scraper that can extract basic information from a webpage, such as its title and text content. The User-Agent header is set to simulate a browser making a request, which may help avoid certain types of anti-scraping measures some websites employ.

This code can be used in various applications like web data mining, social media monitoring, or even simple web testing tools. However, it's essential to note that web scraping should always be done responsibly and with the permission of the website owner, as many sites prohibit scraping in their terms of service.