In [1]:
import os
from typing import Optional

# 1. Welcome to Jupyter Notebooks!

This interface allows you to run Python code interactively and view the results immediately, along with any visualizations or text explanations. Each block of code or text you see is contained in what we call a "cell."

## Basic Operations

- **Running a Cell**: You can run the code or render the markdown in a cell by selecting it and pressing `Shift + Enter`, or by clicking the "Run" button in the toolbar.
- **Adding New Cells**: Add a new cell by clicking the "+" button in the toolbar.
- **Cell Types**: Cells can be code cells or markdown cells. Switch the type using the dropdown in the toolbar.


In [2]:
# Simple Python Example

# Printing a message
print("Hello, World!")

# Basic arithmetic
result = 7 * 6
print("7 multiplied by 6 is", result)

Hello, World!
7 multiplied by 6 is 42


In [3]:
# Using Variables

# Store a value in a variable
a = 10

# Use the variable in a calculation
b = a * 2

# Print the result
print("The result of a multiplied by 2 is", b)

The result of a multiplied by 2 is 20


In [4]:
# Basic Data Structures

# List: an ordered collection of items
fruits = ["apple", "banana", "cherry"]
print("Fruits List:", fruits)

# Dictionary: key-value pairs
prices = {"apple": 0.40, "banana": 0.50, "cherry": 0.30}
print("Fruit Prices:", prices)

Fruits List: ['apple', 'banana', 'cherry']
Fruit Prices: {'apple': 0.4, 'banana': 0.5, 'cherry': 0.3}


In [5]:
# Looping through a list
for fruit in fruits:
    print(fruit, "costs", prices[fruit], "each")

# Conditional: if statement
if "banana" in fruits:
    print("Yes, we have bananas!")

apple costs 0.4 each
banana costs 0.5 each
cherry costs 0.3 each
Yes, we have bananas!


### Introduction to Functions

Functions are a way to organize your code into blocks that can be called multiple times throughout your program. They allow you to write cleaner, more modular code and make your scripts easier to maintain and debug. Functions in Python are defined using the `def` keyword.


In [6]:
# Defining a Simple Function


def greet(name):
    """This function greets the person whose name is passed as a parameter"""
    return f"Hello, {name}! Welcome to our notebook."


# Calling the function
greeting = greet("Alice")
print(greeting)

Hello, Alice! Welcome to our notebook.


In [7]:
# Function with Parameters and Return Value


def calculate_area(length, width):
    """This function returns the area of a rectangle given its length and width."""
    area = length * width
    return area


# Using the function
rect_area = calculate_area(10, 5)
print("The area of the rectangle is:", rect_area)

The area of the rectangle is: 50


### Leveraging Jupyter-AI for Code Generation

Jupyter-AI is an advanced feature integrated into Jupyter Notebooks that helps users write code more efficiently. It utilizes AI technology to suggest code snippets, complete code blocks, and even generate complex code structures.

#### How to Use Jupyter-AI to Write Code

1. **Initiating Code Suggestions**: Simply start typing your code or a description of the function you need in a code cell. Jupyter-AI will automatically suggest completions.
2. **Accepting Suggestions**: When a code suggestion appears, you can press `Tab` to accept it, instantly filling in the suggestion.
3. **Chat Interface**: You can also interact with Jupyter-AI using the chat interface on the left.


In [None]:
# try using the autocomplete functioanlity to write a function that adds two numbers


def add_numbers(a: int, b: int) -> int:
    """Try having jupyter AI autocomplete this function."""
    pass


# Assert statements to check the correctness of the function
assert add_numbers(1, 2) == 3, "Function add_numbers does not work correctly!"
print("Function add_numbers works correctly!")

In [None]:
# We're not limited to simple functions. Here's a tricky function with a bug in it. Try pasting it into the chat bar on the left and asking the AI to fix it


def factorial(n: int) -> int:
    """This function has a bug in it. Can you find and fix it with AI?"""
    if n == 0:
        return 1
    else:
        result = 1
        for i in range(n):
            result *= i
        return result


# Assert statements to check the correctness of the function
assert factorial(0) == 1, "The factorial of 0 should be 1"
assert factorial(1) == 1, "The factorial of 1 should be 1"
assert factorial(5) == 120, "The factorial of 5 should be 120"

### Let's get started with the case study!


# High Level Architecture

The architecture of the system is as follows:

1. We chunk up the document into distinct “sections” and embed those sections
2. Then, we embed the user query and find the most similar part of the document.
3. We feed the original question along with context we found to the LLM and receive an answer


# 2. What exactly is an embedding?


In [32]:
from openai import OpenAI, NOT_GIVEN
import plotly.graph_objects as go

#########################
### UTILITY FUNCTIONS ###
#########################

# instantiating the OpenAI client
client = OpenAI(api_key=os.getenv("OPEN_AI_KEY"))


# wrapper function around openai to directly return embedding of text
def get_embedding(text: str, dimensions: int = NOT_GIVEN) -> list[float]:
    """Get the embedding of the input text."""
    response = client.embeddings.create(
        input=text, model="text-embedding-3-large", dimensions=dimensions
    )
    return response.data[0].embedding


# plotly requires vectors to be in a specific format to plot them
def _vector_to_plotly_format(
    vector: list[float], color: Optional[str] = "red", name: Optional[str] = None
) -> dict:
    """
    Convert a vector from array format to a dictionary format suitable for Plotly 3D plotting, including color and name customization.
    """
    assert len(vector) == 3, "Vector must only have 3 components."
    origin = [0, 0, 0]  # Starting point of the vector

    # Split the vector into its components
    x_component, y_component, z_component = vector

    # Create dictionary in the format expected by Plotly
    return {
        "x": [origin[0], x_component],
        "y": [origin[1], y_component],
        "z": [origin[2], z_component],
        "color": color,
        "name": name,
    }


# simple utility function to add a vector to a 3D plot
def add_vector_to_graph(
    fig: go.Figure, vector: list[float], color: str = "red", name: Optional[str] = None
) -> go.Figure:
    # Ensure vector has exactly three components
    assert len(vector) == 3, "Vector must have exactly 3 components."

    # Origin point
    origin = [0, 0, 0]

    # Components of the vector
    x_component, y_component, z_component = vector

    # Adding the line part of the vector
    fig.add_trace(
        go.Scatter3d(
            x=[origin[0], x_component],
            y=[origin[1], y_component],
            z=[origin[2], z_component],
            mode="lines",
            line=dict(color=color, width=5),
            name=name,
        )
    )

    # Adding the cone at the tip of the vector
    fig.add_trace(
        go.Cone(
            x=[x_component],
            y=[y_component],
            z=[z_component],
            u=[x_component],
            v=[y_component],
            w=[z_component],
            sizemode="scaled",
            sizeref=0.1,
            showscale=False,
            colorscale=[[0, color], [1, color]],
            hoverinfo="none",
        )
    )
    return fig


def create_new_graph() -> go.Figure:
    """Create a 3D plotly figure with a simple layout."""
    fig = go.Figure()

    # make sure the plot isn't rotated
    fig.update_layout(
        scene=dict(
            camera=dict(
                eye=dict(x=1.5, y=1.5, z=0.5),  # Adjust the camera position
                up=dict(x=0, y=0, z=1),  # Sets the z-axis as "up"
                center=dict(x=0, y=0, z=0),  # Focuses the camera on the origin
            ),
            aspectmode="cube",
        )
    )

    # Add a dot at the origin
    fig.add_trace(
        go.Scatter3d(
            x=[0],
            y=[0],
            z=[0],
            mode="markers",
            marker=dict(size=6, color="black", symbol="circle"),
            name="Origin",
        )
    )

    return fig

In [11]:
from openai import OpenAI, NOT_GIVEN

# let's try using the get_embedding function
result = get_embedding("Hello, World!")
print(result)

[-0.0047318595461547375, -0.019121423363685608, 0.016391770914196968, 0.03161963075399399, -0.010759265162050724, -0.018650315701961517, -0.02487170696258545, 0.03658011555671692, -0.03450169786810875, -0.012685263529419899, -0.02100585401058197, -0.009740840643644333, -0.015047729015350342, -0.012221083976328373, 0.015657396987080574, 0.04871806129813194, -0.03782716393470764, 0.034252289682626724, -0.00617289450019598, 0.02326439879834652, 0.05631120502948761, 0.0038485329132527113, 0.013592838309705257, 0.0015951839741319418, 0.025287389755249023, 0.0024317463394254446, -0.03469568490982056, 0.03270040452480316, 0.015089296735823154, -0.050269946455955505, 0.04539259523153305, -0.031675051897764206, -0.016447195783257484, 0.009726985357701778, 0.0022983811795711517, 0.01057913526892662, -0.0001432158169336617, -0.009900186210870743, 0.018497899174690247, -0.027850769460201263, 0.028931545093655586, -0.012809967622160912, 0.01282382383942604, 0.017444834113121033, -0.0324509963393211

That's a lot of numbers! OpenAI embedding support built in dimensionality reduction - let's try using that and visualizing the result


In [33]:
graph = create_new_graph()

text = "Atlanta"

# Get the embedding of the text
vector = get_embedding(text, dimensions=3)

# Add the vector to the plot
add_vector_to_graph(graph, vector, name=text)

# Show the plot
graph.show()

Let's try plotting a couple vectors at once to see if we can see any patterns

In [34]:
graph = create_new_graph()

text = "Atlanta"
atlanta_vector = get_embedding(text, dimensions=3)
add_vector_to_graph(graph, atlanta_vector, name=text, color="purple")

text = "Georgia, USA"
georgia_vector = get_embedding(text, dimensions=3)
add_vector_to_graph(graph, georgia_vector, name=text, color="blue")

text = "Skiing in japan"
ski_vector = get_embedding(text, dimensions=3)
add_vector_to_graph(graph, ski_vector, name=text, color="red")

# Show the plot
graph.show()

How we can quantify the similarity between two vectors? One common way is to use the cosine similarity. The cosine similarity between two vectors is the cosine of the angle between them. It ranges from -1 (opposite directions) to 1 (same direction), with 0 indicating orthogonality.

In [35]:
import numpy as np


def cosine_similarity(a: list[float], b: list[float]) -> float:
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


# We can use the cosine similarity to compare the similarity between two vectors
similarity = cosine_similarity(atlanta_vector, georgia_vector)
print(f"The similarity between 'Atlanta' and 'Georgia, USA' is {similarity:.2f}")

similarity = cosine_similarity(atlanta_vector, ski_vector)
print(f"The similarity between 'Atlanta' and 'Skiing in Japan' is {similarity:.2f}")

The similarity between 'Atlanta' and 'Georgia, USA' is -0.07
The similarity between 'Atlanta' and 'Skiing in Japan' is -0.82


### Advanced Challenges (Optional)

#### 1. Sentence Embeddings

How does adding words to a sentence affect the embedding vector? Try creating a for loop that adds a word to the text and plots the resulting embedding vector.


#### 2. Embedding Dimensionality
Let's see how the cosine similarity changes as we change the number of dimensions. 

How does increasing the number of dimensions affect how well the model can capture relationships between complex concepts?

In [41]:
# Try it out (implemented solutions can be found in the solutions.ipynb notebook)

# 3. Parsing Documents

Large language models are currently primarly optimized for working with text. As a result when dealing with documents like PDF's we need to first convert them into a text format before we can feed them into the model.

We maintain a popular open source library for doing this called [openparse](https://github.com/Filimoa/open-parse/). It is a simple and easy to use.


In [None]:
import openparse

basic_doc_path = "./sample-docs/mobile-home-manual.pdf"
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(basic_doc_path)

for node in parsed_basic_doc.nodes:
    display(node)