# Lab04: Tokens, Embeddings, and Distance

Start by copying this lab notebook into your notebook folder, and run it step by step from there.

What this lab covers:

1. **Imports Libraries and Sets Up API Key**: Imports necessary libraries and sets up the OpenAI API key.
2. **Helper Functions**: 
   - `get_embeddings(text)`: Fetches tokens and embeddings for the input text.
   - `display_tokens_and_embeddings(text)`: Displays tokens, the number of tokens, and the first 5 dimensions of embeddings.
3. **Example Texts for LeBron James**: Breaks down and displays tokens and embeddings for multiple text examples related to LeBron James.
4. **Example Texts for Michael Jordan**: Does the same for Michael Jordan.
5. **Distance Metrics**: Provides a brief explanation of different distance metrics.
6. **Cosine Similarity Calculations**: 
   - Compares all LeBron James examples to each other.
   - Compares all Michael Jordan examples to each other.
   - Aggregates embeddings for LeBron James and Michael Jordan, then compares the aggregated vectors.

In [None]:
# Import necessary libraries
from dotenv import dotenv_values
import openai
from openai import OpenAI
import numpy as np
import re
import tiktoken # see 

# Set up your OpenAI API key
print(f'Using openai {openai.__version__}')
config = dotenv_values()
openai_api_key = config['OPENAI_API_KEY']

In [None]:
# Helper function to get tokens and embeddings
client = OpenAI(
    # This is the default and can be omitted
    api_key=openai_api_key,
)

# encoder choices are ['gpt2', 'r50k_base', 'p50k_base', 'p50k_edit', 'cl100k_base', 'o200k_base']
encoder = tiktoken.get_encoding("o200k_base")
def get_tokens(text):
    return encoder.encode(text)

def get_embeddings(text):
    response = client.embeddings.create(input=text, model="text-embedding-ada-002")
    return response.data[0].embedding

def display_tokens_and_embeddings(text):
    tokens = get_tokens(text)
    embeddings = get_embeddings(text)
    print(f"\n--- Example for: '{text}' ---")
    print("Text:", text)
    print("\nTokens:")
    for token in tokens:
        print(token, end=" | ")

    print(f"\n\nNumber of Tokens: {len(tokens)}")
    print("\nEmbeddings (first 5 dimensions for each token):")
    for i, token in enumerate(tokens):
        print(f"Token '{token}': {embeddings[i*5:(i+1)*5]}")

    print(f"\nEmbedding length: {len(embeddings)}")
    return np.array(embeddings)

## LeBron James Examples

Understanding Tokens and Embeddings with LeBron James' career.

### Tokens

**Definition:**
- Tokens are the smallest units of text that the model processes. They can be characters, words, or subwords, depending on the language model's tokenizer.

#### Key Points:

- **Role in Text Processing:** Tokens are the primary medium through which text is broken down and processed by language models. When you input text into an OpenAI model, the text is first split into tokens.
- **Context Window:** Each model has a fixed limit on the number of tokens it can handle in a single request, known as the context window.
- **Cost:** Usage of the API is typically measured in tokens. You are charged based on the number of tokens processed in both your input and the generated output.

### Embeddings

**Definition:**
- Embeddings are numerical representations of tokens or words in a continuous vector space. These vectors capture semantic meanings and relationships between words in a high-dimensional space.

#### Key Points:

- **Role in Understanding Text:** While tokens are the basic units the model reads, embeddings are the way the model understands and represents these tokens internally. Each token is converted into an embedding, which is a fixed-size vector that the model uses for computations.
- **Semantic Information:** Embeddings capture the semantic meaning of the words or tokens. For example, the words "king" and "queen" will have embeddings that are close together in the vector space, indicating their related meanings.
- **Use Cases:** Embeddings are used in various downstream tasks like clustering, similarity searches, and as features in other machine learning models. OpenAI provides APIs specifically for generating embeddings, like the `text-embedding-ada-002` model.
- **Dimensionality:** The embeddings themselves are fixed-dimensional vectors irrespective of the original token length. For example, a word embedding might be a 300-dimensional vector.

In [None]:
text1 = "LeBron James"
text2 = "LeBron James is a professional basketball player."
lebron_summary = """
LeBron James, born on December 30, 1984, in Akron, Ohio, is an American professional basketball player. 
He is often considered one of the greatest basketball players in the history of the NBA. James was drafted as 
the first overall pick by the Cleveland Cavaliers in the 2003 NBA draft. Over his career, he has played for 
the Miami Heat, Los Angeles Lakers, and is most known for leading his teams to multiple NBA championships. 
LeBron has won four NBA MVP Awards, two Olympic gold medals, and has numerous records and accolades.
"""

lebron1_embeddings = display_tokens_and_embeddings(text1)

In [None]:
lebron2_embeddings = display_tokens_and_embeddings(text2)


In [None]:
lebron_summary_embeddings = display_tokens_and_embeddings(lebron_summary)



## Michael Jordan Examples

Understanding Tokens and Embeddings with Michael Jordan's career.

In [None]:
mj_text1 = "Michael Jordan"
mj_text2 = "Michael Jordan is a retired professional basketball player."
mj_summary = """
Michael Jordan, born on February 17, 1963, in Brooklyn, New York, is widely regarded as the greatest basketball player of all time. 
He played 15 seasons in the NBA, winning six championships with the Chicago Bulls. Jordan is a five-time NBA MVP and 
a 14-time NBA All-Star. He excelled in both offense and defense, becoming a global cultural icon and a symbol of excellence 
and determination in the sport.
"""

mj1_embeddings = display_tokens_and_embeddings(mj_text1)

In [None]:
mj2_embeddings = display_tokens_and_embeddings(mj_text2)

In [None]:
mj_summary_embeddings = display_tokens_and_embeddings(mj_summary)

## Understanding Distance Metrics

Distance metrics help us quantify how similar or different two embeddings are. Here are a few common ones:

### Cosine Similarity
Cosine similarity measures the cosine of the angle between two vectors. It ranges from -1 (completely dissimilar) to 1 (exactly similar).

### Euclidean Distance
Euclidean distance calculates the straight-line (geometric) distance between two points in a multidimensional space. Smaller values indicate more similarity.

### Manhattan Distance
Manhattan distance sums the absolute differences of the corresponding components of the vectors. It is robust to outliers.

### Dot Product
The dot product measures the similarity between two vectors as a single number. Higher values indicate more similarity.

For this notebook, we will focus on Cosine Similarity.

## Cosine Similarity


Cosine similarity is a measure used to determine how similar two vectors are, irrespective of their magnitude.
It calculates the cosine of the angle between two vectors in an inner product space. The cosine similarity metric ranges from -1 to 1:
- **1** indicates that the vectors are identical in their direction.
- **0** indicates that the vectors are orthogonal, meaning they are totally dissimilar.
- **-1** indicates that the vectors are diametrically opposed.

The formula for cosine similarity for two vectors is described on [wikipedia.org/wiki/Cosine_similarity](https://en.wikipedia.org/wiki/Cosine_similarity).
Cosine similarity provides a way to measure the similarity between high-dimensional vectors, making it widely used in text analysis and embeddings.


In [None]:
def cosine_similarity(A, B):
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))


## Compare all LeBron embeddings to each other

In [None]:
lebron_examples = [lebron1_embeddings, lebron2_embeddings, lebron_summary_embeddings]
print("### Cosine Similarity Among LeBron James' Embeddings")
for i in range(len(lebron_examples)):
    for j in range(i + 1, len(lebron_examples)):
        similarity = cosine_similarity(lebron_examples[i], lebron_examples[j])
        print(f"Similarity between LeBron example {i+1} and LeBron example {j+1}: {similarity:.4f}")

## Compare all Michael Jordan embeddings to each other

In [None]:
mj_examples = [mj1_embeddings, mj2_embeddings, mj_summary_embeddings]
print("### Cosine Similarity Among Michael Jordan's Embeddings")
for i in range(len(mj_examples)):
    for j in range(i + 1, len(mj_examples)):
        similarity = cosine_similarity(mj_examples[i], mj_examples[j])
        print(f"Similarity between MJ example {i+1} and MJ example {j+1}: {similarity:.4f}")

### Aggregated Embedding Comparison

Aggregate embeddings for an overall representation of each player.

In [None]:
lebron_aggregate = np.mean(np.array(lebron_examples), axis=0)
print(f"length = {len(lebron_aggregate)}, head = [{lebron_aggregate[0:5]}]")

In [None]:
mj_aggregate = np.mean(np.array(mj_examples), axis=0)
print(f"length = {len(mj_aggregate)}, head = [{mj_aggregate[0:5]}]")

In [None]:
# Compare the aggregated embeddings of LeBron and Michael
overall_similarity = cosine_similarity(lebron_aggregate, mj_aggregate)
print(f"Similarity between aggregated LeBron and aggregated Michael embeddings: {overall_similarity:.4f}")

## Wrap up

Now that you have copmpleted the lab, take a moment to summarize your findings in Overleaf.