In [None]:
%%capture
!pip install llama-index llama-index-embeddings-openai

In [None]:
import os
from getpass import getpass

In [None]:
os.environ['OPENAI_API_KEY'] = getpass("Enter your OpenAI API key: ")

# 🗂️ Indexing

In `LlamaIndex`, an `Index` is a data structure that contains `Document` objects, allowing for efficient querying by an LLM. There are several types of indexes, but we'll focus on the most commun, `VectorStoreIndex`.

- 🌐 **Vector Store Index:** Segments your `Documents` into `Nodes` and generates vector embeddings for each node's text, prepping them for retrieval.

- 🔄 **Vector Store Index Process:** Converts all your text into embeddings with an API from your LLM, aka "embedding your text."

### ⚙️ Embedding Text

First, let's see what an embedding is.


In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding

te_small = OpenAIEmbedding(model="text-embedding-3-small")

te_large = OpenAIEmbedding(model="text-embedding-3-large")

te_ada = OpenAIEmbedding(model="text-embedding-ada-002") 

You can also use local embedding models, by using an embedding model from Hugging Face.

```python

pip install llama-index-embeddings-huggingface

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

hf_embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
```

In [None]:
string = "A"

string_2 = "This is a complete sentence."

string_3 = """In the pursuit of a life well-lived, one must recognize the transient nature of the 
material world and the enduring value of virtue. The Sikh Gurus taught us that the Divine Light 
resides within all, and thus, we are united in our essence beyond the superficial distinctions of 
caste, creed, or status. Similarly, the Stoics emphasized the cultivation of inner virtues such as courage, 
temperance, and wisdom, understanding that true freedom lies in mastery over one's own perceptions and actions. 
As we navigate the vicissitudes of life, let us remember that our choices are our own, and in choosing virtue, 
we align ourselves with the cosmic order and the teachings of the Gurus. It is through selfless service, 
compassion, and the relentless pursuit of truth that we may attain a state of inner peace and contribute 
to the harmony of the world, embodying the principles of both Sikhism and Stoicism in our daily lives
"""

In [None]:
example_embedding = te_small.get_text_embedding(string)

In [None]:
len(example_embedding)

In [None]:
def get_embedding_dimensions(embed_model, list_of_strings):
    embeddings = embed_model.get_text_embedding_batch(list_of_strings)   
    embed_lens = []
    for embedding in embeddings:
        embed_lens.append(len(embedding))
    return embed_lens

In [None]:
get_embedding_dimensions(te_small, [string, string_2, string_3])

In [None]:
get_embedding_dimensions(te_large, [string, string_2, string_3])

In [None]:
get_embedding_dimensions(te_ada, [string, string_2, string_3])

In [None]:
te_ada.similarity(
    te_ada.get_text_embedding("""In embracing both the wisdom of the Sikh Gurus and the Stoic philosophers, 
                              we find a path to tranquility by accepting what is beyond our control and focusing 
                              our efforts on living virtuously and with purpose."""), 
    te_ada.get_text_embedding(string_2),
    mode="cosine"
    )

# Create an Index

First, let's get some data

In [8]:
import requests

def load_text_from_url(url: str) -> str:
    """
    Fetches and returns the text content from the specified URL.

    Parameters:
    - url: The URL of the text file to fetch.

    Returns:
    - The text content of the file if the request is successful; otherwise, an error message.
    """
    try:
        response = requests.get(url)
        response.raise_for_status()  # This will raise an HTTPError if the response was an error
        return response.text
    except requests.RequestException as e:
        return f"Failed to load content from {url}. Error: {e}"

url = "https://www.gutenberg.org/files/10763/10763.txt"

text_content = load_text_from_url(url)

⏳ Generating embeddings can be time-consuming, especially with large volumes of text, due to numerous API calls required. 

Now, create an index by passing a **list of Documents**. To save time, and cost, we will only use 10,000 characters of the document

In [15]:
from llama_index.core import Document, VectorStoreIndex

full_document = Document(text=text_content)

partial_document = Document(text=text_content[50000:60000])

In [16]:
index = VectorStoreIndex.from_documents(
    # remember, you must pass a list of documents!
    [partial_document], 
    embed_model=te_small,
    show_progress=True)

Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 83.77it/s]
Generating embeddings: 100%|██████████| 3/3 [00:00<00:00,  5.98it/s]


Note, you can also build an index over a **list of `Node` objects**.


In [20]:
from llama_index.core.node_parser import SentenceSplitter

# instantiate a node parser
splitter = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=16,
    paragraph_separator="\n\n\n\n",
)

# pass a list of documents to the node paraser
nodes = splitter.get_nodes_from_documents([partial_document])

# create the index from the nodes
index_from_nodes = VectorStoreIndex(
    nodes,
    embed_model=te_small,
    show_progress=True
    )

Generating embeddings: 100%|██████████| 6/6 [00:00<00:00, 15.39it/s]


# Retrieval from `VectorStoreIndex`

 - 🔍 When searching, your query is also converted into a vector embedding. 
 
- The `VectorStoreIndex` then performs a mathematical operation to rank embeddings based on semantic similarity to your query.

- 🔝 Top-k semantic retrieval is the simplest wasy to query a vector index. 

We'll discuss querying in it's own section!