# llama-index - VectorStoreIndex and Query Engine

The objective of this notebook is to demo a VectorStoreIndex and QueryEngine can be created from  

In [1]:
import logging
import os
import shutil
import sys
from pathlib import Path

import html2text
from constants import FILINGS_DIR, LLM_MAX_TOKENS, LLM_PROVIDER, LLM_TEMPERATURE
from datamule import parse_textual_filing
from datamule.filing_viewer.filing_viewer import json_to_html
from dotenv import find_dotenv, load_dotenv
from edgar import Company, set_identity
from IPython.display import Markdown, display
from llama_index.core import Document, Response, VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.groq import Groq

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [2]:
%%capture
load_dotenv(find_dotenv())
set_identity("John Doe john.doe@example.com")

Identity of the Edgar REST client set to [John Doe john.doe@example.com]


Reference pages in the llama-index docs:

- https://docs.llamaindex.ai/en/stable/understanding/indexing/indexing/

Embedding model and the LLM needs to be compatible and work well together. OpenAI GPT models work well with `text-embedding-ada-002` [Ref](https://docs.llamaindex.ai/en/stable/understanding/indexing/indexing/)

In [3]:
TICKER = "UBER"
FILING_TYPE = "10-K"
# LLM_MODEL = LLM_MODELS[LLM_PROVIDER][0]
LLM_MODEL = "llama3-groq-70b-8192-tool-use-preview"

INDEX_STORAGE_FOLDER = "filings_index"
EMBEDDING_MODEL_FOLDER = "embedding_model"
HF_EMBEDDING_MODEL = "BAAI/bge-small-en-v1.5"
PERSIST_VECTOR_STORE = True
CLEAR_VECTOR_STORE = True

In [4]:
display(
    Markdown(
        f"Ticker: {TICKER}\n"
        f"Configuration:\n\nLLM Provider: {LLM_PROVIDER} LLM Model: {LLM_MODEL}\n\n"
        f"Embedding model: {HF_EMBEDDING_MODEL}"
    )
)

Ticker: UBER
Configuration:

LLM Provider: groq LLM Model: llama3-groq-70b-8192-tool-use-preview

Embedding model: BAAI/bge-small-en-v1.5

In [5]:
llm = Groq(
    model=LLM_MODEL,
    api_key=os.environ["GROQ_API_KEY"],
    temperature=LLM_TEMPERATURE,
    max_tokens=LLM_MAX_TOKENS,
)

Using `edgartools` project to get the latest SEC filing.

In [6]:
filing = Company(TICKER).get_filings(form=FILING_TYPE).latest(1)

In [7]:
# Using the datamule project to parse the filing into text
json_content = parse_textual_filing(filing.document.url, return_type="json")
html_content = json_to_html(json_content)

# Using html2text to convert to text
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_tables = False

text_content = h.handle(html_content)

In [8]:
markdown_file = (Path(FILINGS_DIR) / TICKER / FILING_TYPE / filing.document.document).with_suffix(
    ".md"
)
with open(str(markdown_file), "w") as f:
    f.write(text_content)

In [9]:
display(Markdown(f"The SEC filing is stored in Markdown to: {str(markdown_file)}"))

The SEC filing is stored in Markdown to: filings/uber-20231231.md

In [10]:
documents = [Document(text=text_content)]

In [11]:
%%capture
embed_model = HuggingFaceEmbedding(
    model_name=HF_EMBEDDING_MODEL,
    device="cpu",
    cache_folder=EMBEDDING_MODEL_FOLDER,
    trust_remote_code=True,
)

Load pretrained SentenceTransformer: BAAI/bge-small-en-v1.5
2 prompts are loaded, with the keys: ['query', 'text']


A VectorStoreIndex is created from the text content. The document is split into llama-index Nodes, which is similar to text chunks but with more metadata attached and possibly relationships.

In [12]:
%%capture
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

In [13]:
if CLEAR_VECTOR_STORE:
    vector_store_folder = Path(INDEX_STORAGE_FOLDER)
    if vector_store_folder.exists():
        print(f"Clearing vector store folder: {vector_store_folder}")
        shutil.rmtree(vector_store_folder)

In [14]:
if PERSIST_VECTOR_STORE:
    print(f"Persisting vector store in {INDEX_STORAGE_FOLDER}")
    index.storage_context.persist(persist_dir=INDEX_STORAGE_FOLDER)

Persisting vector store in filings_index


In [15]:
display(f"The index is constructed with {len(index.index_struct.nodes_dict.keys())} nodes")

'The index is constructed with 149 nodes'

The default similarity top K for the retriever is 2 in llama-index. This can be increased to 5 for example.
This parameters chooses how many nodes will be sent to the LLM for context to the query. For example if the parameter is 2 then the retriever / embedding model will find 2 similar nodes that has a likelihood of being relevant to the query that is being sent to the LLM.

In [16]:
query_engine = index.as_query_engine(llm=llm, similarity_top_k=3)

In [17]:
query = f"Extract key insights for {TICKER} from their latest 10-K SEC filing"

In [18]:
%%capture
response: Response = query_engine.query(query)

The Response object has the attributes response, source_nodes and metadata

The LLM generated this response:

In [19]:
display(Markdown(response.response))

1. Uber Technologies, Inc. is a technology platform that uses a massive network, leading technology, operational excellence, and product expertise to power movement from point A to point B.

2. Uber's business is divided into three operating and reportable segments: Mobility, Delivery, and Freight. Each segment addresses large, fragmented markets.

3. Mobility connects consumers with a wide range of transportation modalities, such as ridesharing, carsharing, micromobility, rentals, public transit, taxis, and more.

4. Delivery allows consumers to search for and discover local commerce, order a meal or other items, and either pick-up at the restaurant or have it delivered. This segment also includes Grocery & Retail categories.

5. Freight is revolutionizing the logistics industry by connecting shippers with carriers in the freight industry by providing carriers with the ability to book a shipment, transportation management, and other logistics services.

6. The classification of Drivers is currently being challenged in courts, by legislators and by government agencies in the United States and abroad. If Drivers were classified as employees, workers, or quasi-employees, Uber's business, financial condition, operating results, or prospects could be negatively impacted.

7. Uber Technologies, Inc. is involved in numerous legal proceedings globally, including putative class and collective class action lawsuits, demands for arbitration, charges and claims before administrative agencies, and investigations or audits by labor, social security, and tax authorities that claim that Drivers should be treated as employees (or as workers or quasi-employees where those statuses exist), rather than as independent contractors.

8. The company has incurred and expects to incur additional expenses, including expenses associated with a guaranteed minimum earnings floor for Drivers, insurance for injury protection, and subsidies for health care to comply with Proposition 22.

9. Uber Technologies, Inc. is developing technologies designed to provide new solutions to solve everyday problems.

10. The company's business, financial condition, operating results, or prospects could be negatively impacted by risks and uncertainties not currently known to them or that they currently do not believe are material.

The following SourceNodes were used by the LLM to generate the answer:

In [20]:
display(Markdown(response.get_formatted_sources(length=2000)))

> Source (Doc id: d803ba0e-c00d-4bb9-b30f-96c6f781850b): ### ITEM 6. [RESERVED]

### ITEM 7. MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND
RESULTS OF OPERATIONS

### The following discussion and analysis of our financial condition and
results of operations should be read in conjunction with our consolidated
financial statements and the related notes included in Part II, Item 8,
“Financial Statements and Supplementary Data,” of this Annual Report on Form
10-K

. We have elected to omit discussion on the earliest of the three years
covered by the consolidated financial statements presented. Refer to Item 7.
Management's Discussion and Analysis of Financial Condition and Results of
Operations located in our Annual Report on Form 10-K for the year ended
December 31, 2022, filed on February 21, 2023, for reference to discussion of
the fiscal year ended December 31, 2021, the earliest of the three fiscal
years presented.

### In addition to our historical consolidated financial information, the
following discussion contains forward-looking statements that reflect our
plans, estimates, and beliefs. Our actual results could differ materially from
those discussed in the forward-looking statements. You should review the
sections titled “Special Note Regarding Forward-Looking Statements” for a
discussion of forward-looking statements and in Part I, Item 1A, “Risk
Factors”, for a discussion of factors that could cause actual results to
differ materially from the results described in or implied by the forward-
looking statements contained in the following discussion and analysis and
elsewhere in this Annual Report on Form 10-K.

### Overview

We are a technology platform that uses a massive network, leading technology,
operational excellence, and product expertise to power movement from point A
to point B. We develop and operate proprietary technology applications
supporting a variety of offerings on our platform. We connect consumers with
providers of ride services, merchants as well as delivery service providers...

> Source (Doc id: db7564ec-f965-4fb7-a924-53321e616251): The information in this
report is not a part of this Form 10-K.

### Additional Information

We were founded in 2009 and incorporated as Ubercab, Inc., a Delaware
corporation, in July 2010. In February 2011, we changed our name to Uber
Technologies, Inc. Our principal executive offices are located at 1725 3rd
Street, San Francisco, California 94158, and our telephone number is (415)
612-8582. Our website address is www.uber.com and our investor relations
website is located at https://investor.uber.com. The information posted on our
website is not incorporated into this Annual Report on Form 10-K. The U.S.
Securities and Exchange Commission (“SEC”) maintains an Internet site that
contains reports, proxy and information statements, and other information
regarding issuers that file electronically with the SEC at www.sec.gov. Our
Annual Report on Form 10-K, Quarterly Reports on Form 10-Q, Current Reports on
Form 8-K and amendments to reports filed or furnished pursuant to Sections
13(a) and 15(d) of the Securities Exchange Act of 1934, as amended, (the
“Exchange Act”) are also available free of charge on our investor relations
website as soon as reasonably practicable after we electronically file such
material with, or furnish it to, the SEC. We webcast our earnings calls and
certain events we participate in or host with members of the investment
community on our investor relations website. Additionally, we provide
notifications of news or announcements regarding our financial performance,
including SEC filings, investor events, press and earnings releases, as part
of our investor relations website. The contents of these websites are not
intended to be incorporated by reference into this report or in any other
report or document we file.

### ITEM 1A. RISK FACTORS

### Certain factors may have a material adverse effect on our business,
financial condition, and results of operations. You should carefully consider
the following risks, together with all of the other inf...

> Source (Doc id: 760677c0-0f2f-4bc4-99a0-d2ab4e42a41b): ## Sections

# Filing Viewer

CIK: 1543151 | Accession Number: 000154315124000012

### PART I

### ITEM 1. BUSINESS

### Overview

Uber Technologies, Inc. (“Uber,” “we,” “our,” or “us”) is a technology
platform that uses a massive network, leading technology, operational
excellence and product expertise to power movement from point A to point B. We
develop and operate proprietary technology applications supporting a variety
of offerings on our platform (“platform(s)” or “Platform(s)”). We connect
consumers (“Rider(s)”) with independent providers of ride services (“Mobility
Driver(s)”) for ridesharing services, and connect Riders and other consumers
(“Eater(s)”) with restaurants, grocers and other stores (collectively,
“Merchants”) with delivery service providers (“Couriers”) for meal
preparation, grocery and other delivery services. Riders and Eaters are
collectively referred to as “end-user(s)” or “consumer(s).” Mobility Drivers
and Couriers are collectively referred to as “Driver(s).” We also connect
consumers with public transportation networks. We use this same network,
technology, operational excellence and product expertise to connect shippers
(“Shipper(s)”) with carriers (“Carrier(s)”) in the freight industry by
providing Carriers with the ability to book a shipment, transportation
management and other logistics services. Uber is also developing technologies
designed to provide new solutions to solve everyday problems. Our technology
is available in approximately 70 countries around the world, principally in
the United States (“U.S.”) and Canada, Latin America, Europe (excluding
Russia), the Middle East, Africa, and Asia (excluding China and Southeast
Asia).

### Our Segments

As of December 31, 2023, we had three operating and reportable segments:
Mobility, Delivery and Freight. Mobility, Delivery and Freight platform
offerings each address large, fragmented markets.

### Mobility

Our Mobility offering connects consumers with a wide range of transportati...