# Llamaindex Model
<hr>

Download data

In [None]:
! mkdir -p 'data/10k/'
! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'

<hr>

### Ingestion

**Environment setting**

In [1]:
import os

from dotenv import load_dotenv
load_dotenv()

import nest_asyncio
nest_asyncio.apply()
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')

**Get the documents**

In [5]:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader('data/10k/').load_data()

In [9]:
print(len(documents))

545


**Ingestion pipeline**
- transformations
    - OpenAIEmbedding
        ```python
        embed_model = OpenAIEmbedding(
            model="text-embedding-3-large",
            dimensions=512,
        )
        ```

    - SentenceSplitter
        - chunk_size
        - chunk_overlap
        - tokenizer
        - paragraph_separator

    - TitleExtractor
        - LLM


In [11]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline, IngestionCache

# create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=1024, chunk_overlap=20),
        OpenAIEmbedding(model='text-embedding-3-small'),
    ]
)

# run the pipeline
nodes = pipeline.run(documents=documents)

In [17]:
nodes[0]

TextNode(id_='5657f57a-ce46-4692-9584-1f9f1d607001', embedding=[0.03419557213783264, -0.013773573562502861, 0.04164283350110054, -0.008883890695869923, 0.05684659257531166, -0.00511194160208106, 0.010243210010230541, -0.0019310705829411745, 0.011982622556388378, 0.018270278349518776, -0.013773573562502861, -0.0052311234176158905, 0.00969561655074358, -0.003095510881394148, -0.010971186682581902, 0.019739115610718727, -0.029454059898853302, -0.04793049022555351, -0.020061230286955833, -0.050043556839227676, 0.012491562403738499, 0.044219743460416794, 0.0026783738285303116, -0.004909010138362646, 0.024016784504055977, -0.0014382367953658104, -0.03071674518287182, -0.07096804678440094, -0.0389886200428009, 0.00254630739800632, 0.017806435003876686, -0.023591594770550728, 0.008619757369160652, -0.01444356981664896, -0.030536361038684845, 0.01274925284087658, 0.011828008107841015, 0.023849284276366234, 0.028706757351756096, -0.03692709282040596, -0.04973432794213295, -0.03411826491355896, 0

### Indexing

**Indexing and Storing in chromaDB**

- embed_model
    - OpenAI
        - text-embedding-ada-002 (Latest)
        - text-search-babbage-doc-001
        - text-similarity-davinci-001
        - text-embedding-3-small
    - HuggingFace
        - sentence-transformers/all-MiniLM-L6-v2
	    - sentence-transformers/all-mpnet-base-v2
	    - bert-base-uncased

In [12]:
import chromadb
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.chroma import ChromaVectorStore

db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("10k_collection")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
embed_model = OpenAIEmbedding(model='text-embedding-3-small')

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)

### Query Interface

**Tools**

*Vector Tool*

In [20]:
from typing import List
from llama_index.core.vector_stores import FilterCondition
from llama_index.core.tools import FunctionTool
from llama_index.core.vector_stores import MetadataFilters

def vector_query(
    query : str,
    page_numbers: List[str],
) -> str:
    """Performs a vector search over an index.
    
    query (str): the string query to embeded.
    page_numbers (List[str]): Filter by set of pages. Leave BLANK if we want to search
    over all pages. Otherwise, filter by the set of sepcified pages.
    """

    metadata_dicts = [
        {"key": "page_label", "value": p}
        for p in page_numbers
    ]

    query_engine = index.as_query_engine(
        similarity_top_k=2,
        filters=MetadataFilters.from_dicts(
            metadata_dicts,
            condition=FilterCondition.OR
        )
    )
    response = query_engine.query(query)
    return response

vector_query_tool = FunctionTool.from_defaults(
    name="vector_query",
    fn=vector_query,
)

*summary tool*

In [21]:
from llama_index.core import SummaryIndex
from llama_index.core.tools import QueryEngineTool

summary_index = SummaryIndex(nodes)
summary_query_engine = summary_index.as_query_engine(
    response_mode='tree_summarize',
    use_async=True,
)
summary_tool = QueryEngineTool.from_defaults(
    name="summary_tool",
    query_engine=summary_query_engine,
    description=(
        'Useful if you want to get a summary of the ragchecker.'
    )
)

In [22]:
docs = [
    "data/10k/uber_2021.pdf",
    "data/10k/lyft_2021.pdf",
]

search by tools

In [23]:
from pathlib import Path

documents_to_tools_dict = {}
for documents in docs:
    print(f"Getting tools for document: {documents}")
    documents_to_tools_dict[documents] = [vector_query_tool, summary_tool]

Getting tools for document: data/10k/uber_2021.pdf
Getting tools for document: data/10k/lyft_2021.pdf


put tools into a flat list

In [24]:
initial_tools = [t for documents in docs for t in documents_to_tools_dict[documents]]

In [25]:
len(initial_tools)  # should be 4

4

**Using agent worker**

define llm where can be changed

In [31]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model='gpt-4o',temperature=0)

In [27]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
    initial_tools,
    llm=llm,
    verbose=True,
)

agent = AgentRunner(agent_worker)

<hr>

### Test case

In [33]:
response = agent.query(
    "Tell me the diffrerence between Uber and Lyft in"
    "The impact of covid-19.",
)

Added user message to memory: Tell me the diffrerence between Uber and Lyft inThe impact of covid-19.
=== Calling Function ===
Calling function: vector_query with args: {"query": "Difference between Uber and Lyft in the impact of covid-19", "page_numbers": []}
=== Function Output ===
Uber and Lyft both faced challenges due to the COVID-19 pandemic, impacting their operations and financial performance. However, Lyft specifically mentioned challenges related to delays in manufacturing assets, increased costs, and supply chain issues affecting the deployment of new vehicles and features on their network. Lyft also highlighted the adverse impact on demand for vehicles rented through their Express Drive program and Lyft Rentals, leading to operational changes and cost increases. On the other hand, Uber emphasized the global impact of COVID-19 on drivers, merchants, consumers, and business partners, leading to reduced demand for mobility rides and driver supply constraints. Uber responded by

In [38]:
response = agent.query(
    "Give me a summary of the document lyft."
    "and what is the 10k analysis overall based on uber.",
)

print(str(response))

Added user message to memory: Give me a summary of the document lyft.and what is the 10k analysis overall based on uber.
=== Calling Function ===
Calling function: summary_tool with args: {"input": "lyft"}
=== Function Output ===
Lyft, Inc. is a transportation network company that operates a ridesharing marketplace connecting drivers with riders through its mobile application. The company offers various transportation options, including ridesharing, bike and scooter sharing, public transit integration, and car rentals. Lyft is committed to improving people's lives by providing affordable and reliable transportation options. The company has faced challenges due to the COVID-19 pandemic, impacting its operations, financial performance, and driver availability. Additionally, Lyft has focused on safety measures, diversity in its workforce, and environmental initiatives, such as transitioning to electric vehicles.
=== Calling Function ===
Calling function: summary_tool with args: {"input": 