# Librarian Playground

Testing the librarian thing and seeing if it actually fckn works

## Initialize Librarian

A `Librarian` requires the following:
- A DataSource that it can fetch documents from
- A Document Store that stores the fetched document's text at. **Note**: It needs to have a unique index
- A Vectorstore Index that stores the chunked articles source

In [1]:
from kruppe.algorithm.librarian import Librarian
from kruppe.data_source.news.nyt import NewYorkTimesData
from kruppe.functional.rag.index.vectorstore_index import VectorStoreIndex
from kruppe.functional.rag.vectorstore.chroma import ChromaVectorStore
from kruppe.functional.docstore.mongo_store import MongoDBStore
from kruppe.llm import OpenAILLM, OpenAIEmbeddingModel

reset_db = True

In [2]:
llm = OpenAILLM()
embedding_model = OpenAIEmbeddingModel()

# Create doc store
unique_indices = [['title', 'datasource']] # NOTE: this is important to avoid duplicates
docstore = await MongoDBStore.acreate_db(
    db_name="kruppe_librarian",
    collection_name="playground",
    unique_indices=unique_indices,
    reset_db=reset_db
)

# Create vectorstore index
vectorstore = ChromaVectorStore(
    embedding_model=embedding_model,
    collection_name="playground",
    persist_path='/Volumes/Lexar/Daniel Liu/vectorstores/kruppe_librarian'
)
if reset_db:
    vectorstore.clear()
    
index = VectorStoreIndex(llm=llm, vectorstore=vectorstore)

# Define news data source
news_source = NewYorkTimesData(headers_path = "/Users/danielliu/Workspace/fin-rag/.nyt-headers.json")

# Define librarian
librarian_llm = OpenAILLM(model="gpt-4o-mini")
librarian = Librarian(
    llm=librarian_llm,
    docstore=docstore,
    index=index,
    news_source=news_source
)

In [3]:
query = "What are the key developments and financial projections for Amazon's advertising business, and how is it positioning itself in the digital ad market?"

## Testing Individual Methods

### `retrieve_from_library`

The librarian will take the information request, and turn them into function calls (defined by the data source available to the librarian). The downloaded results will be stored into the document store and the index (vector store)

#### Helper Functions

Helper Function 1: `_choose_resource`

Given a request for information, `_choose_resource` returns the function calls that should be made

In [4]:
resource_requests = await librarian._choose_resource(information_desc=query)
resource_requests

[{'func_name': 'news_search',
  'parameters': {'query': 'Amazon advertising developments financial projections digital ad market',
   'sort': 'relevance'},
  'purpose': "Obtain specific articles regarding Amazon's advertising business and its positioning in the digital advertising market.",
  'rank': 1},
 {'func_name': 'news_recent',
  'parameters': {'days': 7,
   'filter': {'include': 'Amazon advertising', 'exclude': None}},
  'purpose': "Get a recent overview of news related to Amazon's advertising business.",
  'rank': 2},
 {'func_name': 'news_archive',
  'parameters': {'start_date': '2020-01-01',
   'end_date': '2023-10-23',
   'filter': {'include': 'Amazon advertising', 'exclude': None}},
  'purpose': "Access historical news regarding key developments in Amazon's advertising sector.",
  'rank': 3}]

Helper Function 2: `_retrieve_helper` executes a single function that's returned from `resource_requests`

In [5]:
rsc_request = resource_requests[0]
async for doc in librarian._retrieve_helper(
    resource_request=rsc_request, rank_threshold=2):
    print(doc)

<Document {"id": "7429d924-d274-4a5f-83b9-fa7ca9f8532f", "metadata": {"query": "Amazon advertising developments financial projections digital ad market", "datasource": "NewYorkTimesData", "url": "https://www.nytimes.com/2021/02/12/business/dealbook/shell-peak-oil.html", "title": "For Shell, Oil Is Past Its Peak (Published 2021)", "description": "Now comes the hard part.", "publication_time": 1613133690, "section": "", "document_type": ""}, "text": "Artificial Intelligence\nAdvertisement\nSupported by..."}>


Helper Function 3: `_save_to_docstore_and_index` saves a `Document` into the document store, and splits the `Document` into `Chunks` and are stored into the (vectorstore) index

In [6]:
# i'm too lazy to show this here i'm pretty sure it works just take my word for it

#### Actual Function

Note: this actually returns the `Document` list, not `Chunks`. Until I implement working with financial documents (and even then), the returned output isn't actually used. In reality, `librarian` is really similar to a `Index` - it returns chunks that are most relevant to the query. The main difference is that it can make function calls to find new sources

In [7]:
ret_documents = await librarian.retrieve_from_library(
    information_desc=query,
    num_resources=2, 
    rank_threshold=2
)
ret_documents

[Document(text='Tariffs\nAdvertisement\nHyundai already makes cars in the United States, in Georgia and Alabama.\ntranscript\nToday, we’re delighted to report that Hyundai is announcing a major $5.8 billion investment in American manufacturing. In particular, Hyundai will be building a brand new steel plant in Louisiana, which will produce more than 2.7 million metric tons of steel, a year, creating more than 1,400 jobs for American steel workers. And then there’ll be major expansion after that. This investment is a clear demonstration that tariffs very strongly work. Hyundai will be producing steel in America and making its cars in America, and as a result, they’ll not have to pay any tariffs. There are no tariffs if you make your product in America.\nBy Jack Ewing\nHyundai Motor, a South Korean conglomerate known for its automobiles, will invest $21 billion to expand manufacturing in the United States in what President Trump said was proof that his tariff policies were creating jobs.

In [8]:
docstore_docs = await docstore.aget_all_documents()
len(ret_documents) == len(docstore_docs) # should be True

True

In [None]:
await index.async_query(query, top_k=7, filter=None) # should return a list of chunks

[Chunk(text='Google and Facebook transformed product marketing from largely an art to a sometimes creepy science, and Sandberg is among the architects of that change. She shares in the credit (or blame) for developing two of the most successful, and perhaps least defensible, business models in internet history.\nAll the anxiety today about apps snooping on people to glean every morsel of activity to better pitch us dishwashers — that’s partly Sandberg’s doing. So are Facebook and Google’s combined $325 billion in annual advertising sales and those of all other online companies that make money from ads.\nThe pattern of deny, deflect, defend.', id=UUID('f2421814-0c52-4e95-b05e-0a26bef7bcfd'), metadata={'datasource': 'NewYorkTimesData', 'description': 'Sandberg transformed digital advertising and was a voice on big issues, but she also denied problems and deflected blame.', 'document_type': 'article', 'publication_time': 1654186582, 'query': '', 'section': 'Technology', 'title': 'Sheryl S

### `retrieve_from_index`

`retrieve_from_index` retrieves documents from a vectorstore index that stores the chunked-scraped-documents.

If there wasn't anything in the index in the first place, the `execute` function (which we will see later), *should* first call on `retrieve_from_library`.

Running the function w/o restricting the time

In [4]:
ret_chunks = await librarian.retrieve_from_index(
    information_desc=query,
    top_k = 10
)
ret_chunks

[Chunk(text='Google and Facebook transformed product marketing from largely an art to a sometimes creepy science, and Sandberg is among the architects of that change. She shares in the credit (or blame) for developing two of the most successful, and perhaps least defensible, business models in internet history.\nAll the anxiety today about apps snooping on people to glean every morsel of activity to better pitch us dishwashers — that’s partly Sandberg’s doing. So are Facebook and Google’s combined $325 billion in annual advertising sales and those of all other online companies that make money from ads.\nThe pattern of deny, deflect, defend.', id=UUID('f2421814-0c52-4e95-b05e-0a26bef7bcfd'), metadata={'datasource': 'NewYorkTimesData', 'description': 'Sandberg transformed digital advertising and was a voice on big issues, but she also denied problems and deflected blame.', 'document_type': 'article', 'publication_time': 1654186582, 'query': '', 'section': 'Technology', 'title': 'Sheryl S

Running the function with time restriction (manual)

In [5]:
ret_chunks = await librarian.retrieve_from_index(
    information_desc=query,
    top_k = 10,
    start_time = "2022-01-01",
    end_time = "2025-01-31"
)
ret_chunks

[Chunk(text='Google and Facebook transformed product marketing from largely an art to a sometimes creepy science, and Sandberg is among the architects of that change. She shares in the credit (or blame) for developing two of the most successful, and perhaps least defensible, business models in internet history.\nAll the anxiety today about apps snooping on people to glean every morsel of activity to better pitch us dishwashers — that’s partly Sandberg’s doing. So are Facebook and Google’s combined $325 billion in annual advertising sales and those of all other online companies that make money from ads.\nThe pattern of deny, deflect, defend.', id=UUID('f2421814-0c52-4e95-b05e-0a26bef7bcfd'), metadata={'datasource': 'NewYorkTimesData', 'description': 'Sandberg transformed digital advertising and was a voice on big issues, but she also denied problems and deflected blame.', 'document_type': 'article', 'publication_time': 1654186582, 'query': '', 'section': 'Technology', 'title': 'Sheryl S

Running the function with time restriction (llm/automatic)

TODO: need to what is the time restriction that LLM has come up with... right now i'm just trusting whatever the heck the LLM came up with

NOTE: LLM will *always* say that a time restriction is neeed. If i want the LLM to be able to say "no" to adding a time restriction, use the prompt `LIBRARIAN_TIME_USER_2`

In [6]:
ret_chunks = await librarian.retrieve_from_index(
    information_desc=query,
    top_k = 10,
    llm_restrict_time = True
)
ret_chunks

[Chunk(text='Google and Facebook transformed product marketing from largely an art to a sometimes creepy science, and Sandberg is among the architects of that change. She shares in the credit (or blame) for developing two of the most successful, and perhaps least defensible, business models in internet history.\nAll the anxiety today about apps snooping on people to glean every morsel of activity to better pitch us dishwashers — that’s partly Sandberg’s doing. So are Facebook and Google’s combined $325 billion in annual advertising sales and those of all other online companies that make money from ads.\nThe pattern of deny, deflect, defend.', id=UUID('f2421814-0c52-4e95-b05e-0a26bef7bcfd'), metadata={'datasource': 'NewYorkTimesData', 'description': 'Sandberg transformed digital advertising and was a voice on big issues, but she also denied problems and deflected blame.', 'document_type': 'article', 'publication_time': 1654186582, 'query': '', 'section': 'Technology', 'title': 'Sheryl S

## `execute`

Testing the big guy

This function will do the following:
1. retrieve chunks using `retrieve_from_index`
2. run the chunks through an LLM, which outputs a confidence/relevance score on how relevant the chunks are to the document.
3. if it is relevant enough, return the chunks
4. if it is *not* relevant enough, run `retrieve_from_library`, which collects new `Documents` and stores them into the docstore and (vectorstore) index
5. start from step 1 again. This function will retry this process for `retries` many times. If by then, there still is no relevant contexts, then it just returns an empty list and basically says "can't find any good luck buddy"

Note: `execute` uses `kwargs`, which are going to be parameters passed into `retrieve_from_index` and `retrieve_from_library`

In [4]:
await librarian.execute(
    information_desc=query,
    retries=2,
    relevance_score_threshold=2,
    # retrieve_from_library parameters
    num_resources=2,
    rank_threshold=2,
    # retrieve_from_index parameters
    top_k=10,
    start_time="2023-01-01"
)

[Chunk(text='On Wednesday, Meta said its increased revenue and profit for the fourth quarter were driven largely by advancements in its systems for advertisement targeting and suggesting relevant posts and videos to users. Those improvements came from its continued investments in artificial intelligence, the company said.\nRevenue for the fourth quarter was $48.4 billion, up from $40.1 billion a year earlier and above Wall Street estimates of $47 billion, according to data compiled by FactSet, a market analysis firm. Profit was $20.8 billion, up from $14 billion a year earlier.\nBut the Silicon Valley company also said it expected revenue in the current quarter to come in at $39.5 billion to $41.8 billion. The low end of the forecast was below analyst expectations of $41.7 billion.', id=UUID('81a5e6fa-c2b8-445d-9d10-53b457fa198f'), metadata={'datasource': 'NewYorkTimesData', 'description': 'President Trump had sued Meta and other tech firms in 2021, arguing that he had been wrongfully 