Kay.ai
=

> Data API built for RAG 🕵️ We are curating the world's largest datasets as high-quality embeddings so your AI agents can retrieve context on the fly. Latest models, fast retrieval, and zero infra.

This notebook shows you how to do Retrieval-Augmented Generation using [Kay.ai](https://kay.ai). We currently have SEC Filings and Press releases available to be used. Have specific data requests? Reach out to us at vishal@kay.ai or [tweet us](https://twitter.com/vishalrohra_)

Installation
=

First you will need to install the `kay` package. You will also need an API key: you can get one for free at [https://kay.ai](https://kay.ai/). Once you have an API key, you must set it as an environment variable `KAY_API_KEY`.

`KayAiRetriever` has a static `.create()` factory method that takes the following arguments:

* `dataset_id: string` required -- A Kay dataset id. This is a collection of data about a particular entity such as companies, people, or places. For example, try `"company"` 
* `data_type: List[string]` optional -- A list types of data for the corresponding dataset. It is a category of data based on its origin, such as ‘10-K’, ‘10-Q’, ‘PressRelease’, or ‘Reports’. For example, under "company" dataset, the accepted sources are `["10-K", "10-Q", "PressRelease"]`. If left empty, Kay will retrieve the most relevant context across all data sources
* `num_contexts: int` optional, defaults to 6 -- The number of documents to retrieve on each call to `get_relevant_documents()`

Examples
=

Basic Retriever Usage
-

In [81]:
from langchain.retrievers import KayAiRetriever

In [82]:
# Setup API key
from getpass import getpass
KAY_API_KEY = getpass()

 ········


In [83]:
import os
os.environ["KAY_API_KEY"] = KAY_API_KEY

retriever = KayAiRetriever.create(dataset_id="company", data_sources=["10-K", "10-Q", "8-K", "6-K", "PressRelease"], num_contexts=3)
docs = retriever.get_relevant_documents("HDFC Bank CAGR growth since last two quarters?")

In [84]:
docs

[Document(page_content="Company Name: PDL COMMUNITY BANCORP \n Company Industry: SAVINGS INSTITUTION, FEDERALLY CHARTERED \n Form Title: 10-Q 2020-Q1 \n Form Section: Management's Discussion and Analysis of Financial Condition and Results of Operations \n Text: For the Three Months Ended March 31, 2020 2019 Average Average Outstanding Average Outstanding Average Balance Interest Yield/Rate (1) Balance Interest Yield/Rate (1) (Dollars in thousands)Interest earning assets: Loans (2) $ 975,499 $ 12,782 5.27 % $ 935,877 $ 12,095 5.24 % Available for sale securities 18,218 83 1.83 % 23,790 86 1.47 % Other (3) 38,220 165 1.73 % 33,714 201 2.42 % Total interest earning assets 1,031,937 13,030 5.07 % 993,381 12,382 5.06 % Non interest earning assets 37,467 34,441 Total assets $ 1,069,404 $ 1,027,822 Interest bearing liabilities: NOW/IOLA $ 29,026 $ 38 0.53 % $ 28,407 $ 26 0.37 % Money market 160,471 618 1.54 % 113,354 564 2.01 % Savings 113,710 35 0.12 % 122,559 40 0.13 % Certificates of depos

Usage in a chain
-

In [85]:
OPENAI_API_KEY = getpass()

 ········


In [86]:
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [87]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model_name="gpt-3.5-turbo")
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)

In [None]:
questions = [
    "HDFC Bank CAGR growth since last two quarters?",
    # "Is Johnson & Johnson increasing its marketing budget in 2022?",
    # "Where is Wex making the most money in 2023?",
    # "Who are Etsy's competitors?",
    # "What are some recent challenges faced by the renewable energy sector?",
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")