## RAG Question and Answering over 101 Alpha paper

The purpose of this is to generate a structured output based on query on alpha numbers in the paper. The output should provide alpha number, alpha expression and alpha explanation.

#### Interesting observations
- claude-3-haiku unable to use the instruction and given context to parse a structured output. It generated a `ValidationError`.
- claude-3-opus, However was able to understand the instruction and given context to parse a structured output. In fact, it was able to use its internal knowledge to give the step by step explanation on the alpha expression in the paper.

In [1]:
#!pip install gpt4all langchain langchain-core langchain-community langchain-openai faiss-cpu pypdf langchain-anthropic

In [2]:
import os

# Optional, add tracing in LangSmith
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "Alpha agent project"

In [3]:
from getpass import getpass

def _set_if_undefined(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass(f"Pass in {var}")

_set_if_undefined("LANGCHAIN_API_KEY")
_set_if_undefined("ANTHROPIC_API_KEY")

Pass in LANGCHAIN_API_KEY··········
Pass in ANTHROPIC_API_KEY··········


In [5]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "101-Alpha-Formula.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

22


In [6]:
print(docs[1].page_content)

2 
 1.  Introduction 
There are two complementary – and in some sense eve n competing – trends in modern 
quantitative trading.  On the one hand, more and mo re market participants (e.g., quantitative 
traders, inter alia) employ sophisticated quantitat ive techniques to mine alphas. 4  This results in 
ever fainter and more ephemeral alphas.  On the oth er hand, technological advances allow to 
essentially automate (much of) the alpha harvesting  process.  This yields an ever increasing 
number of alphas, whose count can be in hundreds of  thousands and even millions, and with 
the exponentially increasing progress in this field  will likely be in billions before we know it… 
This proliferation of alphas – albeit mostly faint and ephemeral – allows combining them in 
a sophisticated fashion to arrive at a unified “meg a-alpha”.  It is then this “mega-alpha” that is 
actually traded – as opposed to trading individual alphas – with a bonus of automatic internal 
crossing of trades (and 

In [7]:
print(docs[8].page_content)

9 
 Alpha#8:  (-1 * rank(((sum(open, 5) * sum(returns, 5)) - del ay((sum(open, 5) * sum(returns, 5)), 
10)))) 
Alpha#9:  ((0 < ts_min(delta(close, 1), 5)) ? delta(close, 1 ) : ((ts_max(delta(close, 1), 5) < 0) ? 
delta(close, 1) : (-1 * delta(close, 1)))) 
Alpha#10:  rank(((0 < ts_min(delta(close, 1), 4)) ? delta(clo se, 1) : ((ts_max(delta(close, 1), 4) < 0) 
? delta(close, 1) : (-1 * delta(close, 1))))) 
Alpha#11:  ((rank(ts_max((vwap - close), 3)) + rank(ts_min((v wap - close), 3))) * 
rank(delta(volume, 3))) 
Alpha#12:  (sign(delta(volume, 1)) * (-1 * delta(close, 1))) 
Alpha#13:  (-1 * rank(covariance(rank(close), rank(volume), 5 ))) 
Alpha#14:  ((-1 * rank(delta(returns, 3))) * correlation(open , volume, 10)) 
Alpha#15:  (-1 * sum(rank(correlation(rank(high), rank(volume ), 3)), 3)) 
Alpha#16:  (-1 * rank(covariance(rank(high), rank(volume), 5) )) 
Alpha#17:  (((-1 * rank(ts_rank(close, 10))) * rank(delta(del ta(close, 1), 1))) * 
rank(ts_rank((volume / adv20), 5))) 
Alpha#18:  (

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_community.vectorstores import FAISS
import math

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
doc_splits = splitter.split_documents(docs)

model_name = "all-MiniLM-L6-v2.gguf2.f16.gguf"
gpt4all_kwargs = {'allow_download': 'True'}
embeddings = GPT4AllEmbeddings(
    model_name=model_name,
    gpt4all_kwargs=gpt4all_kwargs
)

# Add to vectorDB
vectorstore = FAISS.from_documents(
    documents=doc_splits,
    embedding=embeddings,
)
number_of_relevant_docs = math.ceil(0.8 * len(doc_splits))

retriever = vectorstore.as_retriever(search_kwargs={'k': number_of_relevant_docs})

In [17]:
models = ['claude-3-opus-20240229', 'claude-3-5-sonnet-20240620', 'claude-3-sonnet-20240229', 'claude-3-haiku-20240307']

In [23]:
llm_model = models[0]
llm_model

'claude-3-opus-20240229'

In [24]:
from langchain_core.prompts import PromptTemplate
from langchain_anthropic import ChatAnthropic
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.runnables import RunnablePassthrough

class AlphaInfo(BaseModel):
    """This contains essential alpha information in the retrieved documents"""
    alpha_number: str = Field(description="This is the label of alpha expression in the retrieved documents. For example: Alpha#1, Alpha#2, ...., Alpha#101. ")
    alpha_expression: str = Field(description="This contains alpha expression in retrieved documents. For example: `(sign(delta(volume, 1)) * (-1 * delta(close, 1)))'. ")
    alpha_explanation: str = Field(description="Explanation of the alpha expression retrieved. Use your internal knowledge")

template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Here is the context: \n
{context}

Question: {question}
"""

llm = ChatAnthropic(model=llm_model, default_headers={"anthropic-beta": "tools-2024-04-04"})
prompt = PromptTemplate.from_template(template)
# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm.with_structured_output(AlphaInfo)
)

In [25]:
rag_chain.invoke("What is Alpha#1")

AlphaInfo(alpha_number='Alpha#1', alpha_expression='(rank(Ts_ArgMax(SignedPower(((returns < 0) ? stddev(returns, 20) : close), 2.), 5)) - 0.5)', alpha_explanation='This alpha expression does the following:\n1. It calculates the daily returns and checks if the return is negative. \n2. If the return is negative, it takes the 20-day rolling standard deviation of returns. Otherwise it uses the closing price.\n3. It then takes the signed power of this value, essentially squaring it but keeping the sign.\n4. It finds the index of the maximum value of this signed power over the past 5 days.\n5. It takes the rank of this max index.\n6. Finally, it subtracts 0.5 from this rank.\nSo in summary, it is a complex way of ranking stocks based on the maximum of either squared negative volatility or squared price, using the past 5 days of data. The subtraction of 0.5 at the end centers the ranks around 0.')

## Langsmith trace (This allows anybody to look at all the steps required to answer the question)

https://smith.langchain.com/public/80dfa37f-f3f1-4bb4-afa7-952a9e413c2b/r