# PROBLEM STATEMENT
The primary objective of this work is to create a RAG pipeline over the `101 alpha formulaic book` to extract alpha number, expression and explanation. Utimately, this will serve as a basis for creating an end-to-end alpha agent that will discover alpha signals over a knowledge base, create its python implementation, perform backtesting and trade live if expected performance pass a satisfactory threshold.

# TODO
- Use the __ChatGrok model__ to increase inference speed.
- Use more capable model to generate alpha information.

# CHALLENGES
- Sometimes the model is unable to generate the alpha information due to OutputParserException.
- The alpha expression and the alpha number returned do not match what is in the document(Hallucination). At this time, it is entirely possible to posit that the alpha expression are not even contained in the document. Perhaps, they are generated from internal knowledge of the model.

# POSSIBLE SOLUTIONS TO CHALLENGES
- Use more advanced LLM and embeddings method. Perhaps, this may help address the hallucination issue. I am refraining from using extraction approach as long-term, It is expected that we would have a rich and large repositories of papers, documents with alpha information. Also, the improved understanding of LLM models with advanced maths would help incredibly in this regard. 

In [69]:
import os

# Optional, add tracing in LangSmith
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "Alpha agent project"

In [70]:
from getpass import getpass

def _set_if_undefined(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass(f"Pass in {var}")

_set_if_undefined("LANGCHAIN_API_KEY")

## Ground an alpha retriever to document provided

The goal of this work is to generate alpha expression and explanation based on the document provided. If response not relevant to document, we return not

Based on the workflow above, the following steps are followed:
1. Based on a given `query`, we generate relevant documents.
2. We then use a `binary score` to check whether each retrieved document is relevant or not.
3. __If document is relevant__, we generate an answer. otherwise, we `re-write` the `query`.
4. For the answer generated in 3, we check if the answer is grounded to the document(to reduce hallucination).
5. __If no hallucination__, we then check if `answer` is relevant to the question. otherwise `repeat step 3`.
6. __If answer is relevant to question__, return answer. otherwise, `repeat 1`.

In [71]:
# Import necessary libraries (UNCOMMENT below)
#!pip install pypdf gpt4all langchain langchain-core langchain-community chromadb

In [72]:
# LLM
local_llm = 'llama3'

In [73]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain.vectorstores import Chroma

loader = PyPDFLoader("101-Alpha-Formula.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
doc_splits = splitter.split_documents(documents)

# Add to vectorDB
vectorstore = Chroma.from_documents(
    documents=doc_splits,
    collection_name="alpha-doc-chroma",
    embedding=GPT4AllEmbeddings(),
)
retriever = vectorstore.as_retriever()

## Create a graph state

Our state will be a `dict`, We can access through any graph `node` as `state[keys]`.


In [74]:
from typing import TypedDict, Dict

class AlphaRagState(TypedDict):
    """
    Represents the graph state of the alpha agent

    Attributes:
        keys: A dictionary where 
    """
    keys: Dict[str, any]

## Define nodes for our workflow
Create the following nodes:
1. retrieve
2. grade_documents
3. generate
4. transform_query
5. prepare_for_final_grade

In [75]:
# retrieve node gets relevant documents based on a user query
from langchain.prompts import PromptTemplate
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.output_parsers.openai_tools import PydanticToolsParser
from langchain_core.pydantic_v1 import BaseModel
from langchain_core.utils.function_calling import convert_to_openai_tool

def retrieve(state):
    """
    Retrieves relevant documents

    Args: 
        state (dict): current graph state

    Returns:
        state (dict): New key added to the state, documents, contains relevant documents to question 
    """
    print("-- RETRIEVING DOCUMENTS --")
    print("State: ", state)
    state_dict = state["keys"]
    question = state_dict["question"]
    relevant_docs = retriever.get_relevant_documents(question)

    return {"keys": { "documents": relevant_docs, "question": question}}
    
# grade_documents node
def grade_documents(state):
    """
    Determines whether the retrieved documents are relevant to the question.

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): Updates documents key with relevant documents
    """

    print("---CHECKING RELEVANCE---")
    state_dict = state["keys"]
    question = state_dict["question"]
    documents = state_dict["documents"]

    # LLM
    llm = ChatOllama(model=local_llm, format="json", temperature=0)

    # prompt
    prompt = PromptTemplate(
    template="""You are a grader assessing relevance 
    of a retrieved document to a user question. If the document contains keywords related to the user question, 
    grade it as relevant. It does not need to be a stringent test. The goal is to filter out erroneous retrievals. 
    
    Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question.
    Provide the binary score as a JSON with a single key 'score' and no premable or explaination.
     
    Here is the retrieved document: 
    {document}
    
    Here is the user question: 
    {question}
    """,
    input_variables=["question", "document"],
)

    # Chain
    chain = prompt | llm | JsonOutputParser()

    # Score
    filtered_docs = []
    for d in documents:
        binary_result = chain.invoke({"question": question, "document": d.page_content})
        print("Score result: ", binary_result)
        grade = binary_result["score"]

        print("Grade: ", grade)
        if grade == "yes":
            print("---GRADE: DOCUMENT RELEVANT---")
            filtered_docs.append(d)
        else:
            print("---GRADE: DOCUMENT NOT RELEVANT---")
            continue

    return {"keys": {"documents": filtered_docs, "question": question}}
            
# generate node for generating answer to questions
def generate(state):
    """
    Generate answer

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): New key added to state, generation, that contains LLM generation
    """
    print("---GENERATE---")
    state_dict = state["keys"]
    question = state_dict["question"]
    documents = state_dict["documents"]

    list_of_alpha_numbers = ['Alpha#1', 'Alpha#2', 'Alpha#3', 'Alpha#101']

    # Prompt
    prompt = PromptTemplate(
    template= """You are an assistant for question-answering tasks. \n
    Use the retrieved documents to answer the question. If you don't know the answer, just say that you don't know. \n

    Please provide an answer as a JSON with 3 keys: 'alpha_number', 'alpha_expression', and 'alpha_explanation'. \n
    Please ensure 'alpha_number' coincide with 'alpha_expression' as provided in the retrieved documents. \n
    Ensure that the 'alpha_explanation' is a description of the 'alpha_expression'. \n

    Here are examples of alpha numbers in the documents:
    {list_of_alpha_numbers} \n
    
    Here are 3 examples of alpha expression that can be extracted from documents:
    ((-1 * rank((delta(close, 7) * (1 - rank(decay_linear((volume / adv20), 9)))))) * (1 + rank(sum(returns, 250)))) \n
    rank((-1 * ((1 - (open / close))^1))) \n
    (-1 * delta((((close - low) - (high - close)) / (close - low)), 9)) \n
     
    Here are the retrieved documents: 
    {documents}
    
    Here is the user question: 
    {question}
    """,
    input_variables=["question", "documents", "list_of_alpha_numbers"],
    )

    # LLM
    llm = ChatOllama(model=local_llm, temperature=0)

    # Post-processing
    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    # Chain
    rag_chain = prompt | llm | JsonOutputParser()

    # Run
    generation = rag_chain.invoke({"documents": documents, "question": question, "list_of_alpha_numbers": list_of_alpha_numbers})
    
    print("Generated response to alpha query: ", generation)
    return {
        "keys": {"documents": documents, "question": question, "generation": generation}
    }

# transform query node, to regenerate query based on 
def transform_query(state):
    """
    Transform the query to produce a better question.

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): Updates question key with a re-phrased question
    """

    print("---TRANSFORMING QUERY---")
    state_dict = state["keys"]
    question = state_dict["question"]
    documents = state_dict["documents"]

    # Create a prompt template with format instructions and the query
    prompt = PromptTemplate(
        template="""You are generating alpha trading idea question / query that is well optimized for retrieval. \n 
        Look at the input and try to reason about the underlying sematic intent / meaning. \n 
        Here is the initial question:
        \n ------- \n
        {question} 
        \n ------- \n
        Formulate an improved alpha trading idea question: """,
        input_variables=["question"],
    )

    # Grader
    llm = ChatOllama(model=local_llm, temperature=0)

    # Prompt
    chain = prompt | model | StrOutputParser()
    better_question = chain.invoke({"question": question})

    return {"keys": {"documents": documents, "question": better_question}}

# prepare for final grade, 
def prepare_for_final_grade(state):
    """
    Passthrough state for final grade.

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): The current graph state
    """

    print("---FINAL GRADE---")
    state_dict = state["keys"]
    question = state_dict["question"]
    documents = state_dict["documents"]
    generation = state_dict["generation"]

    return {
        "keys": {"documents": documents, "question": question, "generation": generation}
    }


## Define edges for our workflow
Create the following edges:
1. decide_to_generate
2. decide_generation_is_grounded_in_documents
3. decide_generation_addresses_question

In [76]:
def decide_to_generate(state):
    """
    Determines whether to generate an answer, or re-generate a question.

    Args:
        state (dict): The current state of the agent, including all keys.

    Returns:
        str: Next node to call
    """

    print("---DECIDE TO GENERATE OR TRANSFORM QUERY---")
    state_dict = state["keys"]
    question = state_dict["question"]
    filtered_documents = state_dict["documents"]

    if not filtered_documents:
        # All documents have been filtered check_relevance
        # We will re-generate a new query
        print("---DECISION: TRANSFORM QUERY---")
        return "transform_query"
    else:
        # We have relevant documents, so generate answer
        print("---DECISION: GENERATE---")
        return "generate"


def decide_generation_is_grounded_in_documents(state):
    """
    Determines whether the generation is grounded in the document.

    Args:
        state (dict): The current state of the agent, including all keys.

    Returns:
        str: Binary decision
    """

    print("---GRADE GENERATION based on DOCUMENTS---")
    state_dict = state["keys"]
    question = state_dict["question"]
    documents = state_dict["documents"]
    generation = state_dict["generation"]

    # LLM
    llm = ChatOllama(model=local_llm, format="json", temperature=0)

    # Prompt
    prompt = PromptTemplate(
    template="""You are a grader assessing whether 
    an answer is grounded in / supported by a set of facts. Give a binary score 'yes' or 'no' score to indicate 
    whether the answer is grounded in / supported by a set of facts. Provide the binary score as a JSON with a 
    single key 'score' and no preamble or explanation.
    
    Here are the facts:
    {documents} 

    Here is the answer: 
    {generation}
    """,
    input_variables=["generation", "documents"],
    )

    # Chain
    chain = prompt | llm | JsonOutputParser()

    binary_result = chain.invoke({"generation": generation, "documents": documents})
    grade = binary_result["score"]

    if grade == "yes":
        print("---DECISION: SUPPORTED, MOVE TO FINAL GRADE---")
        return "grounded"
    else:
        print("---DECISION: NOT SUPPORTED, GENERATE AGAIN---")
        return "not grounded"


def decide_generation_addresses_question(state):
    """
    Determines whether the generation addresses the question.

    Args:
        state (dict): The current state of the agent, including all keys.

    Returns:
        str: Binary decision
    """

    print("---GRADE GENERATION vs QUESTION---")
    state_dict = state["keys"]
    question = state_dict["question"]
    documents = state_dict["documents"]
    generation = state_dict["generation"]

    # LLM
    llm = ChatOllama(model=local_llm, format="json", temperature=0)

    # Prompt
    prompt = PromptTemplate(
    template="""You are a grader assessing whether an 
    answer is useful to resolve a question. Give a binary score 'yes' or 'no' to indicate whether the answer is 
    useful to resolve a question. Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.
     
    Here is the answer:
    {generation} 

    Here is the question: {question}
    """,
    input_variables=["generation", "question"],
    )

    answer_grader = prompt | llm | JsonOutputParser()
    binary_result = answer_grader.invoke({"question": question,"generation": generation})
    grade = binary_result["score"]
    
    if grade == "yes":
        print("---DECISION: USEFUL---")
        return "useful"
    else:
        print("---DECISION: NOT USEFUL---")
        return "not useful"

## Build graph

In [77]:
import pprint

from langgraph.graph import END, StateGraph

workflow = StateGraph(AlphaRagState)

# Define the nodes
workflow.add_node("retrieve", retrieve)  # retrieve
workflow.add_node("grade_documents", grade_documents)  # grade documents
workflow.add_node("generate", generate)  # generatae
workflow.add_node("transform_query", transform_query)  # transform_query
workflow.add_node("prepare_for_final_grade", prepare_for_final_grade)  # passthrough

# Build graph
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade_documents")
workflow.add_conditional_edges(
    "grade_documents",
    decide_to_generate,
    {
        "transform_query": "transform_query",
        "generate": "generate",
    },
)
workflow.add_edge("transform_query", "retrieve")
workflow.add_conditional_edges(
    "generate",
    decide_generation_is_grounded_in_documents,
    {
        "grounded": "prepare_for_final_grade",
        "not grounded": "generate",
    },
)
workflow.add_conditional_edges(
    "prepare_for_final_grade",
    decide_generation_addresses_question,
    {
        "useful": END,
        "not useful": "transform_query",
    },
)

# Compile
app = workflow.compile()

In [78]:
# Test
import time

start = time.time()
from pprint import pprint
inputs = {"keys": {"question": "Generate alpha expression and explanation based on flow of funds strategy"}}
for output in app.stream(inputs):
    for key, value in output.items():
        pprint(f"Finished running: {key}")

end = time.time()
print(f"Elpased time: {end - start}s")

-- RETRIEVING DOCUMENTS --
State:  {'keys': {'question': 'Generate alpha expression and explanation based on flow of funds strategy'}}
'Finished running: retrieve'
---CHECKING RELEVANCE---
Score result:  {'score': 'yes'}
Grade:  yes
---GRADE: DOCUMENT RELEVANT---
Score result:  {'score': 'yes'}
Grade:  yes
---GRADE: DOCUMENT RELEVANT---
Score result:  {'score': 'yes'}
Grade:  yes
---GRADE: DOCUMENT RELEVANT---
Score result:  {'score': 'yes'}
Grade:  yes
---GRADE: DOCUMENT RELEVANT---
---DECIDE TO GENERATE OR TRANSFORM QUERY---
---DECISION: GENERATE---
'Finished running: grade_documents'
---GENERATE---
Generated response to alpha query:  {'alpha_number': 'Alpha#101', 'alpha_expression': '((-1 * rank((delta(close, 7) * (1 - rank(decay_linear(volume / adv20), 9))))) * (1 + rank(sum(returns, 250))))', 'alpha_explanation': 'This alpha expression is based on the flow of funds strategy, which measures the rate of change in trading volume and adjusts it for market conditions. The expression us

__Wrong example gotten without few shot examples__
1. Generated response to alpha query:  [{'alpha_expression': 'Flow of Funds Alpha', 'explanation': 'The Flow of Funds Alpha strategy involves analyzing the movement of funds between different asset classes, sectors, or industries to identify trends and patterns. This approach can help identify alpha opportunities by capturing changes in investor sentiment, market conditions, and economic indicators.'}].
   
__Generation example after passing few shot examples without few shot examples__(Gave generic explanation of flow of fund strategy instead of `providing explanation to alpha expression`)
3. Generated response to alpha query:  [{'alpha_expression': '-1 * rank(decay_linear(volume / adv20), 9)) * (1 + rank(sum(returns, 250)))', 'explanation': 'The flow of funds strategy is based on the idea that the movement of money in and out of a portfolio can be a valuable indicator of future stock performance. By analyzing the volume data and smoothing it out using the `decay_linear` function, we can create a signal that captures the momentum of the market.'}]