# Valuation Actuary's Q&A Machine using Retrieval Augmented Generation (RAG)
This project aims to create a Retrieval-Augmented Generation (RAG) process for valuation actuaries to ask questions on a set of documentations. The RAG process utilizes the power of the Large Language Model (LLM) to provide answers to questions specific documents.

However, RAG is not without challenges, i.e., hallucination and inaccuracy. This code allows verifiability by providing the context it used to arrive at those answers. This process enables actuaries to validate the information provided by the LLM, empowering them to make informed decisions. By combining the capabilities of LLM with verifiability, this code offers actuaries a robust tool to leverage LLM technology effectively and extract maximum value.

# 1. Initial Setup
This setup includes loading environment variables from a `.env` file, setting the required environment variables, and importing the necessary modules for further processing. It ensures that the code has access to the required APIs and functions for the subsequent tasks.


In [1]:
# Initial set up
from dotenv import load_dotenv
import os

# Load the variables from .env file and set the API key (or user may manually set the API key)
load_dotenv()  # This loads the variables from .env (not part of repo)
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

# Import the necessary modules
from langchain import hub
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Ollama
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel # for RAG with source
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from IPython.display import display, Markdown, Latex
import glob
import chromadb
from semantic_text_splitter import CharacterTextSplitter

In [14]:
## Initial variable setup
embeddings_model = OpenAIEmbeddings()
db_directory = "./data/chroma"
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0) # context window size 16k for GPT 3.5 Turbo    

# 2. Load PDF Files and Convert to a Vector DB
1. Create a function to load and extract text from PDF files in a specified folder. It defines a function called `load_pdfs_from_folder()` that takes a folder path as input and returns a list of extracted text documents from the PDF files in that folder.

2. In the example, the folder path `./data/ASOP_life` is used, but you can modify it to point to your desired folder.

3. By calling the `load_pdfs_from_folder()` function with the folder path, the code loads the PDF files, extracts the text using the PyPDFLoader, and stores the extracted text documents in the `docs` list.

4. After loading and extracting the text, a `RecursiveCharacterTextSplitter` object is created with specific parameters for chunking the documents. The `split_documents()` method is then used to split the documents into smaller chunks based on the specified parameters.

5. Finally, a Chroma vectorstore is created from the document splits. The vectorstore uses the defined embedding model for embedding the chunks and is saved to the predefined directory.

In [15]:
# Define a function to load and extract text from PDFs in a folder
def get_file_name(source_path):
    return source_path.split('/')[-1]

def load_pdfs_from_folder(folder_path):
    # Get a list of PDF files in the specified folder
    pdf_files = glob.glob(f"{folder_path}/*.pdf")
    docs = []
    for pdf_file in pdf_files:
        file_name = get_file_name(pdf_file)
        
        # Load the PDF file using the PyPDFLoader
        loader = PyPDFLoader(pdf_file)
        loaded_docs = loader.load()
        
        for doc in loaded_docs:
            doc.metadata['source'] = file_name
        
        docs.extend(loaded_docs)
    return docs

In [16]:
collection_list=[
    "ASOP_life",
    "Bermuda",
    "CFT",
    "VM21",
    "VM22"
]

for collection_name in collection_list:
    # Example folder path
    folder_path = './data/'+collection_name

    # Call the function to load and extract text from PDFs in the specified folder
    docs = load_pdfs_from_folder(folder_path)
    # Create a text splitter object with specified parameters
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000, # 1000 splits a page into roughly 3 chunks
        chunk_overlap=200,
        length_function=len,)

    splitter = CharacterTextSplitter(trim_chunks=False)

    # Split the documents into chunks using the text splitter
    splits = text_splitter.split_documents(docs)


    # Create a Chroma vector database from the document splits, using OpenAIEmbeddings for embedding
    vectorstore = Chroma.from_documents(documents=splits, 
                                        embedding=embeddings_model, 
                                        persist_directory=db_directory,collection_name=collection_name)

# 3. Retrieve from the Vector DB 

In [17]:
# Get a Chroma vector database with specified parameters
vectorstore = Chroma(embedding_function=embeddings_model, 
                     persist_directory=db_directory,
                     collection_name="ASOP_life")
## a user may choose different collection name from the list

In [18]:
## Retrieve and RAG chain

# Create a retriever using the vector database as the search source
retriever = vectorstore.as_retriever(search_type="mmr", 
                                     search_kwargs={'k': 6, 'lambda_mult': 0.5}) 
# Use MMR (Maximum Marginal Relevance) to find a set of documents that are both similar to the input query and diverse among themselves
# Increase the number of documents to get, and increase diversity (lambda mult 0.5 being default, 0 being the most diverse, 1 being the least)

# Load the RAG (Retrieval-Augmented Generation) prompt
prompt = hub.pull("rlm/rag-prompt")

# Define a function to format the documents with their sources and pages
def format_docs_with_sources(docs):
    formatted_docs = "\n\n".join(doc.page_content for doc in docs)
    sources_pages = "\n".join(f"{doc.metadata['source']} (Page {doc.metadata['page'] + 1})" for doc in docs)
    # Added 1 to the page number assuming 'page' starts at 0 and we want to present it in a user-friendly way

    return f"Documents:\n{formatted_docs}\n\nSources and Pages:\n{sources_pages}"

# Create a RAG chain using the formatted documents as the context
rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs_with_sources(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

# Create a parallel chain for retrieving and generating answers
rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

# 4. Generate Q&A Function

In [19]:
def generate_output():
    # Prompt the user for a question on ASOP
    usr_input = input("What is your question on ASOP?: ")

    # Invoke the RAG chain with the user input as the question
    output = rag_chain_with_source.invoke(usr_input)

    # Generate the Markdown output with the question, answer, and context
    markdown_output = "### Question\n{}\n\n### Answer\n{}\n\n### Context\n".format(output['question'], output['answer'])

    last_page_content = None  # Variable to store the last page content
    i = 1 # Source indicator

    # Iterate over the context documents to format and include them in the output
    for doc in output['context']:
        current_page_content = doc.page_content.replace('\n', '  \n')  # Get the current page content
        
        # Check if the current content is different from the last one
        if current_page_content != last_page_content:
            markdown_output += "- **Source {}**: {}, page {}:\n\n{}\n".format(i, doc.metadata['source'], doc.metadata['page'], current_page_content)
            i = i + 1
        last_page_content = current_page_content  # Update the last page content
    
    # Display the Markdown output
    display(Markdown(markdown_output))

# Example questions related to ASOPs
- explain ASOP No. 14
- How are expenses relfected in cash flow testing based on ASOP No. 22?
- What is catastrophe risk?
- When do I update assumptions?
- What should I do when I do not have credible data to develop non-economic assumptions?

In [20]:
generate_output()

### Question
Explain ASOP 22

### Answer
ASOP 22 covers opinions required under section 8 and was adopted in October 1993 to provide guidance on opinions required under section 7. It was revised in the late 1990s and early 2000s, incorporating portions of ASOP No. 14 and repealed it in 2001. The revisions made to ASOP 22 clarified the intent of the standard without resulting in substantive changes.

### Context
- **Source 1**: asop001_170.pdf, page 11:

ASOPs have been added and, where the te rms added also appear in the Code,  they have been   
made consistent. In addition, an effort has been made to replace undefined terms or phrases with   
phrases that include terms that are defi ned, discussed, or used in the Code.    
    
Role and Scope of ASOPs   
   
 The Introductory ASOP has been revised to clar ify the role and scope of ASOPs. While ASOPs   
are binding on actuaries rendering actuarial services in the U.S., the Introductory ASOP now   
more directly acknowledges that actuaries are subject to a range of requirements and   
considerations that may affect how they do th eir work. These include legal and regulatory   
requirements, their employer’s peer review or other quality assurance processes and policies,   
continuing education requirements, the Code, a nd the actuary’s own professional and ethical   
standards. Because the ASOPs are not overly pres criptive and allow for disclosed deviations, the
- **Source 2**: asop022_203_CFTstatement.pdf, page 3:

ASOP No. 22 to cover opinions required under only section 8 and adopted Actuarial Compliance   
Guideline (ACG) No. 4, Statutory Statements of Opinion Not Including an Asset Adequacy   
Analysis by Appoint ed Actuaries for Life or Health Insurers , in October 1993 to provide   
guidance on opinions required under section 7. At the time of this revision to ASOP No. 22,   
ACG No. 4 continues to be relevant for actuaries working for companies that receive an   
exemptio n from asset adequacy analysis.    
   
In the late 1990s and early 2000s, the ASB reviewed all standards of practice related to cash flow   
testing. Portions of ASOP No. 14, When to Do Cash Flow Testing for Life and Health Insurance   
Companies , were incorporated in to ASOP No. 7, Analysis of Life, Health, or Property/Casualty   
Insurer Cash Flows , and ASOP No. 22. In 2001, the ASB adopted the revised ASOP No. 7 and   
ASOP No. 22 and repealed ASOP No. 14.
- **Source 3**: asop052_189_PBR.pdf, page 4:

and therefore made updates. The task force al so made minor clarifications and provided   
additional guidance in a few sec tions of the exposure draft.   
 In March 2017, the ASB approved the exposure draft with a comment deadline of May 31,   
2017. Fourteen comment letters were received and considered in making changes that are   
reflected in this final ASOP. For a summary of  issues contained in these comment letters,   
please see appendix 2. In general, the revisions provided clarification of the intent of the   
standard and did not result in subs tantive change to the standard.    
 Because VM-20 is a new method for statut ory valuation, the ASB expects numerous   
amendments to the Valuation Manual  over the next few years. The following language has been   
included in section 1.2 of this ASOP to address this:  
 “In the event of a conflict between the   
provisions of the Valuation Manual  in effect at the time the actu arial services are provided and
- **Source 4**: asop022_203_CFTstatement.pdf, page 6:

Pract ice (ASOPs). These ASOPs describe the procedures an actuary should follow when   
performing actuarial services and identify what the actuary should disclose when   
communicating the results of those services.
- **Source 5**: asop002_204-2_nonguaranteedElement.pdf, page 5:

ASOP No. 24 was updated.   
   
8. In section 3.6, guidance for providing opinions and disclosures to meet regulatory   
requirements was added.   
   
9. In sections 3.7, 3.8, and 3.9, guidance for relying on data, projections, and supporting   
analysis supplied by others, relying on assumptions or methods selected by another party,   
and reliance on another actuary was added.   
   
10. In section 3.10, documentation requirements were added.   
   
11. In section 4, disclosure requirements were added, mostly to address expanded guidance   
throughout section 3.
- **Source 6**: asop052_189_PBR.pdf, page 3:

ASOP No. 52—September 2017    
   
   iv  September 2017   
 TO: Members of Actuarial Organizations Govern ed by the Standards of Practice of the   
Actuarial Standards Board and Other Pe rsons Interested in Principle-Based   
Reserves for Life Products   
 FROM: Actuarial Standards Board (ASB)   
   
SUBJ: Actuarial Standard of Practice (ASOP) No. 52    
 This document is the fina l version of ASOP No. 52, Principle-Based Reserves for Life Products   
under the NAIC Valuation Manual .    
 Background   
 The forces that led to the consideration of pr inciple-based approaches  to reserving for life   
insurance are discussed in appendi x 1 of this document. As change s to laws and regulations that   
would incorporate such approach es started to develop several years ago, the ASB decided to   
explore the need for a standard of practice and formed a task fo rce to produce a discussion draft   
of the standard. That task fo rce created a discussion draft containing actuarial guidance for


# 5. References
- https://www.actuarialstandardsboard.org/standards-of-practice/
- https://python.langchain.com/docs/use_cases/question_answering/quickstart
- https://python.langchain.com/docs/use_cases/question_answering/sources
- https://python.langchain.com/docs/integrations/text_embedding/
- https://python.langchain.com/docs/integrations/vectorstores/chroma
- https://docs.gpt4all.io/gpt4all_python_embedding.html#gpt4all.gpt4all.Embed4All
- https://chat.langchain.com/
- https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.chroma.Chroma.html

In [21]:
vectorstore._collection.name # name of the collection

'ASOP_life'

In [22]:
vectorstore._collection.count() # Number of documents 

1198

In [23]:
vectorstore._collection.peek(1) # See what the first component of the vectorstore looks like

{'ids': ['082857d2-d904-11ee-8af8-0e2e3f1e4b98'],
 'embeddings': [[0.016869571059942245,
   0.011291620321571827,
   0.0006650112918578088,
   -0.033250562846660614,
   0.0001435627054888755,
   0.01621813140809536,
   -0.028093334287405014,
   0.003925602417439222,
   -0.013449513353407383,
   -0.008794435299932957,
   0.026071157306432724,
   0.012010917998850346,
   -0.007416911423206329,
   0.016516707837581635,
   -0.019773906096816063,
   0.020411774516105652,
   0.02168751135468483,
   -0.01939390040934086,
   0.005774740595370531,
   -0.005357412155717611,
   -0.030536232516169548,
   -4.7924921091180295e-05,
   -0.03357628360390663,
   0.006962260697036982,
   0.0024344162084162235,
   0.0005992735386826098,
   0.01887817680835724,
   -0.037294916808605194,
   0.030156224966049194,
   -0.01066732406616211,
   0.023804688826203346,
   -0.0007578923250548542,
   -0.030264798551797867,
   -0.025080425664782524,
   0.007308338303118944,
   -0.0018949428340420127,
   -0.02801190316