# Actuarial Standards of Practice (ASOP) Q&A Machine using Retrieval Augmented Generation (RAG)
This project aims to create a Retrieval-Augmented Generation (RAG) process for actuaries to ask questions on a set of Actuarial Standards of Practice (ASOP) documents. The RAG process utilizes the power of the Large Language Model (LLM) to provide answers to questions on ASOPs.

However, RAG is not without challenges, i.e., hallucination and inaccuracy. This code allows verifiability by providing the context it used to arrive at those answers. This process enables actuaries to validate the information provided by the LLM, empowering them to make informed decisions. By combining the capabilities of LLM with verifiability, this code offers actuaries a robust tool to leverage LLM technology effectively and extract maximum value.

The current example uses either OpenAI's GPT 3.5 turbo or a local LLM. Using local LLM can address potential data privacy or security concerns.

# 1. Initial Setup
This setup includes loading environment variables from a `.env` file, setting the required environment variables, and importing the necessary modules for further processing. It ensures that the code has access to the required APIs and functions for the subsequent tasks.


In [1]:
# Initial set up
from dotenv import load_dotenv
import os

# Load the variables from .env file
load_dotenv()  # This loads the variables from .env (not part of repo)

# Set the environment variables
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = os.getenv('LANGCHAIN_API_KEY')

# Import the necessary modules
import bs4
from langchain import hub
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Ollama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel # for RAG with source
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from IPython.display import display, Markdown, Latex
import glob
import chromadb

# 2. Load PDF Files and Convert to a Vector DB
1. Create a function to load and extract text from PDF files in a specified folder. It defines a function called `load_pdfs_from_folder()` that takes a folder path as input and returns a list of extracted text documents from the PDF files in that folder.

2. In the example, the folder path `../data/ASOP` is used, but you can modify it to point to your desired folder.

3. By calling the `load_pdfs_from_folder()` function with the folder path, the code loads the PDF files, extracts the text using the PyPDFLoader, and stores the extracted text documents in the `docs` list.

4. After loading and extracting the text, a `RecursiveCharacterTextSplitter` object is created with specific parameters for chunking the documents. The `split_documents()` method is then used to split the documents into smaller chunks based on the specified parameters.

5. Finally, a Chroma vectorstore is created from the document splits. The vectorstore uses the `OpenAIEmbeddings` for embedding the chunks and is persisted to the directory `../data/chroma_db1`.

In [2]:
'''# Uncomment when creating your own vector database for the first time
# Define a function to load and extract text from PDFs in a folder
def load_pdfs_from_folder(folder_path):
    # Get a list of PDF files in the specified folder
    pdf_files = glob.glob(f"{folder_path}/*.pdf")
    docs = []
    for pdf_file in pdf_files:
        # Load the PDF file using the PyPDFLoader
        loader = PyPDFLoader(pdf_file) 
        # Extract the text from the PDF and add it to the docs list
        docs.extend(loader.load())
    return docs

# Example folder path
folder_path = '../data/ASOP'

# Call the function to load and extract text from PDFs in the specified folder
docs = load_pdfs_from_folder(folder_path)

# Create a text splitter object with specified parameters
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=200,
    length_function=len,)

# Split the documents into chunks using the text splitter
splits = text_splitter.split_documents(docs)

# Create a Chroma vector database from the document splits, using OpenAIEmbeddings for embedding
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=OpenAIEmbeddings(), 
                                    persist_directory="../data/chroma_db1")
''' # Uncomment when creating your own vector database for the first time

'# Uncomment when creating your own vector database for the first time\n# Define a function to load and extract text from PDFs in a folder\ndef load_pdfs_from_folder(folder_path):\n    # Get a list of PDF files in the specified folder\n    pdf_files = glob.glob(f"{folder_path}/*.pdf")\n    docs = []\n    for pdf_file in pdf_files:\n        # Load the PDF file using the PyPDFLoader\n        loader = PyPDFLoader(pdf_file) \n        # Extract the text from the PDF and add it to the docs list\n        docs.extend(loader.load())\n    return docs\n\n# Example folder path\nfolder_path = \'../data/ASOP\'\n\n# Call the function to load and extract text from PDFs in the specified folder\ndocs = load_pdfs_from_folder(folder_path)\n\n# Create a text splitter object with specified parameters\ntext_splitter = RecursiveCharacterTextSplitter(\n    chunk_size=1000, \n    chunk_overlap=200,\n    length_function=len,)\n\n# Split the documents into chunks using the text splitter\nsplits = text_splitter.sp

# 3. Retrieve from the Vector DB
Once a vector database is created, Section 2 can be commented out.  

In [3]:
# Create a Chroma vector database with specified parameters
vectorstore = Chroma(embedding_function=OpenAIEmbeddings(), 
                     persist_directory="../data/chroma_db1")

In [9]:
## Retrieve and RAG chain

# Create a retriever using the vector database as the search source
retriever = vectorstore.as_retriever(search_type="mmr", 
                                     search_kwargs={'k': 6, 'lambda_mult': 0.25}) 
# Use MMR (Maximum Marginal Relevance) to find a set of documents that are both similar to the input query and diverse among themselves
# Increase the number of documents to get, and increase diversity (lambda mult 0.5 being default, 0 being the most diverse, 1 being the least)

# Load the RAG (Retrieval-Augmented Generation) prompt
prompt = hub.pull("rlm/rag-prompt")

# Create a ChatOpenAI language model for augmented generation
# llm = ChatOpenAI(model_name="gpt-3.5-turbo-0125", 
#                 temperature=0) # context window size 16k for GPT 3.5 Turbo

# Create a local large language model for augmented generation
llm = Ollama(model="solar:10.7b-instruct-v1-q5_K_M")

# Define a function to format the documents with their sources and pages
def format_docs_with_sources(docs):
    formatted_docs = "\n\n".join(doc.page_content for doc in docs)
    sources_pages = "\n".join(f"{doc.metadata['source']} (Page {doc.metadata['page'] + 1})" for doc in docs)
    # Added 1 to the page number assuming 'page' starts at 0 and we want to present it in a user-friendly way

    return f"Documents:\n{formatted_docs}\n\nSources and Pages:\n{sources_pages}"

# Create a RAG chain using the formatted documents as the context
rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs_with_sources(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

# Create a parallel chain for retrieving and generating answers
rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

# 4. Generate Q&A

In [10]:
def generate_output():
    # Prompt the user for a question on ASOP
    usr_input = input("What is your question on ASOP?: ")

    # Invoke the RAG chain with the user input as the question
    output = rag_chain_with_source.invoke(usr_input)

    # Generate the Markdown output with the question, answer, and context
    markdown_output = "### Question\n{}\n\n### Answer\n{}\n\n### Context\n".format(output['question'], output['answer'])

    last_page_content = None  # Variable to store the last page content
    i = 1 # Source indicator

    # Iterate over the context documents to format and include them in the output
    for doc in output['context']:
        current_page_content = doc.page_content.replace('\n', '  \n')  # Get the current page content
        
        # Check if the current content is different from the last one
        if current_page_content != last_page_content:
            markdown_output += "- **Source {}**: {}, page {}:\n\n{}\n".format(i, doc.metadata['source'], doc.metadata['page'], current_page_content)
            i = i + 1
        last_page_content = current_page_content  # Update the last page content
    
    # Display the Markdown output
    display(Markdown(markdown_output))

### Example questions related to ASOPs
- explain ASOP No. 14
- How are expenses relfected in cash flow testing based on ASOP No. 22?
- What is catastrophe risk?
- When do I update assumptions?
- What should I do when I do not have credible data to develop non-economic assumptions?

In [13]:
generate_output()

What is your question on ASOP?:  What is catastrophe risk?


### Question
What is catastrophe risk?

### Answer
Catastrophe risk refers to the potential for significant losses resulting from relatively infrequent and often extreme events, both natural phenomena like earthquakes or hurricanes, and manmade incidents like explosions causing toxic material release. These events can significantly disrupt historical experience in terms of insurance data and claims, necessitating adjustments when estimating catastrophe provisions in rates. Catastrophe modeling is commonly used to assess such risks and their impact on insurer finances and risk management practices.

### Context
- **Source 1**: ../data/ASOP/asop039_156.pdf, page 5:

b. Infrequent Occurrence—Some events that occur infrequently have the potential to   
produce losses that can significantly distor t the historical experience. An example   
of such an event is an explosion that resu lts in the release of toxic material. If the   
experience data contain such events, using this experience data without   
adjustment may overstate the catastrophe pr ovision in the rates. If the experience   
data do not contain such events, using this experience data without adjustment may understate the catastrophe provision in the rates.    
 3.2 Identification of Catastrophe Losses  
—The actuary should identify, where practicable, the   
catastrophe losses in the hist orical insurance data. In doing so, the actuary should   
consider how accurately the catas trophe losses can be identifi ed, and the extent to which   
they may have a material impact on the results of the analysis.
- **Source 2**: ../data/ASOP/asop039_156.pdf, page 11:

damage ratio, i.e., ratio of losse s to amount of insurance. These damage ratios are applied to the   
current or projected amounts of insurance and, wh en adjusted by the estimated frequencies of the   
specific catastrophes, produce the expected catastrophe losses.   
 Since our knowledge of catastrophes is not comp lete and is still evolving, computer simulation   
models are also evolving. The expected catastrop he losses calculated from these models can be
- **Source 3**: ../data/ASOP/asop039_156.pdf, page 10:

Examples of such issues include coverage ch anges, such as the greater use of guaranteed   
replacement cost on homeowner po licies or the use of separate wind deductibles; the emergence   
of state-run catastrophe f unds; and the availability of catastrophe options.
- **Source 4**: ../data/ASOP/asop039_156.pdf, page 15:

ASOP No. 39—June 2000    
   
 12standard has been retitled to specify that it applies to property/casualty insurance   
ratemaking. The services referred to for risk fi nancing systems, such as self-insurance and   
securitization products, are considered to be  ratemaking when estimates for future costs   
are being determined.    
    
           Section 2.  Definitions  
   
  Section 2.1, Catastrophe—One commentator believed  that the definition of catastrophe should   
relate to how the event or phenomenon violat ed the general insura nce ratemaking model   
assumption of independent events.  The subcommittee believes th at the use of a qualitative   
definition is more broadly applicable and us eful in terms of current accepted practices.    
   
Another commentator believed th at the phrase “or natural phe nomenon”should be removed, as   
the phrase “relatively infrequent events”  included natural and manmade phenomena. The   
subcommittee agreed and deleted the word “natural” from the definition.
- **Source 5**: ../data/ASOP/asop038_201.pdf, page 14:

models often entering financial statements directly.   
   
Lastly, due to the evolution of enterprise risk management (ERM) practices and regulations, there   
has been increased use of catastrophe modeling as part of insurer stress testing and risk   
management across all practice areas. This trend is likely to continue to evolve and heighten in   
light of the emergence of the novel coronavirus and the COVID-19 pandemic.
- **Source 6**: ../data/ASOP/asop038_201.pdf, page 9:

to the results of the actuarial analysis.   
   
3.2 Catastrophe Models Developed by Experts—When selecting, using, reviewing, or   
evaluating a catastrophe model  developed by experts, the actuary should take into   
account the following:   
   
 a. whether the individual or individuals who developed the catastrophe model  are   
experts in the applicable field;   
   
 b. the extent to which the catastrophe model  has been reviewed or validated by   
experts in the applicable field, including known differences of opinion among


# 5. References
- https://www.actuarialstandardsboard.org/standards-of-practice/
- https://python.langchain.com/docs/use_cases/question_answering/quickstart
- https://python.langchain.com/docs/use_cases/question_answering/sources
- https://chat.langchain.com/