# Introduction
This notebook demonstrates the implementation of a Retrieval-Augmented Generation (RAG) based Language Model (LLM) application for analyzing 10-K financial documents. The app uses local LLM integration, document processing, vector store creation, and a Streamlit interface to provide an interactive question-answering system about company financial information.
Key components of this project include:

Document processing of 10-K PDF files
Vector store creation for efficient information retrieval
Integration of a local LLM for question answering
Development of a conversational chain for context-aware responses
Streamlit web application for user interaction

Let's explore each component in detail.

## Import Libraries and Set Constants
In this section, we import all necessary libraries and define the constants used throughout the application. These include paths for data storage, model configurations, and other parameters that control the behavior of our app.

In [None]:
# 1. Import Libraries and Set Constants

import streamlit as st
from streamlit_chat import message
from PyPDF2 import PdfReader
import asyncio
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain_community.llms import LlamaCpp
import math 
import re

# Constants
PDF_DIRECTORY = "data/"
VECTOR_STORE_FILENAME = "faiss_index"
BATCH_SIZE = 100
MODEL_PATH = "openhermes-2.5-mistral-7b.Q6_K.gguf"

## Initialize Embeddings and LLM
Here, we set up two crucial components of our RAG system:

*Embeddings*: We use HuggingFace's sentence transformers to create embeddings for our document chunks. These embeddings allow us to perform semantic similarity searches.
*Local LLM*: We initialize a local LLM using the LlamaCpp library. This model will be responsible for generating human-like responses based on the retrieved context.

In [None]:
# 2. Initialize Embeddings and LLM

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

llm = LlamaCpp(
    model_path=MODEL_PATH,
    n_ctx=4096,
    n_batch=512,
    n_gpu_layers=-1,
    temperature=0.1,
    max_tokens=512,
    verbose=True,
    use_mlock=True,
    use_mmap=True,
)

## Document Processing
Document processing is a critical step in our RAG pipeline. It involves two main functions:

```get_pdf_text()```: This function extracts text from PDF files.

```get_text_chunks()```: This function splits the extracted text into manageable chunks for processing.

These functions prepare our 10-K documents for embedding and storage in the vector database.

In [None]:
# 3. Document Processing Functions

def get_pdf_text(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as pdf_file:
        pdf_reader = PdfReader(pdf_file)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

def get_text_chunks(text):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = text_splitter.split_text(text)
    return chunks

## Vector Store Ingestion
Vector store ingestion is the process of creating and storing vector representations (embeddings) of our document chunks. This section includes:

```embed_batch()```: A function to embed batches of text chunks.
```create_and_save_vector_store()```: A function that processes all PDFs, creates embeddings, and stores them in a FAISS index.
```load_vector_store()```: A function to load an existing vector store.

This process allows for efficient similarity search when answering user queries.

In [None]:
# 4. Vector Store Ingestion

async def embed_batch(batch_chunks):
    return embeddings.embed_documents(batch_chunks)

async def create_and_save_vector_store(pdf_dir):
    global vector_store
    all_chunks = []

    # Get all chunks from PDF files
    for filename in os.listdir(pdf_dir):
        if filename.endswith(".pdf"):
            pdf_path = os.path.join(pdf_dir, filename)
            text = get_pdf_text(pdf_path)
            chunks = get_text_chunks(text)
            all_chunks.extend(chunks)

    # Batch Embedding
    num_batches = math.ceil(len(all_chunks) / BATCH_SIZE)
    embedding_tasks = []
    for batch_num in range(num_batches):
        start_idx = batch_num * BATCH_SIZE
        end_idx = min((batch_num + 1) * BATCH_SIZE, len(all_chunks))
        batch_chunks = all_chunks[start_idx:end_idx]
        embedding_tasks.append(embed_batch(batch_chunks))

    # Gather Embedding Results
    embeddings_list = await asyncio.gather(*embedding_tasks)
    embeddings_list = [embedding for batch in embeddings_list for embedding in batch]

    # Create FAISS Index
    text_embeddings = list(zip(all_chunks, embeddings_list))
    vector_store = FAISS.from_embeddings(text_embeddings, embeddings)
    vector_store.save_local(VECTOR_STORE_FILENAME)

def load_vector_store():
    global vector_store
    vector_store = FAISS.load_local(VECTOR_STORE_FILENAME, embeddings, allow_dangerous_deserialization=True)


## Query Engine Development
The query engine is responsible for understanding user questions and generating appropriate responses. The key component here is the get_conversational_chain() function, which:

1. Defines a prompt template to guide the LLM's responses.
2. Creates a question-answering chain that combines the LLM with our retrieval system.

This setup allows for context-aware responses that draw information from the relevant parts of our 10-K documents.

In [None]:
# 5. Query Engine Development

def get_conversational_chain():
    prompt_template = """
    You are a helpful AI assistant designed to answer questions about financial documents of different companies. 
    You have been provided with information from Form 10-K filings of multiple companies.
    
    Use the provided context to answer the question as accurately as possible. 
    Whenever the user asks a question about google, answer the question with the context of the alphabet document as google is a subsidary of alphabet; provide a disclaimer for this as well.
    If the question asks for a comparison, make sure to highlight the differences and similarities between the companies.
    If you cannot answer the question from the given context, say "I'm sorry, I don't have enough information to answer that."
    
    Always start your answer with the company name(s) relevant to the question.
    If asked about specific financial figures, provide the exact numbers from the context if available.
    
    Context:
    {context}

    Question:
    {question}

    Answer:
    """
    prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
    chain = load_qa_chain(llm, chain_type="stuff", prompt=prompt)
    return chain

## Streamlit App Structure
While we can't run the Streamlit app directly in this notebook, this section outlines the structure of our web application. Key components include:

1. Setting up the Streamlit page and initializing session state.
2. Handling PDF processing and vector store creation/loading.
3. Implementing the chat interface for user interaction.
4. Processing user queries and displaying responses.

The Streamlit app provides an intuitive interface for users to interact with our RAG-based LLM system.

In [None]:
# 6. Streamlit App

def main():
    st.set_page_config(page_title="10-K Analyzer", page_icon="📈")
    st.title("🔍 Analyze Financial Documents")

    # Session State Initialization
    if 'buffer_memory' not in st.session_state:
        st.session_state.buffer_memory = ConversationBufferWindowMemory(k=3, return_messages=True)

    if "messages" not in st.session_state.keys():
        st.session_state.messages = [
            {"role": "assistant", "content": "Hi! I can help analyze Form 10-K documents. Ask me anything! 😊"}
        ]

    # PDF Processing & Vector Store
    if not os.path.exists(VECTOR_STORE_FILENAME):
        with st.spinner("Processing PDFs..."):
            progress_bar = st.progress(0, text="Starting...")
            asyncio.run(create_and_save_vector_store(PDF_DIRECTORY)) 
            progress_bar.progress(1.0, text="PDFs processed and vector store created!")
            st.success("PDFs processed and vector store created!")

    if os.path.exists(VECTOR_STORE_FILENAME) and vector_store is None:
        with st.spinner("Loading vector store..."):
            load_vector_store()

    # Chat Interaction
    for message in st.session_state.messages:
        with st.chat_message(message["role"]):
            st.write(message["content"])

    if prompt := st.chat_input("Enter your question about the 10-K filings..."):
        st.session_state.messages.append({"role": "user", "content": prompt})

        with st.chat_message("user"):
            st.write(prompt)

        if st.session_state.messages[-1]["role"] != "assistant":
            with st.chat_message("assistant"):
                try:
                    docs = vector_store.similarity_search(prompt, k=5)
                    chain = get_conversational_chain()
                    response = chain.invoke({"input_documents": docs, "question": prompt}, return_only_outputs=True)
                    st.write(response["output_text"])
                    st.session_state.messages.append({"role": "assistant", "content": response["output_text"]})
                except Exception as e:
                    st.error(f"An error occurred while processing your request: {str(e)}")

if __name__ == "__main__":
    main()

## Running the Application
To run the Streamlit application:

1. Ensure all required libraries are installed (pip install -r requirements.txt).
2. Place your 10-K PDF files in the data/ directory.
3. Run the command streamlit run app.py in your terminal.
4. Open the provided URL in your web browser to interact with the app.

*Note*: The first run may take some time as it processes the PDFs and creates the vector store.

## Conclusion and Future Improvements
This notebook demonstrates a functional RAG-based LLM system for analyzing 10-K financial documents. Some potential areas for future improvement include:

1. Implementing more advanced text chunking strategies for better context retrieval.
2. Exploring different embedding models to improve semantic search accuracy.
3. Fine-tuning the local LLM on financial domain data for more accurate responses.
4. Adding features like document comparison or time series analysis of financial metrics.
5. Implementing user authentication and document upload functionality in the Streamlit app.

By continually refining and expanding this system, we can create an increasingly powerful tool for financial document analysis and information retrieval.