# Contextual keywords generation for Financial report document
In this Example we will use a Financial report document taken from https://github.com/patronus-ai/financebench/tree/main/pdfs

**Steps**:
1) Parse financial report document (pdf file) using LlamaParse.
2) Split document into chunks. Here we use basic tokens-based chunking with constant chunk_size, but you can use any other method for chunking.
3) Generate contextual keywords for each chunk.
4) Create questions related to randomly selected chunks since a predefined test set is unavailable.
5) Evaluate the method by retrieving the top five most relevant chunks based on cosine similarity between chunk and question embeddings, checking if the correct chunk is included.
6) To compare results with raw content (without keywords), create a new index in the "Create index" section and modify the document structure by replacing Document(text='#'+x['keywords']+'\n'+x['content'], .. ) with Document(text=x['content'], .. ).


In [1]:
# Install dependencies
!pip install tiktoken
!pip install openai
!pip install llama_parse
!pip install llama_index

import random
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader
from config import OPENAI_API_KEY, LLAMAPARSE_API_KEY

# Enable nested async loops
import nest_asyncio
nest_asyncio.apply()


In [6]:
# Parse pdf file
parser = LlamaParse(
    api_key=LLAMAPARSE_API_KEY,
    result_type="markdown"  # "markdown" and "text" are available
)

# use SimpleDirectoryReader to parse our file
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(input_files=['./data/ADOBE_2015_10K.pdf'], file_extractor=file_extractor).load_data()
print("pdf file pages:", len(documents))

Started parsing the file under job_id eac9517e-8679-478f-82dc-3e03668c9180
.......pdf file pages: 116


In [7]:
# Split into chunks (by tokens)
import os
import sys
import json
from helper import file_get_contents, file_put_contents, generate_contextual_keywords, get_llm_answer, generate_questions_bychunk
from llama_index.core.schema import Document
import tiktoken
enc = tiktoken.get_encoding("o200k_base")

def split_into_chunks(content, chunk_size):
	a = enc.encode(content)
	left, chunks = 0, []
	while left < len(a):
		arr = a[left : left+chunk_size]
		chunks.append(enc.decode(arr))
		left+=chunk_size
	return chunks
    
def generate_chunked_content(chunks):
    chunked_content = ""
    for idx, text in enumerate(chunks):
      chunked_content+=f"### Chunk {idx+1} ###\n{text}\n\n"
    return chunked_content
    

# Generate contextual keywords
path = "./temp/chunks2.json"
if not os.path.exists(path):
    print("Generating keywords..")
    document_content, chunks, chunks2 = "", [], []
    for doc in documents: document_content+=doc.text+"\n"
    chunks1 = split_into_chunks(document_content, 400) #400 -- defaulf value
    for i, chunk in enumerate(chunks1):
        chunks.append(chunk)
        if (len(chunks) > 10 or (i==len(chunks1)-1) and len(chunks)>2):
            chunked_content = generate_chunked_content(chunks)
            keywords = generate_contextual_keywords(chunked_content)        
            print("page_end:", i+1, keywords, len(keywords), len(chunks))            
            assert len(keywords) >= len(chunks)
            for j in range(len(chunks)): chunks2.append( {"idx":j, "keywords":keywords[j], "content":chunks[j]} )
            chunks = []
    file_put_contents(path, json.dumps(chunks2))
else:
    chunks2 = json.loads(file_get_contents(path)) #it has content, keywords, idx


# Generate questions
path = "./temp/chunks3.json"
if not os.path.exists(path):
    print("Generating questions..")
    chunks3 = generate_questions_bychunk(chunks2) 
    file_put_contents(path, json.dumps(chunks3))
else:
    chunks3 = json.loads(file_get_contents(path)) #it has content, keywords, questions, idx now


Generating keywords..
Keywords_st:
 Here are the keywords required to fully understand each chunk:

**Chunk 1**: Adobe Systems Incorporated, Form 10-K, Securities and Exchange Commission (SEC), fiscal year 2015, annual report.

**Chunk 2**: Adobe Systems Incorporated, company information, address, phone number, securities registration, NASDAQ stock market.

**Chunk 3**: Securities Exchange Act of 1934, reporting requirements, Interactive Data File, Regulation S-T, disclosure of delinquent filers.

**Chunk 4**: Large accelerated filer, accelerated filer, non-accelerated filer, smaller reporting company, shell company, aggregate market value of common stock.

**Chunk 5**: Market value of common stock, number of shares outstanding, documents incorporated by reference, Proxy Statement.

**Chunk 6**: Table of contents, Form 10-K, Adobe Systems Incorporated, business overview, risk factors, financial statements.

**Chunk 7**: Management's discussion and analysis of financial condition and re

In [None]:
# Create Index    
from llama_index.core import GPTVectorStoreIndex, StorageContext, load_index_from_storage
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
INDEX_DIR = "./temp/local_index_cache"
if not os.path.exists(INDEX_DIR):
    print("Creating new index ...")
    documents2 = [Document(text='#'+x['keywords']+'\n'+x['content'], metadata={"id": str(x["idx"])}) for x in chunks3] 
    index = GPTVectorStoreIndex.from_documents(documents2)
    index.storage_context.persist(persist_dir=INDEX_DIR)
else:
    storage_context = StorageContext.from_defaults(persist_dir=INDEX_DIR)
    index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine(similarity_top_k=5)

# Run tests
count, correct = 0, 0
for test in chunks3[:]:
    if not "questions" in test: continue
    idx = test["idx"]
    for question in test["questions"]:
        count+=1
        response = query_engine.query(question)
        print("\n\n--- Test:", question, "idx:", idx)
        for result in response.source_nodes[:]:
            print(result.node.metadata) #prompt+=f"\n\n<Document>\n                     
            if result.node.metadata['id'] == str(idx): correct+=1                

print("Test correct, all:", correct, count)