# 📘 Company CSR Question Answer Retriever – V1

This notebook builds a question-answering system over a company's CSR (Corporate Social Responsibility) PDF document using LangChain and OpenAI.

### 🔧 Key Components:
- **Document Loader:** Loads CSR PDFs using `PyPDFLoader`.
- **Text Chunking:** Splits text into overlapping chunks using `RecursiveCharacterTextSplitter`.
- **Embeddings:** Generates embeddings using `OpenAIEmbeddings` (e.g., `text-embedding-ada-002`).
- **Vector Store:** Stores embeddings in a Chroma vector database.
- **LLM Response Generation:** Uses `gpt-4o-mini` via `ChatOpenAI` to answer questions based on retrieved chunks.
- **Custom Prompt:** Ensures grounded answers using a structured prompt template.

This lightweight semantic QA pipeline enables efficient querying of long unstructured CSR reports with accurate and context-aware answers.


In [1]:
import numpy as np
import pandas as pd
import os
from dotenv import load_dotenv
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.load import loads,dumps
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough
from langchain_community.document_loaders import PyPDFLoader
import pymupdf

In [2]:
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'

In [3]:
load_dotenv(override = True)
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY','your-key-if-not-using-env')

In [4]:
db_name = 'vector_db'
MODEL = "gpt-4o-mini"

In [5]:
file_path = (
    "./Input Dataset/D4G_0.pdf"
)

In [6]:
loader = PyPDFLoader(file_path)
pages = []
async for page in loader.alazy_load():
    pages.append(page)

In [7]:
pages

[Document(metadata={'producer': 'Adobe PDF Library 16.0.7', 'creator': 'Adobe InDesign 17.2 (Macintosh)', 'creationdate': '2022-04-19T09:11:48-05:00', 'moddate': '2022-05-10T11:04:16-05:00', 'trapped': '/False', 'source': './Input Dataset/D4G_0.pdf', 'total_pages': 29, 'page': 0, 'page_label': '1'}, page_content='2021 CORPORATE SOCIAL  \nRESPONSIBILITY REPORT'),
 Document(metadata={'producer': 'Adobe PDF Library 16.0.7', 'creator': 'Adobe InDesign 17.2 (Macintosh)', 'creationdate': '2022-04-19T09:11:48-05:00', 'moddate': '2022-05-10T11:04:16-05:00', 'trapped': '/False', 'source': './Input Dataset/D4G_0.pdf', 'total_pages': 29, 'page': 1, 'page_label': '2'}, page_content='PAGE INTENTIONALLY\nLEFT BLANK'),
 Document(metadata={'producer': 'Adobe PDF Library 16.0.7', 'creator': 'Adobe InDesign 17.2 (Macintosh)', 'creationdate': '2022-04-19T09:11:48-05:00', 'moddate': '2022-05-10T11:04:16-05:00', 'trapped': '/False', 'source': './Input Dataset/D4G_0.pdf', 'total_pages': 29, 'page': 2, 'page

In [8]:
splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size = 1000, chunk_overlap = 200)

In [9]:
chunks = splitter.split_documents(pages)

In [10]:
len(chunks)

32

In [11]:
embeddings = OpenAIEmbeddings()

In [12]:
if os.path.exists(db_name):
    Chroma(persist_directory = db_name, embedding_function = embeddings).delete_collection()

  Chroma(persist_directory = db_name, embedding_function = embeddings).delete_collection()


In [13]:
vector_db = Chroma.from_documents(documents = chunks, embedding = embeddings, persist_directory = db_name)

In [14]:
retriever = vector_db.as_retriever()

In [15]:
template = """You are a helpful assistant that generates multiple search queries based on a single input query. \n
Generate multiple search queries related to: {question} \n
Output (4 queries):"""

In [16]:
prompt_fus = ChatPromptTemplate.from_template(template)

In [17]:
generate_queries = (prompt_fus | ChatOpenAI(temperature = 0) | StrOutputParser() | (lambda x: x.split('\n')))

In [19]:
def reciprocal_rank_fusion(results: list[list], k = 60):
    fused_scores = {}
    for list_doc in results:
        for rank,doc in enumerate(list_doc):
            #print("Document\n",doc)
            doc_str = dumps(doc)
            #print("Document\n",doc_str)
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
            previous_score = fused_scores[doc_str]
            fused_scores[doc_str] += 1/(rank+k)
            
    ranked_res = [(loads(doc),score) for doc,score in sorted(fused_scores.items(), key = lambda x: x[1], reverse = True)]
    
    return ranked_res

In [34]:
question = "What is the strategy and Target of the initiative?"

In [23]:
retrieval_fusion = generate_queries | retriever.map() | reciprocal_rank_fusion
#docs = retrieval_fusion.invoke({'question':question})

  ranked_res = [(loads(doc),score) for doc,score in sorted(fused_scores.items(), key = lambda x: x[1], reverse = True)]


In [25]:
template = """Answer the following question based on this context:

{context}

Question: {question}
"""

In [27]:
prompt = ChatPromptTemplate.from_template(template)

In [29]:
llm = ChatOpenAI(temperature = 0)

In [35]:
final_chain = ({"context" : retrieval_fusion,
               "question" : itemgetter("question")}
               | prompt
               | llm
               | StrOutputParser()
              )

print(final_chain.invoke({"question":question}))

The strategy of the initiative is to focus on environmental, social, and governance (ESG) efforts by creating and updating enterprise-wide policies related to human rights, health and safety, labor management, diversity and inclusion, and environmental issues. The target of the initiative is to make a positive contribution in the communities where the company operates by dedicating time and effort to organizations and projects that strive to create a positive impact. The Board of Directors oversees the progress made toward ESG commitments and ensures diversity and inclusion in governance practices. The initiative also involves supporting charitable organizations, sponsoring local sports teams, and making donations to cancer research and other social causes.
