---
title: "Building an ESG Analyst using Llama"
#description: "blog post description (appears underneath the title in smaller text) which is included on the listing page"
author:
  - name: Aryamik Sharma
    url: https://aryamik.github.io

date: 12-29-2023
categories: [Data Analytics] # self-defined categories

draft: true # setting this to `true` will prevent your post from appearing on your listing page until you're ready!
---


Last year, I finally managed to cross an item of my bucketlist

-   Assemble my own Gaming PC ✅

Look at this beauty -

Upon setting it up, I did what any tech nerd would do.


```{=html}
<p align = 'center'><iframe width="560" height="315" src="https://www.youtube.com/embed/1DyD2jrWjFM?si=ap_D5Y-yvTsCNV2a" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe></p>
```


All this horsepower and no room to gallop

So eventually, I was like "you know what, now that I have all this horsepower, why don't I try my hands at Retrieval Augmented Generation (RAG) as it is the latest talk of the town?"

It just so happened that I stumbled upon work by Markus Leippold and his team at the University of Zurich on using Large Language Models (LLMs) for analysing sustainability reports. I discovered the tutorials Tobias Schimanski has published some good tutorials on doing this.

As I was in the process of implementing the solution, I noticed that the code was pretty up to what I was looking for. However, as I was going through the code, I thought of couple of areas that I could play around with

-   For starters, Tobias uses OpenAI. While I could create my own OpenAI API keys, I wanted to use an open source model, something like Llama.

-   For parsing PDFs, the tutorial uses PyPDF. While it does an amazing job of parsing documents, for more complex documents like sustainability reports that are not standardized. using Llamaparser, would result in better outputs. I came across this idea in [this](https://www.youtube.com/watch?v=u5Vcrwpzoz8) tutorial by AI Jason.

The tutorial provide a good use case of using RAG to analyse sustainability reports. More specifically, extracting basic information about a company, its sector etc. I asked myself - "Why not take this one step further? Why not create an AI agent that helps me analyse a sustainability report and see whether it meets the principles outlined in Canada's Competition Act, specifically around environmental claims and greenwashing.

Last year, the Competition Bureau released its draft principles on environmental claims. While they have not been finalized at the time of this writing, I thought it would be a good starting good point for this use case.

From an analytics perspective, greenwashing as a topic has always fascinated me. Why, you may ask?

Simply put, it's because it is hard to 'quantify' greenwashing. Sure, it does involve numbers. Let's say I go out on a limb and say that my product is going to save 50% emissions compared to other products in the market, you would be like 'How did you come up with that number?' and I 'm going to be like:

![](images/proxy-image.jpg){width="422"}

You would tell me to do a trend analysis of my historical emissions or by first sourcing the emissions data of the industry, sourcing data of other products. Maybe even do a Life cycle assessment (LCA) and then I might get to a number. So yes, there are numbers involved. But it's not as if I can simply 'measure' greenwashing. Over the last few months, I was curious to see if there where was any work done on this topic. I came across some good papers such as [this](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4582917) one by Lagasio (2023) who introduces 'Greenwashing Severity Index' (GSI) or [this](https://doi.org/10.1007/s10551-012-1360-0)one by Chen and Chang (2013) who suggest measuring greenwashing on product level, green consumer confusion, green perceived risk and green trust based on survey responses.

However, [Dorfleitner and Utz (2023)](https://link.springer.com/article/10.1007/s11846-023-00718-w) sum it up best :

> So far, the literature does not present a widely accepted framework to
> measure greenwashing. Moreover, approaches based on surveys and case
> studies are not scalable on a broad sample of firms.

In short, there is no unified framework to measure greenwashing. So while this use case is not an attempt to quantify greenwashing, it's simply a glimpse into how we could leverage analytics to make our lives easier.


In [None]:
import nest_asyncio
nest_asyncio.apply()

import os
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.schema import Document
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# --- Prompt Template: GENERAL INFO ---
PROMPT_TEMPLATE_GENERAL = """
You are tasked with the role of a climate scientist, assigned to analyze a company's sustainability report. Based on the following extracted parts from the sustainability report, answer the given QUESTIONS.
If you don't know the answer, just say that you don't know by answering "NA". Don't try to make up an answer.

Given are the following sources:
--------------------- [BEGIN OF SOURCES]
{context}
--------------------- [END OF SOURCES]

QUESTIONS:
1. What is the company of the report?
2. What sector does the company belong to?
3. Where is the company located?

Format your answers in JSON format with the following keys: COMPANY_NAME, COMPANY_SECTOR, COMPANY_LOCATION.
Your FINAL_ANSWER in JSON (ensure there's no format error):
"""


# --- Load and parse document ---
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-8zFsxaLmD47ZZ4bqBysEEZ9I9E0dNgDL7jrwDPGotjTXdNnQ"  # Replace with your actual key
parser = LlamaParse(result_type="markdown")
file_extractor = {".pdf": parser}

docs = SimpleDirectoryReader(
    input_files=[r"C:\Users\aryam\Downloads\Documents\427857-1-_7_Nike-2024-Combo_Form-10-K_WR.pdf"],
    file_extractor=file_extractor
).load_data()

# --- Split into chunks ---
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
document_splits = []

for doc in docs:
    chunks = text_splitter.split_text(doc.text)
    for chunk in chunks:
        document_splits.append(Document(page_content=chunk, metadata={"source": doc.get_doc_id()}))

# --- Embed and store ---
local_embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(documents=document_splits, embedding=local_embeddings)
retriever = vectorstore.as_retriever()

# --- Load LLM model ---
model = ChatOllama(model="gemma3:27b")  # Ensure this model is available in Ollama

# --- Build the RAG chain using the GENERAL template ---
general_prompt = ChatPromptTemplate.from_template(PROMPT_TEMPLATE_GENERAL)

question_answer_chain = create_stuff_documents_chain(model, general_prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

# --- Run the pipeline ---
# --- Run the pipeline ---
results = rag_chain.invoke({"input": "Extract basic company information from the report."})

# --- Parse and pretty-print the answer ---
try:
    final_answer = json.loads(results["answer"])
    print("\n📄 Extracted Company Information:")
    print(f"🔹 Company Name   : {final_answer.get('COMPANY_NAME', 'NA')}")
    print(f"🔹 Sector         : {final_answer.get('COMPANY_SECTOR', 'NA')}")
    print(f"🔹 Location       : {final_answer.get('COMPANY_LOCATION', 'NA')}")
except json.JSONDecodeError:
    print("\n⚠️ Unable to parse model output as JSON:")
    print(results["answer"])

So there you have it. A very simple use case of RAG to analyse greenwashing. Is it perfect? No, far from it. At the end of the day, it still requires human overlay because like I said earlier, greenwashing is a very nuanced topic. Nevertheless, it is still pretty cool to play around with it and get a sense of what is possible within the realm of ESG and AI.

As far as next steps are concerned, I will try to tinker around with it.

Overall, this was a fun learning project for me. As a newbie to RAG, I have barely scratched the surface. There are so many rabbit holes I want to explore such as playing around with different models, refining the prompts and tinkering with other technical parameters to get better outputs.

Maybe also create a nice web application. Some of the ones that I found really interesting were

-   [SEC Insights](https://github.com/run-llama/sec-insights?tab=readme-ov-file) - As I was exploring some of the different use cases of RAG, I found that a 10-K RAG agent was very popular. You have an agent that scour through complex 10-K filings and easily answer some of the questions you might have. I really liked this one because it largely overlaps with my use case. Granted 10-K filings have a structured format, I thought analysing ESG reports seems like a natural extension of these financial chatbots.

-   Llama Banker - Similar to the previous one, but I stumbled across this tutorial and really liked how Nick highlighted some of the challenges he encountered while building this application.

-   [Agentic Company Researcher](https://github.com/pogjester/company-research-agent) - This one is a little bit more advanced as it involves retrieving data from multiple sources such as company websites, news articles, financial reports, and industry analyses.

why not expand it to ESG and Sustainable Finance.

mpany Researcher

# Competition Act

The basic

scan the documents, I will provide instructions . For referemce ,

The only thing is it was using OpenAI. Llama was the key, I wanted to build something that was open-source.

My first reaction was -Llama Models.

<https://www.youtube.com/watch?v=u5Vcrwpzoz8>