# Local Query System Using Gemini, RAG, and Vector Stores


Lead Data Scientist: Y.B.

## Project Overview
This project provides a step-by-step guide to building a query system for local data using the free Gemini model, Retrieval-Augmented Generation (RAG), and vector stores. The system enables efficient querying and retrieval of relevant data while leveraging AI-powered insights. This file includes instructions on setting up the environment, configuring dependencies, and executing queries efficiently. Additionally, it covers best practices for optimizing performance and troubleshooting common issues.

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain import VectorDBQA
from langchain.document_loaders import DirectoryLoader
import config

In [3]:
from langchain.document_loaders import PyPDFLoader

In [4]:
#loader = PyPDFLoader('/Users/yvonne/LLMChatbox/Doc/FY2024-NVIDIA-Corporate-Sustainability-Report.pdf')
#documents = loader.load()

In [5]:
loader = DirectoryLoader(
    path="files_folder_path",
    glob="*.pdf",  # This will match all PDFs in the folder and subfolders
    loader_cls=PyPDFLoader
)

documents = loader.load()

In [6]:
from google import genai
client = genai.Client(api_key=config.Google_API_KEY50323)

### Test API connection

In [33]:
from google import genai
client = genai.Client(api_key=config.Google_API_KEY50323)

response = client.models.generate_content(
    model="gemini-2.0-flash", contents="Please explain ESG in 2 sentences"
)
print(response.text)

ESG stands for Environmental, Social, and Governance, representing a framework for evaluating a company's impact and sustainability across these three key areas. Investors and stakeholders use ESG factors to assess risks and opportunities beyond traditional financial metrics, informing investment decisions and promoting responsible business practices.



### Gemini API Installation tips:
1- Associate API function with the google account project:


In [None]:
gcloud init 
gcloud auth application-default login

2- In the google project management system, add relevant API and Credential

In [7]:
import os 
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "json_credentials_file"

https://cloud.google.com/sdk/gcloud/reference/auth/application-default/login
https://googleapis.dev/python/google-auth/2.6.6/user-guide.html

### Define the llm to use:

In [None]:
from google import genai
from google.genai import types
from langchain_google_genai import ChatGoogleGenerativeAI


In [10]:
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash",temperature=0.7, top_p=0.85)

### Build the retrial system

In [37]:
from langchain_community.vectorstores import FAISS
from langchain_core.vectorstores import VectorStoreRetriever

In [14]:
#retriever = VectorStoreRetriever(vectorstore=FAISS(...))

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004",api_key=config.Google_API_KEY50323)
docsearch = Chroma.from_documents(texts, embeddings)

In [11]:
retriever = docsearch.as_retriever(search_kwargs={"k": 3})

In [19]:
qa = RetrievalQA.from_chain_type(llm,retriever=retriever)

In [17]:
#qa = VectorDBQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=docsearch)

### Ask Questions based on the documents in the folder

In [47]:
query1 = "What's carbon emission for NVDA in 2023?"
qa.invoke(query1)

{'query': "What's carbon emission for NVDA in 2023?",
 'result': "Here's a breakdown of NVIDIA's carbon emissions (MT CO2e) for FY23, according to the provided document:\n\n*   **Scope 1:** 12,346\n*   **Scope 2 (market-based):** 60,671\n*   **Scope 1 and 2 (market-based):** 73,017\n*   **Scope 2 (location-based):** 142,909\n*   **Scope 3:** 3,514,000\n\n    *   Category 1: Purchased goods and services: 2,975,189\n    *   Category 2: Capital goods: 353,280\n    *   Category 3: Fuel-and energy-related activities: 67,805\n    *   Category 4: Upstream transportation and distribution: 60,572\n    *   Category 5: Waste generated in operations: 579\n    *   Category 6: Business travel: 8,633\n    *   Category 7: Employee commuting: 14,990\n    *   Category 8: Upstream leased assets: 32,952"}

In [48]:
query2 = "What's carbon emission for boeing in 2023?"
qa.invoke(query2)

{'query': "What's carbon emission for boeing in 2023?",
 'result': 'Boeing achieved net-zero carbon emissions (Scope 1 and Scope 2) at manufacturing and other work sites, and in business travel (Scope 3, Category 6) in 2023 for the fourth consecutive year, by expanding conservation and renewable energy procurement while securing third-party-verified offsets for the remaining greenhouse gas (GHG) emissions.\n\nHere are the carbon emissions for Boeing in 2023:\n*   Scope 1 GHG: 536,000 metric tons CO2e\n*   Scope 2 GHG (location-based): 764,000 metric tons CO2e\n*   Scope 2 GHG (market-based): 380,000 metric tons CO2e\n*   Scope 3 GHG - business travel: 254,000 metric tons CO2e\n*   Scope 3 GHG - use of sold products (Commercial Airplanes): 427,000,000 metric tons CO2e\n*   Scope 3 GHG - use of sold products (Defense, Space & Security): 21,000,000 metric tons CO2e\n*   Total calculated GHG excluding sold products: 1,170,000 metric tons CO2e\n*   Core metrics sites GHG (location-based): 6

In [49]:
query3="please compare the carbon emission between boeing and NVDA in 2023? "
qa.invoke(query3)

{'query': 'please compare the carbon emission between boeing and NVDA in 2023? ',
 'result': "In 2023, Boeing's Scope 1 GHG emissions were 642,000 metric tons CO2e, Scope 2 (market-based) GHG emissions were 401,000 metric tons CO2e. NVIDIA's Scope 1 GHG emissions were 12,346 metric tons CO2e, and Scope 2 (market-based) GHG emissions were 60,671 metric tons CO2e."}

In [50]:
query4 = "Based on ESG fundamental, what is major risk for boeing?"
qa.invoke(query4)

{'query': 'Based on ESG fundamental, what is major risk for boeing?',
 'result': 'Based on the provided text, major risks for Boeing include:\n\n*   Safety and quality issues.\n*   Increase in supply chain risks from conflicts and geopolitical events.\n*   Effects of climate change and legal, regulatory or market responses to such change.\n*   Potential environmental liabilities.'}