<a href="https://colab.research.google.com/github/ShridharBagalkote/python/blob/main/RAG_Model_with_GeminiAI_ChromaDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
## All together
!pip install chromadb # is used to save vectors Database
!pip install langchain
!pip install langchain-community
!pip install langchain-google-genai

Collecting langchain-community
  Downloading langchain_community-0.3.16-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.16 (from langchain-community)
  Downloading langchain-0.3.17-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.32 (from langchain-community)
  Downloading langchain_core-0.3.33-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.0-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-

# Step-1: Import the packages

In [5]:
from langchain.prompts import PromptTemplate # Prompt template
from langchain.vectorstores import Chroma   # Store the vectors
from langchain.text_splitter import RecursiveCharacterTextSplitter # Chunks
from langchain.document_loaders import TextLoader  # Load the text
from langchain.chains import VectorDBQA,RetrievalQA, LLMChain # Chains and Retrival ans
from langchain.retrievers.multi_query import MultiQueryRetriever # Multiple Answers
from langchain_google_genai import ChatGoogleGenerativeAI # GenAI model to retrive
from langchain_google_genai import GoogleGenerativeAIEmbeddings # GenAI model to conver words

## Step-2: Load the data

In [6]:
# Load documents
loader = TextLoader('/content/State_union.txt')
documents = loader.load()

# Step-3: Divide into chunks

In [7]:
# Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

## Step-4: Set up the models

- One is embedding model

- One is Chat model

In [8]:
# Set up embeddings
embeddings = GoogleGenerativeAIEmbeddings(
    model='models/embedding-001',
    google_api_key='AIzaSyAUH70gKFSmR52QAbZq4fJFM3WSbTYCHp8',
    task_type="retrieval_query"
)


In [9]:
from google.generativeai.types.safety_types import HarmBlockThreshold, HarmCategory

safety_settings = {
                    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
                    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
                    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
                    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
                    }

chat_model = ChatGoogleGenerativeAI(
    model="gemini-1.5-flash",
    google_api_key='AIzaSyAUH70gKFSmR52QAbZq4fJFM3WSbTYCHp8',
    temperature=0.3,
    safety_settings=safety_settings
)


# Step-5: Get the Embeddings store in VectorDB

In [14]:
# Create the vector store
vectordb = Chroma.from_documents(documents=texts, embedding=embeddings)

# Step-6: Make the Prompt Template

In [15]:
prompt_template = """
## Safety and Respect Come First!

You are programmed to be a helpful and harmless AI. You will not answer requests that promote:

* **Harassment or Bullying:** Targeting individuals or groups with hateful or hurtful language.
* **Hate Speech:**  Content that attacks or demeans others based on race, ethnicity, religion, gender, sexual orientation, disability, or other protected characteristics.
* **Violence or Harm:**  Promoting or glorifying violence, illegal activities, or dangerous behavior.
* **Misinformation and Falsehoods:**  Spreading demonstrably false or misleading information.

**How to Use You:**

1. **Provide Context:** Give me background information on a topic.
2. **Ask Your Question:** Clearly state your question related to the provided context.

**Please Note:** If the user request violates these guidelines, you will respond with:
"I'm here to assist with safe and respectful interactions. Your query goes against my guidelines. Let's try something different that promotes a positive and inclusive environment."

##  Answering User Question:

Answer the question as precisely as possible using the provided context. The context can be from different topics. Please make sure the context is highly related to the question. If the answer is not in the context, you only say "answer is not in the context".

Context: \n {context}
Question: \n {question}
Answer:
"""


prompt = PromptTemplate(template = prompt_template, input_variables=['context','question'])

In [16]:
print(prompt)

input_variables=['context', 'question'] input_types={} partial_variables={} template='\n## Safety and Respect Come First!\n\nYou are programmed to be a helpful and harmless AI. You will not answer requests that promote:\n\n* **Harassment or Bullying:** Targeting individuals or groups with hateful or hurtful language.\n* **Hate Speech:**  Content that attacks or demeans others based on race, ethnicity, religion, gender, sexual orientation, disability, or other protected characteristics.\n* **Violence or Harm:**  Promoting or glorifying violence, illegal activities, or dangerous behavior.\n* **Misinformation and Falsehoods:**  Spreading demonstrably false or misleading information.\n\n**How to Use You:**\n\n1. **Provide Context:** Give me background information on a topic.\n2. **Ask Your Question:** Clearly state your question related to the provided context.\n\n**Please Note:** If the user request violates these guidelines, you will respond with:\n"I\'m here to assist with safe and resp

# Step-7: Create tha QA chains

In [17]:
# Create the QA
retriever_from_llm = MultiQueryRetriever.from_llm(retriever=vectordb.as_retriever(search_kwargs={"k": 5}),
                                                  llm=chat_model)

qa_chain = RetrievalQA.from_chain_type(llm=chat_model,
                                       retriever= retriever_from_llm,
                                       return_source_documents=True,
                                       chain_type="stuff",
                                       chain_type_kwargs={"prompt": prompt}
                                      )

In [18]:
# Run the query
response = qa_chain.invoke({"What did the president say about Ketanji Brown Jackson?"})
print(response)

{'query': {'What did the president say about Ketanji Brown Jackson?'}, 'result': "The president said he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson to the Supreme Court four days prior.  He called her one of the nation's top legal minds who will continue Justice Breyer's legacy of excellence.", 'source_documents': [Document(metadata={'source': '/content/State_union.txt'}, page_content='In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections.\n\nWe cannot let this happen.\n\nTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.\n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your s

In [19]:
response

{'query': {'What did the president say about Ketanji Brown Jackson?'},
 'result': "The president said he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson to the Supreme Court four days prior.  He called her one of the nation's top legal minds who will continue Justice Breyer's legacy of excellence.",
 'source_documents': [Document(metadata={'source': '/content/State_union.txt'}, page_content='In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections.\n\nWe cannot let this happen.\n\nTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.\n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your

In [20]:
response.keys()

dict_keys(['query', 'result', 'source_documents'])

In [22]:
from langchain.prompts import PromptTemplate # Prompt template
from langchain.vectorstores import Chroma   # Store the vectors
from langchain.text_splitter import RecursiveCharacterTextSplitter # Chunks
from langchain.document_loaders import TextLoader  # Load the text
from langchain.chains import VectorDBQA,RetrievalQA, LLMChain # Chains and Retrival ans
from langchain.retrievers.multi_query import MultiQueryRetriever # Multiple Answers
from langchain_google_genai import ChatGoogleGenerativeAI # GenAI model to retrive
from langchain_google_genai import GoogleGenerativeAIEmbeddings # GenAI model to conver words
from langchain.schema import Document
import pandas as pd

# Load documents
# Load Excel data
df = pd.read_excel('/content/extracted_job_d.xlsx')


# Convert the 'text' column to a list of LangChain Document objects
documents = [Document(page_content=text) for text in df['text'].tolist()]
# Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# Set up embeddings
embeddings = GoogleGenerativeAIEmbeddings(
    model='models/embedding-001',
    google_api_key='AIzaSyAUH70gKFSmR52QAbZq4fJFM3WSbTYCHp8',
    task_type="retrieval_query"
)



from google.generativeai.types.safety_types import HarmBlockThreshold, HarmCategory

safety_settings = {
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
                    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
                    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
                    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,

}
chat_model = ChatGoogleGenerativeAI(
    model="gemini-1.5-pro",
    google_api_key='AIzaSyAUH70gKFSmR52QAbZq4fJFM3WSbTYCHp8',
    temperature=0.3,  # Adjust the temperature here
    safety_settings=safety_settings
)


# Create the vector store
vectordb = Chroma.from_documents(documents=texts, embedding=embeddings)

prompt_template = """
## Safety and Respect Come First!

You are programmed to be a helpful and harmless AI. You will not answer requests that promote:

* **Harassment or Bullying:** Targeting individuals or groups with hateful or hurtful language.
* **Hate Speech:**  Content that attacks or demeans others based on race, ethnicity, religion, gender, sexual orientation, disability, or other protected characteristics.
* **Violence or Harm:**  Promoting or glorifying violence, illegal activities, or dangerous behavior.
* **Misinformation and Falsehoods:**  Spreading demonstrably false or misleading information.

**How to Use You:**

1. **Provide Context:** Give me background information on a topic.
2. **Ask Your Question:** Clearly state your question related to the provided context.

**Please Note:** If the user request violates these guidelines, you will respond with:
"I'm here to assist with safe and respectful interactions. Your query goes against my guidelines. Let's try something different that promotes a positive and inclusive environment."

##  Answering User Question:

Answer the question as precisely as possible using the provided context. The context can be from different topics. Please make sure the context is highly related to the question. If the answer is not in the context, you only say "answer is not in the context".

Context: \n {context}
Question: \n {question}
Answer:
"""


prompt = PromptTemplate(template = prompt_template, input_variables=['context','question'])
# Create the QA
retriever_from_llm = MultiQueryRetriever.from_llm(retriever=vectordb.as_retriever(search_kwargs={"k": 5}),
                                                  llm=chat_model)

qa_chain = RetrievalQA.from_chain_type(llm=chat_model,
                                       retriever= retriever_from_llm,
                                       return_source_documents=True,
                                       chain_type="stuff",
                                       chain_type_kwargs={"prompt": prompt}
                                      )

# Run the query
response = qa_chain.invoke({"What are general job roles"})

print(response)

{'query': {'What are general job roles'}, 'result': 'Senior level Full stack developer, Midlevel Full stack developer, API developer, Java Lead Software Engineer, Test Automation Engineer, Data architect, Java Developer, .NET Developer, Java full stack engineer, Product owner, UX designer, Senior data engineer, Salesforce tester, Data Engineer, Data Analyst.', 'source_documents': [Document(metadata={}, page_content='technologies encryption types and protocolsstandards RBAC based access for cluster namespaces Vulnerability and threat management Professional certifications CIMP CIAM CISSP'), Document(metadata={}, page_content='Hi Folks I hope you are doing well Please go through below requirements are W2 and relocation is fine and except CPT Visa any visa can work Job Role Senior level Full stack developer Location Charlotte NC Contract W2 only Minimum 0813 years of experience Mandatory skills Java Spring Boot Angular Node Kafka hashtagW2 only Please dont share resumes for C2C roles Its 