**langchain langchain-google-genai** This installs the langchain-google-genai package, which is a LangChain integration for Google's Generative AI models. It allows developers to easily use Google's AI models within their LangChain applications.

**pillow** This installs the Pillow library, which is a Python Imaging Library (PIL). Pillow provides a simple interface for processing images, including opening, manipulating, and saving images in various file formats.

In [1]:
!pip install langchain langchain-google-genai pillow



---
**import os** The provided code snippet is a single line of Python code that imports the os module. The os module is a part of the Python standard library and is widely used in Python programming. It provides a platform-independent way to interact with the underlying operating system

In [2]:
import os

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = "Enter your google API key here"

In [3]:
!pip install langchain_community



---
**Necessary Libraries**

In [4]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain_core.prompts.prompt import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableLambda


---
**!pip install pypdf** is used to install the pypdf library in your Python environment. This library is a popular and widely-used library for working with PDF files in Python.

In [5]:
!pip install pypdf



---
The **PyPDFLoader** is a component used in the RAG (Retrieval-Augmented Generation) model

In [6]:
 # Load PDF document using PyPDFLoader
loader = PyPDFLoader("path to pdf file")
pages = loader.load()

---
The **CharacterTextSplitter** is a component used in the RAG (Retrieval-Augmented Generation)

In [7]:
# Split the loaded document into smaller chunks for processing
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=900,
    chunk_overlap=150,
    length_function=len,
)

splits = text_splitter.split_documents(pages)

---
Generate embeddings for the text extracted from documents. These **embeddings** are used to represent the semantic meaning of the text in a numerical format, which is essential for efficient retrieval and comparison of relevant information.

In [8]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

---

The **ChromaDB** library is a powerful vector database  to efficiently store and query high-dimensional embeddings.

In [9]:
!pip install chromadb



In [10]:
# Initializes a Chroma vector database with document embeddings and text chunks.
vectorDb = Chroma.from_documents(
    embedding=embeddings,
    documents=splits,
    persist_directory="/docs/chroma"
)

The **Retriever interface** in LangChain defines a standard way for retrieving relevant documents from a data source. By creating a Retriever from a VectorStore, you can use the Retriever interface to interact with the VectorStore in a consistent way, without having to worry about the underlying implementation details.

The **as_retriever()** method creates a VectorStoreRetriever object, which is a wrapper around the VectorStore object. This VectorStoreRetriever implements the Retriever interface and uses the search capabilities of the VectorStore to find relevant documents.

In [11]:
retriever = vectorDb.as_retriever()

---
This **prompt** provides clear instructions for an AI assistant to generate responses based on a provided document. It specifies the expected format for the response, including a direct answer, supporting information from the document, and an acknowledgment if the relevant information is not found in the document. This structure helps ensure the assistant's responses are accurate, detailed, and transparent about the limitations of the available information.

In [12]:
prompt = """You are an AI assistant designed to answer questions based on a provided document.
   Please use the information in the document to generate accurate and detailed responses.
   If the information is not found in the document, indicate that the document does not contain the relevant information."


   Use the following format for structuring response:

   Question: {question}
   Document: {document}

   Answer Format:
        Direct Answer: Provide a concise and direct answer to the question based on the document.
        Supporting Information: Include relevant details or excerpts from the document to support the answer. Use quotes or references to specific sections of the document. """

In [13]:
# creates a PromptTemplate object that formats the provided prompt with the specified input variables (question and document).

prompt_template = PromptTemplate(template=prompt, input_variables=["question", "document"])

In [14]:
# creates an instance of the class, assigning it to the variable llm.

llm = ChatGoogleGenerativeAI(model="gemini-pro")

In [15]:
# Concatenates the content of multiple pages (docs) into a single string, separating each page's content with two newline characters.
def format_pages(pages):
    return "\n\n".join([page.page_content for page in pages])

# Custom Chain

This RAG chain combines several components to generate a response to a user's question:
1. The "document" input is processed by the retriever, the format_pages function, and a RunnableLambda to convert the result to a string.
2. The "question" input is passed through the RunnablePassthrough component.
3. The processed "document" and "question" are then fed into the prompt_template.
4. The output from the prompt_template is passed to the language model (llm).
5. Finally, the StrOutputParser component processes the language model's output to produce the final response.

In [16]:
rag_chain = (
    {"document": retriever | format_pages |  RunnableLambda(lambda x: str(x)), "question": RunnablePassthrough()}
    | prompt_template
    | llm
    | StrOutputParser()
)

# Question Answering

In [17]:
question = "Describe the data cleaning and pre-processing steps for sentiment analysis."
result = rag_chain.invoke(question)
print(result)

Question: Describe the data cleaning and pre-processing steps for sentiment analysis.
Direct Answer: The data cleaning process for sentiment analysis involves finding and handling missing values, detecting and removing duplicates to prepare a reliable dataset for subsequent analysis.
Supporting Information: "Data cleaning procedures are implemented to improve the data quality. It involves finding and handling missing values, detecting and removing duplicates to prepare a reliable data set for subsequent analysis."


In [20]:
question = "According to the results, which machine learning algorithm performed the best for sentiment analysis, and what might be the reason for its superior performance?"
result = rag_chain.invoke(question)
print(result)

Question: According to the results, which machine learning algorithm performed the best for sentiment analysis, and what might be the reason for its superior performance?
Direct Answer: Logistic Regression performed the best for sentiment analysis, achieving the highest accuracy of 96.1% compared to Naive Bayes and KNN.
Supporting Information: "The outcomes revealed that Logistic Regression outperforms across all metrics and achieved highest accuracy of (96.1%) as compared to other results by Naive Bayes and KNN."
