# **Project: Langchain rag 2.0**

## **1. Install required libraries**
These libraries help with text extraction, AI processing, and handling file uploads.

In [1]:
!pip install pytesseract pypdf2 langchain-pinecone langchain-google-genai pinecone-client

Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Collecting pypdf2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting langchain-pinecone
  Downloading langchain_pinecone-0.2.2-py3-none-any.whl.metadata (1.6 kB)
Collecting langchain-google-genai
  Downloading langchain_google_genai-2.0.9-py3-none-any.whl.metadata (3.6 kB)
Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting aiohttp<3.11,>=3.10 (from langchain-pinecone)
  Downloading aiohttp-3.10.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting langchain-tests<0.4.0,>=0.3.7 (from langchain-pinecone)
  Downloading langchain_tests-0.3.10-py3-none-any.whl.metadata (3.6 kB)
Collecting pinecone<6.0.0,>=5.4.0 (from langchain-pinecone)
  Downloading pinecone-5.4.2-py3-none-any.whl.metadata (19 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain-google-genai)
  Downloading filetype-1.2.

## **2. Import Necessary Libraries**

In [2]:
import pytesseract
import PyPDF2
import io
import os
from google.colab import files
from PIL import Image
from google.colab import userdata
from langchain_pinecone import PineconeVectorStore
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_core.documents import Document
from pinecone import Pinecone, ServerlessSpec
from uuid import uuid4

## **3. Define functions to extract text from different file types**

In [3]:
def extract_text_from_pdf(pdf_file):
    text = ""
    reader = PyPDF2.PdfReader(pdf_file)
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

def extract_text_from_image(image_file):
    image = Image.open(image_file)
    text = pytesseract.image_to_string(image)
    return text

def process_uploaded_file():
    uploaded = files.upload()
    extracted_text = ""
    for file_name in uploaded.keys():
        if file_name.endswith('.pdf'):
            with open(file_name, "rb") as f:
                extracted_text = extract_text_from_pdf(f)
        elif file_name.endswith(('.png', '.jpg', '.jpeg')):
            extracted_text = extract_text_from_image(file_name)
        else:
            with open(file_name, "r", encoding="utf-8") as f:
                extracted_text = f.read()

    print("Extracted Text:")
    print(extracted_text)
    return extracted_text

# Process uploaded file and extract questions
extracted_text = process_uploaded_file()
questions = extracted_text.split("\n")  # Split text into questions

Saving rag questions.txt to rag questions.txt
Extracted Text:
who dreamed creation of pakistan and when?
how mohammad ali jinnah succeed?
why mahatma gandhi come in the politics?


## **4. Initialize the RAG (Retrieval-Augmented Generation) system**

* **Get embeddings ready:** Sets up Google's tools for creating embeddings.
* **Connect to Pinecone:** Links to the Pinecone database where information is stored.
* **Prepare the AI:** Gets the ChatGoogleGenerativeAI model ready to provide answers.

In [4]:
# Initialize RAG system components
def initialize_rag_system():
  # Set API keys using userdata.get
    os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY_2')

    # Initialize embeddings
    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

    # Initialize Pinecone
    pc = Pinecone(api_key=userdata.get("PINECONEKEY2")) # Make sure this is your actual Pinecone API key or retrieve it from userdata
    index_name = "new-rag-index"
    pc.create_index(
        name=index_name,
        dimension=768,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
    vector_store = PineconeVectorStore(index=pc.Index(index_name), embedding=embeddings)

    # Initialize LLM (Generative AI model)
    llm = ChatGoogleGenerativeAI(
        model="gemini-1.5-flash",
        temperature=0,
        max_tokens=None,
        timeout=None,
        max_retries=2,
    )
    return vector_store, llm

## **5. Process each question and generate an answer using AI**

*   **Find similar information:** For each question, Pinecone is used to search for the 2 most similar pieces of information in the database.
*   **Get the AI's response:** The question and the similar information found are given to the Gemini AI model. Gemini then uses this information to create an answer.
*   **Show the results:** The original question and Gemini's answer are displayed in the output.
*   **Get the questions:** The questions are read in from a ".txt" file.
*   **Set up the system:** The RAG system, which uses AI to answer questions, is prepared.
*   **Process and show answers:** The system works through each question, finds answers, and shows them in the Colab notebook's output area.

In [5]:
# Process each question and print answers
def process_questions(questions, vector_store, llm):
    for query in questions:
        print(f"\n**Question**: {query}")
        # Perform vector search
        vector_results = vector_store.similarity_search(query, k=2)

        # Generate the final answer using LLM
        final_answer = llm.invoke(f"ANSWER THE USER QUERY: {query}, Here are some references: {vector_results}")
        print(f"**Answer**: {final_answer.content}")

# Main workflow
if __name__ == "__main__":
  # Load questions from the file
    print("Questions loaded:", questions)

    # Initialize RAG system
    vector_store, llm = initialize_rag_system()

    # Process questions and display answers in Colab output
    process_questions(questions, vector_store, llm)

Questions loaded: ['who dreamed creation of pakistan and when?', 'how mohammad ali jinnah succeed?', 'why mahatma gandhi come in the politics?']

**Question**: who dreamed creation of pakistan and when?
**Answer**: The creation of Pakistan wasn't the dream of a single person, but rather a culmination of ideas and efforts from many individuals over a considerable period.  However, **Muhammad Ali Jinnah** is widely considered the most prominent figure in the movement for a separate Muslim state.  He articulated the vision and led the Muslim League's efforts to achieve it.

While the specific "dream" evolved over time, the idea of a separate Muslim homeland gained significant momentum in the early to mid-20th century, culminating in the **Pakistan Resolution (Lahore Resolution) passed in 1940**.  This resolution is generally considered the formal articulation of the demand for a separate Muslim state, marking a key moment in the dream's realization.  Therefore, while Jinnah is the most pr