In [None]:
!pip install langchain faiss-cpu openai transformers pdfplumber pytorch torchvision

To interact with PDFs (text, images, and tables) using open-source tools. This will walk you through setting up a system where the chatbot can answer questions based on the content of PDFs, including text and images, and tables.


1. **Extract Text and Images from PDF**: Use `pdfplumber` for text and `Pillow` for images.
2. **Store the Data**: We’ll use `FAISS` to store text and image features.
3. **Use a Pre-trained Language Model (LLM)**: We will use OpenAI’s `gpt-3.5-turbo` (or an equivalent free model from HuggingFace) for answering questions.
4. **RAG Setup**: We will use `LangChain` to implement the RAG-based pipeline.


In [None]:
### 1. Extract Text, Images, and Tables from PDF

import pdfplumber

def extract_text_and_tables(pdf_path):
    text = ""
    tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text()  # Extracting text from the page
            tables.extend(page.extract_tables())  # Extracting tables from the page

    return text, tables

In [None]:
# Test: Use it to extract content from a PDF
pdf_path = 'your_file.pdf'  # Replace with your PDF file path
text, tables = extract_text_and_tables(pdf_path)

print(text)  # Preview the extracted text
print(tables)  # Preview the extracted tables

**Extracting Images**:

In [None]:
from PIL import Image

def extract_images(pdf_path):
    images = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            for img in page.images:
                im = Image.open(io.BytesIO(img['stream']))
                images.append(im)
    return images

In [None]:
# Example: Extract images
images = extract_images(pdf_path)

# Show the first extracted image
images[0].show()

We can use `FAISS` to store text embeddings and images for efficient search. First, we’ll extract text embeddings using a pretrained language model and image embeddings using a vision model.

**Storing Text & Image Embeddings**:

In [None]:
### 2. Storing Data Using FAISS for Efficient Search
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import CLIPProcessor, CLIPModel
import torch

# Function to get text embeddings
def get_text_embeddings(texts):
    model = HuggingFaceEmbeddings()
    return model.embed_documents(texts)

# Function to get image embeddings
def get_image_embeddings(images):
    model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")

    image_inputs = processor(images=images, return_tensors="pt")
    with torch.no_grad():
        image_embeddings = model.get_image_features(**image_inputs)

    return image_embeddings


In [None]:
# Test: Create embeddings for text and images
texts = [text]  # Assuming 'text' is a string of extracted text from the PDF
images_embeddings = get_image_embeddings(images)

# Combine text and image embeddings
texts_embeddings = get_text_embeddings(texts)

# Storing in FAISS
text_faiss = FAISS.from_documents(texts, HuggingFaceEmbeddings())
image_faiss = FAISS.from_vectors(images_embeddings, dimension=512)  # CLIP output is 512-d

# You can store both and search them separately based on the type of query.

Now, let’s integrate this with a RAG approach using LangChain. This will allow the chatbot to fetch relevant information from both text and images to answer a query.


In [None]:
### 3. Implementing RAG with LangChain
from langchain.chains import RetrievalQA
from langchain.chains import RAG
from langchain.llms import OpenAI
from langchain.agents import initialize_agent
from langchain.agents import Tool
from langchain.chat_models import ChatOpenAI

# You can use HuggingFace’s free models for this
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.0)

# Use FAISS as a retriever for both text and image data
qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=text_faiss.as_retriever())

def multimodal_rag(query):
    # Fetch relevant results for the query from text and images
    text_results = qa_chain.run(query)

    # Optionally add image search and retrieval logic here
    # For example, using the image_faiss search to match query related to images

    return text_results  # Returning text response

In [None]:
# Test with a query
query = "What are the key insights in the PDF?"
response = multimodal_rag(query)
print(response)


In [None]:
### 4. Bringing It All Together
# Full process to extract data from PDF, process it, and query the chatbot.
pdf_path = 'your_file.pdf'  # Replace with the path to your PDF

# Step 1: Extract text, tables, and images
text, tables = extract_text_and_tables(pdf_path)
images = extract_images(pdf_path)

# Step 2: Create embeddings for text and images
texts = [text]  # List of texts to pass for embedding
texts_embeddings = get_text_embeddings(texts)
images_embeddings = get_image_embeddings(images)

# Step 3: Store embeddings in FAISS
text_faiss = FAISS.from_documents(texts, HuggingFaceEmbeddings())
image_faiss = FAISS.from_vectors(images_embeddings, dimension=512)

# Step 4: Define and run multimodal RAG-based chatbot
query = "What are the key insights in the PDF?"
response = multimodal_rag(query)

# Print the final response from the chatbot
print(response)

This project setup extracts text, images, and tables from PDFs and uses `FAISS` for efficient data retrieval. You can extend the project to search and use both text and image data for answering questions in a multimodal fashion.

Remember:

1. **Text Embedding**: We used HuggingFace's pre-trained transformer models.
2. **Image Embedding**: We used CLIP (a vision model from OpenAI) to get image features.
3. **RAG Model**: We used LangChain’s `RetrievalQA` to implement a retrieval-augmented generation system that helps answer questions based on the available text and image data.
