## Semi-Structured RAG For Private Data

In this homework assignment, you will be delving into the realm of **Retrieval Augmented Generation (RAG)**.

Your objective is to construct a system that leverages retrieval from a **private database** consisting of PDFs. These PDFs encapsulate a rich variety of content, including textual information, images, and tables.

The challenge lies in preserving all these components while efficiently extracting relevant data based on a user's input question.

- As a first step, you will need to develop mechanisms for extracting text from the PDFs. Also, extract textual embeddings for following comparison with the user's input.
- Subsequently, you should implement a process to identify and retrieve the most pertinent information matching a user's query.
- Because some input texts are too long, we have to summarize them, and then use the summary of the most similar text to LLM as input.
- Then, you will integrate this retrieved information with a Large Language Model (LLM) to generate comprehensive and contextually relevant responses to user queries.
- Finally, you will apply this mechanism in a Multimodal approach, where you convert PDF images to clip embeddings and use the input's textual CLIP embeddings to compare with the ground truth's image embeddings and find the most similar image to the input text.
- As we are using Unimodal LLMs, we can not give those images to the LLM. Hence, we use image captions to be used in LLM's input.

This holistic approach ensures that no valuable information is lost, and the system provides nuanced answers by combining both the knowledge embedded in the PDFs and the capabilities of the LLM.

Instruction:

<font color='77CC99'>Follow the Green texts and fill out the notebook.</font>


<img src='https://drive.google.com/uc?id=1kODk16WWrn9DqvaWoEAekHRXup1djGjl' width="75%">

## Packages

In [1]:
# restart kernel after first instllation
%%capture
!apt-get install -y poppler-utils
!apt-get install tesseract-ocr
!pip install pytesseract
# for image extraction from pdf
! pip install PyMuPDF
! pip install Pillow
# text embedding
! pip install -U sentence-transformers
! pip install transformers accelerate bitsandbytes>=0.39.0 -q

# 0 - Loading Data

### 0.1 - Downoading the PDF

In [1]:
from pathlib import Path
import urllib.request

# Define the name of the PDF file and then download them
file_name = "Dall_E_paper"

url = "https://arxiv.org/pdf/2204.06125.pdf"
file_path = f"{file_name}.pdf"
urllib.request.urlretrieve(url, file_path)

('Dall_E_paper.pdf', <http.client.HTTPMessage at 0x79a9d4a36ce0>)

## 0.2 - Extract Images and Texts

Implement mechanisms to extract images and texts from the downloaded PDFs.

In [2]:
!which pdftotext

/usr/bin/pdftotext


In [3]:
import pytesseract
print(pytesseract.get_tesseract_version())

4.1.1


In [4]:
# Import required dependencies
import fitz
import os
from PIL import Image

#### Step 0.2.1: Extract and Store Images

In [5]:
# Open PDF file
pdf_file = fitz.open(file_path)

# Calculate number of pages in PDF file
page_nums = len(pdf_file)

# Create empty list to store images information
images_list = []

# Extract all images information from each page
for page_num in range(page_nums):
    page_content = pdf_file[page_num]
    images_list.extend(page_content.get_images())

In [6]:
images_path = "./images/"
Path(images_path).mkdir(parents=True, exist_ok=True)

#Save all the extracted images
for i, image in enumerate(images_list, start=1):
    #Extract the image object number
    xref = image[0]
    #Extract image
    base_image = pdf_file.extract_image(xref)
    #Store image bytes
    image_bytes = base_image['image']
    #Store image extension
    image_ext = base_image['ext']
    #Generate image file name
    image_name = file_name + '_' +str(i) + '.' + image_ext
    #Save image
    with open(os.path.join(images_path, image_name) , 'wb') as image_file:
        image_file.write(image_bytes)
        image_file.close()

### Step 0.2.2: Extract and Store Texts From PDF Content

In [7]:
!pip install unstructured[all-docs]==0.11.2 -q

In [8]:
from lxml import html
from pydantic import BaseModel
from typing import Any, Optional
from unstructured.partition.pdf import partition_pdf

path='./'

# Specify the path to the poppler installation
poppler_path = './images/'  # Replace with the path obtained from the previous step

# Specify the path to the Tesseract OCR installation
tesseract_path = '/usr/bin/tesseract'  # Replace with the path obtained from the previous step


# Get elements
raw_pdf_elements = partition_pdf(
    filename= "./"+"Dall_E_paper.pdf",
    # Using pdf format to find embedded image blocks
    extract_images_in_pdf=True,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 1900 chars
    # Attempt to keep chunks > 1000 chars
    # Hard max on chunks
    max_characters=2000,
    new_after_n_chars=1900,
    combine_text_under_n_chars=1000,
    image_output_dir_path=poppler_path,
    tesseract_path=tesseract_path,
)

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
# Text
text_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.CompositeElement" in str(type(element)):
        text_elements.append(str(element))

print(len(text_elements))

39


Because some texts are too long, we have to summarize them at first

In [10]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

summarized_text_elements = summarizer(text_elements , max_length=100, do_sample=False)

Your max_length is set to 100, but your input_length is only 37. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)
Your max_length is set to 100, but your input_length is only 50. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=25)


# 1 - Unimodal RAG

## 1.1 - Loading True Text Data as Embeded Vectors

In this section, we should convert the text data into embedding vectors and store them. Hence, in the following step. having an input, by comparing we can find out the most similar fact with the input.

We use this model for [Text-Embedding](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2)

In [11]:
from sentence_transformers import SentenceTransformer, util

text_emb_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [12]:
text_embeddings = text_emb_model.encode(text_elements, convert_to_tensor=True)

Now, we have all our crucial embeddings. Thus, if we have a new input, we know that we should compare the input's embeddings with the text_embeddings element and find the closest one.

## 1.2 - Unimodal Semi-Structured RAG

### Step 1.2.1: Most Similar Ground Truth Text Ectraction

At the fist step, for any given input, we have to have evaluation functions to find the closest embedding vector to the input vectors. We use the Cosine similarity for this operation.

<font color='77CC99'>Write a function "text_embedding_similarity" to convert input texts to embedded vector and then returns the similarity between the input text and any of the ground truth texts.</font>


In [13]:
import numpy as np

def get_similarity(embeddings_1, embeddings_2):

  embeddings_1 = embeddings_1 / embeddings_1.norm(dim=-1, keepdim=True)
  embeddings_2 = embeddings_2 / embeddings_2.norm(dim=-1, keepdim=True)

  return embeddings_1.cpu().detach().numpy() @ embeddings_2.cpu().detach().numpy().T

In [14]:
def text_embedding_similarity(input_text, text_embeddings, text_emb_model):

    ### To Do ###

    input_text_emb = text_emb_model.encode(input_text, convert_to_tensor=True)

    ### End ###

    return get_similarity(text_embeddings, input_text_emb)

In [15]:
input_text = "is DALL-E2 uses a clip model inside?"
text_embedding_similarity(input_text, text_embeddings, text_emb_model)

array([0.18537423, 0.20288187, 0.21293396, 0.20692343, 0.2686563 ,
       0.31146652, 0.01873749, 0.21311942, 0.20436674, 0.25198764,
       0.26972088, 0.24603592, 0.2582291 , 0.2539547 , 0.23482765,
       0.23657551, 0.17059211, 0.15917057, 0.15459305, 0.14453974,
       0.11232636, 0.18208514, 0.12710598, 0.37942076, 0.05238129,
       0.3243861 , 0.30019337, 0.0696483 , 0.10325827, 0.13214463,
       0.02927516, 0.15812725, 0.10174509, 0.26262775, 0.15325922,
       0.00922206, 0.31082875, 0.04998646, 0.12929475], dtype=float32)

<font color='77CC99'> Now, write a function that finds "Summaries" of the k most similar ground truth texts to the user's input. function "text_retrival"</font>

In [16]:
import heapq

def text_retrival(k, input_text, text_embeddings, text_elements, summarized_text_elements, text_emb_model):

    ### To Do ###
    scores = text_embedding_similarity(input_text, text_embeddings, text_emb_model)
    ind = np.argsort(scores)[::-1][:k]
    selected_text_elements = np.array(summarized_text_elements)[ind]
    ### End ###

    return {"selected_text_elements": selected_text_elements}

### Step 1.2.2: Load the core LLM and Combine them all

We use a Question-answering model as the core of our system. In fact, having the input text and finding the closest ground truth fact to the input text, we can give them both to an LLM to answer the question.

Here we load the core LLM for our Unimodal  Semi-Structured RAG. [Model in HF](https://huggingface.co/samwit/koala-7b)

In [17]:
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline
import torch
import textwrap

model = LlamaForCausalLM.from_pretrained(
    "samwit/koala-7b",
    load_in_8bit=True,
    device_map='auto',
)

tokenizer = LlamaTokenizer.from_pretrained("samwit/koala-7b")

Loading checkpoint shards:   0%|          | 0/14 [00:00<?, ?it/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [18]:
llm_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15)

tokenizer.pad_token_id = tokenizer.eos_token_id

Now, what follows is our Prompt, based on that, do the task bellow.


<font color='77CC99'> Based on the prompt and what we have done before, write a function that answers the user's question by finding the most related ground truth text(fact) by giving the prompt to LLM. Function "Unimodal_Question_Answering" </font>

In [19]:
# Prompt
prompt_text = """ANSWER the QUESTION in conformity to on FACTS. \n
FACTS: \n {text_facts}. \n
QUESTION: {user_question} \n
ANSWER:  """

In [20]:
def Unimodal_Question_Answering(input_text,k=1):

    ### To Do ###
    prompt_text = """ANSWER the QUESTION in conformity to on FACTS. \n
    FACTS: \n {text_facts}. \n
    QUESTION: {user_question} \n
    ANSWER:  """
    facts = ''
    summaries = text_retrival(k, input_text, text_embeddings, text_elements, summarized_text_elements, text_emb_model)['selected_text_elements'].tolist()
    for dictionary in summaries:
        facts = facts + dictionary['summary_text'] + '\n'
    prompt_text = prompt_text.format(
        text_facts=facts,
        user_question=input_text
    )
    response = llm_pipeline(prompt_text, do_sample=True)[0]
    ### End ###

    return response

In [21]:
input_text = "is DALL-E2 uses a clip model inside?"

response = Unimodal_Question_Answering(input_text,k=1)

In [22]:
response['generated_text']

'ANSWER the QUESTION in conformity to on FACTS. \n\n    FACTS: \n Since its release, CLIP has been used extensively to steer generative image models towards text prompts. Nichol et al. [35] showed classiﬁer-free guidance works more favorably than CLIP guidance for text conditional image generation. Zhou and Crowson [9] trained diffusion models conditioned on CLIP text embeddings, allowing for direct text-conditional imagegeneration.\n. \n\n    QUESTION: is DALL-E2 uses a clip model inside? \n\n    ANSWER:  \n    \n    Based on the information provided, it appears that DALL-E2 does not use a clip model as an internal component. Instead, DALL-E2 is designed as a general-purpose framework for training and evaluating GAN architectures using real-world data sets and tasks. The framework includes features such as transfer learning, multi-task learning, and fine-tuning, which can be applied to different types of datasets and tasks without modifying the underlying architecture. Therefore, whil

# 2 - Multimodal RAG

In this section, we want to add another modality to our unimodal RAG. What happens if we can consider images as ground truth facts?

We have stored all ground truth images. Thus, in this step, we should extract image embeddings for comparison with textual input embeddings

## 2.1 - Loading CLIP Model for Extracting Embeddings

<font color='77CC99'> Load CLIP model for extracting textual and visial embeddings, then convert all input images to their corresponding vectors.

[Huggingface Link](https://huggingface.co/docs/transformers/model_doc/clip) </font>


In [23]:
from PIL import Image
import requests
from transformers import AutoTokenizer, CLIPTextModelWithProjection
from transformers import AutoProcessor, CLIPVisionModelWithProjection

### To Do ###

textual_clip_model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch16")
textual_clip_tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch16")
visual_clip_model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch16")
visual_clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch16")

### End ###

In [24]:
import glob
images_path = glob.glob('./images/*')

### To Do ###

images_embeddings = []

for image_path in images_path:
    im = Image.open(image_path)
    inputs = visual_clip_processor(images=im, return_tensors='pt')
    outputs = visual_clip_model(**inputs).last_hidden_state.detach().cpu().numpy()
    images_embeddings.append(outputs)

### End ###

In [25]:
images_embeddings = torch.tensor(images_embeddings)

  images_embeddings = torch.tensor(images_embeddings)


In [26]:
images_embeddings = images_embeddings.squeeze()

In [27]:
images_embeddings = torch.mean(images_embeddings, dim=1)

In [28]:
images_embeddings.shape

torch.Size([218, 768])

As we are using unimodsl LLM, we need to make image's information understandable for LLM. Hence, we extract textual information of imaged as "Caption" store them in "captions" list.

<font color='77CC99'>Write the corresponding code.</font>


[Image Captioning HF Model](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning)

In [29]:
from transformers import pipeline
image_to_text = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")

### To Do ###

caption_list = []

for image_path in images_path:
    outputs = image_to_text(images=image_path)[0]['generated_text']
    caption_list.append(outputs)


### End ###

Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Ok, now that we have every thing ready, we can code the Multimodal Semi-Structured RAG

## 2.2 - Multimodal Semi-Structured RAG

### Step 2.2.1: Most Similar Ground Truth Image Ectraction

At the fist step, for any given input, we have to have evaluation functions to find the closest visual embedding vector to the input's textual vectors. We use the Cosine similarity for this operation.

<font color='77CC99'>Write a function "visual_embedding_similarity" to convert input texts to clip embedding vector and then returns the similarity between the input text and any of the ground truth images.</font>

Note: You can use "get_similarity" function that you have definced before.

In [30]:
def get_embedding_similarity(input_text, images_embeddings, textual_clip_tokenizer, textual_clip_model):

    ### To Do ###
    inputs = textual_clip_tokenizer(input_text, return_tensors="pt")
    text_embeds = textual_clip_model(**inputs).last_hidden_state[0, -1].detach()
    ### End ###

    return get_similarity(images_embeddings[:, :512], text_embeds)

<font color='77CC99'>Now, write a function that finds k most similar Text/Image to user's input.</font>

In [31]:
import heapq

def multimodal_retrival(k,input_text,text_embeddings,text_elements,summarized_text_elements,
                        text_emb_model,images_embeddings,caption_list,textual_clip_tokenizer ,textual_clip_model):

    ### To Do ###
    image_scores = get_embedding_similarity(input_text, images_embeddings, textual_clip_tokenizer, textual_clip_model)
    ind = np.argsort(image_scores)[::-1][:k]
    selected_image_elements = np.array(caption_list)[ind]

    selected_text_elements = text_retrival(k, input_text, text_embeddings, text_elements, summarized_text_elements, text_emb_model)['selected_text_elements']
    return {"selected_image_elements": selected_image_elements,
          "selected_text_elements": selected_text_elements}

    ### End ###

### Step 2.2.2: Use the core LLM and Combine them all

In this section, based on what we have done before(Loading LLM), we want to use what we have done in this section to write the Multimodal RAG. Do it as follows.

<font color='77CC99'> Based on the new prompt which contains both textual ground truth facts and the caption of visual ground truth images, to write the "Multimodal_Question_Answering" function. This function should takes the user's textual question as input, then finds the most correlated textual and visual ground truth. Then gives them all to LLM via prompt.</font>

In [32]:
# Prompt
prompt_text = """ANSWER the QUESTION in conformity to on FACTS. \n
FACTS: \n {text_facts} \n {image_facts}. \n
QUESTION: {user_question} \n
ANSWER:  """

In [33]:
def Multimodal_Question_Answering(input_text,k=1):
    prompt_text = """ANSWER the QUESTION in conformity to on FACTS. \n
    FACTS: \n {text_facts} \n {image_facts}. \n
    QUESTION: {user_question} \n
    ANSWER:  """
    ### To Do ###
    summaries = multimodal_retrival(k,input_text,text_embeddings,text_elements,summarized_text_elements,
                        text_emb_model,images_embeddings,caption_list,textual_clip_tokenizer ,textual_clip_model)
    text_facts = ""
    image_facts = ""


    for dictionary in list(summaries['selected_image_elements']):
        image_facts = image_facts + dictionary + '\n'

    for dictionary in list(summaries['selected_text_elements']):
        text_facts = text_facts + dictionary['summary_text'] + '\n'

    prompt_text = prompt_text.format(
        text_facts=text_facts,
        image_facts=image_facts,
        user_question=input_text
    )
    response = llm_pipeline(prompt_text, do_sample=True)[0]
    ### End ###

    return response

In [37]:
input_text = "is DALL-E2 uses a clip model inside?"

response = Multimodal_Question_Answering(input_text,k=1)

In [38]:
response['generated_text']

'ANSWER the QUESTION in conformity to on FACTS. \n\n    FACTS: \n Since its release, CLIP has been used extensively to steer generative image models towards text prompts. Nichol et al. [35] showed classiﬁer-free guidance works more favorably than CLIP guidance for text conditional image generation. Zhou and Crowson [9] trained diffusion models conditioned on CLIP text embeddings, allowing for direct text-conditional imagegeneration.\n \n a bowl of fruit on a table \n. \n\n    QUESTION: is DALL-E2 uses a clip model inside? \n\n    ANSWER:  \n    \n    Yes, it does use a Clip model. According to the paper "DALL-E2: Text-conditioned Image Generation with Improved Quality" by Liu et al., they state that "To achieve robustness, we propose a new architecture called DALL-E2, which consists of two parts: a Clip generator that encodes context information into the image space using a language model, and a diffraction network that computes an embedding of the image." \n    \n    '

<font color='77CC99'>The Answer to the input question is "Yes" or "No". What are your Semi-structured models' answers? (Both Unimodal and Multimodal). Are they right or not?</font>

<font color='CC7799'>Your Answer:</font> Interestingly the unimodal model's answer is incorrect but the multimodal model was able to give the correct answer. This could be due to some information being present in images of the paper and not in the text.

....