Author: Akshay Chougule

Date of first publication: Nov 5, 2023

Description: This is an attempt to query a multimodal pdf on local system using LlaMa and LlaVA

In [1]:
import os
import Constants

os.environ["OPENAI_API_KEY"] = Constants.OPENAI_API_KEY

In [2]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader, PyPDFDirectoryLoader

# First pass

## Loaders

This first part is largely based on:
* https://colab.research.google.com/drive/1gyGZn_LZNrYXYXa-pltFExbptIe7DAPe?usp=sharing#scrollTo=cYA-H59u0Skn
* https://www.youtube.com/watch?v=3yPBVii7Ct0

In [3]:
from langchain.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader('./cart_research_pdfs/',glob="./*.pdf")

In [6]:
docs = loader.load()
len(docs), docs[0]

(84,
 Document(page_content='Int. J. Biol. Sci. 2019, Vol. 15 \n \n \nhttp://www. ijbs.com  2548 \nInternational  Journal  of Biological Sciences  \n2019;  15(12):  2548-2560.  doi: 10.7150/ ijbs.34213 \nReview \nCurrent P rogress in  CAR- T Cell Therapy for  Solid  \nTumors  \nShuo  Ma1,*, Xinchun Li1,*, Xinyue  Wang1,*, Liang  Cheng1,2, Zhong  Li1, Changzheng  Zhang1, Zhenlong  \nYe1,3,4,\uf02a, Qijun  Qian1,3,4\uf02a \n1. Shanghai  Baize  Medic al Laboratory, Shanghai,  China  \n2. Department  of Pathology  and Laboratory  Medicine,  Indiana  University  School  of Medicine,  Indianapolis,  Indiana,  USA  \n3. Shanghai  Cell Therapy  Research  Institute,  Shanghai,  China  \n4. Shanghai  Engineering  Research  Center  for Cell Therapy,  Shanghai,  China  \n* These  authors  contributed  equally  to this work.  \n\uf02a Corresponding author s: Qijun  Qian , Shanghai  Baize  Medical  Laboratory , 75 Qianyang Road,  Shanghai  201805,  China . Email:  qian@shcell.org;  Zhenlong  Ye, \nS

In [7]:
docs = loader.load_and_split()
len(docs), docs[0]

(168,
 Document(page_content='Int. J. Biol. Sci. 2019, Vol. 15 \n \n \nhttp://www. ijbs.com  2548 \nInternational  Journal  of Biological Sciences  \n2019;  15(12):  2548-2560.  doi: 10.7150/ ijbs.34213 \nReview \nCurrent P rogress in  CAR- T Cell Therapy for  Solid  \nTumors  \nShuo  Ma1,*, Xinchun Li1,*, Xinyue  Wang1,*, Liang  Cheng1,2, Zhong  Li1, Changzheng  Zhang1, Zhenlong  \nYe1,3,4,\uf02a, Qijun  Qian1,3,4\uf02a \n1. Shanghai  Baize  Medic al Laboratory, Shanghai,  China  \n2. Department  of Pathology  and Laboratory  Medicine,  Indiana  University  School  of Medicine,  Indianapolis,  Indiana,  USA  \n3. Shanghai  Cell Therapy  Research  Institute,  Shanghai,  China  \n4. Shanghai  Engineering  Research  Center  for Cell Therapy,  Shanghai,  China  \n* These  authors  contributed  equally  to this work.  \n\uf02a Corresponding author s: Qijun  Qian , Shanghai  Baize  Medical  Laboratory , 75 Qianyang Road,  Shanghai  201805,  China . Email:  qian@shcell.org;  Zhenlong  Ye, \n

In [4]:
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=250)
texts = text_splitter.split_documents(documents)
len(texts), texts[0]

(422,
 Document(page_content='Gene, Cell, + RNA Therapy Landscape Report\nQ3 2023 Quarterly Data Report', metadata={'source': 'cart_research_pdfs/ASGCT-Citeline-Q3-2023-Report.pdf', 'page': 0}))

## Vector DB

In [5]:
# Embed and store the texts

# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db'

## here we are using OpenAI embeddings but in future we will swap out to local embeddings
embedding = OpenAIEmbeddings()

vectordb = Chroma.from_documents(documents=texts, 
                                 embedding=embedding,
                                 persist_directory=persist_directory)

In [6]:
vectordb

<langchain.vectorstores.chroma.Chroma at 0x7f2a43e11760>

In [17]:
# persiste the db to disk
vectordb.persist()
# Delete in memory instance
vectordb = None

In [18]:
vectordb

In [19]:
# Now we can load the persisted database from disk, and use it as normal. 
vectordb = Chroma(persist_directory=persist_directory, 
                  embedding_function=embedding)
vectordb

<langchain.vectorstores.chroma.Chroma at 0x7fcb768d7790>

## Retriever

In [7]:
retriever = vectordb.as_retriever()

In [8]:
docs = retriever.get_relevant_documents("What is cellular target of tisagenlecleucel?")

In [9]:
docs

[Document(page_content='Leukemia. NE n g lJM e d (2018) 378(5):439 –48. doi: 10.1056/\nNEJMoa1709866\n2. Schuster SJ, Bishop MR, Tam CS, Waller EK, Borchmann P, McGuirk JP,\net al. Tisagenlecleucel in Adult Relapsed or Refractory Diffuse Large B-Cell\nLymphoma. N Engl J Med (2019) 380:45 –56. doi: 10.1056/NEJMoa18049803. Fowler NH, Dickinson M, Dreyling M, Martinez-Lopez J, Kolstad A, Butler\nJ, et al. Tisagenlecleucel in Adult Relapsed or Refractory Follicular\nLymphoma: The Phase 2 ELARA Trial. Nat Med (2021) 28(2):325 –32.\ndoi: 10.1038/s41591-021-01622-0\n4. Locke FL, Ghobadi A, Jacobson CA, Miklos DB, Lekakis LJ, Oluwole OO,\net al. Long-Term Safety and Activity of Axicabtagene Ciloleucel in\nRefractory Large B-Cell Lymphoma (ZUMA-1): A Single-Arm,\nMulticentre, Phase 1-2 Trial. Lancet Oncol (2019) 20(1):31 –42.\ndoi: 10.1016/S1470-2045(18)30864-7TABLE 3 | Combinatorial strategies with CAR-T cell therapy reported in clinical studies.\nCombinatorial approach Disease Target Referenc

# Second Pass

Now that we have seen complete basic workflow, let's try to refine some steps

This section is largely based on:
* https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb
* https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_structured_and_multi_modal_RAG.ipynb
* https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_structured_multi_modal_RAG_LLaMA2.ipynb

## Extracting table and images (along with the text)

We will use the [Gene, Cell, + RNA Therapy Landscape: Q3 2023 Quarterly Data Report](https://asgct.org/global/documents/asgct-citeline-q3-2023-report.aspx) as the RAG source to test this experiment.

In [15]:
from lxml import html
from pydantic import BaseModel
from typing import Any, Optional
from unstructured.partition.pdf import partition_pdf

# Get elements
raw_pdf_elements = partition_pdf(
    filename="./cart_research_pdfs/ASGCT-Q32023.pdf",
    # Using pdf format to find embedded image blocks
    extract_images_in_pdf=True,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    # Hard max on chunks
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=".",
)


In [16]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 25,
 "<class 'unstructured.documents.elements.Table'>": 15}

In [17]:
raw_pdf_elements[0]

<unstructured.documents.elements.CompositeElement at 0x7f28c16fef70>

In [18]:
class Element(BaseModel):
    type: str
    text: Any


# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

# Tables
table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))

# Text
text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))


15
25


## Multi-vector retriever

Using multi-vector-retriever.

The summaries are used to retrieve raw tables and / or raw chunks of text.

### Text and Table summaries

In [20]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser

In [21]:
# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \ 
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
model = ChatOpenAI(temperature=0, model="gpt-4")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [22]:
# Apply to text
texts = [i.text for i in text_elements]
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})


In [23]:
# Apply to tables
tables = [i.text for i in table_elements]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

In [24]:
table_summaries[0]

'The text provides information about various gene therapies, their generic names, the year they were first approved, the diseases they treat, the locations where they were approved, and the companies that originated them. For instance, Gendicine, a recombinant p53 gene therapy, was first approved in 2004 in China by Shenzhen SiBiono GeneTech for treating head and neck cancer. Another example is Strimvelis, an autologous CD34+ enriched cells therapy, approved in 2016 in the EU and UK by Orchard Therapeutics for treating adenosine deaminase deficiency. The most recent therapy mentioned is Zynteglo, a betibeglogene autotemcel therapy, approved in 2019 in the US by bluebird bio for treating transfusion-dependent beta thalassemia.'

### Images

We will implementfollowing steps:

* Use a multimodal LLM (LLaVA) to produce text summaries from images
* Embed and retrieve text
* Pass text chunks to an LLM for answer synthesis

Image summaries

We will use LLaVA, an open source multimodal model.

We will use llama.cpp to run LLaVA locally (e.g., on a Mac laptop):

1. Clone [llama.cpp](https://github.com/ggerganov/llama.cpp): ```git clone git@github.com:ggerganov/llama.cpp.git ```
2. Download the LLaVA model: mmproj-model-f16.gguf and one of ggml-model-[f16|q5_k|q4_k].gguf [from LLaVA 7b repo](https://huggingface.co/mys/ggml_llava-v1.5-7b/tree/main)
3. Build

Let's go into the details of all steps:

#### Step 1

For linux, after cloning the llama cpp, go inside the lamma.cpp directory:
* mkdir build && cd build && cmake ..
* cmake --build . 

or

```
mkdir build
cd build
cmake ..
cmake --build .
```
#### Step 2


#### Step 3

Run inference across images:

/Users/rlm/Desktop/Code/llama.cpp/bin/llava -m ../models/llava-7b/ggml-model-q5_k.gguf --mmproj ../models/llava-7b/mmproj-model-f16.gguf --temp 0.1 -p "Describe the image in detail. Be specific about graphs, such as bar plots." --image "$img" > "$output_file"



In [42]:
# %%bash

# # Define the directory containing the images
# IMG_DIR=~/Desktop/Papers/LLaVA/

# # Loop through each image in the directory
# for img in "${IMG_DIR}"*.jpg; do
#     # Extract the base name of the image without extension
#     base_name=$(basename "$img" .jpg)

#     # Define the output file name based on the image name
#     output_file="${IMG_DIR}${base_name}.txt"

#     # Execute the command and save the output to the defined output file
#     /Users/rlm/Desktop/Code/llama.cpp/bin/llava -m ../models/llava-7b/ggml-model-q5_k.gguf --mmproj ../models/llava-7b/mmproj-model-f16.gguf --temp 0.1 -p "Describe the image in detail. Be specific about graphs, such as bar plots." --image "$img" > "$output_file"

# done

## Add to vectorstore

In [25]:
import uuid
from langchain.vectorstores import Chroma
from langchain.storage import InMemoryStore
from langchain.schema.document import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=s, metadata={id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

In [26]:
tables[2]

'Year first approved Generic name Disease(s) Locations approved* Originator company mipomersen sodium 2013 Homozygous familial hypercholesterolemia US, Mexico, Argentina, South Korea Ionis Pharmaceuticals eteplirsen 2016 Dystrophy, Duchenne muscular US US, EU, UK, Canada, Japan, Brazil, Switzerland, Australia, South Korea, China, Argentina, Colombia, Taiwan, Turkey, Hong Kong, Israel Argentina EU, UK, Canada, US, Brazil US, EU, UK, Japan, Canada, Switzerland, Brazil, Taiwan, Israel, Turkey, Australia US Sarepta Therapeutics nusinersen 2016 Muscular atrophy, spinal Ionis Pharmaceuticals rintatolimod inotersen 2016 2018 Chronic fatigue syndrome Amyloidosis, transthyretin-related hereditary AIM ImmunoTech Ionis Pharmaceuticals patisiran 2018 Amyloidosis, transthyretin-related hereditary Alnylam golodirsen 2019 Dystrophy, Duchenne muscular Hypertriglyceridemia; Lipoprotein lipase deficiency Sarepta Therapeutics volanesorsen 2019 EU, UK, Brazil, Canada Ionis Pharmaceuticals Infection, coron

In [27]:
table_summaries[2]


'The table provides information about various drugs, the year they were first approved, the diseases they treat, the locations where they were approved, and the companies that originated them. For instance, Mipomersen sodium was first approved in 2013 by Ionis Pharmaceuticals for the treatment of Homozygous familial hypercholesterolemia in the US, Mexico, Argentina, and South Korea. Eteplirsen was approved in 2016 by Sarepta Therapeutics for Duchenne muscular dystrophy in multiple locations including the US, EU, UK, Canada, Japan, Brazil, Switzerland, Australia, South Korea, China, Argentina, Colombia, Taiwan, Turkey, Hong Kong, and Israel. The most recent drugs listed are Tozinameran and an unnamed drug by Moderna Therapeutics, both approved in 2020 for novel coronavirus prophylaxis.'

In [32]:
# We can retrieve this table
retriever.get_relevant_documents(
    #"Which RNA therapies were approved in 2023?"
    "Which Gene therapies are approved in China?"
)[1]


'Source: Biomedtracker | Citeline, October 2023 13 / Q3 2023\n\n\n\nPipeline overview\n\n3 2 0 2 3 Q\n\nAmerican Society of Gene + Cell Therapy\n\nPipeline of gene, cell, and RNA therapies\n\n3,866 therapies are in development, ranging from preclinical through pre-registration\n\n¢\n\n2,082 gene therapies (including\n\ngenetically modified cell therapies such as CAR T-cell therapies) are in development, accounting for 53% of gene, cell, and RNA therapies • 862 non-genetically modified cell therapies are in development, accounting for 22% of gene, cell, and RNA therapies\n\nSource: Pharmaprojects| Citeline, October 2023\n\n15 / Q3 2023\n\nPipeline therapies by category\n\nGene therapies RNA therapies Cell therapies (non-genetically modified)\n\nEs American Society of Gene + Cell Therapy\n\nGene therapy pipeline\n\nGene therapy and genetically modified cell therapies\n\n3 2 0 2 3 Q\n\n\n\nGene therapy pipeline: Quarterly comparison\n\n\n\ne\n\nFor the first time in the past year, the num

## RAG

In [29]:
from operator import itemgetter
from langchain.schema.runnable import RunnablePassthrough

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# Option 1: LLM
model = ChatOpenAI(temperature=0, model="gpt-4")
# Option 2: Multi-modal LLM
# model = GPT4-V or LLaVA

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)


In [33]:
chain.invoke(
    #"Which RNA therapies were approved in 2023?"
    "Which Gene therapies are approved in China?"
)


'The gene therapies approved in China are Gendicine (recombinant p53 gene) for head and neck cancer, Oncorine (E1B/E3 deficient adenovirus) for head and neck cancer and nasopharyngeal cancer, and Fucaso, an anti-BCMA-targeting CAR-T therapy for multiple myeloma.'

In [35]:
def rag_pdf_chat(question):

    retriever.get_relevant_documents(
        question
    )

    # Prompt template
    template = """Answer the question based only on the following context, which can include text and tables:
    {context}
    Question: {question}
    """
    prompt = ChatPromptTemplate.from_template(template)

    # Option 1: LLM
    model = ChatOpenAI(temperature=0, model="gpt-4")
    # Option 2: Multi-modal LLM
    # model = GPT4-V or LLaVA

    # RAG pipeline
    chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | model
        | StrOutputParser()
    )

    return(chain.invoke(question))

In [36]:
rag_pdf_chat("Which Gene therapies are approved in China?")

'The gene therapies approved in China are Gendicine (recombinant p53 gene) for head and neck cancer, Oncorine (E1B/E3 deficient adenovirus) for head and neck cancer and nasopharyngeal cancer, and Fucaso, an anti-BCMA-targeting CAR-T therapy for multiple myeloma.'

In [37]:
rag_pdf_chat("Which RNA therapies are approved in 2023?")

'The text does not provide specific information on which RNA therapies are approved in 2023.'

In [38]:
rag_pdf_chat("Which Gene therapy is approved for Critical limb ischemia")

'The gene therapy approved for Critical limb ischemia is Collategene (beperminogene perplasmid).'

In [41]:
rag_pdf_chat("How many Gene therapies are approved globally for clinical use?")

'The text does not provide specific information on the total number of gene therapies approved globally for clinical use.'

In [40]:
rag_pdf_chat("Which Blue Bird Bio therapied are approved?")

'The approved therapies by Bluebird Bio are Abecma (idecabtagene vicleucel) for multiple myeloma in the US, Canada, EU, UK, and Japan, Skysona (elivaldogene autotemcel) for early cerebral adrenoleukodystrophy (CALD) in the US, and Zynteglo (betibeglogene autotemcel) for transfusion-dependent beta thalassemia in the US.'

In [43]:
rag_pdf_chat("How many Gene therapies are in development?")

'There are 2,082 gene therapies in development.'

In [44]:
rag_pdf_chat("What is the most popular indication for CAR-T cell therapies?")

'The most popular indication for CAR-T cell therapies is cancer.'

In [45]:
rag_pdf_chat("What are the most common target for Gene therapy pipeline?")

'The most common targets for the gene therapies in preclinical trials through pre-registration are CD19, B-cell maturation antigen (BCMA), also known as TNF receptor superfamily member 17, and CD22 molecule. These are the top three most common targets for oncology indications. For non-oncology indications, the most common target is the CD19 molecule.'

In [46]:
rag_pdf_chat("What are the top modalities for research in RNA therapy pipeline?")

'The top modalities for research in the RNA therapy pipeline are messenger RNA (mRNA) and RNA interference (RNAi).'

In [47]:
rag_pdf_chat("What are most common diseases targets for RNA therapies?")

'The most common diseases targeted by RNA therapies are rare diseases, anti-infective diseases, and anticancer diseases. For rare diseases, the top specified oncology indications are pancreatic, liver, and ovarian cancer. For non-oncology rare diseases, Duchenne muscular dystrophy, amyotrophic lateral sclerosis, and Huntington’s disease are the most commonly targeted.'

In [48]:
rag_pdf_chat("Which startup raised most money in Q3 of 2023 for Cell and Gene therapies?")

'The startup that raised the most money in Q3 of 2023 for Cell and Gene therapies was Tenpoint Therapeutics, which launched with $70M Series A Financing.'

In [49]:
rag_pdf_chat("What are the top 3 startups that raised most money in Q3 of 2023 for Cell and Gene therapies?")

'The top 3 startups that raised the most money in Q3 of 2023 for Cell and Gene therapies are:\n\n1. AIRNA - $30M in Initial Financing\n2. CellFE - $22M Series A Financing\n3. Innovac Therapeutics and BlueWhale Bio - Both secured $18M (Series Pre-A Financing for Innovac Therapeutics and Seed Funding for BlueWhale Bio)'