<a href="https://colab.research.google.com/github/ErikLarssonDev/NLP/blob/main/llama2-multiple-pdf-chat/llama2-multiple-pdf-chat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chat with Multiple PDFs using Llama 2, Pinecone and LangChain

The Llama 2 model can only take 16 000 english characters as input, therefore the input has to be split up into multiple text chunks.

In [1]:
%pip install langchain
%pip install pinecone-client
%pip install sentence_transformers
%pip install pdf2image
%pip install pypdf
%pip install xformers
%pip install bitsandbytes accelerate transformers
%pip install transformers

Collecting langchain
  Downloading langchain-0.1.0-py3-none-any.whl (797 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/798.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/798.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m798.0/798.0 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.9 (from langchain)
  Downloading langchain_community-0.0.12-py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m52.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2,>=0.1.7 (from langchain)
  Downloading

### Importing required libraries

In [3]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Pinecone
import pinecone
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import pipeline
import os
import sys
import torch

In [4]:
from langchain import HuggingFacePipeline, PromptTemplate
from langchain.chains import RetrievalQA

### Extracting the text from the PDF's

In [5]:
loader = PyPDFDirectoryLoader("pdfs")
data = loader.load()
data

[Document(page_content='YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object\ndetectors\nChien-Yao Wang1, Alexey Bochkovskiy, and Hong-Yuan Mark Liao1\n1Institute of Information Science, Academia Sinica, Taiwan\nkinyiu@iis.sinica.edu.tw, alexeyab84@gmail.com, and liao@iis.sinica.edu.tw\nAbstract\nYOLOv7 surpasses all known object detectors in both\nspeed and accuracy in the range from 5 FPS to 160 FPS\nand has the highest accuracy 56.8% AP among all known\nreal-time object detectors with 30 FPS or higher on GPU\nV100. YOLOv7-E6 object detector (56 FPS V100, 55.9%\nAP) outperforms both transformer-based detector SWIN-\nL Cascade-Mask R-CNN (9.2 FPS A100, 53.9% AP) by\n509% in speed and 2% in accuracy, and convolutional-\nbased detector ConvNeXt-XL Cascade-Mask R-CNN (8.6\nFPS A100, 55.2% AP) by 551% in speed and 0.7% AP\nin accuracy, as well as YOLOv7 outperforms: YOLOR,\nYOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable\nDETR, DINO-5scale-R50, ViT-Adapter-B and

### Splitting the extracted data into text chunks

In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)

In [7]:
docs = text_splitter.split_documents(data)

len(docs)

152

In [8]:
docs[3]

Document(page_content='ing [94, 93], autonomous driving [40, 18], robotics [35, 58],\nmedical image analysis [34, 46], etc. The computing de-\nvices that execute real-time object detection is usually some\nmobile CPU or GPU, as well as various neural processing\nunits (NPU) developed by major manufacturers. For exam-\nple, the Apple neural engine (Apple), the neural compute\nstick (Intel), Jetson AI edge devices (Nvidia), the edge TPU\n(Google), the neural processing engine (Qualcomm), the AI', metadata={'source': 'pdfs/yolov7.pdf', 'page': 0})

In [9]:
docs[4]

Document(page_content='processing unit (MediaTek), and the AI SoCs (Kneron), are\nall NPUs. Some of the above mentioned edge devices focus\non speeding up different operations such as vanilla convolu-\ntion, depth-wise convolution, or MLP operations. In this pa-\nper, the real-time object detector we proposed mainly hopes\nthat it can support both mobile GPU and GPU devices from\nthe edge to the cloud.\nIn recent years, the real-time object detector is still de-\nveloped for different edge device. For example, the devel-', metadata={'source': 'pdfs/yolov7.pdf', 'page': 0})

### Downloading embeddings from Hugging Face

In [10]:
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [11]:
query_result = embeddings.embed_query("Hello World")
print("Length", len(query_result))

Length 384


### Initializing Pinecone

You need to set up a .env file with the variables `PINECONE_API_KEY` and `PINECONE_API_ENV`.

In [12]:
# from dotenv import dotenv_values
# config = dotenv_values(".env")

# # initialize pinecone, if running locally with gpu
# pinecone.init(
#     api_key=config['PINECONE_API_KEY'],  # find at app.pinecone.io
#     environment=config['PINECONE_API_ENV']  # next to api key in console
# )

# initialize pinecone, if running on google colab
from google.colab import userdata
pinecone.init(
    api_key=userdata.get('PINECONE_API_KEY'),  # find at app.pinecone.io
    environment=userdata.get('PINECONE_API_ENV') # next to api key in console
)

index_name = "langchainpinecone" # put in the name of your pinecone index here

### Creating embeddings for each of the text chunks

In [13]:
docsearch=Pinecone.from_texts([t.page_content for t in docs], embeddings, index_name=index_name)
# If you already have an index, you can load it like this
#docsearch = Pinecone.from_existing_index(index_name, embeddings)

### Similarity search

In [14]:
query = "YOLOv7 outperforms which models"
docs = docsearch.similarity_search(query, k=4)
docs

[Document(page_content='From the results we see that if compared with YOLOv4,\nYOLOv7 has 75% less parameters, 36% less computation,\nand brings 1.5% higher AP. If compared with state-of-the-\nart YOLOR-CSP, YOLOv7 has 43% fewer parameters, 15%\nless computation, and 0.4% higher AP. In the performance\nof tiny model, compared with YOLOv4-tiny-31, YOLOv7-\ntiny reduces the number of parameters by 39% and the\namount of computation by 49%, but maintains the same AP.\nOn the cloud GPU model, our model can still have a higher'),
 Document(page_content='From the results we see that if compared with YOLOv4,\nYOLOv7 has 75% less parameters, 36% less computation,\nand brings 1.5% higher AP. If compared with state-of-the-\nart YOLOR-CSP, YOLOv7 has 43% fewer parameters, 15%\nless computation, and 0.4% higher AP. In the performance\nof tiny model, compared with YOLOv4-tiny-31, YOLOv7-\ntiny reduces the number of parameters by 39% and the\namount of computation by 49%, but maintains the same AP.\

### Creating a Llama 2 model wrapper

In [15]:
from huggingface_hub import notebook_login
import torch

In [16]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [18]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", use_auth_token=True)



tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [19]:
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf",
                                             device_map='auto',
                                             torch_dtype=torch.float16,
                                             use_auth_token=True,
                                             load_in_8bit=True
                                             )



config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [20]:
pipe = pipeline("text-generation",
                model=model,
                tokenizer= tokenizer,
                torch_dtype=torch.bfloat16,
                device_map="auto",
                max_new_tokens = 512,
                do_sample=True,
                top_k=30,
                num_return_sequences=1,
                eos_token_id=tokenizer.eos_token_id
                )

In [21]:
# temperature is a value between 0-1.
# It restricts how creative the model is.
# The value is set low since the goal is to generate answers from the provided data in the pdf.
llm=HuggingFacePipeline(pipeline=pipe, model_kwargs={'temperature':0.1})

### Creating a prompt template

In [22]:
SYSTEM_PROMPT = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer."""

In [23]:
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<>\n", "\n<>\n\n"

In [24]:
SYSTEM_PROMPT = B_SYS + SYSTEM_PROMPT + E_SYS

In [25]:
instruction = """
{context}

Question: {question}
"""

In [26]:
template = B_INST + SYSTEM_PROMPT + instruction + E_INST

In [27]:
template

"[INST]<>\nUse the following pieces of context to answer the question at the end.\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n<>\n\n\n{context}\n\nQuestion: {question}\n[/INST]"

In [28]:
prompt = PromptTemplate(template=template, input_variables=["context", "question"])

In [29]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt},
)

In [30]:
result = qa_chain("YOLOv7 is used for")

  warn_deprecated(


In [31]:
result['result']

'  Based on the given context, YOLOv7 is used for object detection.'

In [32]:
while True:
    user_input=input(f"prompt:")
    if user_input=='exit':
        print('Exiting')
        sys.exit()
    if user_input=='':
        continue
    result=qa_chain({'query':user_input})
    print(f"Answer:{result['result']}")

prompt:Which models does YOLOv7 outperform?
Answer:  Based on the provided context, YOLOv7 outperforms the following models:

1. YOLOR-P6: YOLOv7-W6 has 8 fps faster inference speed and 1% higher detection rate than YOLOR-P6.
2. YOLOR-E6: YOLOv7-E6 has 0.8% higher AP than YOLOR-E6.
3. YOLOv5-X6 (r6.1): YOLOv7-E6 has 0.9% higher AP than YOLOv5-X6 (r6.1).

It's important to note that these comparisons are based on the provided context and may not be applicable to all scenarios.
prompt:What do I want for lunch?
Answer:  I don't know what you want for lunch. I'm just an AI and do not have access to your personal preferences or dietary needs. Additionally, I cannot read minds or predict your choices. If you want to know what I recommend for lunch, I can provide some general suggestions based on popular options. However, it's important to consult with a medical professional before making any significant changes to your diet.
prompt:Is YOLOv7 faster than the vision transformer ViT?
Answer:  B

KeyboardInterrupt: Interrupted by user