<a href="https://colab.research.google.com/github/Vibhuarvind/Content-Engine-RAG-for-PDF/blob/main/Content_Engine_PDF_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Installing Dependencies**

1. **LangChain**: LangChain is a framework for developing applications powered by language models. It provides a collection of components and tools for building end-to-end language applications, including data loaders, text splitters, vector stores, and memory management. In the context of a PDF RAG system, LangChain can be used to load and process PDF documents, split text into chunks, and manage the interaction between the retrieval system and the language model.

2. **Torch**: PyTorch is a popular open-source machine learning library based on the Torch library. It provides a wide range of tools and functionalities for building and training neural networks, including support for deep learning, computer vision, and natural language processing. In the context of a PDF RAG system, PyTorch can be used to train or fine-tune a language model, such as BERT or T5, for the specific domain of the PDF documents.

3. **Sentence Transformers**: Sentence Transformers is a Python library for generating sentence embeddings using pre-trained transformer models. Sentence embeddings are dense vector representations of sentences that capture their semantic meaning. In the context of a PDF RAG system, Sentence Transformers can be used to generate embeddings for the text chunks extracted from the PDF documents, which can then be stored in a vector database for efficient retrieval.

4. **Faiss-CPU**: Faiss is a library for efficient similarity search and clustering of dense vectors. Faiss-CPU is a version of Faiss that runs on the CPU, making it suitable for smaller-scale applications or for running on systems without a GPU. In the context of a PDF RAG system, Faiss-CPU can be used to build a vector index for the sentence embeddings, enabling fast and efficient retrieval of relevant text chunks based on user queries.

5. **PyPDF**: PyPDF is a Python library for working with PDF files. It provides functionalities for reading, writing, and manipulating PDF documents, including extracting text and images. In the context of a PDF RAG system, PyPDF can be used to extract text from the PDF documents, which can then be processed and used to build the knowledge base.

6. **llama-cpp-python**: llama-cpp-python is a Python library for running the llama.cpp project, which is a C++ implementation of the LLaMA language model. LLaMA is a large-scale language model developed by Meta AI. In the context of a PDF RAG system, llama-cpp-python can be used to run a local instance of the LLaMA language model, which can be used to generate responses based on the retrieved information.

7. **Accelerate**: Accelerate is a Python library for accelerating PyTorch models on GPU and TPU. It provides functionalities for distributed training, mixed precision training, and gradient accumulation, which can improve the performance and efficiency of training language models. In the context of a PDF RAG system, Accelerate can be used to optimize the training of the language model on GPU or TPU, if available.

8. **Transformers**: Transformers is a Python library for state-of-the-art natural language processing (NLP) models. It provides a wide range of pre-trained models and tools for tasks such as text classification, named entity recognition, and question answering. In the context of a PDF RAG system, Transformers can be used to load and fine-tune pre-trained language models, such as BERT or T5, for the specific domain of the PDF documents.

9. **Hugging Face Hub**: The Hugging Face Hub is a platform for sharing and hosting NLP models and datasets. It provides a user-friendly interface for browsing, downloading, and using pre-trained models and datasets. In the context of a PDF RAG system, the Hugging Face Hub can be used to access and download pre-trained language models and embeddings, which can be used to build the knowledge base and generate responses.

In [None]:
!pip install langchain
!pip install torch
!pip install sentence_transformers
!pip install faiss-cpu
!pip install pypdf
!pip install llama-cpp-python
!pip -q install accelerate
!pip install transformers
!pip install huggingface_hub

Collecting langchain
  Downloading langchain-0.2.6-py3-none-any.whl (975 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/975.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/975.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m450.6/975.5 kB[0m [31m6.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m860.2/975.5 kB[0m [31m8.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m975.5/975.5 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.3.0,>=0.2.10 (from langchain)
  Downloading langchain_core-0.2.10-py3-none-any.whl (332 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m332.8/332.8 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langch

In [None]:
!pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.2.6-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.21.3-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Installing collected packages: mypy-extensi

# Mistral_7B_Instruct

In [5]:
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import LlamaCpp
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFDirectoryLoader

# Loading pdf content

In [14]:
loader = PyPDFDirectoryLoader('documents/')
data = loader.load()

In [15]:
print(data)

[Document(metadata={'source': 'documents\\DM Final Project Report.pdf', 'page': 0, 'page_label': '1'}, page_content='D i s e a s e C l a s s i f i c a t i o n i n P a t i e n t X - R a y s u s i n g D e e p L e a r n i n g A p p r o a c h e s \n S r i n i v a s N a t a r a j a n SCAI Arizona State University T empe AZ US snatar28@asu.edu \nS a i t e j a A l a p a r t h i SCAI Arizona State University T empe AZ US salapart@asu.edu \nT a n u s h i A h u j a SCAI Arizona State University T empe AZ US tahuja4@asu.edu \nD a r s h a n G o v i n d a r a j SCAI Arizona State University T empe AZ US dgovind5@asu.edu \nA B S T R A C T The interpretation of X-rays is critical for identifying various medical conditions and their precursors. In this project, we explore the potential of using computer systems to aid in the diagnosis process using various deep learning approaches for ef ficient disease localization in patient X-rays. Specifically , we use standard boosting techniques, deep learning a

# Text Splitting and Chunk Retrieval

In [9]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=20)
text_chunks = text_splitter.split_documents(data)

In [17]:
text_chunks[len(text_chunks)-1]

Document(metadata={'source': 'documents\\Srinivas_Natarajan_Resume.pdf', 'page': 0, 'page_label': '1'}, page_content='Srinivas Natarajan Data Engineer\nTempe, Arizona 85282\n+1 602 814 9054 nsrinivas06@gmail.com linkedin.com/in/srinivas-natarajan github.com/Srinivas-Natarajan\nEducation\nArizona State University August 2022 – May 2024\nMaster of Science in Computer Science GPA:3.96/4.0\nRelevant Coursework: Cloud Computing, Statistical Machine Learning, Data Mining, Artificial Intelligence\nVellore Institute of Technology July 2018 – June 2022\nBachelor of Technology in Computer Science GPA: 9.38/10.0\nRelevant Coursework: Computer Architecture, Database Systems, Operating Systems, Natural Language Processing\nTechnical Skills\nLanguages/Tools: Python, C++, JavaScript, Java, SQL, PostgreSQL MongoDB, NoSQL, HTML/CSS, Bootstrap, Node.js,\nMicrosoft Power BI, Tableau, MATLAB, Git, GitHub, AWS, Oracle Cloud (OCI), Docker, Linux, LangChain, FastAPI,\nTensorFlow, PyTorch, Matplotlib, OpenCV,

# Generate Vectors: Use a local embedding model to create embeddings for document content.

In [13]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


# Store in Vector Store: Utilize local persisting methods in the chosen vector store.

In [None]:
vector_store = FAISS.from_documents(text_chunks, embedding=embeddings)

# Integrate LLM: Run a local instance of a Large Language Model for contextual insights.

In [None]:
#import model

llm = LlamaCpp(

               streaming=True,
               model_path='/content/drive/MyDrive/Colab Notebooks/mistral-7b-instruct-v0.1.Q4_K_M.gguf',
               temperature = 0.75,
               top_p=1,
               verbose=True,
               n_ctx=4096

)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /content/drive/MyDrive/Colab Notebooks/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loade

# Configure Query Engine: Set up retrieval tasks based on document embeddings.

In [None]:
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vector_store.as_retriever(search_kwargs={"k":2}))

In [None]:
query = "What are the differences in the business of Tesla and Uber?"

In [None]:
qa.run(query)

  warn_deprecated(

llama_print_timings:        load time =    3306.32 ms
llama_print_timings:      sample time =      72.77 ms /   131 runs   (    0.56 ms per token,  1800.22 tokens per second)
llama_print_timings: prompt eval time = 1154473.83 ms /  2475 tokens (  466.45 ms per token,     2.14 tokens per second)
llama_print_timings:        eval time =   93895.68 ms /   130 runs   (  722.27 ms per token,     1.38 tokens per second)
llama_print_timings:       total time = 1248775.63 ms /  2605 tokens


" Tesla and Uber are two very different companies with different business models. Tesla is primarily involved in the manufacturing and sale of electric vehicles, energy storage products, and solar panels. They also provide services such as maintenance and repair of vehicles, installation of energy storage products, and energy management for commercial and industrial customers. On the other hand, Uber's primary business is providing a technology platform that connects consumers with independent providers of ride services, meal preparation, grocery, and other delivery services. Uber also provides financial partnership products and advertising opportunities to consumers and merchants. The two companies operate in different industries and have different business models."

# Develop Chatbot Interface: Use Streamlit to facilitate user interaction and display comparative insights.

App: https://github.com/Vibhuarvind/Content-Engine-RAG-for-PDF/blob/main/app.py

In [None]:
import sys

while True:
  user_input = input(f"Input Prompt : ")
  if(user_input == "exit"):
    print("Exiting...")
    sys.exit()
  if(user_input == ''):
    continue
  result = qa({'query':user_input})
  print(f"Answer:{result['result']}")

Input Prompt : What is the total revenue for Google Search?


Llama.generate: prefix-match hit

llama_print_timings:        load time =    3306.32 ms
llama_print_timings:      sample time =      16.06 ms /    24 runs   (    0.67 ms per token,  1494.58 tokens per second)
llama_print_timings: prompt eval time =  858014.79 ms /  1845 tokens (  465.05 ms per token,     2.15 tokens per second)
llama_print_timings:        eval time =   16018.39 ms /    23 runs   (  696.45 ms per token,     1.44 tokens per second)
llama_print_timings:       total time =  874102.63 ms /  1868 tokens


Answer: The total revenue for Google Search in 2023 is $175,033 million.
Input Prompt : exit
Exiting...


SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
