<a href="https://colab.research.google.com/github/Sweta-Das/LangChain-HuggingFace-LLM/blob/main/PDF_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# %%capture
%pip -q install pypdf
%pip -q install torch
%pip -q install langchain
%pip -q install faiss-cpu
%pip -q install accelerate
%pip -q install transformers
%pip -q install huggingface-hub
%pip -q install llama-cpp-python
%pip -q install sentence-transformers

In [None]:
%pip install -q PyPDF2

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFDirectoryLoader
from sentence_transformers import SentenceTransformer, util
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain import HuggingFaceHub
from langchain.llms import LlamaCpp
import numpy as np
import sys, random
import PyPDF2
import torch
import time
import os

**About Libraries**:<br>
- *RecursiveCharacterTextSplitter* : a function to split text into smaller chunks based on a specified character set & chunk size. Recursive splitting works by repeatedly splitting the text into smaller pieces until it reaches a desired size or encounters a separator character.
- *PyPDFDirectoryLoader* : a LangChain's class that extracts text from PDF docs stored in a directory. It returns a list of tuples (file name, text extracted).
- *SentenceTransformer* : class used for embedding sentences into numerical vectors for various NLP tasks
- *HuggingFaceEmbeddings* : class that allows to integrate Hugging Face Models into LangChain pipeline for text embedding
- *FAISS* : Facebook AI Similarity Search, a powerful library designed for efficient similarity search and clustering of dense vectors
- RetrievalQA : a class to create a question answering system based on information retrieval
- *LlamaCPP* : a LangChain's wrapper class that enables to use the Llama LLM within LangChain.
- *LLMChain* : a LangChain's class specifically designed to interact with LLMs; a C++ implementation of GPT-3
- *HuggingFaceHub* :  class that joins LangChain with Hugging Face
- *PyPDF2* : a library that works with PDF files in Python; *PdfReader* reads the PDF docs' content
- *pdfplumber* : a library for extracting text & data from PDF docs; *pdf* works with PDFs
- *numba* : a library in Python ecosystem used for high-performance numerical computing. It provides **JIT (Just In Time)** compiler *(@jit)* that translates Python functions into optimized machine code at runtime. It also support **cuda** like *(@cuda.jit)* to execute code on NVIDIA GPUs.



In [4]:
# Accessing through HuggingFace Access Token
os.environ['HUGGINGFACEHUB_API_TOKEN'] = 'HUGGINGFACEHUB_API_TOKEN'

In [5]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive/')
model = 'drive/MyDrive/LLM_Model/mistral-7b-instruct-v0.1.Q3_K_S.gguf'

Mounted at /content/drive/


In [6]:
# Reading PDF and extracting ToC
def extract_ToC(pdf_path, start_page, end_page):

  with open(pdf_path, 'rb') as file:
    pdf_reader = PyPDF2.PdfReader(file)

    toc_entries = []

    for page in range(start_page, end_page+1):
      page = pdf_reader.pages[page]
      text = page.extract_text()
      text = text.replace("vii", "").replace("viii", "").replace("i17", "17")

      toc_lines = text.splitlines()

      for i in toc_lines:
        toc_entries.append(i)
    return toc_entries

loader = "/content/drive/MyDrive/LLM_Model/Yoga_Education_for_Children_Vol_1.pdf"
toc = extract_ToC(loader, 7, 8)
toc

['Contents',
 'Introduction  1',
 'Yoga and Education  ',
 ' 1. The Need for a Y oga-Based Education System  13',
 ' 2. Yoga and Children’s Problems  22',
 ' 3. Yoga with Pre-School Children  25',
 ' 4. Yoga Lessons Begin at Age Eight  31',
 ' 5. Student Unr est and Its Remedy  34',
 ' 6. Yoga and the Youth Problem  39',
 ' 7. Better Ways of Educatio n 45',
 ' 8. Yoga at School  50',
 ' 9. Yoga and Education  57',
 '10. Questions and Answers  65',
 'Yoga as Therapy  ',
 '11. Yoga for Emotional Disturbances  77',
 '12. Yoga for the Disabled  83',
 '13. Yoga Benefits Juvenile Diabetes  87',
 'Practices  ',
 '14. Yoga Techniques for Pre-School Children  93',
 '15. Yoga Techniques for 7–14 Y ear-Olds  101',
 '16. Yoga Techniques for the Classroom  110',
 '17. Introduction to Asana  133',
 '18. Pawanmuktasana Series  139',
 'Pawanmuktasana 1: Anti-Rheumatic Asanas  141',
 'Pawanmuktasana 2: Anti-Gastric Asanas  156',
 'Pawanmuktasana 3: Energizing Asanas  165',
 '19.  Eye Exercises  171',
 

In [11]:
# Downloading embedding model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [12]:
# Creating embeddings for each text chunk using FAISS class
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("/content/drive/MyDrive/LLM_Model/Yoga_Education_for_Children_Vol_1.pdf").load()
vector_store = FAISS.from_documents(loader, embedding = embeddings)

In [13]:
print(vector_store)

<langchain_community.vectorstores.faiss.FAISS object at 0x7967c8186b90>


In [14]:
# Loading LLM model
llm = LlamaCpp(
    streaming = True,
    model_path = "/content/drive/MyDrive/LLM_Model/mistral-7b-instruct-v0.1.Q3_K_S.gguf",
    temperature = 0.75,
    top_p = 1,
    verbose = True,
    n_ctx = 4096
)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /content/drive/MyDrive/LLM_Model/mistral-7b-instruct-v0.1.Q3_K_S.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - k

In [15]:
# Creating a QnA sys
qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type = "stuff",
                                 retriever=vector_store.as_retriever(search_kwargs={"k":1}))

In [16]:
query = "Meaning of pawanmuktasana"
qa.invoke(query)


llama_print_timings:        load time =    4798.54 ms
llama_print_timings:      sample time =      66.52 ms /   116 runs   (    0.57 ms per token,  1743.94 tokens per second)
llama_print_timings: prompt eval time =  207780.27 ms /   402 tokens (  516.87 ms per token,     1.93 tokens per second)
llama_print_timings:        eval time =   74837.75 ms /   115 runs   (  650.76 ms per token,     1.54 tokens per second)
llama_print_timings:       total time =  283226.78 ms /   517 tokens


{'query': 'Meaning of pawanmuktasana',
 'result': " Pawanmuktasana is a set of yoga poses that help release wind and gases from the body. It's aimed at regulating what are known as 'humours' in ancient Indian medical science called Ayurveda - phlegm or kapha, wind or vata and acid/bile, pitta. These three humours control all bodily functions and any irregularity can lead to disease. The series is also used for developing body awareness and facilitating equal development of the body and brain hemispheres."}

In [18]:
# Create an infinite loop to interact with the system
import sys

while True:
  user_input = input(f"Input Prompt: ")
  if user_input == 'exit':
    print('Exiting')
    sys.exit()
  if user_input == '':
    continue
  # pass the query to the system and print the response
  result = qa.invoke({'query': user_input})
  print(f"Answer: {result['result']}")

Llama.generate: prefix-match hit

llama_print_timings:        load time =    4798.54 ms
llama_print_timings:      sample time =      14.43 ms /    26 runs   (    0.55 ms per token,  1802.05 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   17404.39 ms /    26 runs   (  669.40 ms per token,     1.49 tokens per second)
llama_print_timings:       total time =   17514.20 ms /    27 tokens


Answer:  The eye exercises provided to children are Palming, Squinting, Focusing, Blinking, and Double Vision.


Llama.generate: prefix-match hit

llama_print_timings:        load time =    4798.54 ms
llama_print_timings:      sample time =      52.71 ms /    91 runs   (    0.58 ms per token,  1726.36 tokens per second)
llama_print_timings: prompt eval time =  201180.35 ms /   359 tokens (  560.39 ms per token,     1.78 tokens per second)
llama_print_timings:        eval time =   60034.68 ms /    90 runs   (  667.05 ms per token,     1.50 tokens per second)
llama_print_timings:       total time =  261707.69 ms /   449 tokens


Answer:  Pawanmuktasana is a group of exercises that release wind and gases from the body. It is used in ancient Indian medical science known as ayurveda to regulate the "humours," or bodily fluids, which are thought to control all the functions of the body. The exercises involve simple movements that help develop body awareness and promote equal development of the body. They should be practiced slowly to develop concentration.
Input Prompt: exit
Exiting


SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


Checking document sources

In [19]:
query = "What are some of the eye exercises for children?"

docs = vector_store.similarity_search(query)
print(f"Query = {query}")
print(f"Retrieved docs: {len(docs)}")

for doc in docs:
  doc_details = doc.to_json()['kwargs']
  print("Source: ", doc_details['metadata']['source'])
  print("Text ", doc_details['page_content'], "\n")

Query = What are some of the eye exercises for children?
Retrieved docs: 4
Source:  /content/drive/MyDrive/LLM_Model/Yoga_Education_for_Children_Vol_1.pdf
Text  171
BSY ©19
Eye Exercises
The following exercises can remove and prevent most eye 
diseases, both muscular and optical, if they are practised with patience and perseverance. Many people who have done 
these exercises over a long period of time have dis  carded 
their spectacles. Aldous Huxley was one such person.
 After each of the exercises, the eyes should be closed and 
rested for at least half a minute. The more often the exercises are done the better; however, if there is lack of time in the daily program then the whole series performed once in the morning and once in the evening will suffice. If this is the case, there is extra reason to do the exercises with maximum dedication and awareness.
Exercise 1: Palming
Sit quietly with the eyes 
closed and face the sun if possible.Rub the palms of the hands 
together vigor  ousl

**Yoga_Education_for_Children_Vol_1**(https://www.amazon.com/Yoga-Education-Children-VOL-1/dp/8185787336)