<a href="https://colab.research.google.com/github/AdityaPrasad275/AdityaPrasad275/blob/main/PDF_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installations and Models

In [None]:
### Installations
!pip install langchain
!pip install chromadb
!pip install pdfplumber
!pip install tiktoken
!pip install lxml
!pip install torch
!pip install transformers
!pip install accelerate
!pip install sentence-transformers
!pip install einops
!pip install xformers
!pip install keras
!pip install safetensors
!pip install InstructorEmbedding

# Text embedding models
EMB_OPENAI_ADA = "text-embedding-ada-002"
EMB_INSTRUCTOR_XL = "hkunlp/instructor-xl"
EMB_SBERT_MPNET_BASE = "sentence-transformers/all-mpnet-base-v2"

# LLM models
LLM_OPENAI_GPT35 = "gpt-3.5-turbo"
LLM_FLAN_T5_XXL = "google/flan-t5-xxl"
LLM_FLAN_T5_XL = "google/flan-t5-xl"
LLM_FASTCHAT_T5_XL = "lmsys/fastchat-t5-3b-v1.0"
LLM_FLAN_T5_SMALL = "google/flan-t5-small"
LLM_FLAN_T5_BASE = "google/flan-t5-base"
LLM_FLAN_T5_LARGE = "google/flan-t5-large"
LLM_FALCON_SMALL = "tiiuae/falcon-7b-instruct"

# we'll be using instructor-xl for text embeddings and flan t5 for llm (later if we get gpu, falcon would be better choice)

Collecting langchain
  Downloading langchain-0.0.321-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.1-py3-none-any.whl (27 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langsmith<0.1.0,>=0.0.43 (from langchain)
  Downloading langsmith-0.0.49-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.9/41.9 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain)
  Downloading marshmallow-3.20.1-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langch

# Setting up the models

## Functions

In [None]:
from langchain.document_loaders import PDFPlumberLoader
from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter
from transformers import pipeline
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain import HuggingFacePipeline
from langchain.embeddings import HuggingFaceInstructEmbeddings, HuggingFaceEmbeddings
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
import torch
from transformers import AutoTokenizer
import re


In [None]:
def create_SBERT_embedding_model():
        device = "cuda" if torch.cuda.is_available() else "cpu"
        device = "cpu"
        return HuggingFaceEmbeddings(model_name=EMB_SBERT_MPNET_BASE, model_kwargs={"device": device})

def create_instructor_xl_embedding_model():
        device = "cuda" if torch.cuda.is_available() else "cpu"
        return HuggingFaceInstructEmbeddings(model_name=EMB_INSTRUCTOR_XL, model_kwargs={"device": device})


def create_flan_t5_base(load_in_8bit=False):

        model= LLM_FLAN_T5_BASE
        tokenizer = AutoTokenizer.from_pretrained(model)
        return pipeline(
            task="text2text-generation",
            model=model,
            tokenizer = tokenizer,
            max_new_tokens=100,
            model_kwargs= {"device_map": "auto", "load_in_8bit": load_in_8bit, "max_length": 512, "temperature": 0.}
        )

def create_falcon_instruct_small(load_in_8bit=False):
        model = "tiiuae/falcon-7b-instruct"

        tokenizer = AutoTokenizer.from_pretrained(model)
        hf_pipeline = pipeline(
                task="text-generation",
                model = model,
                tokenizer = tokenizer,
                trust_remote_code = True,
                max_new_tokens=100,
                model_kwargs={
                    "device_map": "auto",
                    "load_in_8bit": load_in_8bit,
                    "max_length": 512,
                    "temperature": 0.01,
                    "torch_dtype":torch.bfloat16,
                    }
            )
        return hf_pipeline

## set up les gooooooo

In [None]:
llm = create_flan_t5_base(load_in_8bit=False)
embedding = create_instructor_xl_embedding_model()

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]



Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)7f436/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

Downloading (…)/2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

Downloading (…)0daf57f436/README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

Downloading (…)af57f436/config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)7f436/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading (…)f57f436/modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


# Turning PDFs to embeddings

## Loading pdf

In [None]:
!wget https://www.iitb.ac.in/newacadhome/ugrulebook.pdf -O rulebook.pdf

--2023-10-24 08:44:21--  https://www.iitb.ac.in/newacadhome/ugrulebook.pdf
Resolving www.iitb.ac.in (www.iitb.ac.in)... 103.21.124.10
Connecting to www.iitb.ac.in (www.iitb.ac.in)|103.21.124.10|:443... connected.
HTTP request sent, awaiting response... 200 
Length: 1999956 (1.9M) [application/pdf]
Saving to: ‘rulebook.pdf’


2023-10-24 08:44:24 (1.14 MB/s) - ‘rulebook.pdf’ saved [1999956/1999956]



In [None]:
pdf_path = "rulebook.pdf"
loader = PDFPlumberLoader(pdf_path)
documents = loader.load()

## Split text doc and create text snippets

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=10)
texts = text_splitter.split_documents(texts)

persist_directory = None
vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=persist_directory)


# Using LLMs

## Retrieving snippets
RetrievalQA finds relevant snippets based on question embeddings, then construct a Prompt and query LLM


In [None]:
hf_llm = HuggingFacePipeline(pipeline=llm)
retriever = vectordb.as_retriever(search_kwargs={"k":4})
qa = RetrievalQA.from_chain_type(llm=hf_llm, chain_type="stuff",retriever=retriever)

## Default prompt

In [None]:
question_t5_template = """
    context: {context}
    question: {question}
    answer:
    """
QUESTION_T5_PROMPT = PromptTemplate(
    template=question_t5_template, input_variables=["context", "question"]
)
qa.combine_documents_chain.llm_chain.prompt = QUESTION_T5_PROMPT

qa.combine_documents_chain.verbose = True
qa.return_source_documents = True


# Query the LLM

In [None]:
def query_llm(question):

  res = qa({"query":question,})

  return res


In [None]:
query_llm("What is the IDDDP program?")

Token indices sequence length is longer than the specified maximum sequence length for this model (2105 > 512). Running this sequence through the model will result in indexing errors




[1m> Entering new StuffDocumentsChain chain...[0m





[1m> Finished chain.[0m


{'query': 'What is the IDDDP program?',
 'result': 'IDDDP is only for the movement of students from one academic unit to another',
 'source_documents': [Document(page_content='e) The final list of selected candidates will be conveyed to the Convener, DUGC of the respective\nparent academic units and the Convener, DUGC / DPGC of the destination academic units. The\nfinal list will also be conveyed to Associate/ Dean, SA for adjustment in hostel accommodation.\nC. Rules & Regulations:\na) IDDDP is only for the movement of students from one academic unit to another.\nb) A DD specialization / M.Tech. program usually requires the completion of 8 to 9 courses of 6\ncredits and a DD/M.Tech. project (DDP/MTP) of 74 - 92 credits.\nc) IDDDP should also be treated as (b). However, considering (a), IDDDP also allows the comple-\ntion of only 4 PG level courses (as specified by the concerned academic unit) and the DDP/MTP\nproject to earn a "Dual Degree in xxx Specialization WITHOUT HONORS".\nd) An

In [None]:
query_llm("what is full form of CPI")



[1m> Entering new StuffDocumentsChain chain...[0m





[1m> Finished chain.[0m


{'query': 'what is full form of CPI',
 'result': 'Cumulative Performance Index',
 'source_documents': [Document(page_content='Seminar etc.) in a semester with credits C1, C2, C3, C4 and C5 and her/his grade points in these courses\nare g1, g2, g3, g4 and g5 respectively, then her/his SPI is equal to:\nC1g1 + C2 g2 + C3 g3 + C4 g4 + C5 g5\nSPI = ----------------------------------------------------------\nC1 + C2 + C3 + C4 + C5\nThe SPI is calculated to two decimal places. The SPI for any semester will take into consideration the FR\ngrades awarded in that semester. For example, if a student has failed in course 4, the SPI will then be\ncomputed as:\nC1g1 + C2 g2 + C3 g3 + C4 * ZERO+C5 g5\nSPI = -------------------------------------------------------------------\nC1 + C2 + C3 + C4 + C5\nThe courses which do not form the minimum requirement of the degrees will not be considered for cal-\nculation of the SPI. Such additional courses undertaken and the grades earned by the student will be\n

In [None]:
query_llm("can i retag minor into core")



[1m> Entering new StuffDocumentsChain chain...[0m





[1m> Finished chain.[0m


{'query': 'can i retag minor into core',
 'result': 'no',
 'source_documents': [Document(page_content='Each programme prescribes the minimum credits and courses that qualify a candidate for the award of\nthe Degree in a particular discipline. The total credits for the B.Tech. Programme, for example, vary be-\ntween 266-282 depending on the discipline, as mentioned earlier. This approximately converts itself\ninto about four theory courses and one or two laboratory courses or other activities like seminar, pro-\nject, etc., every semester.\nThe curriculum is designed to permit B.Tech., B.S. and B.Des. students, who are not identified as aca-\ndemically weak, to optionally take additional courses. The freedom to take about six credits every se-\nmester after the first year, permits a student to satisfy her/his interests / abilities and aspirations.\nIt is expected that all students with reasonably good academic standing, utilize this surplus time for en-\nhancing their academic learning 

In [None]:
query_llm("what's SLP-IDP?")



[1m> Entering new StuffDocumentsChain chain...[0m





[1m> Finished chain.[0m


{'query': "what's SLP-IDP?",
 'result': 'a mandatory requirement of the Dual Degree Programmes',
 'source_documents': [Document(page_content='e) The final list of selected candidates will be conveyed to the Convener, DUGC of the respective\nparent academic units and the Convener, DUGC / DPGC of the destination academic units. The\nfinal list will also be conveyed to Associate/ Dean, SA for adjustment in hostel accommodation.\nC. Rules & Regulations:\na) IDDDP is only for the movement of students from one academic unit to another.\nb) A DD specialization / M.Tech. program usually requires the completion of 8 to 9 courses of 6\ncredits and a DD/M.Tech. project (DDP/MTP) of 74 - 92 credits.\nc) IDDDP should also be treated as (b). However, considering (a), IDDDP also allows the comple-\ntion of only 4 PG level courses (as specified by the concerned academic unit) and the DDP/MTP\nproject to earn a "Dual Degree in xxx Specialization WITHOUT HONORS".\nd) An admitting academic unit can presc