<a href="https://colab.research.google.com/github/GauravS-30/LLM/blob/main/LLM_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install accelerate transformers>=4.32.0 optimum>=1.12.0
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu117/  # Use cu117 if on CUDA 11.7



Looking in indexes: https://pypi.org/simple, https://huggingface.github.io/autogptq-index/whl/cu117/


In [None]:
!pip -q install langchain

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/Llama-2-7b-Chat-GPTQ"
# To use a different branch, change revision
# For example: revision="gptq-4bit-64g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)





model.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

In [None]:
from langchain.llms import HuggingFacePipeline

local_model = HuggingFacePipeline(pipeline = pipe)

Reading multiple PDF text and preprocessing the data before vectorzing and storing it in ChromaDB

In [None]:
!pip install chromadb sentence-transformer pypdf

[31mERROR: Could not find a version that satisfies the requirement sentence-transformer (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for sentence-transformer[0m[31m
[0m

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader


In [None]:
loader = DirectoryLoader('./drive/MyDrive/PDF_data/', glob="./*.pdf", loader_cls=PyPDFLoader)

documents = loader.load()

In [None]:
print(len(documents))

146


In [None]:
# Splitting the data in a given chunk size before creating embeddings.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

In [None]:
print(len(texts))

568


Building Embedding Model: Here I used Sentence transformers to get a light weight embedding model, which I used to create embedding for my given texts and then store it using chromadb

In [None]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=c755be1774786b65bbe214f192bdb32e886ba5495050202a639a536b8b28308f
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-tr

In [None]:
# Dowloading the embedding model
from langchain.embeddings import HuggingFaceEmbeddings

instructor_embeddings = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2",
                                                      model_kwargs={"device": "cuda"})

In [None]:
# Creating a DB to store vector embeddings
persist_directory = 'db'

embedding = instructor_embeddings
# Creating embeddings and storing in db.
vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

In [None]:
vectordb.persist()
vectordb = None

In [None]:
vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding)

Retriver : Retriever will help retrieve most useful documents from DB based on given query.

In [None]:
retriever = vectordb.as_retriever()

In [None]:
docs = retriever.get_relevant_documents("Who is eligible custodian?")

In [None]:
len(docs)

4

In [None]:
docs[0]

Document(page_content='and unaffiliated custodians (each, an “Eligible Custodian” ), we have offered to you, and you \nseek to engage us to provide, an informational and complimentary service by which we \nfacilitate the provision of certain data for your accounts and assets held by Eligible \nCustodians which includes account balances, transaction history and other transaction-level \ndata ( “Customer Data” ) by a third party service provider engaged by us ( “Aggregation \nVendor” ). We are willing to perform such services for you and to provide you with', metadata={'page': 33, 'source': 'drive/MyDrive/PDF_data/dsa_la.pdf'})

Make QA Chain : Finally making a retrieval QA chain using langchain. This will help us add memory to the conversation and query the database and continue the conversation across multiple queries.

In [None]:
 from langchain import PromptTemplate
template = """
<s>[INST] <<SYS>>
Act as a customer service expert for Chase Bank and answer question about bank terms and conditions.
<</SYS>>

{context}

{question} [/INST]
"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])

In [None]:
qa_chain = RetrievalQA.from_chain_type(llm=local_model,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  chain_type_kwargs={"prompt": prompt},
                                  return_source_documents=True)

In [None]:
result = qa_chain(
    "what are general principles while reviewing Online Privacy Policy"
)

In [None]:
print(result["result"].strip())

As a customer service expert for Chase Bank, I can provide you with general principles to consider when reviewing an online privacy policy. Here are some key points to keep in mind:
1. Transparency: The policy should clearly state what personal information the bank collects, how it collects it, and why it needs that information.
2. Purpose limitation: The policy should explain the purposes for which the bank collects, uses, and shares personal information, and how those purposes relate to the customer's relationship with the bank.
3. Data minimization: The policy should explain the steps the bank takes to minimize the amount of personal information collected and used, and how it ensures that the information is relevant and necessary for the identified purpose.
4. Data protection measures: The policy should describe the technical and organizational measures the bank has implemented to protect personal information from unauthorized access, use, or disclosure.
5. Data sharing: The policy 

In [None]:
result = qa_chain(
    'Can you elaborate on Data protection measures'
)

In [None]:
print(result["result"].strip())

Of course! As a customer service expert for Chase Bank, I can provide information on the data protection measures we have in place to protect our customers' sensitive information.
At Chase, we take data privacy and security very seriously. We understand that our customers trust us with their personal and financial information, and it is our responsibility to protect this information from unauthorized access or misuse. To achieve this, we have implemented various data protection measures across our digital platforms and services. Here are some of the measures we have in place:
1. Encryption: We use encryption to protect sensitive information, such as account numbers, passwords, and personal identification numbers (PINs), when it is transmitted over the internet or stored in our systems. This ensures that even if an unauthorized party gains access to the information, they will not be able to read or use it without the proper decryption key.
2. Firewalls and intrusion detection: Our syste