<a href="https://colab.research.google.com/github/JamBelg/RAG/blob/main/Rag_pdf_files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Libraries

In [None]:
!pip install langchain_openai langchain_core langchain_community langchain_pinecone langchain pypdf2 pypdf

Collecting langchain_openai
  Downloading langchain_openai-0.1.14-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.9/45.9 kB[0m [31m970.2 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_core
  Downloading langchain_core-0.2.11-py3-none-any.whl (337 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m337.4/337.4 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_community
  Downloading langchain_community-0.2.6-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_pinecone
  Downloading langchain_pinecone-0.1.1-py3-none-any.whl (8.4 kB)
Collecting langchain
  Downloading langchain-0.2.6-py3-none-any.whl (975 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m975.5/975.5 kB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf2
  Downloading pypdf2-3.0

# Read PDF

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

pdf_link = "https://arxiv.org/pdf/2402.16893"
loader = PyPDFLoader(file_path=pdf_link,
                     extract_images=False)
text_splitter = RecursiveCharacterTextSplitter(chunk_size= 4000, chunk_overlap= 50)
pages = loader.load_and_split(text_splitter)

In [None]:
pages[0]

Document(metadata={'source': 'https://arxiv.org/pdf/2402.16893', 'page': 0}, page_content='The Good and The Bad: Exploring Privacy Issues\nin Retrieval-Augmented Generation (RAG)\nShenglai Zeng1*†, Jiankun Zhang∗3,4,5, Pengfei He1, Yue Xing1, Yiding Liu2, Han Xu1\nJie Ren1, Shuaiqiang Wang2, Dawei Yin2, Yi Chang3,4,5, Jiliang Tang1\n1Michigan State University2Baidu, Inc.\n3School of Artificial Intelligence, Jilin University\n4International Center of Future Science, Jilin University\n5Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China\nAbstract\nRetrieval-augmented generation (RAG) is a\npowerful technique to facilitate language model\nwith proprietary and private data, where data\nprivacy is a pivotal concern. Whereas extensive\nresearch has demonstrated the privacy risks of\nlarge language models (LLMs), the RAG tech-\nnique could potentially reshape the inherent\nbehaviors of LLM generation, posing new pri-\nvacy issues that are currently under-exp

In [None]:
pages[0].page_content

'The Good and The Bad: Exploring Privacy Issues\nin Retrieval-Augmented Generation (RAG)\nShenglai Zeng1*†, Jiankun Zhang∗3,4,5, Pengfei He1, Yue Xing1, Yiding Liu2, Han Xu1\nJie Ren1, Shuaiqiang Wang2, Dawei Yin2, Yi Chang3,4,5, Jiliang Tang1\n1Michigan State University2Baidu, Inc.\n3School of Artificial Intelligence, Jilin University\n4International Center of Future Science, Jilin University\n5Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China\nAbstract\nRetrieval-augmented generation (RAG) is a\npowerful technique to facilitate language model\nwith proprietary and private data, where data\nprivacy is a pivotal concern. Whereas extensive\nresearch has demonstrated the privacy risks of\nlarge language models (LLMs), the RAG tech-\nnique could potentially reshape the inherent\nbehaviors of LLM generation, posing new pri-\nvacy issues that are currently under-explored.\nIn this work, we conduct extensive empiri-\ncal studies with novel attack methods,

In [None]:
len(pages)

24

# RAG model (gpt 3.5)

In [None]:
# Opem ai model (gpt 3.5)
from langchain_openai.chat_models import ChatOpenAI
from google.colab import userdata
OPENAI_API_KEY = userdata.get("open_ai")
model = ChatOpenAI(openai_api_key=OPENAI_API_KEY,
                   model="gpt-3.5-turbo",
                   temperature=0.2)

In [None]:
# Open ai embeddings
from langchain_openai.embeddings import OpenAIEmbeddings
embeddings =OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [None]:
# Prompt template (context + question)
from langchain.prompts import ChatPromptTemplate
template = """
Answer the question based on the content below. If you can't answer the question, reply 'I don't know'.
Context: {context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

In [None]:
# Parser
from langchain_core.output_parsers import StrOutputParser
parser = StrOutputParser()

In [None]:
# Pinecone for cosine distance between question and different documents
from langchain_pinecone import PineconeVectorStore
import os
index_name = "pdf-files"
os.environ["PINECONE_API_KEY"] = userdata.get("pinecone_api_key")

pinecone = PineconeVectorStore.from_documents(
    pages, embedding=embeddings, index_name=index_name
)

In [None]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

chain = (
    {"context":pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)
chain.invoke("What is the definition of RAG?")

'Retrieval-augmented generation (RAG) is an advanced natural language processing technique that enhances text generation by integrating information retrieved from a large corpus of documents.'

In [None]:
chain = (
    {"context":pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)
chain.invoke("What are the different risks of using RAG ?")

'The different risks of using RAG include the potential extraction of private data from the external retrieval database and the possibility of the LLMs outputting private information from their training/fine-tuning dataset.'

In [None]:
chain = (
    {"context":pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)
chain.invoke("What is the best strategy to deal with private data?")

'Based on the content provided, the best strategy to deal with private data involves conducting targeted attacks to extract specific private information while also evaluating the memorization effects through prefix attacks. Additionally, exploring various defense techniques to mitigate privacy risks associated with retrieval-augmented generation (RAG) techniques is crucial.'

In [None]:
chain = (
    {"context":pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)
chain.invoke("Who are the authors of this paper ?")

"I don't know."

In [None]:
chain = (
    {"context":pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)
chain.invoke("What are the refrences of this paper")

'The references of this paper include:\n1. Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee. 2023. Do language models plagiarize? In Proceedings of the ACM Web Conference 2023, pages 3637–3647.\n2. Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2021. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499.\n3. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.\n4. Liu. 2023. Twitter post. https://twitter.com/kliu128/status/1623472922374574080.\n5. Jerry Liu. 2022. Llamaindex. 11 2022. https://github.com/jerryjliu/llama_index.'

# Summarize

In [None]:
from langchain.chains.summarize import load_summarize_chain
chain = load_summarize_chain(model, chain_type="stuff")
result = chain.invoke(pages)

print(result["output_text"])

The study explores privacy issues in Retrieval-Augmented Generation (RAG) systems, focusing on the vulnerability of leaking private retrieval database information and the impact of retrieval data on the memorization behavior of large language models (LLMs). The research reveals that RAG systems are susceptible to privacy risks but can also mitigate the leakage of training data from LLMs. Various attack methods and defense strategies are examined, showing that integrating retrieval data can reduce the risk of leaking training data. The study provides insights for privacy protection in RAG systems.
