# Chat with your PDFs locally

### Imports

In [22]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_community.embeddings import OllamaEmbeddings
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.prompts import ChatPromptTemplate

## Load the Documents

New code for importing multiple files

In [15]:
%%capture # Avoid printing the output

# Get a list of all .pdf files in the directory
pdf_files = [f for f in os.listdir("../../sample_files") if f.endswith('.pdf')]

pages = []
for pdf_file in pdf_files:
    loader = PyPDFLoader(f"../../sample_files/{pdf_file}")
    pages.extend(loader.load_and_split())

In [16]:
chunk1 = pages[2].page_content
print(chunk1[0:100])

Il dossier è stato curato dall’ U
FFICIO RAPPORTI CON L ’UNIONE EUROPEA  
( 066760.2145 -  cdrue@c


In [17]:
len(pages)

47

In [26]:
pages[26].page_content[0:100]

'6\nAdoptedOnce it is concluded that a controller or processor is established in the EU, an in concret'

### Vector Store

Split the documents into chunks

In [19]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(pages)

Create Embeddings - We will use the Ollama Embeddings

In [20]:
embeddings = OllamaEmbeddings(model="nomic-embed-text")

In [23]:
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=embeddings)

retriever = vectorstore.as_retriever()

## Retrieval and Generation

### Examples of Prompts from LangChain

In [33]:
# RAG prompt
from langchain import hub
prompt_rag_lc = hub.pull("rlm/rag-prompt")
prompt_rag_lc.pretty_print()


You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: [33;1m[1;3m{question}[0m 
Context: [33;1m[1;3m{context}[0m 
Answer:


In [38]:
from langchain import hub
legal_prompt = hub.pull("usctrojan/in-house-legal")
legal_prompt.pretty_print()


### **Prompt**
You are requested to embody the role of an in-house legal advisor. Within your capacity, you will have access to a repository of contracts for review, as well as queries from users seeking guidance. Your responses should be strictly confined to the content derived from these contracts, devoid of any additional explanations. In instances where the solution appears beyond immediate reach, refrain from advising the user to undertake actions outside their capability. Instead, exhaust all possible alternatives at your disposal first. When confronted with challenges, it's pivotal to approach them methodically, step by step, ensuring a composed and thorough analysis.
### **Guidelines**
1. **Contractual Fidelity:** Your advice should be exclusively rooted in the contract documents available to you. Avoid speculative or generic legal advice.
2. **Problem-Solving Approach:** Adopt a systematic approach to problem-solving. If an issue seems intractable, consider all possible inter

In [40]:
# Prompt
template_rag = """
System Message: You are a privacy consultant for a marketing company that operate internationally. Provide your legal advice based on the context.

Context: {context}

Question: {question}
"""

prompt_rag = ChatPromptTemplate.from_template(template_rag)

In [41]:
prompt_rag.pretty_print()



System Message: You are a privacy consultant for a marketing company that operate internationally. Provide your legal advice based on the context.

Context: [33;1m[1;3m{context}[0m

Question: [33;1m[1;3m{question}[0m


## Create the RAG Chain

### Model Selections

In [42]:
from langchain_community.llms import Ollama

llm = Ollama(model="phi3")

### Post-Processing

In [43]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

### Chain

In [44]:
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt_rag
    | llm
    | StrOutputParser()
)

### Retrival

In [ ]:
user_question = "In the context of the material scope of application of GDPR, what is the edpb position in relation to the one-stop shop mechanism?"

In [62]:
user_question_2 = "Stasera mni voglio ubriacare, cosa mi consigli cosa bere?"

Andiamo a estrarre i 4 documenti più rilevanti

In [63]:
retriever = vectorstore.as_retriever(k=4)

docs = retriever.invoke(user_question_2)

In [68]:
len(docs)

4

In [77]:
len(docs[0].page_content)

973

In [78]:
docs[0].page_content

"• 24 mesi  dopo: tutte le regole della legge sull'IA diventano applicabili, \ncompresi gli obblighi per i sistemi ad alto rischio definiti nell'allegato III (elenco dei casi d'uso ad alto rischio);  \n• 36 mesi  dopo: si applicano gli obblighi per i sistemi ad alto rischio definiti \nnell'allegato II (elenco della normativa di armonizzazione dell'Unione).  \n \n \nLa posizione negoziale del Governo italiano \nNel corso del complesso negoziato in seno al Consiglio, il Governo italiano si \nè sempre dichiarato a favore dell’introduzione di un quadro comune di regole \nsull’intelligenza artificiale, sottolineando l’importanza che il nuovo regolamento \ntutelasse i  diritti fondamentali , imponesse obblighi e sanzioni commisurati al \nrischio  e allo stesso tempo permettesse di mantenere il passo tecnologico e lo \nslancio verso l'innovazione di altri competitor globali, come Stati Uniti e Cina.  \nLa posizione negoziale italiana , inoltre, si  è basata su una visione “umano-"

In [65]:
docs[1].page_content[0:400]

"4 \n • i sistemi di riconoscimento delle emozioni utiliz zati sul luogo di lavoro \ne negli istituti scolastici , eccetto per motivi medici o di sicurezza (ad \nesempio il monitoraggio dei livelli di stanchezza di un pilota);  \n• l’estrazione non mirata (scraping) di immagini facciali da internet o \ntelecamere a circuito chiuso per la creazione o l'espansione di banche dati;  \n• i sistemi che consent"

In [79]:
docs

[Document(page_content="• 24 mesi  dopo: tutte le regole della legge sull'IA diventano applicabili, \ncompresi gli obblighi per i sistemi ad alto rischio definiti nell'allegato III (elenco dei casi d'uso ad alto rischio);  \n• 36 mesi  dopo: si applicano gli obblighi per i sistemi ad alto rischio definiti \nnell'allegato II (elenco della normativa di armonizzazione dell'Unione).  \n \n \nLa posizione negoziale del Governo italiano \nNel corso del complesso negoziato in seno al Consiglio, il Governo italiano si \nè sempre dichiarato a favore dell’introduzione di un quadro comune di regole \nsull’intelligenza artificiale, sottolineando l’importanza che il nuovo regolamento \ntutelasse i  diritti fondamentali , imponesse obblighi e sanzioni commisurati al \nrischio  e allo stesso tempo permettesse di mantenere il passo tecnologico e lo \nslancio verso l'innovazione di altri competitor globali, come Stati Uniti e Cina.  \nLa posizione negoziale italiana , inoltre, si  è basata su una visio

## Run the RAG Chain

### Prompt With Context

**Examples of user_prompts**
- 1 (bad) - what is the edpb position in relation to the one-stop shop mechanism?
- 2 (good) - What is the EDPB position in relation to the one stop shop mechanism provided by article 52 of the GDPR in the context of the of geographical scope of application of the GDPR (art. 3.2), so called targeting?
     -- (art. 56 is the correct one)

In [52]:
user_question = "In the context of the material scope of application of GDPR, what is the edpb position in relation to the one-stop shop mechanism?"

In [54]:
output_rag = rag_chain.invoke(user_question)

In [49]:
print("--- RAG ---\n")
print(output_rag)

--- RAG ---

As a privacy consultant advising your marketing company on international operations subject to GDPR, it's crucial to understand the EDPB’s stance regarding the "one-stop shop" (OSS) mechanism. The OSS simplifies compliance for EU data controllers and processors by allowing them to address their cross-border processing concerns with a single supervisory authority within the EU, usually the local one where they have an establishment.


The EDPB's position on the OSS is clear: it supports this mechanism as a means of ensuring consistent application of GDPR across member states and facilitating cooperation among supervisory authorities. The aim is to harmonize enforcement actions and ensure that data protection laws are applied consistently, avoiding fragmentation and duplication in the regulatory environment.


However, while operating under the OSS framework, your company must still comply with all GDPR obligations regardless of which member state's supervisory authority you

### Prompt Without the Context

In [80]:
llm = Ollama(model="phi3")

output_no_rag = llm.invoke(user_question)
print(output_no_rag)

The European Data Protection Board (EDPB) plays a pivotal role within the framework established by the General Data Protection Regulation (GDPR) for its application across all member states. Specifically concerning the "one-stop shop" (OSS) mechanism, which was introduced to streamline regulatory processes and reduce administrative burdens on businesses operating in multiple EU countries, the EDPB's position is supportive and facilitative.


The OSS allows lead supervisory authorities (LSA) of each member state where a company has its main establishment or single establishment processing activities to take primary responsibility for cross-border data protection issues involving that entity. This significantly simplifies the oversight process by having one LSA coordinate with other concerned DPAs and national regulators, fostering cooperation among them.


In relation to this mechanism, the EDPB's role is advisory and supervisory:

- The EDPB assists in ensuring consistent application o