# GDPR Compliance Assistant - RAG Agent Implementation

This notebook implements the QA agent for the GDPR Compliance Assistant using your existing Pinecone vector database.



## Setup and Imports

First, let's install required packages and import dependencies.

In [1]:
# First, make sure you have the latest LangChain
# pip install langchain-core langchain-openai

# Cell 1: Setup and Imports
import os
import sys
from dotenv import load_dotenv

# Add project root to Python path
sys.path.append(os.path.abspath('..'))

# LangChain components
from langchain_openai import OpenAIEmbeddings, ChatOpenAI  # ‚úÖ Correct imports
from langchain_pinecone import PineconeVectorStore  # ‚úÖ Pinecone integration
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

from pinecone import Pinecone, ServerlessSpec

import time


# from langchain.chains import RetrievalQA
# from langchain.vectorstores import Pinecone
# from langchain.embeddings import OpenAIEmbeddings
# from langchain_openai import ChatOpenAI
# from langchain.prompts import PromptTemplate
# from langchain_pinecone import PineconeVectorStore

# Load environment variables
load_dotenv()

print("‚úÖ All packages imported successfully!")

‚úÖ All packages imported successfully!



For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from langchain_pinecone.vectorstores import Pinecone, PineconeVectorStore



## Configuration / Environment Setup

Set up your API keys and configuration. Replace with your actual values.

In [2]:
# Configure your API keys
def setup_environment():
    # Check if API keys are already in environment
    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
    PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
    
    # If not set, prompt user
    if not OPENAI_API_KEY:
        OPENAI_API_KEY = getpass.getpass("Enter your OpenAI API key: ")
        os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
    
    if not PINECONE_API_KEY:
        PINECONE_API_KEY = getpass.getpass("Enter your Pinecone API key: ")
        os.environ["PINECONE_API_KEY"] = PINECONE_API_KEY
    
    # Your Pinecone index name (replace with your actual index name)
    index_name = "gdpr-compliance-openai"  # Change this to your index name
    
    return index_name, OPENAI_API_KEY, PINECONE_API_KEY

index_name, OPENAI_API_KEY, PINECONE_API_KEY = setup_environment()
print(f"üîë API keys configured")
print(f"üìÅ Using Pinecone index: {index_name}")

üîë API keys configured
üìÅ Using Pinecone index: gdpr-compliance-openai


In [3]:
# ---------------------------
# Pinecone Initialization (Current 2025 syntax)
# ---------------------------
def init_pinecone(api_key: str, index_name: str = "gdpr-compliance-openai", environment: str = "us-east-1"):
    """
    Initialize Pinecone connection using current Pinecone
    """
    if not api_key:
        raise ValueError("PINECONE_API_KEY is missing!")
    
    # Initialize Pinecone (Current API)
    print("üîå Initializing Pinecone...")

    pc = Pinecone(api_key=api_key)
    print("‚úÖ Pinecone initialized successfully")
    
    # Check if index exists
    if index_name in pc.list_indexes().names():
        print(f"‚úÖ Index '{index_name}' exists")
        # Wait for index to be ready
        while not pc.describe_index(index_name).status.ready:
            print("‚è≥ Waiting for index to be ready...")
            # import time
            time.sleep(1)
    else:
        print(f"‚ö†Ô∏è  Index '{index_name}' not found.")
    
    # Get the index object
    index = pc.Index(index_name)
    return pc, index

In [4]:
pc, index = init_pinecone(
        api_key=PINECONE_API_KEY,
        index_name=index_name)
print("‚úÖ Pinecone setup completed!")


üîå Initializing Pinecone...
‚úÖ Pinecone initialized successfully
‚úÖ Index 'gdpr-compliance-openai' exists
‚úÖ Pinecone setup completed!


## Initialize embeddings

In [5]:
# Initialize embeddings with CURRENT syntax - NO DEPRECATION WARNING
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=OPENAI_API_KEY
)
print("‚úÖ Embeddings initialized successfully")

‚úÖ Embeddings initialized successfully


## Initialize Vector Store Connection

In [6]:
index_name

'gdpr-compliance-openai'

In [7]:
vector_store = PineconeVectorStore(
        index=index,  # Use the index object from our initialization
        embedding=embeddings,
        text_key="text"  # This should match your upload metadata field name
    )
    
print("‚úÖ LangChain successfully connected to Pinecone index!")

‚úÖ LangChain successfully connected to Pinecone index!


## Test the connection with current syntax


In [None]:
# # Test the connection with current syntax
# test_query = "Telephone number from a client"
# test_results = vector_store.similarity_search("test_query", k=2)
# print(f"üìö Test retrieval found {len(test_results)} documents")

# # Show metadata structure (useful for debugging)
# if test_results:
#     print(f"üìã Available metadata fields: {list(test_results[0].metadata.keys())}")
#     print(f"üìÑ Sample content: {test_results[0].page_content[:150]}...")
    
# # Alternative: Check what's in the vector store
# print(f"\nüîç Vector store type: {type(vector_store)}")

In [26]:
# # Test the connection with current syntax
# test_results = vector_store.similarity_search("Datenschutz", k=2)
# print(f"üìö Test retrieval found {len(test_results)} documents")

# # Show metadata structure (useful for debugging)
# if test_results:
#     print(f"üìã Available metadata fields: {list(test_results[0].metadata.keys())}")
#     print(f"üìÑ Sample content: {test_results[0].page_content[:150]}...")
    
# # Alternative: Check what's in the vector store
# print(f"\nüîç Vector store type: {type(vector_store)}")

üìö Test retrieval found 2 documents
üìã Available metadata fields: ['author', 'chunk_id', 'chunk_size', 'content_category', 'content_length', 'creationdate', 'document_name', 'document_type', 'language', 'moddate', 'page', 'page_label', 'page_number', 'section_type', 'source', 'total_chunks', 'total_pages']
üìÑ Sample content: Leitfaden 
Datenschutzrecht 
Was Betriebe zu beachten haben 
 
 
Stand: November 2020 
 
Abteilung Organisation und Recht...

üîç Vector store type: <class 'langchain_pinecone.vectorstores.PineconeVectorStore'>


## Verify Data and Create Retriever

In [19]:
retriever=vector_store.as_retriever()

## Current LLM Setup

In [16]:
# test with gpt-3.5-turbo
print("üöÄ Testing with GPT-3.5-Turbo LLM...")

llm_3_turbo = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0,
    max_tokens=500,
)

üöÄ Testing with GPT-3.5-Turbo LLM...


## Create QA Chain

In [27]:
query_test = "How long can i store my client's email?"

results_test = vector_store.similarity_search(
    query_test,  # our search query
    k=3  # return 3 most relevant docs
)

In [31]:
results_test

[Document(id='1c200008-177f-4b67-85c1-b9bcc5d22d58', metadata={'author': 'Kasper, Lisa', 'chunk_id': 99.0, 'chunk_size': 727.0, 'content_category': 'customer_data', 'content_length': 2176.0, 'creationdate': '2020-11-06T11:24:59+01:00', 'document_name': 'ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf', 'document_type': 'zdh_gdpr_handbook', 'language': 'german', 'moddate': '2020-11-06T11:24:59+01:00', 'page': 34.0, 'page_label': '35', 'page_number': 35.0, 'section_type': 'content', 'source': '../2_data/raw/ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf', 'total_chunks': 266.0, 'total_pages': 99.0}, page_content='Gesetzliche L√∂schfristen  \n \nIn vereinzelten F√§llen schreiben gesetzliche Regelungen vor, wann bestimmte Daten zu l√∂-\nschen sind (f√ºr eine √ú bersicht gesetzlicher L√∂schfristen siehe die Anlage 17). Eine l√§ngere \nAufbewahrung solcher Daten ist unzul√§ssig.  \n \nEtwas anderes gilt nur dann, wenn die Daten zu einem anderen Zweck als zu dem, zu dem \nsie urspr√ºngli

In [30]:
for i, doc in enumerate(results_test):
	print(f"Document {i+1} content:\n{doc.page_content}\n{'-'*60}")

Document 1 content:
Gesetzliche L√∂schfristen  
 
In vereinzelten F√§llen schreiben gesetzliche Regelungen vor, wann bestimmte Daten zu l√∂-
schen sind (f√ºr eine √ú bersicht gesetzlicher L√∂schfristen siehe die Anlage 17). Eine l√§ngere 
Aufbewahrung solcher Daten ist unzul√§ssig.  
 
Etwas anderes gilt nur dann, wenn die Daten zu einem anderen Zweck als zu dem, zu dem 
sie urspr√ºnglich erhoben wurden, weiterhin ben√∂tigt wer den. Eine solche Zweck√§nderung 
oder Zweckerweiterung ist jedoch an gesetzliche Zul√§ssigkeitsvoraussetzungen gebunden 
(Art. 6 Abs. 4 DSGVO).  
 
Beispiel: 
Kundendaten werden nach Ablauf der Gew√§hrleistungsfristen und der steuerrechtlichen 
Aufbewahrungspflichten ‚Äì d.h. nach zehn Jahren ‚Äì nicht mehr zur Abwicklung des Ver-
------------------------------------------------------------
Document 2 content:
Anlage 17 
 
Aufbewahrungs- und L√∂schfristen 
 
Die Liste stellt eine √úbersicht praxisrelevanter Verfahren dar und erhebt keinen An-
spruch auf Vollst√§

In [13]:
# ---------------------------
# Code adapted from lesson:

qa_test = RetrievalQA.from_chain_type(
    llm=llm_3_turbo,
    chain_type="stuff",
    retriever=vector_store.as_retriever()
)

query_test = "How long can i store my client's email?"

print(qa_test.invoke(query_test))


{'query': "How long can i store my client's email?", 'result': "According to the information provided, there is no specific legal requirement for how long you can store your client's email. It is generally up to the discretion of the data controller, which in this case would be the business that collected the data. However, it is recommended to establish a data retention policy or a deletion concept to determine when to delete data, taking into account legal requirements and best practices."}


In [None]:
# Create prompt template and QA chain with current syntax
print("üîó Creating QA chain...")

# Current prompt template
prompt_template_en = """You are a privacy assistant specialized in GDPR for small craft businesses. 
Explain in a clear, practical, and easy-to-understand way based on the following context. 
This is not legal advice. If the context does not contain the answer, say so openly.

Context:
{context}

Question:
{question}

Answer (short and practical):"""

PROMPT_en = PromptTemplate(
    template=prompt_template_en, 
    input_variables=["context", "question"]
)

# Create QA chain with current syntax
qa_chain_en = RetrievalQA.from_chain_type(
    llm=llm_3_turbo,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": PROMPT_en},
    return_source_documents=True
)


üîó Creating QA chain...


## Create a helper function to test the RAG and display results.


In [None]:
def ask_gdpr_question_en(question, show_sources=True):
    """
    Ask a question to the GDPR assistant and display the response with sources.
    
    Args:
        question (str): The question to ask (in German or English)
        show_sources (bool): Whether to display source documents
    
    Returns:
        dict: Complete result with answer and source documents
    """
    print(f"‚ùì Question: {question}")
    print("‚è≥ Thinking...")
    
    # Get answer from QA chain
    result = qa_chain_en.invoke({"query": question})

    # Check if we got a valid answer
    answer = result.get('result', '').strip()
    
    print(f"‚úÖ Answer: {result['result']}")
    
    # Show source documents if requested
    if show_sources and result['source_documents']:
        print(f"\nüìö Source ({len(result['source_documents'])}):")
        for i, doc in enumerate(result['source_documents']):
            source_text = doc.page_content.replace('\n', ' ').strip()
            print(f"   {i+1}. {source_text[:150]}...")
    
    print("‚Äï" * 80)
    return result


In [22]:

# qa_chain_eng = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, chain_type_kwargs={"prompt": "{query}"}, return_source_documents=True)
print(qa_chain_en.invoke({"query": "How long can i store my client's email?"}))

# print("‚úÖ QA chain created successfully!")

{'query': "How long can i store my client's email?", 'result': "You can store your client's email for as long as it is necessary for the purpose for which it was originally collected. If the email is no longer needed for that purpose, it should be deleted according to legal requirements and best practices.", 'source_documents': [Document(id='1c200008-177f-4b67-85c1-b9bcc5d22d58', metadata={'author': 'Kasper, Lisa', 'chunk_id': 99.0, 'chunk_size': 727.0, 'content_category': 'customer_data', 'content_length': 2176.0, 'creationdate': '2020-11-06T11:24:59+01:00', 'document_name': 'ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf', 'document_type': 'zdh_gdpr_handbook', 'language': 'german', 'moddate': '2020-11-06T11:24:59+01:00', 'page': 34.0, 'page_label': '35', 'page_number': 35.0, 'section_type': 'content', 'source': '../2_data/raw/ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf', 'total_chunks': 266.0, 'total_pages': 99.0}, page_content='Gesetzliche L√∂schfristen  \n \nIn vereinzelte

## Test the RAG System

Now let's test the system with various GDPR questions.


In [24]:
# Test 1: Data retention periods
print("üß™ TEST 1 ('gpt-3.5-turbo'): Data retention periods: english Q > EN temp prompt > A english?")
result2 = ask_gdpr_question_en("How long can i keep a client's email stored?")

üß™ TEST 1: Aufbewahrungsfristen
‚ùì Frage: How long can i keep a client's email stored?
‚è≥ Denke nach...
‚úÖ Answer: You can keep a client's email stored for as long as it is necessary for the purpose for which it was originally collected. After that, you should delete it unless there are legal requirements or other legitimate reasons for keeping it.

üìö Source (4):
   1. Gesetzliche L√∂schfristen     In vereinzelten F√§llen schreiben gesetzliche Regelungen vor, wann bestimmte Daten zu l√∂- schen sind (f√ºr eine √ú bersicht ...
   2. Ob und wann die Aufbewahrung von Daten nicht mehr erforderlich ist, liegt grunds√§tzlich im  Ermessen des Dateninhabers, also des Handwerksbetriebs, de...
   3. ben√∂tigt, schreiben zahlreichliche gesetzliche Regelungen vor, dass bestimmte Daten min- destens f√ºr einen konkreten Zeitraum aufzubewahren sind. Solc...
   4. Aufbewahrungspflichten ‚Äì d.h. nach zehn Jahren ‚Äì nicht mehr zur Abwicklung des Ver- trags ben√∂tigt. Die Daten des Kunden k√∂nne

# Draft