### **Research on Data Governance AI Assistant**

Data Governance (DG) refers to the framework of policies, standards, and practices that ensure the proper management, security, and quality of data within an organization. It plays a crucial role in:

* **Data Quality**: Ensuring the accuracy, consistency, and reliability of data.
* **Data Privacy and Compliance**: Ensuring that data handling practices adhere to relevant laws and regulations (e.g., GDPR).
* **Data Accessibility**: Ensuring that data is available and accessible to authorized users.
* **Data Stewardship**: Managing data resources and ensuring that data management responsibilities are defined clearly across the organization.

In the context of AI and machine learning, **Data Governance** becomes even more significant, as it ensures that AI systems work with trusted and high-quality data, reducing the risks of biased, erroneous, or illegal data usage.

#### **Overview of the Data Governance AI Assistant**

The **Data Governance AI Assistant** is designed to support organizations in managing and ensuring compliance with Data Governance practices. This assistant, powered by Natural Language Processing (NLP) and machine learning technologies, can automatically answer questions, guide users on best practices, and provide data governance recommendations. The assistant is capable of leveraging information from various documents related to Data Governance, such as:

* **Legal frameworks**: e.g., GDPR, NDP-ACT-GAID-2025.
* **Internal policies**: Organizational guidelines for managing data.
* **Best practices**: Established standards in the industry for data management.

The **Data Governance AI Assistant** uses a combination of **Document Retrieval** techniques and **Text Generation** models to provide accurate answers and insights to users regarding data governance questions.

---

#### **Architecture of the Data Governance AI Assistant**

The architecture of the Data Governance AI Assistant can be broken down into several key components:

1. **Document Loading**:

   * **PyPDFLoader**: This component loads documents in PDF format, which are the primary sources of knowledge for the assistant (policies, guidelines, etc.).
   * **DirectoryLoader**: This component loads documents from a specific directory and can be used to load large sets of documents (e.g., internal policies stored in a folder).

In [2]:
%pwd

'/home/chukwuemeka-james/Documents/Data-Governance-AI-Assistant/research'

In [3]:
# Move to the parent directory to access the research resource: Knwoledge_Source

import os
os.chdir("../")

In [4]:
%pwd

'/home/chukwuemeka-james/Documents/Data-Governance-AI-Assistant'

In [5]:
from langchain.document_loaders import PyPDFLoader, DirectoryLoader

def load_pdf_file(data):
    loader = DirectoryLoader(data, glob="*.pdf", loader_cls=PyPDFLoader)
    documents = loader.load()
    return documents

In [None]:
extracted_data=load_pdf_file(data='Knowledge_Source/')
extracted_data

[Document(metadata={'producer': 'PDFlib+PDI 8.0.5p2 (C++/Win64)', 'creator': 'Arbortext Advanced Print Publisher 11.0.3108/W Unicode-x64', 'creationdate': '2016-05-02T11:34:14+02:00', 'author': 'Publications Office', 'moddate': '2016-05-03T08:52:52+02:00', 'title': 'REGULATION  (EU)  2016/  679  OF  THE  EUROPEAN  PARLIAMENT  AND  OF  THE  COUNCIL  -  of  27  April  2016  -  on  the  protection  of  natural  persons  with  regard  to  the  processing  of  personal  data  and  on  the  free  movement  of  such  data,  and  repealing  Directive 95/  46/  EC  (General  Data  Protection  Regulation)', 'source': 'Knwoledge_Source/General Data Protection Regulation (GDPR).pdf', 'total_pages': 88, 'page': 0, 'page_label': '1'}, page_content='I \n(Legislativ e acts) \nREGUL A TIONS \nREGUL A TION (EU) 2016/679 OF THE EUR OPEAN P ARLIAMENT AND OF THE COUNCIL \nof 27 Apr il 2016 \non the protection of natural persons with regard to the processing of personal data and on the free \nmo v ement of 

2. **Text Chunking**:

   * **RecursiveCharacterTextSplitter**: Once the documents are loaded, they are split into smaller, manageable text chunks. This is important for efficient processing, as large documents can exceed the token limits of AI models.

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def text_split(extracted_data):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
    text_chunks = text_splitter.split_documents(extracted_data)
    return text_chunks

In [8]:
text_chunks=text_split(extracted_data)
print("Length of Text Chunks", len(text_chunks))

Length of Text Chunks 1349


In [9]:
text_chunks

[Document(metadata={'producer': 'PDFlib+PDI 8.0.5p2 (C++/Win64)', 'creator': 'Arbortext Advanced Print Publisher 11.0.3108/W Unicode-x64', 'creationdate': '2016-05-02T11:34:14+02:00', 'author': 'Publications Office', 'moddate': '2016-05-03T08:52:52+02:00', 'title': 'REGULATION  (EU)  2016/  679  OF  THE  EUROPEAN  PARLIAMENT  AND  OF  THE  COUNCIL  -  of  27  April  2016  -  on  the  protection  of  natural  persons  with  regard  to  the  processing  of  personal  data  and  on  the  free  movement  of  such  data,  and  repealing  Directive 95/  46/  EC  (General  Data  Protection  Regulation)', 'source': 'Knwoledge_Source/General Data Protection Regulation (GDPR).pdf', 'total_pages': 88, 'page': 0, 'page_label': '1'}, page_content='I \n(Legislativ e acts) \nREGUL A TIONS \nREGUL A TION (EU) 2016/679 OF THE EUR OPEAN P ARLIAMENT AND OF THE COUNCIL \nof 27 Apr il 2016 \non the protection of natural persons with regard to the processing of personal data and on the free \nmo v ement of 

   * **Chunk Size**: The `chunk_size` and `chunk_overlap` parameters define how large each chunk should be and how much overlap there should be between chunks for context preservation.

3. **Embeddings**:

   * **HuggingFace Embeddings**: Text chunks are then embedded using HuggingFace’s pre-trained embeddings model, which converts the text into numerical representations that can be processed by the AI model.

In [10]:
from langchain_huggingface import HuggingFaceEmbeddings

def download_hugging_face_embeddings():
    embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
    return embeddings

In [11]:
embeddings = download_hugging_face_embeddings()

  from .autonotebook import tqdm as notebook_tqdm


   This embedding model is a lightweight transformer model suitable for producing dense vector representations of the documents, which are necessary for efficient searching and retrieval.

In [12]:
query_result = embeddings.embed_query("Hello world")
print("Length", len(query_result))

Length 384


In [13]:
query_result

[-0.034477315843105316,
 0.031023172661662102,
 0.006734910886734724,
 0.02610892429947853,
 -0.03936195746064186,
 -0.1603025197982788,
 0.06692396104335785,
 -0.006441440898925066,
 -0.04745054617524147,
 0.014758836477994919,
 0.07087532430887222,
 0.055527545511722565,
 0.01919332519173622,
 -0.026251299306750298,
 -0.01010951679199934,
 -0.026940451934933662,
 0.022307397797703743,
 -0.022226639091968536,
 -0.1496926248073578,
 -0.01749303936958313,
 0.007676327601075172,
 0.054352276027202606,
 0.0032544792629778385,
 0.03172592446208,
 -0.08462144434452057,
 -0.029405953362584114,
 0.05159562826156616,
 0.048124104738235474,
 -0.003314818488433957,
 -0.05827919766306877,
 0.04196928068995476,
 0.02221069671213627,
 0.12818878889083862,
 -0.02233896404504776,
 -0.011656257323920727,
 0.06292840093374252,
 -0.03287629410624504,
 -0.09122602641582489,
 -0.031175386160612106,
 0.05269954726099968,
 0.047034841030836105,
 -0.08420310169458389,
 -0.030056146904826164,
 -0.020744822919

4. **Pinecone for Vector Storage**:

   * **Pinecone**: Pinecone is used for storing the vector embeddings and enabling similarity search. The embeddings of the text chunks are stored in Pinecone, which allows the assistant to retrieve the most relevant documents quickly when queried.
   * To setup pinecone headover to [pinecone.io](https://www.pinecone.io/)

In [21]:
from dotenv import load_dotenv
load_dotenv()

PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')

pcsk_64qe1j_JRi8GCviLSxyDUFkqMtku3eU4epB25LbQSeJBQgx6btsnr2eUmaaUfxAZ6wML2R


In [23]:
from pinecone import ServerlessSpec
from pinecone.grpc import PineconeGRPC as Pinecone

pc = Pinecone(api_key=PINECONE_API_KEY)

index_name = "datagovbot"               # Note: Upper case is not allowed for index name in pinecone
pc.create_index(
    name=index_name,
    dimension=384,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws", 
        region="us-east-1"
    )
)

{
    "name": "datagovbot",
    "metric": "cosine",
    "host": "datagovbot-97k8njw.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "vector_type": "dense",
    "dimension": 384,
    "deletion_protection": "disabled",
    "tags": null
}

   **Pinecone Index**: The index is created with the necessary parameters, such as:

   * **dimension**: The dimensionality of the embedding vectors (384 in this case).
   * **metric**: The distance metric used for similarity search (e.g., cosine similarity).

5. **Retrieval Mechanism**:

   * Once the documents are indexed, the assistant can retrieve relevant documents based on a given query. The `PineconeVectorStore` is used to map the text chunks to the Pinecone index and perform similarity searches.

##### **Let's now upserts (update/insert) the document chunks:**

- **If the item (vector/document) already exists → update it, else → insert it.**

- **Convert each text chunk into a vector embedding and store it in the specified Pinecone index**


In [None]:
# Convert each text chunk into a vector embedding and store it in the specified Pinecone index
from langchain_pinecone import PineconeVectorStore

docsearch = PineconeVectorStore.from_documents(
    documents=text_chunks,
    index_name=index_name,
    embedding=embeddings, 
)

##### **Load Existing index**

- **Let's connects to the existing Pinecone index and prepares it to retrieve embeddings using the embedding model.**


In [25]:
docsearch = PineconeVectorStore.from_existing_index(
    index_name=index_name,
    embedding=embeddings
)

The above code does not re-upload documents but makes the index ready for queries.

In [26]:
docsearch

<langchain_pinecone.vectorstores.PineconeVectorStore at 0x7bdea73c12a0>

In [27]:
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k":3})

   Here, `k=3` defines the number of documents to retrieve. The assistant retrieves the top 3 most relevant documents based on the user's query.


In [28]:
retrieved_docs = retriever.invoke("What is GDPR?")

In [29]:
retrieved_docs

[Document(id='a82ae9ed-dc75-49f1-9b57-e6eed010f32a', metadata={'author': 'Babatunde Bamigboye', 'creationdate': '2025-03-20T16:50:54+01:00', 'creator': 'Microsoft® Word 2016', 'moddate': '2025-03-20T16:50:54+01:00', 'page': 75.0, 'page_label': '76', 'producer': 'Microsoft® Word 2016', 'source': 'Knwoledge_Source/NDP-ACT-GAID-2025-MARCH-20TH.pdf', 'total_pages': 117.0}, page_content='76 \n \n \n \nProtection Report \n(SAPR) as provided for \nunder the NDP ACT \nGAID? \n \n2. Please select from the \nlist “types of answers” \ncolumn, the type of \nfacts that describe the \norganisation’s practices \nin respect of \naccountability and \nrecord of processing \nactivities.   \nArt. 13   a. The (SAPR) is an accurate, \nevidence-based assessment of \nthe organisation ’s data security \nbased on Art.14 of the GAID.  \n \nb. The organisation processed \npersonal data of at least ----------'),
 Document(id='488df52f-24c4-4f66-8f39-6b8a358ffbf6', metadata={'author': 'Babatunde Bamigboye', 'creati

6. **Language Model for Answer Generation**:

   * **OpenAI GPT-3** (or other similar models) is used to generate human-like responses from the retrieved context. The assistant uses a pre-defined **system prompt** that guides the model to answer based on data governance policies.

In [63]:
from langchain_groq import ChatGroq

groq_api_key=os.environ.get('groq_api_key')
llm=ChatGroq(groq_api_key=groq_api_key, model_name="deepseek-r1-distill-llama-70b")


   **System Prompt**: The system prompt defines the behavior of the AI assistant. For example, it ensures that answers are based on context (GDPR, internal policies, etc.).

In [65]:
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant specialized in answering questions about Data Governance. "
    "Use the provided pieces of retrieved context from the NDP-ACT-GAID-2025 (Nigeria) and "
    "the General Data Protection Regulation (GDPR - Europe) to generate accurate and relevant answers."
    "If the answer is not contained within the provided context, clearly state that you don't know. "
    "Always indicate whether your response is based on NDP-ACT-GAID-2025, GDPR, or both. "
    "Keep your answers clear, and concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

7. **Retrieval-Augmented Generation (RAG)**:

   * **RAG Chain**: The final step in the process is the retrieval-augmented generation chain, which combines document retrieval and answer generation. This chain is responsible for pulling the relevant information and using the language model to create a coherent response.

In [64]:
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Create the chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [62]:
from dotenv import load_dotenv
load_dotenv()

PINECONE_API_KEY=os.environ.get('PINECONE_API_KEY')
groq_api_key=os.environ.get('groq_api_key')

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


In [66]:
response = rag_chain.invoke({"input": "Data Governance?"})
print(response["answer"])

<think>
Okay, so I'm trying to understand Data Governance based on the provided context from NDP-ACT-GAID-2025 and GDPR. Let me break this down step by step.

First, I see that the context lists several factors that data controllers and processors must consider. These include risks to fundamental rights, security implications, public welfare, administration of justice, sustainable development, prior relationships, data sovereignty, data sensitivity, data-driven financial assets, and reliance on third-party services.

Looking at the NDP-ACT-GAID-2025, it seems to outline specific obligations for data handlers. They must abide by global best practices, considering factors like data sensitivity and the financial assets entrusted to them. They also have to think about their role as government establishments and whether they use third-party servers or cloud services for substantial data processing.

Now, the GDPR, which I'm somewhat familiar with, emphasizes transparency, purpose limitation

In [67]:
response = rag_chain.invoke({"input": "What is meant by “lawful basis for processing” under GDPR?"})
print(response["answer"])

<think>
Okay, so I need to figure out what "lawful basis for processing" means under the GDPR. I remember that GDPR has specific articles about how personal data can be legally processed. From the context given, it looks like Article 6 is key here. 

In the context, Article 6(1) lists several conditions. The first one is consent, where the data subject agrees to their data being processed for specific purposes. Then there are legal obligations, like when a company has to comply with a law that requires them to process data. Vital interest is next, which I think refers to situations where processing is necessary to protect someone's life or health, maybe in emergencies. 

There's also legitimate interest, which seems a bit broader. It's when the company has a genuine reason to process data, like for business needs, but they have to make sure it doesn't override the individual's rights. Public interest is another basis, which probably applies when processing is necessary for a task carri

#### **Use Cases for the Data Governance AI Assistant**

The Data Governance AI Assistant can be used in various scenarios to assist organizations in managing their data policies and compliance efforts, including:

* **Regulatory Compliance**: Helping organizations understand and comply with legal frameworks like GDPR, CCPA, and other data protection laws.
* **Data Quality Checks**: Providing guidance on maintaining data quality standards and offering recommendations based on historical data governance practices.
* **Audit Trails**: Assisting in understanding how data was used, who accessed it, and when, to support auditing and accountability efforts.
* **Training**: Offering personalized training sessions on data governance best practices and policies for employees.

#### **Conclusion**

The **Data Governance AI Assistant** is a powerful tool for enhancing data governance practices within organizations. By leveraging advanced technologies like NLP, embeddings, and retrieval-augmented generation, it is possible to create a system that not only automates information retrieval but also generates insightful and accurate answers in real-time. This AI-powered assistant ensures that organizations can more effectively manage their data and comply with ever-evolving regulatory requirements.