## Chat with mDGF document 

This notebook loads the Modern Data Governance Framework (MDGF) document and uses generative models to answer user questions about the document. The notebook also allows creation of governance document based on user input.

In [None]:
%%capture
# update or install the necessary libraries
!pip install --upgrade langchain-openai
!pip install --upgrade langchain
!pip install --upgrade python-dotenv

In [None]:
import os
from dotenv import load_dotenv
print(f".env file loaded correctly: {load_dotenv()}")

from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationBufferMemory
from langchain.callbacks import get_openai_callback
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationBufferMemory
from langchain.callbacks import get_openai_callback
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    MessagesPlaceholder
)
from langchain.schema import (
    SystemMessage,
)

# Load The MDGF Document

In [None]:
# read the document
with open("../data/mdgf_document.txt", "r") as file:
    mdgf_document = file.read()

# Combining Prompt Patterns
- the Persona prompt pattern
  - Asking the model to assume persona of expert in scientific data dovernance
- Recipe prompt 
  - Providing the model steps to follow to generate the governance document
- Output Automator prompt 
  - Asking the model to generate a script to automate and provide output in specific format: in this case - the headings required to generate the governance document

In [None]:
import pprint
import json

MDGF_PROMPT = f"""
You are an expert in scientific data governance and management and you will assist the users by answering questions and creating documents. Use only the content in the Modern Data Governance Framework (MDGF) reference text after the delimiter for your answers. If a questions falls outside the reference text, then respond, “This is out of scope for me to answer”

Your responsibilities are two::

First - Answering Questions:
You will be asked questions. Answer the question only using the reference text provided.
Apart from Answering the question, Cite the passages from the document used to answer the question, prefixing it with citation.
For Any Requirement, you should also provide the corresponding procedure.
If you cannot find an answer in the reference text, then respond, “I could not find the answer”

Second - Creating Documents:

When asked by a user to create either a requirements document or a procedure plan based on the reference text. Assist the user by asking a series of questions to capture their project needs.

Step 1: Identify the entity in the user’s project. Respond with: “Sure, I will be happy to help. First tell me the core entity or asset in that you will be managing

Data 
Metadata
Digital content 
Code
Software”

Step 2: Identify governance activity in the user’s project. Respond with: “Tell me about the governance activity need in your project

Planning and Design
Monitoring
Generation/Curation
Sharing
Use/Reuse
Preservation”

Step 3: Identify the user's need for the Type of document. Respond with: “Are you seeking Requirements or Procedures for your project?

Requirements
Procedures”

Finally, Respond with:
"Here are the headings for the Requirements document:
A.1.1.1, A.1.2.1, ..." 
You should provide only the headings (A.1.1.1, A.1.2.1, ...) provided in the DGF documents. You should never provide any additional information. Do NOT use placeholder text or ... or anything similar in the response.


Here is the reference DGF document:
{mdgf_document} 
"""

llm = ChatOpenAI(
    temperature=0.5,
	openai_api_key=os.environ["OPENAI_API_KEY"],
	model_name="gpt-4-turbo-preview"
)

prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessage(
            content=MDGF_PROMPT,
        ),  # The persistent system prompt
        MessagesPlaceholder(
            variable_name="history"
        ),  # Where the memory will be stored.
        HumanMessagePromptTemplate.from_template(
            "{input}"
        ),  # Where the human input will injected
    ]
)

def ask(chain, query, track_token=True):
    with get_openai_callback() as cb:
        result = chain.invoke(input=query)
        if track_token:
            print(f'Total tokens: {cb.total_tokens}')
            print(f'Requests: {cb.successful_requests}')
    print(result['response'])
    return result['response']

conversation = ConversationChain(
    prompt=prompt,
    llm=llm,
    verbose=False,
    memory=ConversationBufferMemory(ai_prefix="AI Assistant", memory_key="history", return_messages=True),
)

In [None]:
_ = ask(conversation, "what data file naming conventions should I use?")

In [None]:
_ = ask(conversation, "Can you create a requirements document for me?")

In [None]:
_ = ask(conversation, "Data, Metadata")


In [None]:
_ = ask(conversation, "Planning and Design")

In [None]:
model_response = ask(conversation, "Requirements")

# Generate MDGF document based on model output
- Using regex and string matching to generate the governance document based on the model output

In [None]:
import re

text = model_response

pattern = r'[A-Z]\d+\.\d+\.\d+[a-z]?'
headers = re.findall(pattern, text)
print('All matches:', headers)

In [None]:
import json

def subset_data(headers, data):
    # Initialize a dictionary to hold the subsetted data
    subsetted_data = {}
    for top_key, top_value in data.items():
        if isinstance(top_value, dict):
            subsetted_section = {}
            
            for second_key, entries in top_value.items():
                subsetted_entries = []
                
                for entry_list in entries:
                    entry_item = []
                    for entry in entry_list:
                        if any(header in entry[:10] for header in headers):
                            entry_item.append(entry)
                    if entry_item:
                        subsetted_entries.append(entry_item)
                
                if subsetted_entries:
                    subsetted_section[second_key] = subsetted_entries
            if subsetted_section:
                subsetted_data[top_key] = subsetted_section
    
    return subsetted_data

import json
data = json.load(open("../data/dgf.json"))
subset = subset_data(headers, data)

# Printing the subset to verify
print(json.dumps(subset, indent=4))