In [11]:
%reload_ext autoreload
%autoreload 2

# Import essential modules

In [12]:
from __future__ import annotations

In [13]:
import nest_asyncio
nest_asyncio.apply()

In [14]:
import json
from pprint import pprint
import pickle

In [15]:
import pandas as pd

In [16]:
from typing import List, Optional, Dict, Set, Union, Tuple
from tqdm import tqdm
from loguru import logger
import functools

In [17]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.schema.language_model import BaseLanguageModel

In [18]:
from paperqa import Docs, PromptCollection

In [19]:
import openai

In [20]:
import re
from IPython.display import display, HTML, Markdown

# Policies and Requirements

### Data Governance Framework

> A Data Governance Framework serves as a template or starting point for organizations to define and enforce policies that govern the management of data throughout its lifecycle. Requirements and implementation procedures captured in the document should be included in an Open Science Management Plan for Data (i.e., OSDMP) and/or a Software Management Plan (i.e., SMP) or a Data Management Plan (DMP) when initiating a new IMPACT project.


### Policy

A policy document is a written statement that outlines the rules and procedures that an organization will follow, that guides in decision-making and ensure that everyone is on the same page.

#### SPD-41a

> SPD-41a is a scientific information policy developed by NASA's Science Mission Directorate (SMD) to make SMD-funded research as open as possible, restricted as required, and always secure


We load the policy document as a json like:
```json
{
    "SPD41a": [
        "SPD-41a Policy Requirement",
        "Data shall be made publicly available, free in open, machine-readable formats [III.C.ii\u2013iv]",
        "SMD-funded data shall be reusable with a clear, open, and accessible data license [III.C.vii]. (If there are no other restrictions, SMD scientific data should be released with a Creative Commons Zero license.)",
        "Publicly available SMD-funded data collections shall be citable using a persistent identifier [III.C.viii]",
        "SMD-funded data shall include robust, standards-compliant metadata that clearly and explicitly describe the data [III.C.vi]",
        "SMD-funded data shall be findable, such that the data can be retrieved, downloaded, indexed, and searched [III.C.v]",
        "SMD-funded data collections shall be indexed as part of the NASA catalog of data [III.C.ix]",
        "SMD-funded data should follow the FAIR Guiding Principles [III.C.i]",
        "The SMD repository provides documentation on policies for data retention [Appendix D.11.]",
        "If there are no other restrictions, publicly available SMD-funded software should be released under a permissive license that has broad acceptance in the community [III.D.iii]",
        "Publicly available SMD-funded software shall be citable using a persistent identifier [III.D.vii]"
    ],
    "FAIR": [
        "FAIR Guiding Principle",
        "F1 - Persistent identifier (PID)",
        "F2 - Rich metadata",
        "F3 - Linked PID",
        "F4 - Searchable",
        "",
        "A1 - Retrievable",
        "A1.1 - Protocol",
        "A1.2 - Procedure (authentication and authorization)",
        "A2 - Permanent metadata record",
        "",
        "I1 - Language (knowledge representation)",
        "I2 - Vocabulary",
        "I3 - Reference",
        "",
        "R1 - Attributes",
        "R1.1 - License",
        "R1x .2 - Provenance",
        "R1.3 - Community standards"
    ]
}
```

---

For this demo, we use **SPD41a** policies.

# Compliance Checker

The goal is to find the requirements statements/sections in the requirements document that support and compliant with the given policies (eg: **SPD41a**).

We use **modern Digital Governance Framework** (*mdgf*) document to check compliants against SPD41a policies.

## How to?

The main compliance check function *(check_compliance)* takes in only:
- `requirements_doc`: The document from which requirements are loaded
- `policy_type`: Which policy to check compliant against? *(eg: SPD41a)*

The compliance checker use **Large Language Model** (LLM) with various prompting strategies to do the checks as accurately as possible. The checker has two types of prompt that *can be tuned* according to the requirements

- System Prompt (*SYSTEM_PROMPT*) : Tells the LLM to behave in a certain way *(here as a compliance checker)*
- Checker prompt (*CHECK_PROMPT*) : Part of the conversation where the user interacts with LLM to obtain specifc response to queries *(eg: respond in bullet points, provide citations, etc.)*

In [21]:
def get_policies(path, offset: int = 1, policy_type="SPD41a"):
    policies = []
    with open(path) as f:
        policies = json.load(f).get(policy_type, [])
    policies = policies[offset:] if policies else policies
    return policies

In [22]:
class ComplianceChecker:
    """
        A class that checks compliants
    """

    _CHECK_PROMPT_TEMPLATE = PromptTemplate.from_template("""
     Which requirements sections talk about the policy '{policy}'? List in bullet points and also cite references.
    """.strip())

    _SYSTEM_PROMPT = """You are a very precise compliant checker.
For the policy provided, find the statements that DIRECTLY support and compliant with that policy.
    e.g: if policy is "Data shall be made publicly available, free in open, machine-readable formats [III.C.ii–iv]"
    a compliant statement for the above policy is A1.1.3 Adhere to community accepted standard machine readable data file formats
    a non-compliant statement for the above policy is A1.1.6  Adhere to community standard variable names, types, and unit(s), keywords
    returns list of compliant_statements objects.

    List all the compliants in bullet points. Strictly adhere to original statement text.
    Do not rewrite or rephrase. Do not repeat the original question in the answer.
"""

    def __init__(
        self,
        requirements_doc: str,
        docs: Optional[Docs] = None,
        prompt: PromptTemplate = _CHECK_PROMPT_TEMPLATE,
        system_prompt: str = _SYSTEM_PROMPT,
        llm: Optional[BaseLanguageModel] = None,
        debug: bool = False,
        ) -> None:
        self.debug = debug
        self.requirements_doc = requirements_doc
        self.prompt = prompt or self._CHECK_PROMPT_TEMPLATE
        self.system_prompt = system_prompt or self._SYSTEM_PROMPT

        llm = llm or ChatOpenAI(model="gpt-3.5-turbo", temperature=0.0)
        self.docs = docs or Docs(llm=llm, prompts=PromptCollection(system=self.system_prompt))

    def _build_indices(self) -> ComplianceChecker:
        return self.add_to_index(self.requirements_doc)

    def add_to_index(self, path: str) -> ComplianceChecker:
        if path:
            self.docs.add(path)

        if self.docs.texts_index is None:
            self._build_text_indices()
            
        return self

    def _build_text_indices(self):
        # dummy
        _ = self.docs.query("")
        return self.docs.texts_index

    def check(self, policies: Union[List[str], Tuple[str]]) -> List[Dict[str, str]]:
        return self._check(tuple(policies))
        
    # @functools.lru_cache
    def _check(self, policies: Tuple[str]) -> List[Dict[str, str]]:
        res = []
        for policy in tqdm(policies):
            if self.debug:
                logger.debug(f"Checking for policy={policy}")
            answer = self.docs.query(self.prompt.format(policy=policy))
            res.append(dict(policy=policy, section=answer.answer))
        return res


    def save_index(self, path) -> ComplianceChecker:
        with open(path, "wb") as f:
            pickle.dump(self.docs, f)
        return self

    def load_index(self, path) -> ComplianceChecker:
        with open(path, "rb") as f:
            self.docs = pickle.load(f)
        return self

    # @staticmethod
    # def postprocess_result(df: List[Dict[str, str]]) -> pd.DataFrame:

    @staticmethod
    def beautify_result(df: List[Dict[str, str]]) -> pd.DataFrame:
        if not isinstance(df, pd.DataFrame):
            df = pd.DataFrame(df)

        def highlight_rows(row):
            return ['background-color: red' if row['section_text'] == 'N/A' else ''] * len(row)

        if "section" in df:
            df["section"] = df["section"].str.replace("I cannot answer.", "N/A", regex=True)
            df["section"] = df["section"].apply(ComplianceChecker._italicize_references)
            df.rename(columns={"section": "section_text"}, inplace=True)
        df["section_text"] = df["section_text"].fillna("N/A")
        df["sections"] = df["section_text"].apply(lambda x: x.split("\n-"))
        df["sections"] = df.sections.apply(lambda x: list(map(ComplianceChecker._extract_section, x)))
            
        df = df.style.apply(highlight_rows, axis=1)
        
        return df

    @staticmethod
    def _extract_section(text: str) -> str:
        match = re.search(r'[A-Z]\d+(?:\.\d+)+[a-z]?', text)
        return match.group(0) if match else text
        
    @staticmethod
    def _italicize_references(text):
        pattern = r'\(([^()]*)\)'
        def italicize(match):
            references = match.group(1)
            italicized_references = f'(*{references}*)'
            # italicized_references = f'(<em>{references}</em>)'
            return italicized_references
        return re.sub(pattern, italicize, text)

In [23]:
SYSTEM_PROMPT = """
You are a very precise compliant checker.

For the policy provided, find the statements that DIRECTLY support and compliant with that policy.
    e.g: if policy is "Data shall be made publicly available, free in open, machine-readable formats [III.C.ii–iv]"
    a compliant statement for the above policy is A1.1.3 Adhere to community accepted standard machine readable data file formats
    a non-compliant statement for the above policy is A1.1.6  Adhere to community standard variable names, types, and unit(s), keywords
    returns list of compliant_statements objects.

    List all the compliants in bullet points.
    Strictly adhere to original statement text for each compliant statement. Do not rewrite or rephrase.
    Do not repeat or add the original question in the answer.
"""

In [24]:
CHECK_PROMPT = PromptTemplate.from_template("""
List the requirement sections that comply with the policy '{policy}'? Cite references for each.

Do not repeat or add original question text in the answer. Just list as mentioned.

If each item in the list doesn't have any associated statement number, try to rematch the policy to get the number.
""")

In [25]:
def check_compliance(
    requirements_doc,
    policy_type="SPD41a",
    system_prompt=SYSTEM_PROMPT,
    check_prompt=CHECK_PROMPT,
    model="gpt-4",
):
    # point to the json file where policies are found
    policy_json = "../scripts/policies.json"
    policies = get_policies(policy_json, policy_type=policy_type)
    
    checker = ComplianceChecker(
        "../tmp/mdgf.pdf",
        docs=None,
        prompt=check_prompt,
        system_prompt=system_prompt,
        llm = ChatOpenAI(model=model, temperature=0.0)
    ).load_index("../tmp/mdgf-gpt-4.pkl")

    # if there's already a dump from previous check, just load that
    result = "../tmp/spd41_mdgf_check_gpt_4.csv"
    # result = None
    if result and isinstance(result, str):
        result = pd.read_csv(result)
    else:
        result = checker.check(policies)

    df = checker.beautify_result(result)
    display(df)
    return df

In [29]:
_ = check_compliance(
    policy_type="SPD41a",
    requirements_doc="../tmp/mdgf.pdf",
)

Unnamed: 0,policy,section_text,sections
0,"Data shall be made publicly available, free in open, machine-readable formats [III.C.ii–iv]","- Requirement A3.3.1 and B3.3.1a: All information and documents should be openly accessible, with public accessibility ensured using the Zenodo best practices guide (*Modern2023 pages 28-30*). - Requirement A3.3.2 and B3.3.2: All web content to have semantic annotations using RDFa Lite (*Modern2023 pages 28-30*). - Requirement A3.3.3 and B3.3.3: All documents should be assigned a persistent identifier, with Zenodo used if not assigned by NASA (*Modern2023 pages 28-30*). - Requirement A3.3.4, B3.3.4a, B3.3.4b, and B3.3.4c: Documents are assigned an open license, with open access selected in Zenodo, the Creative Commons Attribution 4.0 International license leveraged, and documents added to the IMPACT community in Zenodo (*Modern2023 pages 28-30*). - Requirement A1.1.3: Adherence to community accepted standard machine-readable data file formats (*Modern2023 pages 5-7*).","['A3.3.1', 'A3.3.2', 'A3.3.3', 'A3.3.4', 'A1.1.3']"
1,"SMD-funded data shall be reusable with a clear, open, and accessible data license [III.C.vii]. (If there are no other restrictions, SMD scientific data should be released with a Creative Commons Zero license.)","- A1.1.14: 'Identify the most appropriate data license for the data product' (*Modern2023 pages 9-10*) - B1.1.14: 'If there are no other restrictions, SMD scientific data should be released with a Creative Commons Zero license' (*Modern2023 pages 9-10*) - A3.3.4: 'Ensuring documents are assigned an open license' (*Modern2023 pages 28-30*) - B3.3.4a: 'Selecting 'open access' for Access rights in Zenodo' (*Modern2023 pages 28-30*) - B3.3.4b: 'Leveraging the Creative Commons Attribution 4.0 International license in Zenodo for open access documentation' (*Modern2023 pages 28-30*)","['A1.1.14', 'B1.1.14', 'A3.3.4', 'B3.3.4a', 'B3.3.4b']"
2,Publicly available SMD-funded data collections shall be citable using a persistent identifier [III.C.viii],"- Requirement A3.3.3: All documents created for the information identified in B3.2.1 should be assigned a persistent identifier (*Modern2023 pages 28-30*). - Procedure B3.3.3: Suggests obtaining a Zenodo account and publishing documents to Zenodo to get a DOI if not assigned by NASA (*Modern2023 pages 28-30*). - Requirement A4.2.2: Code is citable (*Modern2023 pages 32-34*). - Requirement B4.2.2b: Details the creation of a citation file for all code (*Modern2023 pages 32-34*). - Requirement A4.3.2: Ensures that the code has a persistent identifier and is discoverable with the data (*Modern2023 pages 32-34*). - Procedures B4.3.2a, B4.3.2b, and B4.3.2c: Provide further details on assigning a registered persistent identifier, adding the code identifier to the data product metadata, and adding the DOI to the Github citation file respectively (*Modern2023 pages 32-34*). - Requirement A4.4.1: Ensures a recommended citation is provided for the code (*Modern2023 pages 32-34*). - Procedure B4.4.1: Details the upload of the citation file to the GitHub repository (*Modern2023 pages 32-34*).","['A3.3.3', 'B3.3.3', 'A4.2.2', 'B4.2.2b', 'A4.3.2', 'B4.3.2a', 'A4.4.1', 'B4.4.1']"
3,"SMD-funded data shall include robust, standards-compliant metadata that clearly and explicitly describe the data [III.C.vi]","- A1.4.1: Use a clear data versioning scheme to ensure that users are aware of significant changes to data (*Modern2023 pages 12-14*) - B1.4.1: Provide users with data product change logs using a template similar to this example (*Modern2023 pages 12-14*) - A2.1.1: Adhere to a standard metadata schema for data product (*collection*) and file (*granule*) level metadata (*Modern2023 pages 12-14*) - B2.1.1: Utilize the UMM or STAC schema (*Modern2023 pages 12-14*) - DS B2.5.2b: Ensure up-to-date metadata (*Modern2023 pages 26-28*) - DS A2.5.3: Update metadata fields for schema compliance (*Modern2023 pages 26-28*) - A2.6.1: Monitor and report metadata quality metrics (*Modern2023 pages 26-28*) - A2.6.2: Update the metadata schema (*Modern2023 pages 26-28*) - A2.2.1: Provide metadata specification information in the collection level metadata, with references to UMM and STAC standards (*Modern2023 pages 16-17*) - A2.2.2: Provide the data product title in the collection level metadata, again with references to UMM and STAC (*Modern2023 pages 16-17*) - A2.2.3: Provide the DOI information in the collection level metadata, with specific instructions for UMM and STAC (*Modern2023 pages 16-17*) - A2.2.4: Provide the abstract information in the collection level metadata, with a focus on UMM (*Modern2023 pages 16-17*)","['A1.4.1', 'B1.4.1', 'A2.1.1', 'B2.1.1', 'B2.5.2b', 'A2.5.3', 'A2.6.1', 'A2.6.2', 'A2.2.1', 'A2.2.2', 'A2.2.3', 'A2.2.4']"
4,"SMD-funded data shall be findable, such that the data can be retrieved, downloaded, indexed, and searched [III.C.v]","- Requirement A2.2.1: Providing metadata specification information in the collection level metadata (*Modern2023 pages 16-17*) - Requirement A2.2.2: Providing the data product title in the collection level metadata (*Modern2023 pages 16-17*) - Requirement A2.2.3: Providing the DOI information in the collection level metadata (*Modern2023 pages 16-17*) - Requirement A2.2.4: Providing the abstract information in the collection level metadata (*Modern2023 pages 16-17*) - Requirement B1.1.5: Defining and documenting file naming conventions (*Modern2023 pages 7-9*) - Requirement A1.1.6 and B1.1.6: Adhering to community standard variable names, types, and units (*Modern2023 pages 7-9*) - Requirement A1.1.7 and B1.1.7: Adhering to community standards for coordinate systems (*Modern2023 pages 7-9*) - Requirement A1.1.8 and B1.1.8: Adhering to community standards for map projections (*Modern2023 pages 7-9*) - Requirement A1.1.9 and B1.1.9: Adhering to community standards for date and time formats (*Modern2023 pages 7-9*) - Requirement A1.1.10 and B1.1.10: Defining a data product versioning scheme (*Modern2023 pages 7-9*) - Requirement A1.1.13 and B1.1.13: Defining metrics to be collected along various dimensions including data use, data quality, data/information profile, data processing, ingest, and data access APIs/services (*Modern2023 pages 7-9*) - Requirement A1.3.5: Ensuring checksum and manifest files are available to users (*Modern2023 pages 12-14*) - Requirement A1.4.1: Using a clear data versioning scheme (*Modern2023 pages 12-14*) - Requirement A1.6.1: Collecting and monitoring data usage metrics (*Modern2023 pages 12-14*) - Requirement A2.1.1: Adhering to a standard metadata schema (*Modern2023 pages 12-14*) - Requirement A1.1.1: Defining a data flow diagram (*Modern2023 pages 5-7*) - Requirement A1.1.3: Adhering to community accepted standard machine readable data file formats (*Modern2023 pages 5-7*) - Requirement A1.1.5: Adhering to community best practices on data file naming conventions (*Modern2023 pages 5-7*)","['A2.2.1', 'A2.2.2', 'A2.2.3', 'A2.2.4', 'B1.1.5', 'A1.1.6', 'A1.1.7', 'A1.1.8', 'A1.1.9', 'A1.1.10', 'A1.1.13', 'A1.3.5', 'A1.4.1', 'A1.6.1', 'A2.1.1', 'A1.1.1', 'A1.1.3', 'A1.1.5']"
5,SMD-funded data collections shall be indexed as part of the NASA catalog of data [III.C.ix],,['N/A']
6,SMD-funded data should follow the FAIR Guiding Principles [III.C.i],"1. Requirement section A2.2.1: Citation requirements, which include author(*s*) or project name(*s*), date published, title, release version, repository, DOI, access date, and resource type (*Modern2023 pages 16-17*). 2. Requirement section A2.2.2: Metadata specification information should be provided in the collection level metadata (*Modern2023 pages 16-17*). 3. Requirement section A2.2.3: The data product title should be provided in the collection level metadata (*Modern2023 pages 16-17*). 4. Requirement section A2.2.4: DOI information should be included in the collection level metadata (*Modern2023 pages 16-17*). 5. Requirement section A4.3.1: Ensuring the code is openly accessible (*Modern2023 pages 32-34*). 6. Requirement section A4.3.2: Ensuring the code has a persistent identifier and is discoverable with the data (*Modern2023 pages 32-34*). 7. Requirement section A1.4.1: Using a clear data versioning scheme (*Modern2023 pages 12-14*). 8. Requirement section A2.1.1: Adhering to a standard metadata schema for data product (*Modern2023 pages 12-14*). 9. Requirement section A3.3.1: Making all information and documents openly accessible (*Modern2023 pages 28-30*). 10. Requirement section A3.3.3: Assigning documents a persistent identifier (*Modern2023 pages 28-30*).",['A2.2.1']
7,The SMD repository provides documentation on policies for data retention [Appendix D.11.],"- A4.5.1: Requires the preservation of code associated with the data product (*Modern2023 pages 34-36*) - B4.5.1a and B4.5.1b: State that code for generating and reading data products should be included in the AIP (*Modern2023 pages 34-36*) - A5.1.5 and B5.1.5: Discuss defining a storage policy for retention, including rules for defining which storage class to use and when to change it (*Modern2023 pages 34-36*) - A5.1.6 and B5.1.6: Define a storage policy for retiring data, suggesting that retired datasets should be moved to cold storage or deleted (*Modern2023 pages 34-36*) - A3.1.2 and B3.1.2a: Discuss defining how information will be retained and creating a retention plan for documentation and digital content (*Modern2023 pages 28-30*) - B3.1.2b: Ensures that the scope for the Archived Info Package includes information and digital content about the data (*Modern2023 pages 28-30*) - A1.1.12 and B1.1.12: Discuss the development of a data retention plan, including a process for when and how data will be sunset, and the creation of a data retention plan that includes information about the end of project preservation plan and rolling archive plans (*Modern2023 pages 7-9*) - A5.2.1: Discusses creating the S3 Bucket structure (*Modern2023 pages 36-39*) - A5.2.2: Involves creating data storage environments as per the storage plan (*Modern2023 pages 36-39*) - A5.2.3: Is about creating a storage policy for retention rules (*Modern2023 pages 36-39*) - A5.5.1: Adheres to the storage policy for retiring data, suggesting that retired data should be moved to cold storage or deleted (*Modern2023 pages 36-39*)","['A4.5.1', 'B4.5.1a', 'A5.1.5', 'A5.1.6', 'A3.1.2', 'B3.1.2b', 'A1.1.12', 'A5.2.1', 'A5.2.2', 'A5.2.3', 'A5.5.1']"
8,"If there are no other restrictions, publicly available SMD-funded software should be released under a permissive license that has broad acceptance in the community [III.D.iii]","- The requirement section that mentions the selection of an open, permissive code license in line with open science requirements (*A4.1.3*) (*Modern2023 pages 30-32*). - The section that suggests using a permissive license with broad acceptance in the science community to share code, including the Apache License 2.0, the BSD 3-Clause “Revised” License, and the MIT License (*B4.1.3*) (*Modern2023 pages 30-32*). - The section that includes ensuring the code is openly accessible (*A4.3.1*) (*Modern2023 pages 32-34*). - The section that mentions setting the code repository to ‘public’ in GitHub (*B4.3.1*) (*Modern2023 pages 32-34*).","['A4.1.3', 'B4.1.3', 'A4.3.1', 'B4.3.1']"
9,Publicly available SMD-funded software shall be citable using a persistent identifier [III.D.vii],"- A4.2.2: Ensure that the code is citable (*Modern2023 pages 32-34*) - B4.2.2a: Create a clear, descriptive name for the code repo (*Modern2023 pages 32-34*) - B4.2.2b: Create a citation file for all code with information identified in B4.1.6 (*Modern2023 pages 32-34*) - A4.3.2: Ensure that the code has a persistent identifier and is discoverable with the data (*Modern2023 pages 32-34*) - B4.3.2a: Assign a registered persistent identifier to the code repository (*Modern2023 pages 32-34*) - B4.3.2b: Add the code identifier to the data product metadata (*Modern2023 pages 32-34*) - B4.3.2c: Add the DOI to the Github citation file (*Modern2023 pages 32-34*) - A4.4.1: Provide a recommended citation for the code (*Modern2023 pages 32-34*) - B4.4.1: Upload the citation file to the GitHub repository and ensure the 'Cite This Repository' link is functioning (*Modern2023 pages 32-34*)","['A4.2.2', 'B4.2.2a', 'B4.2.2b', 'A4.3.2', 'B4.3.2a', 'B4.3.2b', 'B4.3.2c', 'A4.4.1', 'B4.4.1']"
