# CrewAI sequential workflow of control analysis

We were able to effectively use a chat session (with context) to analyze a PDF and match it against a
control from the NIST 800-53 standard (specifically CA-5). In this notebook, we explore turning
that chat session into a CrewAI Workflow.

In this particular experiment, we will pull in a specific EPA PDF, convert it to text (using
Python's PyMuPDF library) and use a single Agent with a sequential workflow modeled after the chat session.

### Install Notes

To run this you will need:

```pip install PyMuPDF openpyxl crewai```

In [1]:
import openai
from pinecone import Pinecone
import pandas as pd
import fitz
import textwrap
import os 
import re
from crewai import Agent, Task, Crew, Process

### Define OpenAI Environment

In [2]:
os.environ["OPENAI_MODEL_NAME"] = "gpt-4o-mini"

## Create an Array of Controls to Analyze

Here pull in the Excel spreadsheet of NIST 800-53 controls and extract specific controls that we are interested in 
evaluating. Unfortunately the Excel spreadsheet does not maintain the proper formatting of the Control Text, so we 
quickly correct that.

In [3]:
# Loading the controls from an Excel file
controls_df = pd.read_excel(r"../resources/NIST_SP-800-53/sp800-53r5-control-catalog.xlsx" )

results = []  # To collect results for each control

# Properly indent control text
def format_control_text(control_text):
    control_text = re.sub(r"^([a-z].)", r"  \1",control_text, flags=re.MULTILINE)
    control_text = re.sub(r"^(\d.)",    r"    \1",control_text, flags=re.MULTILINE)
    control_text = re.sub(r"^(\([a-z]\))", r"      \1",control_text, flags=re.MULTILINE)
    return(control_text)

# Loop through each control in the DataFrame
def get_selected_controls(control_ids):
    selected_controls = []
    for control_id in control_ids:
        # Find the row in the DataFrame that matches the control ID
        row = controls_df.loc[controls_df['Control Identifier'] == control_id]
        if not row.empty:
            # Assuming each Control Identifier in the DataFrame is unique
            row = row.iloc[0]
            Control_Identifier = textwrap.dedent(row['Control Identifier'])
            Control_Name = textwrap.dedent(row['Control (or Control Enhancement) Name']).replace('| ', '')
            Control_Text = format_control_text(row['Control Text'])
            
            Full_Control = f"{Control_Identifier}\n\n{Control_Name.upper()} Control:\n\n{Control_Text}\n\n" 
            selected_controls.append(Full_Control)
        else:
            print(f"No Control found for ID: {control_id}")
    
    return selected_controls

# End User Enter Selected Controls
user_selected_ids = ['SA-10', 'CM-2', 'CP-7', 'IA-4', 'IR-6']  
selected_controls_text = get_selected_controls(user_selected_ids)

# Print the Selected Controls
print("\n---\n".join(selected_controls_text))

SA-10

DEVELOPER CONFIGURATION MANAGEMENT Control:

Require the developer of the system, system component, or system service to:
  a. Perform configuration management during system, component, or service [Selection (one or more): design; development; implementation; operation; disposal];
  b. Document, manage, and control the integrity of changes to [Assignment: organization-defined configuration items under configuration management];
  c. Implement only organization-approved changes to the system, component, or service;
  d. Document approved changes to the system, component, or service and the potential security and privacy impacts of such changes; and
  e. Track security flaws and flaw resolution within the system, component, or service and report findings to [Assignment: organization-defined personnel].


---
CM-2

BASELINE CONFIGURATION Control:

  a. Develop, document, and maintain under configuration control, a current baseline configuration of the system; and
  b. Review and up

In [4]:
# Configure the pinecone database that holds our policy documents
index_name = "dylantest"
embed_model="text-embedding-ada-002"
pinecone_api_key = os.getenv("PINECONE_API_KEY")
pinecone = Pinecone(api_key=pinecone_api_key)
index = pinecone.Index(index_name)



In [5]:
# Loop through each control in the List
for row in selected_controls_text:
    # Assign the value in the "Control Text" column to the Control variable
    Control = row

    # TODO: Each control also has a "Discussion" that we can potentially use to improve
    # the segment search results and "descision making" by the LLM
    #Discussion = textwrap.dedent(row['Discussion'])

    # Using the control, search the pinecone database for document segments that pertain
    # to the control
    qe = openai.embeddings.create(input=Control, model=embed_model)
    res = index.query(vector=qe.data[0].embedding, top_k=5, include_metadata=True)
    policy_text = "\n\n---\n\n".join([f"Document: {r['metadata']['document']}\n\n{r['metadata']['text']}" for r in res['matches']])
    # print(policy_text)
    
    # Create the Investigator agent
    Investigator = Agent(
        role='Investigator',
        goal=textwrap.dedent("""
          We are conducting an examination of the IT process and policy documents of a system for adherence
          to NIST SP 800-53 security controls.
          
          We need to do two key things: 
            1. Gather evidence (if available) of the implementation of a security control. 
            2. Identify the values for properties (variables) in the control from the evidence.
               The following are segments of IT process and policy documents found from search
               results based upon the control we are examining.
        """),
        verbose=True,
        memory=True,
        backstory=textwrap.dedent("""
          You are a seasoned Investigator tasked with determining if the provided NIST 800-53 security 
          controls are present in the policy documentation. You are skilled at understanding and analyzing
          software policy. If something is unclear or if the evidence is "weak", you believe it is your duty
          to point out inconsistencies or incomplete policies.
        """),
        tools=[]
    )
    
    #
    # Task for the Investigator to search the policy documents for evidence of the implementation
    # of a particular control
    #
    Gather_Evidence = Task(
    description=textwrap.dedent(f"""
      The following are segments of IT process and policy documents found from search results based upon
      the control we are examining:
      
      ---
      
      {policy_text}
      
      ---
      
      The control we are assessing is: 
      
      {Control}
      
      Please find evidence (if present) from the IT security and policy documents that address
      this control. For the evidence you found, please provide:
        * The document name
        * A section number or identifier (if available)
        * A quote of the evidence from the document.
    """),
    expected_output="A list of evidence.",
    agent=Investigator,
)
    
    #
    # Task for the Investigator to use populate poperties from the control that are fulfilled by the
    # policy evidence.
    #
    Evaluate_Control = Task(
    description=textwrap.dedent(f"""
      Here is the Control again:
      
      {Control}

      The bracketed sections (within [ and ] in the control) represent properties that
      should be specified in the evidence if the control is fully satisfied. Please do not
      create properties that are not specified within brackets. For each
      property in the control please:
        1. List the property from the control (using the name given by Assignment).
        2. A quote from the supporting evidence that provides the value of that property (if available).
        3. The value of the property extracted from the evidence (if available)
    
      Then please tell us yes or no if you feel that the control is addressed by the policy documents.
    """),
    expected_output='A list of properties, if the control is satisfied, and why.',
    agent=Investigator,
)
    
    # 
    # Task for the Investigator to combine the output of the previous tasks and format it into an
    # XML report
    #
    Format_Evidence = Task(
    description=textwrap.dedent(f"""
      Here is the Control again:
      
      {Control}

      The bracketed sections (within [ and ] in the control) represent properties that
      should be specified in the evidence if the control is fully satisfied. Please do not
      create properties that are not specified within brackets.

      Please put the results of your examination into the following XML format:
      
      <CONTROL>
        <CONTROL_ID></CONTROL_ID>
        <CONTROL_TITLE></CONTROL_TITLE>
        <EVIDENCE>
          <DOC_TITLE></DOC_TITLE>
          <DOC_SECTION></DOC_SECTION>
          <DOC_QUOTE></DOC_QUOTE>
        </EVIDENCE>
        <PARAMETER>
          <NAME></NAME>
          <VALUE></VALUE>
        </PARAMETER>
        <CONTROL_IS_SATISFIED></CONTROL_IS_SATISFIED>
        <CONTROL_SATISFACTION_RATIONAL></CONTROL_SATISFACTION_RATIONAL>
      </CONTROL>

      Please output the XML by itself (do not provide any surrounding text nor code block quotes).
    """),
    expected_output='A properly formatted XML file with the evaluation results.',
    context=[Gather_Evidence, Evaluate_Control],
    agent=Investigator,
)
    
    # Forming the crew and defining the process
    crew = Crew(
        agents=[Investigator],
        tasks=[Gather_Evidence, Evaluate_Control, Format_Evidence],
        process=Process.sequential
    )

    # Kick off the crew
    result = crew.kickoff()

    # Collect or aggregate results
    results.append(result)

# Print the result
for result in results:
    print(result)
    



[1m> Entering new CrewAgentExecutor chain...[0m
[32;1m[1;3mI now can give a great answer  
Final Answer: 

**Evidence for Control SA-10 – Developer Configuration Management**

1. **Document Name**: ../resources/EPA_Policy_Example/information_security_system_and_services_acquisition_procedure.pdf  
   **Section Number/Identifier**: SA-10 – Developer Configuration Management  
   **Quote**: "Require the developer of the system, system component, or system service to:  
   a) Perform configuration management during system, component, or service design, development, implementation, operation, and disposal;  
   b) Document, manage and control the integrity of changes to configuration items defined in the Configuration Management Plan referenced in CM-9;  
   c) Implement only organization-approved changes to the system, component or service;  
   d) Document approved changes to the system, component, or service and potential security and privacy impacts of such changes;  
   e) Track





[1m> Entering new CrewAgentExecutor chain...[0m
[32;1m[1;3mI now can give a great answer  
Final Answer: 

1. **Document Name:** ../resources/EPA_Policy_Example/information_security_configuration_management_procedure.pdf  
   **Section Number/Identifier:** CM-2 – Baseline Configuration  
   **Quote of Evidence:** "1) Develop, document, and maintain under configuration control, a current baseline configuration of the system; and 2) Review and update the baseline configuration of the system: a) Annually; b) When required due to significant changes to the system; and c) When system components are installed or upgraded."

2. **Document Name:** ../resources/EPA_Policy_Example/information_security_configuration_management_procedure.pdf  
   **Section Number/Identifier:** Purpose  
   **Quote of Evidence:** "The purpose of this procedure is to facilitate the implementation of Environmental Protection Agency (EPA) security control requirements for the Configuration Management (CM) contro





[1m> Entering new CrewAgentExecutor chain...[0m
[32;1m[1;3mI now can give a great answer.

Final Answer:
1. **Document Name**: ../resources/EPA_Policy_Example/information_security contingency planning _procedure.pdf  
   **Section Number/Identifier**: CP-7 – Alternate Processing Site  
   **Quote of Evidence**:  
   "1) Establish an alternate processing site, including necessary agreements to permit the transfer and resumption of information system operations necessary for essential mission and business functions within the Recovery Time Objective (RTOs) and Recovery Point Objectives (RPOs) established in the Contingency Plan and BIA when the primary processing capabilities are unavailable;  
   2) Make available at the alternate processing site, the equipment and supplies required to transfer and resume operations or put contracts in place to support delivery to the site within the organization-defined time period for transfer and resumption; and  
   3) Provide controls at the 





[1m> Entering new CrewAgentExecutor chain...[0m
[32;1m[1;3mI now can give a great answer  
Final Answer: 

1. **Document Name:** ../resources/EPA_Policy_Example/information_security_identification_and_authentication_procedure.pdf  
   **Section Number/Identifier:** IA-4 – Identifier Management  
   **Quote of the Evidence from the Document:**  
   "1) SOs, in coordination with ISOs, IMOs, IOs, ISSOs, CCPs, and SCAs, for EPA-operated systems shall; and SMs, in coordination with IOs, ISOs, IMOs, ISSOs, CCPs, and SCAs, for systems operated on behalf of the EPA, shall ensure service providers:  
   a) Receive authorization from a designated EPA official (e.g., system administrator, technical lead or system owner) to assign individual, group, role, or device identifiers.  
   b) Select and assign information system identifiers that uniquely identify an individual, group, role, or device.  
   i) Assignment of individual, group, role, or device identifiers shall ensure that no two user





[1m> Entering new CrewAgentExecutor chain...[0m
[32;1m[1;3mI now can give a great answer  
Final Answer: 

1. **Document Name:** ../resources/EPA_Policy_Example/incident_response_procedures_0.pdf  
   **Section Number/Identifier:** IR-6 – Incident Reporting  
   **Quote of Evidence:**  
   "1) Require personnel to report suspected incidents to the organizational incident response capability within one (1) hour of identification; and  
   2) Report incident information to the Computer Security Incident Response Capability (CSIRC) by contacting the Enterprise IT Service Desk (EISD) (sending email to EISD@epa.gov or calling 1-866-411-4-EPA) and the ISO at their respective sites."

2. **Document Name:** ../resources/EPA_Policy_Example/spillage_classified_info_unclassified_systems_procedure_20190824_508_vwn.pdf  
   **Section Number/Identifier:** Section 7.6 Incident Reporting  
   **Quote of Evidence:**  
   "1) Per EPA Incident Response (IR) procedures, security incident information