# Cyber Developer Day 2024

## Introduction

Generative AI (GenAI) and Large Language Models (LLMs) are becoming essential tools in cybersecurity in part due to their ability to enhance the efficiency of cyber threat detection and response by accelerating analyst workflows.

### Problem Statement:
Determining the impact of a documented Common Vulnerabilities and Exposures (CVE) on a specific project or container is a labor-intensive and manual task. This process involves the collection, comprehension, and synthesis of various pieces of information to ascertain whether immediate remediation, such as patching, is necessary upon the identification of a new CVE. Often, reasons cited for not updating a library affected by a CVE include the occurrence of scan false positives, the existence of mitigating factors, or the absence of necessary environments or dependencies for the exploit to be viable. Once an analyst has determined the library is not affected, a Vulnerability Exploitability eXchange (VEX) document must be created to standardize and distribute the results. The efficiency of this process can be significantly enhanced through the deployment of an event-driven LLM agent pipeline.

### Tutorial Goals:
Our team developed a cybersecurity vulnerability analysis tool to aid in assessing the exploitability of CVEs in specific projects and containers. The following tutorial shows step-by-step how to leverage LLMs, RAG, and agents to create toy version and microservice running LLM-powered CVE exploitability analysis.
Students can experiment with the different modules to expand on this use case or leverage these modular pieces to construct a new pipeline of their own.

### Outline
#### Part 1 - Intro to Interacting with LLMs
    a. Python calls to LLM API
    b. Prompt engineering
    c. Prompt templating
    d. One-shot learning
    e. Multi-shot learning
    f. Fine-tuning LLMs
    
#### Part 2 - Prototyping
    a. Building a Vector Database
    b. Running a RAG pipeline with Morpheus
    c. Running the CVE Pipeline with Morpheus
    
#### Part 3 - Beyond Prototyping
    a. Improving the Model
    b. Batching Requests
    c. Creating a Microservice

<div class="alert alert-block alert-info">
<b>Note:</b> Please continue running the notebook up to Part 1 during the introduction presentation to ensure your environment is set up correctly.
</div>

### Environment Setup

The following code blocks are used to setup environment variables and imports for the rest of the notebook.

In [None]:
%load_ext autoreload
%aimport -logging
%autoreload 2

# Ensure that the morpheus directory is in the python path. This may not need to be run depending on the environment setup
import sys
import os

if ("MORPHEUS_ROOT" not in os.environ):
    os.environ["MORPHEUS_ROOT"] = os.path.abspath("../../..")

llm_dir = os.path.abspath(os.path.join(os.getenv("MORPHEUS_ROOT", "../../.."), "examples", "cyber_dev_day"))

if (llm_dir not in sys.path):
    sys.path.append(llm_dir)

Ensure the necessary environment variables are set. As a last resort, try to load them from a `.env` file.


In [None]:
# Ensure that the current environment is set up with API keys
required_env_vars = [
    "MORPHEUS_ROOT",
    "NGC_API_KEY",
    "NVIDIA_API_KEY",
]

if (not all([var in os.environ for var in required_env_vars])):

    # Try loading an .env file if it exists
    from dotenv import load_dotenv

    load_dotenv()

    # Check again
    if (not all([var in os.environ for var in required_env_vars])):
        raise ValueError(f"Please set the following environment variables: {required_env_vars}")


In [None]:
# Global imports
import os
import sys
import cudf
import pandas as pd
import ast

Configure logging to allow Morpheus messages to appear in the notebook.

In [None]:
# Configure logging
import logging
import cyber_dev_day

# Create a logger for this module. Use the cyber_dev_day module name because the notebook will just be __main__
logger = logging.getLogger(cyber_dev_day.__name__)

# Configure the root logger log level
logger.root.setLevel(logging.INFO)

In [None]:
# Test the logger out. You should see the log message in the output
logger.info("Successfully configured logging!")

<div class="alert alert-block alert-warning">
<b>Note:</b> Please wait here until instructed to continue with running Part 1 of the notebook.
</div>

## Part 1 - Intro to Interacting with LLMs

This section will go over how to integrate LLMs into code with Python based examples.

We will cover-

    a. Python calls to LLM API
    b. Prompt engineering
    c. Prompt templating
    d. One-shot learning
    e. Multi-shot learning
    f. Fine-tuning LLMs

In [None]:
from nemollm.api import NemoLLM
from pprint import pprint

### Part 1a - single call to LLM API
#### Use case - general cyber knowledge assistant, productivity tool to aid cyber analysts

`Query: How can you determine if a CVE is vulnerable in your specific environment?`

In [None]:
conn = NemoLLM(api_key=os.getenv("NGC_API_KEY"), org_id=os.getenv("NGC_ORG_ID")

In [None]:
response = conn.generate(
  prompt="In general, how can I determine if my specific environment is affected by a CVE?",
  model="gpt-43b-002",
  stop=[],
  tokens_to_generate=128,
  temperature=0,
  top_k=1,
  top_p=0.9,
  random_seed=0,
  beam_search_diversity_rate=0.0,
  beam_width=1,
  repetition_penalty=1.0,
  length_penalty=1.0,
)
response["text"]

### *Explore on your own*

#### Try another model such as `gpt-8b-000` `gpt20b` or `llama2-70b-HF`

In [None]:
response = conn.generate(
  prompt="In general, how can I determine if my specific environment is affected by a CVE?",
  model="gpt-8b-000",
  stop=[],
  tokens_to_generate=128,
  temperature=0,
  top_k=1,
  top_p=0.9,
  random_seed=0,
  beam_search_diversity_rate=0.0,
  beam_width=1,
  repetition_penalty=1.0,
  length_penalty=1.0
)
response["text"]

### *Explore on your own*
##### How do the different models compare? Can you change the parameters (like `temperature` or `repetition_penalty`) to help the smaller models improve?
##### What are some other cybersecurity questions you could ask an LLM to upskill a junior analyst?

### Part 1b - prompt engineering using personas

Some tips for improving performance using prompt engineering can be found here https://www.promptingguide.ai/introduction/tips


`Persona: You are a helpful cybersecurity analyst with an IQ of 140.`

`Query: In general, how can I determine if my specific environment is vulnerable to a CVE?`

In [None]:
prompt_template = "{persona} {query}"

formatted_prompt = prompt_template.format(persona="You are helpful cybersecurity expert with an IQ of 140.",
                                          query="In general, how can I determine if my specific environment is vulnerable to a CVE?")


response = conn.generate(
  prompt= formatted_prompt,
  model="gpt-43b-002",
  stop=[],
  tokens_to_generate=256,
  temperature=0,
  top_k=1,
  top_p=0.9,
  random_seed=0,
  beam_search_diversity_rate=0.0,
  beam_width=1,
  repetition_penalty=1.0,
  length_penalty=1.0
);

response["text"]

### *Explore on your own*
##### Does the persona improve performance?
##### What happens if you change the persona or attributes such as IQ?

### Part 1c - prompt template to include specific CVE context for the model

Can our LLM help with specific CVEs?

`Query: "How can I determine if my specific environment is affected by CVE-2023-47248?`

In [None]:
prompt_template = "{persona} {query}"

formatted_prompt = prompt_template.format(persona="You are helpful cybersecurity expert with an IQ of 140.",
                                          query="How can I determine if my specific environment is affected by CVE-2023-47248?")


response = conn.generate(
  prompt= formatted_prompt,
  model="gpt-43b-002",
  stop=[],
  tokens_to_generate=256,
  temperature=0,
  top_k=1,
  top_p=0.9,
  random_seed=0,
  beam_search_diversity_rate=0.0,
  beam_width=1,
  repetition_penalty=1.0,
  length_penalty=1.0
);

response["text"]

##### What are some methods to get up-to-date CVE knowledge to our model?

##### Can we add it to the prompt?



In [None]:
prompt_template = "Generate a checklist for a security analyst to use when assessing the exploitability of a specific CVE within a containerized environment. \
For each checklist item, start with an action verb, making it clear and actionable. Provide the checklist as a Python list of strings. \
Utilize the provided CVE details below to tailor the checklist items specifically for this CVE. \
CVE Details: \
- CVE ID: {cve} \
- Description: {cve_description} \
- Vulnerable Package Name: {vuln_package} \
- CVSS3 Vector String: {cvss3}"

formatted_prompt = prompt_template.format(cve = "2023-47248",
                                          cve_description= "Deserialization of untrusted data in IPC and Parquet readers in PyArrow versions 0.14.0 to 14.0.0 \
                                          allows arbitrary code execution. An application is vulnerable if it reads Arrow IPC, Feather or Parquet data from untrusted sources \
                                          (for example user-supplied input files). This vulnerability only affects PyArrow, not other Apache Arrow implementations or bindings. \
                                          It is recommended that users of PyArrow upgrade to 14.0.1. Similarly, it is recommended that downstream libraries upgrade their dependency \
                                          requirements to PyArrow 14.0.1 or later. PyPI packages are already available, and we hope that conda-forge packages will be available soon. \
                                          If it is not possible to upgrade, we provide a separate package `pyarrow-hotfix` that disables the vulnerability on older PyArrow versions. \
                                          See https://pypi.org/project/pyarrow-hotfix/ for instructions.",
                                          vuln_package="PyArrow",
                                          cvss3="CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H")

response = conn.generate(
  prompt= formatted_prompt,
  model="gpt-43b-002",
  stop=[],
  tokens_to_generate=128,
  temperature=0,
  top_k=1,
  top_p=0.9,
  random_seed=0,
  beam_search_diversity_rate=0.0,
  beam_width=1,
  repetition_penalty=1.0,
  length_penalty=1.0
);

response["text"]

##### Is this a useful planning step and thought process?

##### Did the model generate the checklist/task list as a python list of strings as we requested?

##### How can we measure accuracy?


In [None]:
# we can evaulate if the checklist is properly formatted using this function

def is_properly_formatted_list(checklist):
    try:
        # Attempt to evaluate checklist as a Python literal
        evaluated_checklist = ast.literal_eval(checklist)
        
        # Check if the evaluated object is a list
        if isinstance(evaluated_checklist, list):
            print("Checklist is properly formatted.")
            return True
        else:
            print("Checklist is not a list.")
            return False
    except ValueError as e:
        # Handle the case where checklist cannot be evaluated as a Python literal
        print(f"Checklist is not properly formatted: {e}")
        return False
    except SyntaxError as e:
        # Handle syntax errors in the checklist string
        print(f"Checklist has a syntax error: {e}")
        return False

In [None]:
is_properly_formatted_list(response["text"])

How can we improve performance?

### Part 1d - including an example in the prompt (one-shot learning)

You are an expert secuirty analyst. Your objective is to add a "Checklist" section containing steps to use when assessing the exploitability of a specific CVE within a containerized environment. \
For each checklist item, start with an action verb, making it clear and actionable

**Context**:
Not all CVEs are exploitable in a given container. By making a checklist specific to the information available for a given CVE analysts can execute the checklist to determine exploitability.

**Example Format**:
Below is a format for examples that illustrate transforming CVE information into an exploitability assessment checklist.

Example CVE Details:
- CVE ID: CVE-2022-2309 \
- Description: NULL Pointer Dereference allows attackers to cause a denial of service (or application crash). This only applies when lxml up to version 4.9.1 \
is used together with libxml2 2.9.10 through 2.9.14. libxml2 2.9.9 and earlier are not affected. It allows triggering crashes through forged input data, given a
vulnerable code sequence in the application. The vulnerability is caused by the iterwalk function (also used by the canonicalize function). Such code shouldn't be
in wide-spread use, given that parsing + iterwalk would usually be replaced with the more efficient iterparse function. However, an XML converter that serialises to \
C14N would also be vulnerable, for example, and there are legitimate use cases for this code sequence. If untrusted input is received (also remotely) and processed via \
iterwalk function, a crash can be triggered. 
- Vulnerable Package Name: lxml, libxml2 
- Vulnerable Package Version: lxml: up to 4.9.1, libxml2: 2.91.0 through 2.9.14
- CVSS3 Vector String: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H

Example Exploitability Assessment Checklist: \
[
"Check for lxml: Verify if your project uses the lxml library, which is the affected package. If lxml is not a dependency in your project, then your code is not vulnerable to this CVE.",
"Review Affected Versions: If lxml is used, checked the version that your project depends on. According to the vulnerability details, versions 4.9.0 and earlier are vulnerable.",
"Review Versions of Connected Dependencies: The package is only vulnerable if libxml 2.9.10 through 2.9.14 is also present. Check the version of libxml in the project.",
"Check for use of vulnerable functions: The library is vulnerable through its `iterwalk` function, which is also utilized by the `canonicalize` function. Check if either of these functions are used in your code base."
]

**Criteria**:
- Exploitability assessment checklists must relate to the information in the specific CVE Details.

**Procedure**:
[
"Understand the CVE Details, description, and CVSS3 attack vector string.",
"Produce a CVE exploitability assessment checklist.",
"Format the checklist as comma separated list surrounded by square braces.",
"Output the checklist."
]

**CVE Details:**
- CVE ID: {cve} \
- Description: {cve_description} \
- Vulnerable Package Name: {vuln_package} \
- CVSS3 Vector String: {cvss3}"

**Checklist**: 

In [None]:
prompt_template = """You are an expert secuirty analyst. Your objective is to add a "Checklist" section containing steps to use when assessing the exploitability of a specific CVE within a containerized environment. \
For each checklist item, start with an action verb, making it clear and actionable

**Context**:
Not all CVEs are exploitable in a given container. By making a checklist specific to the information available for a given CVE analysts can execute the checklist to determine exploitability.

**Example Format**:
Below is a format for examples that illustrate transforming CVE information into an exploitability assessment checklist.

Example CVE Details:
- CVE ID: CVE-2022-2309 \
- Description: NULL Pointer Dereference allows attackers to cause a denial of service (or application crash). This only applies when lxml up to version 4.9.1 \
is used together with libxml2 2.9.10 through 2.9.14. libxml2 2.9.9 and earlier are not affected. It allows triggering crashes through forged input data, given a
vulnerable code sequence in the application. The vulnerability is caused by the iterwalk function (also used by the canonicalize function). Such code shouldn't be
in wide-spread use, given that parsing + iterwalk would usually be replaced with the more efficient iterparse function. However, an XML converter that serialises to \
C14N would also be vulnerable, for example, and there are legitimate use cases for this code sequence. If untrusted input is received (also remotely) and processed via \
iterwalk function, a crash can be triggered. 
- Vulnerable Package Name: lxml, libxml2 
- Vulnerable Package Version: lxml: up to 4.9.1, libxml2: 2.91.0 through 2.9.14
- CVSS3 Vector String: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H

Example Exploitability Assessment Checklist: \
[
"Check for lxml: Verify if your project uses the lxml library, which is the affected package. If lxml is not a dependency in your project, then your code is not vulnerable to this CVE.",
"Review Affected Versions: If lxml is used, checked the version that your project depends on. According to the vulnerability details, versions 4.9.0 and earlier are vulnerable.",
"Review Versions of Connected Dependencies: The package is only vulnerable if libxml 2.9.10 through 2.9.14 is also present. Check the version of libxml in the project.",
"Check for use of vulnerable functions: The library is vulnerable through its `iterwalk` function, which is also utilized by the `canonicalize` function. Check if either of these functions are used in your code base."
]

**Criteria**:
- Exploitability assessment checklists must relate to the information in the specific CVE Details.

**Procedure**:
[
"Understand the CVE Details, description, and CVSS3 attack vector string.",
"Produce a CVE exploitability assessment checklist.",
"Format the checklist as comma separated list surrounded by square braces.",
"Output the checklist."
]

**CVE Details:**
- CVE ID: {cve}
- Description: {cve_description}
- Vulnerable Package Name: {vuln_package}
- CVSS3 Vector String: {cvss3}"

**Checklist**: """

formatted_prompt = prompt_template.format(cve = "2023-47248",
                                          cve_description= "Deserialization of untrusted data in IPC and Parquet readers in PyArrow versions 0.14.0 to 14.0.0 \
                                          allows arbitrary code execution. An application is vulnerable if it reads Arrow IPC, Feather or Parquet data from untrusted sources \
                                          (for example user-supplied input files). This vulnerability only affects PyArrow, not other Apache Arrow implementations or bindings. \
                                          It is recommended that users of PyArrow upgrade to 14.0.1. Similarly, it is recommended that downstream libraries upgrade their dependency \
                                          requirements to PyArrow 14.0.1 or later. PyPI packages are already available, and we hope that conda-forge packages will be available soon. \
                                          If it is not possible to upgrade, we provide a separate package `pyarrow-hotfix` that disables the vulnerability on older PyArrow versions. \
                                          See https://pypi.org/project/pyarrow-hotfix/ for instructions.",
                                          vuln_package="PyArrow",
                                          cvss3="CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H")

response = conn.generate(
  prompt= formatted_prompt,
  model="gpt-43b-002",
  stop=[],
  tokens_to_generate=256,
  temperature=0,
  top_k=1,
  top_p=0.9,
  random_seed=0,
  beam_search_diversity_rate=0.0,
  beam_width=1,
  repetition_penalty=1.0,
  length_penalty=1.0
);

response["text"]

In [None]:
is_properly_formatted_list(response["text"])

##### Do the examples and specific instructions help?

##### Is there anything you would add to the PyArrow checklist?

##### Would an example with a workaround or hotfix help?

### Part 1e- including multiple examples in the prompt (few-shot learning)

In [None]:
prompt_template = """You are an expert secuirty analyst. Your objective is to add a "Checklist" section containing steps to use when assessing the exploitability of a specific CVE within a containerized environment. \
For each checklist item, start with an action verb, making it clear and actionable

**Context**:
Not all CVEs are exploitable in a given container. By making a checklist specific to the information available for a given CVE analysts can execute the checklist to determine exploitability.

**Example Format**:
Below is a format for examples that illustrate transforming CVE information into an exploitability assessment checklist.

Example 1 CVE Details:
- CVE ID: CVE-2022-2309
- Description: NULL Pointer Dereference allows attackers to cause a denial of service (or application crash). This only applies when lxml up to version 4.9.1 \
is used together with libxml2 2.9.10 through 2.9.14. libxml2 2.9.9 and earlier are not affected. It allows triggering crashes through forged input data, given a
vulnerable code sequence in the application. The vulnerability is caused by the iterwalk function (also used by the canonicalize function). Such code shouldn't be
in wide-spread use, given that parsing + iterwalk would usually be replaced with the more efficient iterparse function. However, an XML converter that serialises to \
C14N would also be vulnerable, for example, and there are legitimate use cases for this code sequence. If untrusted input is received (also remotely) and processed via \
iterwalk function, a crash can be triggered. 
- Vulnerable Package Name: lxml, libxml2 
- Vulnerable Package Version: lxml: up to 4.9.1, libxml2: 2.91.0 through 2.9.14
- CVSS3 Vector String: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H

Example 1 Exploitability Assessment Checklist:
[
"Check for lxml: Verify if your project uses the lxml library, which is the affected package. If lxml is not a dependency in your project, then your code is not vulnerable to this CVE.",
"Review Affected Versions: If lxml is used, checked the version that your project depends on. According to the vulnerability details, versions 4.9.0 and earlier are vulnerable.",
"Review Versions of Connected Dependencies: The package is only vulnerable if libxml 2.9.10 through 2.9.14 is also present. Check the version of libxml in the project.",
"Check for use of vulnerable functions: The library is vulnerable through its `iterwalk` function, which is also utilized by the `canonicalize` function. Check if either of these functions are used in your code base."
]

Example 2 CVE Details:
- CVE ID: CVE-2024-23334
- Description: aiohttp is an asynchronous HTTP client/server framework for asyncio and Python. When using aiohttp as a web server and configuring static routes, \
it is necessary to specify the root path for static files. Additionally, the option 'follow_symlinks' can be used to determine whether to follow symbolic links \
outside the static root directory. When 'follow_symlinks' is set to True, there is no validation to check if reading a file is within the root directory. This can \
lead to directory traversal vulnerabilities, resulting in unauthorized access to arbitrary files on the system, even when symlinks are not present. \
Disabling `follow_symlinks` by setting `follow_symlinks = False` and using a reverse proxy are encouraged mitigations. Version 3.9.2 fixes this issue.
- Vulnerable Package Name: aiohttp
- Vulnerable Package Version: from 1.0.5 up to (excluding) 3.9.2
- CVSS3 Vector String: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:N

Example 2 Exploitability Assessment Checklist:
[
    "Check for aiohttp: Verify if your project uses the aiohttp library, which is the affected package. If aiohttp is not a dependency in your project, then your code is not vulnerable to this CVE.",
    "Review Affected Versions: If aiohttp is used, check the version that your project depends on. According to the vulnerability details, versions from 1.0.5 up to (excluding) 3.9.2 are affected by this vulnerability.",
    "Review Code To Check for Vulnerability Mitigation: Check if the 'follow_symlinks' option is set to False to mitigate the risk of directory traversal vulnerabilities."
]

**Criteria**:
- Exploitability assessment checklists must relate to the information in the specific CVE Details.
- Exploitability assessment checklists must include checks for mitigating conditions when present in the CVE Details.

**Procedure**:
[
"Understand the CVE Details, description, and CVSS3 attack vector string.",
"Produce a CVE exploitability assessment checklist.",
"Format the checklist as comma separated list surrounded by square braces.",
"Output the checklist."
]

**CVE Details:**
- CVE ID: {cve}
- Description: {cve_description}
- Vulnerable Package Name: {vuln_package}
- CVSS3 Vector String: {cvss3}"

**Checklist**: """


formatted_prompt = prompt_template.format(cve = "2023-47248",
                                          cve_description= "Deserialization of untrusted data in IPC and Parquet readers in PyArrow versions 0.14.0 to 14.0.0 \
                                          allows arbitrary code execution. An application is vulnerable if it reads Arrow IPC, Feather or Parquet data from untrusted sources \
                                          (for example user-supplied input files). This vulnerability only affects PyArrow, not other Apache Arrow implementations or bindings. \
                                          It is recommended that users of PyArrow upgrade to 14.0.1. Similarly, it is recommended that downstream libraries upgrade their dependency \
                                          requirements to PyArrow 14.0.1 or later. PyPI packages are already available, and we hope that conda-forge packages will be available soon. \
                                          If it is not possible to upgrade, we provide a separate package `pyarrow-hotfix` that disables the vulnerability on older PyArrow versions. \
                                          See https://pypi.org/project/pyarrow-hotfix/ for instructions.",
                                          vuln_package="PyArrow",
                                          cvss3="CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H")

len(formatted_prompt)

response = conn.generate(
  prompt= formatted_prompt,
  model="gpt-43b-002",
  stop=[],
  tokens_to_generate=256,
  temperature=0,
  top_k=1,
  top_p=0.9,
  random_seed=0,
  beam_search_diversity_rate=0.0,
  beam_width=1,
  repetition_penalty=1.0,
  length_penalty=1.0
);

response["text"]

In [None]:
is_properly_formatted_list(response["text"])

### *Explore on your own*

#### What happens when you input a tasklist item as query to the model?

#### What are your concerns about the model output?

### Part 1f - LORA fine-tuning

When you have more task-specific examples than space in your prompt you should consider fine-tuning and LLM for your specific task. Datasets required for fine-tuning have examples with just two fields- `Prompt` and `Completion`

In [None]:
TODO

## Part 2 - Prototyping

Now that we have the task generation for this workflow ready, how can we automate getting the answers for our checklist items?

### Overview

It is possible to build a language model-based system that accesses external knowledge sources to complete tasks. In Part 1c above, we added additional CVE details into the prompt by hand. While this strategy can be effective for adding additional context for very specific items like CVE Details, it requires apriori knowledge of what details to include (like those from NVD). When you would like to help your LLM with its query by adding more context in real-time, you're ready for RAG (Retreival Augmented Generation). 

When a query or checklist item is posed to an LLM equipped with RAG, the model first consults the vector database to find relevant information related to the query. This retrieved data is then combined with the original question and fed back into the LLM. With this enriched context, the LLM can generate a more accurate and informed response, potentially including evidence or reasoning based on the newly incorporated data. This approach not only improves the quality of the LLM's outputs and for our tool gives it access to project and container specific information to determine CVE exploitability.

### Building the Vector Database

In addition to having a query and LLM, RAG requires additional information to be stored in a vector database. One mechanism of finding the proper information from the database is to first embed the query into the same vector space and retreive the top most similiar items via a distance metric. The additional information is then presented in the prompt of the LLM. The neighboring vectors in the database are said to be "semanically similiar" to the query and likely relevant.

For our demonstration purposes, a we would like our LLM to be able to access include the code repository of the project we're interested in checking for exploitable CVEs. The first step is transforming the repo into a vector database.

In [None]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from cyber_dev_day.embeddings import create_code_embedding

# Create the embedding object that will be used to generate the embeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", model_kwargs={"device":"cuda"}, show_progress=True)

# Create a vector database of the code using the supplied embedding function. The returned value will be a
# FaissVectorDatabase object.
# NOTE: This may take a few minutes to run.
faiss_vdb = create_code_embedding(code_dir=os.getenv("MORPHEUS_ROOT"), embedding=embeddings, include_notebooks=False, exclude=[
    ".cache/**/*.py",
    "build*/**/*.py"
])

# Save the vector database to disk
code_faiss_dir = os.path.join(os.getenv("MORPHEUS_ROOT"), ".tmp", "morpheus_code_faiss")

faiss_vdb.save_local(code_faiss_dir)


### Running a RAG Pipeline with Morpheus

How do we utilize the Vector Database to answer questions?

#### Morpheus Overview

Quick overview of what Morpheus is and how to use it.

In [None]:
from morpheus.config import Config, PipelineModes

# Create the pipeline config
pipeline_config = Config()
pipeline_config.mode = PipelineModes.OTHER

#### Building a Morpheus RAG Pipeline

Below, we will build a pipeline that uses Morpheus to answer questions about the code in the repository that we created a vector database for. This works by using the `LLMEngine` in Morpheus with a `RAGNode`.

In [None]:
from textwrap import dedent
import cudf
from cyber_dev_day.config import EngineConfig, LLMModelConfig, NVFoundationLLMModelConfig
from morpheus._lib.llm import LLMEngine
from morpheus.llm.nodes.extracter_node import ExtracterNode
from morpheus.llm.nodes.rag_node import RAGNode
from morpheus.llm.services.llm_service import LLMService
from morpheus.llm.task_handlers.simple_task_handler import SimpleTaskHandler
from morpheus.messages import ControlMessage

from morpheus.pipeline.linear_pipeline import LinearPipeline
from morpheus.service.vdb.faiss_vdb_service import FaissVectorDBService
from morpheus.stages.input.in_memory_source_stage import InMemorySourceStage
from morpheus.stages.llm.llm_engine_stage import LLMEngineStage
from morpheus.stages.output.in_memory_sink_stage import InMemorySinkStage
from morpheus.stages.preprocess.deserialize_stage import DeserializeStage
from morpheus.utils.concat_df import concat_dataframes

def _build_rag_llm_engine(model_config: LLMModelConfig):

    engine = LLMEngine()

    engine.add_node("extracter", node=ExtracterNode())

    prompt = dedent("""
    You are a helpful assistant. Given the following background information:
    {% for c in contexts -%}
    Source File: {{ c.metadata.source }}
    Source File Language: {{ c.metadata.language }}
    Source Content:
    ```
    {{ c.page_content }}
    ```
    {% endfor %}

    Please answer the following question:
    {{ query }}
    """).strip("\n")

    vector_service = FaissVectorDBService(code_faiss_dir, embeddings=embeddings)

    vdb_resource = vector_service.load_resource()

    llm_service = LLMService.create(model_config.service.type, **model_config.service.model_dump(exclude={"type"}))

    llm_client = llm_service.get_client(**model_config.model_dump(exclude={"service"}))

    # Async wrapper around embeddings
    async def calc_embeddings(texts: list[str]) -> list[list[float]]:
        return embeddings.embed_documents(texts)

    engine.add_node("rag",
                    inputs=["/extracter"],
                    node=RAGNode(prompt=prompt,
                                 vdb_service=vdb_resource,
                                 embedding=calc_embeddings,
                                 llm_client=llm_client))

    engine.add_task_handler(inputs=["/rag"], handler=SimpleTaskHandler())

    return engine

async def run_rag_pipeline(p_config: Config, model_config: LLMModelConfig, question: str):
    source_dfs = [
        cudf.DataFrame({"questions": [question]}),
    ]

    completion_task = {"task_type": "completion", "task_dict": {"input_keys": ["questions"], }}

    pipe = LinearPipeline(p_config)

    pipe.set_source(InMemorySourceStage(p_config, dataframes=source_dfs))

    pipe.add_stage(
        DeserializeStage(p_config, message_type=ControlMessage, task_type="llm_engine", task_payload=completion_task))

    pipe.add_stage(
        LLMEngineStage(p_config,
                        engine=_build_rag_llm_engine(model_config)))

    sink = pipe.add_stage(InMemorySinkStage(p_config))

    await pipe.run_async()

    messages = sink.get_messages()
    responses = concat_dataframes(messages)

    # The responses are quite long, when debug is enabled disable the truncation that pandas and cudf normally
    # perform on the output
    pd.set_option('display.max_colwidth', None)
    print("Response:\n%s" % (responses['response'].iloc[0], ))

In [None]:
model_config = NVFoundationLLMModelConfig.model_validate({
    "service": {
        "type": "nvfoundation", "api_key": None
    },
    "model_name": "mixtral_8x7b",
    "temperature": 0.0
})

# Run the Pipeline
await run_rag_pipeline(pipeline_config, model_config, "Does the code repo import the `pyarrow_hotfix` package from the `morpheus` root package?")


### RAG Limitations

Using the pipeline we built, we can now ask questions about the code in the repository and the LLM will be able use the vector database to answer them. However, what happens if we need to ask questions about the code that are not in the vector database? For example, what if we needed to ask questions about the dependencies that the code uses? Would the LLM be able to answer these questions? Let's try it out by re-running our RAG pipeline with a more complex question:



In [None]:
# Run the Pipeline
await run_rag_pipeline(pipeline_config, model_config, "Does the code repo use `langchain` functions which are deprecated?")

It's likely that the model was not able to determine the answer to this question because it would need additional information. Depending on the model used, you might see output similar to:
```
Based on the provided source code, I cannot determine if the `langchain` functions used in the code are deprecated or not.

To determine if `langchain` functions are deprecated, you would need to check the documentation for the specific functions being used in the code or consult the maintainers of the `langchain` package.
```

How would we go about solving this problem?

#### Answering Complex Question with RAG + LLM Agents

Intro to LLM Agents and how our checklist + agents approach can help solve multi-step problems.

### Running the CVE Pipeline with Morpheus

#### The Engine Config

In [None]:
from cyber_dev_day.config import EngineConfig

# Create the engine configuration
engine_config = EngineConfig.model_validate({
    'checklist': {
        'model': {
            'service': {
                'type': 'nemo', 'api_key': None, 'org_id': None
            },
            'model_name': 'gpt-43b-002',
            'customization_id': None,
            'temperature': 0.0,
            'tokens_to_generate': 300
        }
    },
    'agent': {
        'model': {
            'service': {
                'type': 'nvfoundation', 'api_key': None
            }, 'model_name': 'mixtral_8x7b', 'temperature': 0.0
        },
        'sbom': {
            'data_file': ''
        },
        'code_repo': {
            'faiss_dir': code_faiss_dir,
            'embedding_model_name': "sentence-transformers/all-mpnet-base-v2"
        }
    }
})

In [None]:
# Print the current configuration object
print(engine_config.model_dump_json(indent=2))

#### The Pipeline Function

In [None]:
import time
import cudf
from cyber_dev_day.pipeline_utils import build_cve_llm_engine
from morpheus.messages import ControlMessage
from morpheus.pipeline.linear_pipeline import LinearPipeline
from morpheus.stages.input.in_memory_source_stage import InMemorySourceStage
from morpheus.stages.llm.llm_engine_stage import LLMEngineStage
from morpheus.stages.output.in_memory_sink_stage import InMemorySinkStage
from morpheus.stages.preprocess.deserialize_stage import DeserializeStage
from morpheus.utils.concat_df import concat_dataframes

async def run_cve_pipeline(p_config: Config, e_config: EngineConfig, input_cves: list[str]):
    source_dfs = [
        cudf.DataFrame({
            "cve_info": input_cves
        })
    ]

    completion_task = {"task_type": "completion", "task_dict": {"input_keys": ["cve_info"], }}

    pipe = LinearPipeline(p_config)

    pipe.set_source(InMemorySourceStage(p_config, dataframes=source_dfs))

    pipe.add_stage(DeserializeStage(p_config, message_type=ControlMessage, task_type="llm_engine", task_payload=completion_task))

    pipe.add_stage(LLMEngineStage(p_config, engine=build_cve_llm_engine(e_config)))

    sink = pipe.add_stage(InMemorySinkStage(p_config))

    await pipe.run_async()

    messages = sink.get_messages()
    responses = concat_dataframes(messages)

    logger.info("Pipeline complete")

    print("Pipeline complete. Received %s responses:\n%s" % (len(messages), responses['response']))


In [None]:
# Now run the pipeline with a specified CVE description
await run_cve_pipeline(pipeline_config, engine_config, [
    "An issue was discovered in the Linux kernel through 6.0.9. drivers/media/dvb-core/dvbdev.c has a use-after-free, related to dvb_register_device dynamically allocating fops."
])

### Hitting the Limits of the LLMs

This section will go over some of the areas where LLMs may fail.

In [None]:
# Example to cause the LLM to fail (TBD)

## Part 3 - Beyond Prototyping

This section will focus on refining the prototype to fix any gaps, improve the accuracy, and create a production ready pipeline.

### Improving the Model

In [None]:
# Code for improving the model
engine_config_custom_model = engine_config.model_copy(deep=True)

# Set the customization ID
# engine_config_custom_model.checklist.model.customization_id = "<CUSTOMIZATION_ID>"

# Run the pipeline
await run_cve_pipeline(pipeline_config, engine_config_custom_model, [
    "An issue was discovered in the Linux kernel through 6.0.9. drivers/media/dvb-core/dvbdev.c has a use-after-free, related to dvb_register_device dynamically allocating fops."
])

### Batching Multiple Requests

This section will show how to run multiple requests simultaneously improving the throughput

In [None]:
# Code for batching multiple requets
await run_cve_pipeline(pipeline_config, engine_config_custom_model, [
    "An issue was discovered in the Linux kernel through 6.0.9. drivers/media/dvb-core/dvbdev.c has a use-after-free, related to dvb_register_device dynamically allocating fops."
] * 5)

### Creating a Microservice

This section will show how to convert the development pipeline into a microservice which is capable of handling requests

In [None]:
# Code for creating a microservice
import time
import cudf
from cyber_dev_day.pipeline_utils import build_cve_llm_engine
from morpheus.messages import ControlMessage
from morpheus.messages import MessageMeta
from morpheus.pipeline.linear_pipeline import LinearPipeline
from morpheus.pipeline.stage_decorator import stage
from morpheus.stages.input.http_server_source_stage import HttpServerSourceStage
from morpheus.stages.input.in_memory_source_stage import InMemorySourceStage
from morpheus.stages.llm.llm_engine_stage import LLMEngineStage
from morpheus.stages.output.in_memory_sink_stage import InMemorySinkStage
from morpheus.stages.preprocess.deserialize_stage import DeserializeStage
from morpheus.utils.concat_df import concat_dataframes
from morpheus.utils.http_utils import HTTPMethod

async def run_cve_pipeline_microservice(p_config: Config, e_config: EngineConfig):

    completion_task = {"task_type": "completion", "task_dict": {"input_keys": ["cve_info"], }}

    pipe = LinearPipeline(p_config)

    # expected payload is:
    # [{"cve_info": <sting>},
    #  {"cve_info": <sting>},]
    pipe.set_source(
        HttpServerSourceStage(p_config,
                                bind_address="0.0.0.0",
                                port=26302,
                                endpoint="/scan",
                                method=HTTPMethod.POST))

    @stage
    def print_payload(payload: MessageMeta) -> MessageMeta:
        serialized_str = payload.df.to_json(orient='records', lines=True)

        logger.info("======= Got Request =======\n%s\n===========================", serialized_str)

        return payload

    pipe.add_stage(print_payload(config=p_config))

    pipe.add_stage(DeserializeStage(p_config, message_type=ControlMessage, task_type="llm_engine", task_payload=completion_task))

    pipe.add_stage(LLMEngineStage(p_config, engine=build_cve_llm_engine(e_config)))

    sink = pipe.add_stage(InMemorySinkStage(p_config))

    await pipe.run_async()

    messages = sink.get_messages()
    responses = concat_dataframes(messages)

    logger.info("Pipeline complete")

    print("Pipeline complete. Received %s responses:\n%s" % (len(messages), responses['response']))

In [None]:
await run_cve_pipeline_microservice(pipeline_config, engine_config_custom_model)

##### Triggering the Microservice

To trigger the microservice, we will use a CURL request to send a request to the microservice. Since the notebook cannot run commands while the microservice is running, we need to open up a new terminal to send the request. To do that, follow the steps below:

1. In Jupyter Lab, press Ctrl + Shift + L (Shift + ⌘ + L on Mac) to open a new Launcher tab
2. In the Launcher tab, click on the Terminal icon to open a new terminal
3. In the terminal, run the following command to send a request to the microservice:
```bash
curl --request POST \
  --url http://localhost:26302/scan \
  --header 'Content-Type: application/json' \
  --data '[{
      "cve_info" : "An issue was discovered in the Linux kernel through 6.0.9. drivers/media/dvb-core/dvbdev.c has a use-after-free, related to dvb_register_device dynamically allocating fops."
   }]'
```
4. Once the request is sent, the microservice will process the request and return the results in the terminal
   1. To see the results, switch back to the Notebook tab. You should see that the microservice received your request and started processing it.
   ```
   I20240308 16:00:56.422039 3010283 http_server.cpp:129] Received request: POST : /scan
   ```
   2. It helps to have the terminal and the notebook side by side so you can see the results in the terminal as they come in. To do this, click on the terminal tab and drag it to the right side of the screen. You should then be able to see the terminal and the notebook side by side similar to the image below:
   ![Terminal and Notebook Side by Side](./images/side_by_side.png)
5. To stop the microservice, interrupt the kernel by pressing the stop button in the toolbar