# Code Vulnerability Scanning & Automated Remediation using Vertex AI Gemini API (Gemini Pro 1.5)

---



### Background

In today's digital landscape, software security is paramount. With the increasing sophistication of cyber threats, it's more important than ever for developers to proactively identify and address vulnerabilities in their code. Vulnerabilities can lead to data breaches, financial losses, and reputational damage. By harnessing the power of Gemini Pro 1.5, we can revolutionize code vulnerability detection and remediation, ensuring more robust and secure software applications

### Use Case:

Advanced code scanning and vulnerability detection (100+ python files) for enhanced software security with gemini pro 1.5

| | |
|-|-|
|Author(s) | [Souvik Mukherjee](https://github.com/talktosauvik)

### Objectives & Approach

This notebook provides a guide to building a code vulnerability scanning and remediation engine using gemini 1.5 pro,

step by step process:


1.   **Data ingestion:** Reading Python files from a GCS bucket and combining them into a single string.

1.   **Prompt engineering:** Crafting a clear and comprehensive prompt for Gemini Pro, providing instructions for code analysis and output formatting.

1.   **LLM invocation:** Submitting the consolidated code string to Gemini Pro 1.5 for analysis.
2.   **Response parsing:** Extracting vulnerability information, recommendations, and code snippets from Gemini Pro's response.


5.  **Output generation:** Creating CSV and JSON reports for further analysis and integration with security tools.


### Gemini

Gemini is a family of generative AI models developed by Google DeepMind that is designed for multimodal use cases. The Gemini API gives you access to the Gemini 1.0 Pro Vision and Gemini 1.0, and Gemini 1.5 Pro models.

Its key strengths include:


1.   **Multimodality:** Gemini Pro 1.5 excels in understanding and
generating content across various modalities, including text, code, and images. This multimodality makes it an ideal tool for analyzing complex codebases.
2.   **Extended Context Window:** Gemini Pro 1.5 boasts an extended context window of 1M tokens¹, allowing it to process vast amounts of code in a single call, making it efficient for large-scale code scanning.
3. **Advanced Code Understanding:** Gemini Pro 1.5 possesses a deep understanding of programming languages and security best practices, enabling it to identify potential vulnerabilities and suggest effective remediation strategies.



### Dataset

The sample data files for this notebook have been taken from an openly accessible Git repo- For your purpose, you can download from any similar or different repos.



For this specific use case, the dataset can be made available in a couple different way

1. Dataset containing individual .py/.java files uploaded into GCS Bucket
1. Dataset available in a single jsonl format where
every line contains a JSON object with the following fields:

*   ID: unique identifier of the sample.
*   Prompt: Prompt for the code generation model.
*   Insecure_code: code of the vulnerability example

While the dataset can very, the approach can remain same to handle the use case.


### Getting Started

### Install Vertex AI SDK for Python and Google Storage


In [None]:
! pip3 install  -q --upgrade --user google-cloud-aiplatform google-cloud-storage

### Restart the Carnel

To use the newly installed packages, you must restart the current runtime.

In [None]:
# restart the current runtime to be able to access the downloaded packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

### Authenticate the notebook environment (Colab only)

If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).

In [None]:
from google.colab import auth

auth.authenticate_user()

### Define project information & initialize Vertex AI



In [None]:
# initialize variables
PROJECT_ID = "your-project-id"
REGION = "us-central1"
BUCKET_NAME = "your-bucket-name"
PREFIX = "folder-name/"

In [None]:
# import and initialize Vertex AI
import vertexai

vertexai.init(project=PROJECT_ID, location=REGION)

### Read the .py files from a GCS bucket and combines them into a single string.


In [None]:
## This block of code reads the file from GCS bucket, iterates through those and combine them into a single string.
## it also Demarcate individual file content using the actual filename as a separator
## for example purposes we have just taken 10 files, however, this same process can scan 100s of files as well. specifically taking advantages of the long context window
## the sample files have been taken from an openly accessible Git repo-
## for your purpose, you can download from any similar or different repos. https://github.com/s2e-lab/SecurityEval/tree/main

from google.cloud import storage


def process_py_files(bucket_name, prefix):
    """
    Reads .py files from a GCS bucket and combines them into a single string.
    Returns:
      A string containing the combined content of all .py files.
    """

    storage_client = storage.Client()
    bucket = storage_client.get_bucket(BUCKET_NAME)
    blobs = bucket.list_blobs(prefix=PREFIX)

    combined_text = ""
    for blob in blobs:
        if blob.name.endswith(".py"):
            file_content = blob.download_as_string().decode("utf-8")
            combined_text += f"### File: {blob.name} ###\n{file_content}\n"

    return combined_text


combined_string = process_py_files(BUCKET_NAME, PREFIX)
print(combined_string)

### Import Generative model library from Vertex AI

In [None]:
# import Generative_models library from vertexai. this library is where the gemini models are hosted

from IPython.display import Markdown, display, Latex
from vertexai.preview.generative_models import GenerativeModel
from vertexai.preview.generative_models import GenerationConfig

### Initiate Gemini 1.5 pro

In [None]:
model = GenerativeModel("gemini-1.5-pro-001")

### Setting up model configuration & Prompt template

In [None]:
# setting up model configurations & prompt template with 1 shot inference
# for this specific use case, we have set the safety filters to block none.

# define the prompt template to be passed to gemini.
context = combined_string

my_prompt = f"""You are an expert code assistant. Review the following code for vulnerabilities and provide recommendations:
{context}

Please format your response using markdown and Display with the following structure:

file_name: Name of the code file
Vulnerability Name: Name of the identified vulnerability
Vulnerability: Description of the vulnerability and its potential impact.
Recommendations: List of actionable steps to mitigate the vulnerability.
Recommended code: Recommended code snippet to remove the suspected vulnerability

I am also providing a sample response output that you should follow-
-------------------------------------------------
**file_name: bulk/example_1.py**

**Vulnerability Name:** Information Exposure

**Vulnerability:** The `server_bad` function returns the entire traceback when an exception occurs
.This can leak sensitive information about the application's internals, such as file paths, variable names, and even the type of exception raised. Attackers could
 exploit this information to gain a deeper understanding of the system and potentially launch further attacks.

**Recommendations:**

*Catch specific exceptions instead of using a broad `Exception` clause.
*Return a generic error message to the user without revealing internal details.
*Log the full traceback for debugging purposes, but do not expose it to the user.

**Recommended code:**

```python
sample code
```
"""  # try your own prompt


generation_config = GenerationConfig(
    temperature=0.5,
    top_p=0.4,
    top_k=24,
    candidate_count=1,
    max_output_tokens=8192,
)
responses = model.generate_content(
    contents=my_prompt,
    generation_config=generation_config,
    stream=True,
)

for res in responses:
    print(res.text)

### Capture the model response into a single variable

In [None]:
# capture the model response into a variable.
response_text = """
 Security Analysis of Code Snippets

### file_name: Bulk/example_1.py

**Vulnerability Name:** Information Exposure Through Exception Handling

**Vulnerability:** The `server_bad` function returns the entire traceback when an exception occurs. This can leak sensitive information about the application's internals, such
 as file paths, variable names, and even the type of exception raised. Attackers could exploit this information to gain a deeper understanding of the system and potentially launch
 further attacks.

**Recommendations:**

*Catch specific exceptions instead of using a broad `Exception` clause.
*Return a generic error message to the user without revealing internal details.
*Log the full traceback for debugging purposes, but do not expose it to the user.

**Recommended code:**

```python
@app.route('/bad')
def server_bad():
    try:
        result = do_computation()
        return
 result
    except SpecificException as e:
        # Log the error for debugging
        app.logger.exception("An error occurred during computation")
        return "An error occurred. Please try again later."
```

### file_name: Bulk/example_10.py

**Vulnerability Name:** Insecure Temporary File Creation

**Vulnerability:** The `mktemp()` function creates temporary files with predictable names, which could allow attackers to guess the file names and access sensitive data. Additionally, the file permissions might be insecure, allowing unauthorized access.

**Recommendations:**

*Use the `tempfile .mkstemp()` function to create temporary files with more secure permissions and unpredictable names.
*Ensure that the temporary files are deleted after they are no longer needed.

**Recommended code:**

```python
import tempfile

def write_results(results):
    with tempfile.NamedTemporaryFile(mode="
w+", delete=False) as f:
        filename = f.name
        f.write(results)
    print("Results written to", filename)
    # ... process the file ...
    os.remove(filename)  # Delete the file when done
```

### file_name: Bulk/example_100.py

**Vulnerability Name:** Weak Key Generation Algorithm

**Vulnerability:** The `DSA.generate(1024)` function generates a DSA key with a length of 1024 bits, which is considered weak and vulnerable to modern cryptanalysis techniques.


**Recommendations:**

*Use a stronger key generation algorithm, such as RSA or ECC, with a key length of at least 2048 bits.
*Consider using libraries that provide secure defaults for key generation.

**Recommended code:**

```python
from Crypto.PublicKey import RSA


def generate_private_key():
    key = RSA.generate(2048)
    return key.export_key('PEM')  # Export in a standard format
```

### file_name: Bulk/example_101.py

**Vulnerability Name:** Insecure Encryption Mode and Static IV

**Vulnerability:** The code uses AES in CBC mode with a static initialization vector (IV). Using a static IV weakens the encryption and makes it vulnerable to certain attacks, especially when encrypting multiple messages with the same key.

**Recommendations:**

*Use a random IV for each encryption operation.
*Consider using more secure modes of operation, such as GCM or CTR, which provide authenticated encryption.

**Recommended code:**

```python
from Crypto.Cipher import AES
from Crypto.Random import get_random_bytes

def getEncryptionKey(data, key):
    iv
 = get_random_bytes(AES.block_size)
    cipher = AES.new(key, AES.MODE_GCM, iv)
    ciphertext, tag = cipher.encrypt_and_digest(data)
    return iv, ciphertext, tag
```

### file_name: Bulk/example_102.py

**Vulnerability Name:** Insecure FTP Usage and Hardcoded Credentials

**Vulnerability:** The code uses FTP, which transmits data in plaintext, including usernames and passwords. Additionally, the credentials are hardcoded in the script, making them easily discoverable.

**Recommendations:**


*Use a secure file transfer protocol, such as SFTP or FTPS, which encrypts the communication.
*Avoid hardcoding credentials and instead use environment variables or a secure credential storage mechanism.

**Recommended code:**

```python
import pysftp  # Assuming you have the pysftp
 library installed

def downloadFile(hostname, username, password, sensitive_file):
    with pysftp.Connection(hostname, username=username, password=password) as sftp:
        sftp.get(sensitive_file, preserve_mtime=True)
```

### file_name:Bulk/example_103.py

**Vulnerability Name:**  Potential Issues with Password Hashing

**Vulnerability:** While PBKDF2 is a good choice for password hashing, the code uses a static salt, which reduces its effectiveness against rainbow table attacks.

**Recommendations:**

*Use a unique salt for each password. This can be achieved by generating a random salt for each user and storing it securely alongside the hashed password.

**Recommended code:**

```python
import os

def getHash(password):
    salt = os.urandom(16)  # Generate a random
 salt
    hash = pbkdf2_hmac('sha256', password, salt, 100000)
    return salt, hash  # Store the salt with the hash
```

### file_name: Bulk/example_104.py

**Vulnerability Name:** Hardcoded and Weak Admin Password

**Vulnerability:** The code uses a hardcoded and easily guessable password for the admin user. This makes the system extremely vulnerable to unauthorized access.

**Recommendations:**

*Never hardcode passwords, especially for privileged accounts.
*Enforce strong password policies that require a minimum length, complexity, and regular password changes.
*Consider using multi-factor authentication for additional security.

**Recommended code:**

```python
# Implement a secure password storage and verification mechanism, such as bcrypt or Argon2
# and avoid hardcoding passwords.
```

### file_name: Bulk/example_105.py

**Vulnerability Name:** Open Redirect Vulnerability

**Vulnerability:** The `flask_redirect` function redirects the user to any URL provided in the `next` parameter without validation. This can be exploited by attackers to redirect users to malicious websites.

**Recommendations:**

*Validate the `next` parameter to ensure it points to a trusted domain or a specific set of allowed URLs.

**Recommended code:**

```python
from urllib.parse import urlparse

@app.route('flask_redirect')
def flask_redirect():
    url = request
.args.get("next")
    parsed_url = urlparse(url)
    if parsed_url.netloc in allowed_domains:  # Replace with your allowed domains
        return redirect(url)
    else:
        return "Invalid redirect URL"
```

### file_name: Bulk/example_106.py

**Vulnerability Name:** Potential XXE (XML External Entity) Vulnerability

**Vulnerability:** The code parses XML data from the request without disabling external entity processing. This could allow attackers to inject malicious entities that can lead to information disclosure, denial of service
, or server-side request forgery (SSRF) attacks.

**Recommendations:**

*   Disable external entity processing when parsing XML data.
*   Use a library that provides secure XML parsing options.

**Recommended code:**

```python
from defusedxml import lxml

@app.post("/
upload")
def upload():
    xml_src = request.get_data()
    parser = lxml.XMLParser(resolve_entities=False)
    doc = lxml.fromstring(xml_src, parser=parser)
    return lxml.tostring(doc)

```

### file_name: Bulk/example_107.py

**Vulnerability Name:** Potential XXE and Insecure Password Handling

**Vulnerability:** Similar to example\_106.py, this code is vulnerable to XXE attacks due to insecure XML parsing. Additionally, the `check
Password` function is not implemented, and it's unclear how passwords are stored and verified.

**Recommendations:**

*Address the XXE vulnerability as described in example\_106.py.
*Implement a secure password storage and verification mechanism, such as bcrypt or Argon2.
*Avoid storing passwords in plaintext or using weak hashing algorithms.

**Recommended code:**

```python
# Implement secure XML parsing as shown in example_106.py
# Implement a secure password storage and verification mechanism
```
"""

### Parse the response and export it into a csv for further analysis.

To parse the response we have used regular expression pattern matching. This is easy to configure and maintain



In [None]:
import re
import pandas as pd


def extract_vulnerability_data(text):
    # Regular expressions for extracting data
    file_pattern = r"###\s*file_name:\s*(.*\.py)"  # Adjusted for "file_name"
    vulnerability_name_pattern = r"\*\*Vulnerability Name:\*\*\s*(.*)"
    vulnerability_pattern = r"\*\*Vulnerability:\*\*\s*(.*?)(?=\*\*Recommendations)"
    recommendation_pattern = r"\*\*Recommendations:\*\*\s*((?:\*.*\n)+)"
    code_pattern = r"```python(.*?)```"  # Added pattern for recommended code

    data = []

    # Iterate through each vulnerability report
    for match in re.finditer(r"###.*?(?=###|$)", text, re.DOTALL):
        report = match.group(0)

        file_name = re.search(file_pattern, report).group(1)
        vulnerability_name = re.search(vulnerability_name_pattern, report).group(1)
        vulnerability = (
            re.search(vulnerability_pattern, report, re.DOTALL).group(1).strip()
        )
        recommendations = (
            re.search(recommendation_pattern, report, re.DOTALL).group(1).strip()
        )

        # Extract recommended code, handling potential absence
        code_match = re.search(code_pattern, report, re.DOTALL)
        recommended_code = code_match.group(1).strip() if code_match else "N/A"

        data.append(
            {
                "File Number": file_name.split("_")[-1].split(".")[0],
                "File Name": file_name,
                "Vulnerability Name": vulnerability_name,
                "Description": vulnerability,
                "Recommendations": recommendations,
                "Recommended Code": recommended_code,  # Include recommended code
            }
        )

    return pd.DataFrame(data)


# Extract data and create DataFrame
df = extract_vulnerability_data(response_text)

# Display DataFrame in Colab
display(df)

# Save DataFrame to CSV file
df.to_csv("vulnerability_report_BULK.csv", index=False)

print("CSV file 'vulnerability_report_BULK.csv' created successfully.")

In [None]:
# Display the created CSV file (assuming you have the 'google.colab' library installed)
from google.colab import files

files.download("vulnerability_report_BULK.csv")

### Now running similar operation but with json output

In [None]:
import re
import json


def extract_vulnerability_data(text):
    # Regular expressions for extracting data
    file_pattern = r"###\s*file_name:\s*(.*\.py)"  # Adjusted for "file_name"
    vulnerability_name_pattern = r"\*\*Vulnerability Name:\*\*\s*(.*)"
    vulnerability_pattern = r"\*\*Vulnerability:\*\*\s*(.*?)(?=\*\*Recommendations)"
    recommendation_pattern = r"\*\*Recommendations:\*\*\s*((?:\*.*\n)+)"
    code_pattern = r"```python(.*?)```"  # Added pattern for recommended code

    data = []

    # Iterate through each vulnerability report
    for match in re.finditer(r"###.*?(?=###|$)", text, re.DOTALL):
        report = match.group(0)

        file_name = re.search(file_pattern, report).group(1)
        vulnerability_name = re.search(vulnerability_name_pattern, report).group(1)
        vulnerability = (
            re.search(vulnerability_pattern, report, re.DOTALL).group(1).strip()
        )
        recommendations = (
            re.search(recommendation_pattern, report, re.DOTALL).group(1).strip()
        )

        # Extract recommended code, handling potential absence
        code_match = re.search(code_pattern, report, re.DOTALL)
        recommended_code = code_match.group(1).strip() if code_match else "N/A"

        data.append(
            {
                "File Number": file_name.split("_")[-1].split(".")[0],
                "File Name": file_name,
                "Vulnerability Name": vulnerability_name,
                "Description": vulnerability,
                "Recommendations": recommendations,
                "Recommended Code": recommended_code,  # Include recommended code
            }
        )

    return data


# Extract data and write to JSON file (assuming you have 'response_text' defined)
data = extract_vulnerability_data(response_text)

with open("vulnerabilities.json", "w") as f:
    json.dump(data, f, indent=4)

print("Vulnerability data extracted and saved to vulnerabilities.json")

In [None]:
from google.colab import files

files.download("vulnerabilities.json")

### **IMPORTANT NOTE**- This use case can have multiple variations as provided below. However, the overall approach to handle the use case can remain same as showcased in this notebook. Additionally, the approach demonstrated within this notebook is just a way to solve this use case. There could be other approaches as well when it comes to solve similar use cases.

**Scenario 1**- Run a sample piece of code standalone against Gemini to understand vulnerability and remediation

**Scenario 2**-  Run multiple python code files in a batch against gemini Pro 1.5 to understand vulnerability and remediation

**Scenario 3**- Use Gemini 1.5 against publically available Git Repos to identify vulnerabilities and proposed remediation

Although Gemini 1.5 showcased impressive performance for use case 1 and 2, , the 3rd use case could prove challenging to accomplish from within a notebook environment. Perhaps, Gemini Code Assist with its in built additional code scanning capabilities could be a more preferred way to handle the same.





## Conclusions

In this notebook we have successfully leveraged Gemini 1.5 multimodal capability to
1. analyze multiple python files for potential vulnerabilities
2. Used Gemini 1.5 to provide recommendations (both in wordings and in the form of code)
3. export the details into csv and json for further analysis

Although a potent approach, this technique doesn't automatically apply recommended fixes to established benchmarks. However, these insightful recommendations can be manually implemented and tested against industry-standard benchmarks for deeper analysis