# Code Vulnerability Scanning & Automated Remediation using Gemini API in Vertex AI (Gemini 1.5 Pro)

---


<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/code_scanning_and_vulnerability_detection.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fcode%2Fcode_scanning_and_vulnerability_detection.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>    
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/code/code_scanning_and_vulnerability_detection.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/code_scanning_and_vulnerability_detection.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/code_scanning_and_vulnerability_detection.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/code_scanning_and_vulnerability_detection.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/code_scanning_and_vulnerability_detection.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/code_scanning_and_vulnerability_detection.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/code_scanning_and_vulnerability_detection.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            


| | |
|-|-|
|Author(s) | [Souvik Mukherjee](https://github.com/talktosauvik)

## Background

In today's digital landscape, software security is paramount. With the increasing sophistication of cyber threats, it's more important than ever for developers to proactively identify and address vulnerabilities in their code. Vulnerabilities can lead to data breaches, financial losses, and reputational damage. By harnessing the power of Gemini 1.5 Pro, we can help transform code vulnerability detection and remediation, and build a software vulnerability scanning mechanism

## Overview

Gemini 1.5 Pro, a member of Google Gemini family, is a generative AI model purpose-built for diverse multimodal applications. It's proficiency in understanding and generating content across text, code, and images makes it a powerful asset for intricate codebase analysis. With its expansive 2M token context window, Gemini 1.5 Pro efficiently processes large code volumes in a single call, streamlining large-scale code scanning.Gemini 1.5 Pro's deep comprehension of programming languages and security best practices enables it to identify potential vulnerabilities and suggest helpful and contextual modifications. Learn more about [Gemini 1.5 Pro](https://deepmind.google/technologies/gemini/pro/).





This experimental approach aims to efficiently scan large codebases, analyze multiple files in a single call, and delve deeper into complex code relationships and patternsThe model's deep analysis of code can help ensure comprehensive vulnerability detection, going beyond surface-level flaws. By using this approach, we can accommodate code written in several programming languages. Additionally, we can generate the findings and recommendations as JSON or CSV reports, which we would hypothetically use to make comparisons against established benchmarks and policy checks. With this tutorial, you learn how to use the Gemini API in Vertex AI, Google Cloud Storage API and the Vertex AI SDK to work with the Gemini 1.5 Pro model to build a step by step code vulnerability scanning approach using Gemini 1.5 Pro with the following steps:


*   Read Python files from a GCS bucket and combining them into a single string
*   Prompt engineering by crafting a clear and comprehensive prompt for Gemini 1.5 Pro, providing instructions for code analysis and output formatting

*   Submit the consolidated code string to Gemini 1.5 Pro for analysis
*   Extract vulnerability information, recommendations, and code snippets from the model response

*   Generate CSV and JSON output reports for further analysis, benchmarking and integration with security tools


### Getting Started

### Install Vertex AI and Google Cloud Storage SDKs for Python


In [None]:
%pip install  -q --upgrade --user google-cloud-aiplatform google-cloud-storage

### Restart the Kernel

To use the newly installed packages, you must restart the current runtime.

In [None]:
# restart the current runtime to be able to access the downloaded packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

### Authenticate the notebook environment (Colab only)

If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).

In [None]:
from google.colab import auth

auth.authenticate_user()

### Set Google Cloud project information and initialize Vertex AI SDK


To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).


In [None]:
# initialize variables

PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}
BUCKET_NAME = "your-bucket-name"  # @param {type:"string"}
PREFIX = "your prefix folder-name/"  # @param {type:"string"}

In [None]:
# import and initialize Vertex AI
import vertexai

vertexai.init(project=PROJECT_ID, location=REGION)

## Process Python files in batch

This block of code reads Python files from the GCS bucket, combines their content, and adds respective `filename` as separator for LLM to better identify each file.


In [None]:
from google.cloud import storage


def process_py_files(bucket_name, prefix):
    """
    Reads .py files from a GCS bucket and combines them into a single string.
    Returns:
      A string containing the combined content of all .py files.
    """

    storage_client = storage.Client()
    bucket = storage_client.get_bucket(BUCKET_NAME)
    blobs = bucket.list_blobs(prefix=PREFIX)

    combined_text = ""
    for blob in blobs:
        if blob.name.endswith(".py"):
            file_content = blob.download_as_string().decode("utf-8")
            combined_text += f"### File: {blob.name} ###\n{file_content}\n"

    return combined_text


combined_string = process_py_files(BUCKET_NAME, PREFIX)
print(combined_string)

### Import Generative model library from Vertex AI

In [None]:
from IPython.display import display
from vertexai.generative_models import GenerationConfig, GenerativeModel

### Initiate Gemini 1.5 Pro

In [None]:
model = GenerativeModel("gemini-1.5-pro")

### Setting up model configuration & Prompt template

This piece of code sets up the model configurations & prompt template with 1 shot inference. For this specific notebook, the safety filters have not been imported.


In [None]:
# define the prompt template to be passed to gemini.
context = combined_string

my_prompt = f"""You are an expert code assistant. Review the following code for vulnerabilities and provide recommendations:
{context}

Please format your response using markdown and Display with the following structure:

file_name: Name of the code file
Vulnerability Name: Name of the identified vulnerability
Vulnerability: Description of the vulnerability and its potential impact.
Recommendations: List of actionable steps to mitigate the vulnerability.
Recommended code: Recommended code snippet to remove the suspected vulnerability

I am also providing a sample response output that you should follow-
-------------------------------------------------
**file_name: bulk/example_1.py**

**Vulnerability Name:** Information Exposure

**Vulnerability:** The `server_bad` function returns the entire traceback when an exception occurs
.This can leak sensitive information about the application's internals, such as file paths, variable names, and even the type of exception raised. Attackers could
 exploit this information to gain a deeper understanding of the system and potentially launch further attacks.

**Recommendations:**

*Catch specific exceptions instead of using a broad `Exception` clause.
*Return a generic error message to the user without revealing internal details.
*Log the full traceback for debugging purposes, but do not expose it to the user.

**Recommended code:**

```python
sample code
```
"""  # try your own prompt


generation_config = GenerationConfig(
    temperature=0.5,
    top_p=0.4,
    top_k=24,
    candidate_count=1,
    max_output_tokens=8192,
)
responses = model.generate_content(
    contents=my_prompt,
    generation_config=generation_config,
    stream=True,
)

for res in responses:
    print(res.text)

## Capture the model response into a single variable


In [None]:
response_text = r"""
 Security Analysis of Code Snippets

### file_name: Bulk/example_1.py

**Vulnerability Name:** Information Exposure Through Exception Handling

**Vulnerability:** The `server_bad` function returns the entire traceback when an exception occurs. This can leak sensitive information about the application's internals, such
 as file paths, variable names, and even the type of exception raised. Attackers could exploit this information to gain a deeper understanding of the system and potentially launch
 further attacks.

**Recommendations:**

*Catch specific exceptions instead of using a broad `Exception` clause.
*Return a generic error message to the user without revealing internal details.
*Log the full traceback for debugging purposes, but do not expose it to the user.

**Recommended code:**

```python
@app.route('/bad')
def server_bad():
    try:
        result = do_computation()
        return
 result
    except SpecificException as e:
        # Log the error for debugging
        app.logger.exception("An error occurred during computation")
        return "An error occurred. Please try again later."
```

### file_name: Bulk/example_10.py

**Vulnerability Name:** Insecure Temporary File Creation

**Vulnerability:** The `mktemp()` function creates temporary files with predictable names, which could allow attackers to guess the file names and access sensitive data. Additionally, the file permissions might be insecure, allowing unauthorized access.

**Recommendations:**

*Use the `tempfile .mkstemp()` function to create temporary files with more secure permissions and unpredictable names.
*Ensure that the temporary files are deleted after they are no longer needed.

**Recommended code:**

```python
import tempfile

def write_results(results):
    with tempfile.NamedTemporaryFile(mode="
w+", delete=False) as f:
        filename = f.name
        f.write(results)
    print("Results written to", filename)
    # ... process the file ...
    os.remove(filename)  # Delete the file when done
```

### file_name: Bulk/example_100.py

**Vulnerability Name:** Weak Key Generation Algorithm

**Vulnerability:** The `DSA.generate(1024)` function generates a DSA key with a length of 1024 bits, which is considered weak and vulnerable to modern cryptanalysis techniques.


**Recommendations:**

*Use a stronger key generation algorithm, such as RSA or ECC, with a key length of at least 2048 bits.
*Consider using libraries that provide secure defaults for key generation.

**Recommended code:**

```python
from Crypto.PublicKey import RSA


def generate_private_key():
    key = RSA.generate(2048)
    return key.export_key('PEM')  # Export in a standard format
```

### file_name: Bulk/example_101.py

**Vulnerability Name:** Insecure Encryption Mode and Static IV

**Vulnerability:** The code uses AES in CBC mode with a static initialization vector (IV). Using a static IV weakens the encryption and makes it vulnerable to certain attacks, especially when encrypting multiple messages with the same key.

**Recommendations:**

*Use a random IV for each encryption operation.
*Consider using more secure modes of operation, such as GCM or CTR, which provide authenticated encryption.

**Recommended code:**

```python
from Crypto.Cipher import AES
from Crypto.Random import get_random_bytes

def getEncryptionKey(data, key):
    iv
 = get_random_bytes(AES.block_size)
    cipher = AES.new(key, AES.MODE_GCM, iv)
    ciphertext, tag = cipher.encrypt_and_digest(data)
    return iv, ciphertext, tag
```

### file_name: Bulk/example_102.py

**Vulnerability Name:** Insecure FTP Usage and Hardcoded Credentials

**Vulnerability:** The code uses FTP, which transmits data in plaintext, including usernames and passwords. Additionally, the credentials are hardcoded in the script, making them easily discoverable.

**Recommendations:**


*Use a secure file transfer protocol, such as SFTP or FTPS, which encrypts the communication.
*Avoid hardcoding credentials and instead use environment variables or a secure credential storage mechanism.

**Recommended code:**

```python
import pysftp  # Assuming you have the pysftp
 library installed

def downloadFile(hostname, username, password, sensitive_file):
    with pysftp.Connection(hostname, username=username, password=password) as sftp:
        sftp.get(sensitive_file, preserve_mtime=True)
```

### file_name:Bulk/example_103.py

**Vulnerability Name:**  Potential Issues with Password Hashing

**Vulnerability:** While PBKDF2 is a good choice for password hashing, the code uses a static salt, which reduces its effectiveness against rainbow table attacks.

**Recommendations:**

*Use a unique salt for each password. This can be achieved by generating a random salt for each user and storing it securely alongside the hashed password.

**Recommended code:**

```python
import os

def getHash(password):
    salt = os.urandom(16)  # Generate a random
 salt
    hash = pbkdf2_hmac('sha256', password, salt, 100000)
    return salt, hash  # Store the salt with the hash
```

### file_name: Bulk/example_104.py

**Vulnerability Name:** Hardcoded and Weak Admin Password

**Vulnerability:** The code uses a hardcoded and easily guessable password for the admin user. This makes the system extremely vulnerable to unauthorized access.

**Recommendations:**

*Never hardcode passwords, especially for privileged accounts.
*Enforce strong password policies that require a minimum length, complexity, and regular password changes.
*Consider using multi-factor authentication for additional security.

**Recommended code:**

```python
# Implement a secure password storage and verification mechanism, such as bcrypt or Argon2
# and avoid hardcoding passwords.
```

### file_name: Bulk/example_105.py

**Vulnerability Name:** Open Redirect Vulnerability

**Vulnerability:** The `flask_redirect` function redirects the user to any URL provided in the `next` parameter without validation. This can be exploited by attackers to redirect users to malicious websites.

**Recommendations:**

*Validate the `next` parameter to ensure it points to a trusted domain or a specific set of allowed URLs.

**Recommended code:**

```python
from urllib.parse import urlparse

@app.route('flask_redirect')
def flask_redirect():
    url = request
.args.get("next")
    parsed_url = urlparse(url)
    if parsed_url.netloc in allowed_domains:  # Replace with your allowed domains
        return redirect(url)
    else:
        return "Invalid redirect URL"
```

### file_name: Bulk/example_106.py

**Vulnerability Name:** Potential XXE (XML External Entity) Vulnerability

**Vulnerability:** The code parses XML data from the request without disabling external entity processing. This could allow attackers to inject malicious entities that can lead to information disclosure, denial of service
, or server-side request forgery (SSRF) attacks.

**Recommendations:**

*   Disable external entity processing when parsing XML data.
*   Use a library that provides secure XML parsing options.

**Recommended code:**

```python
from defusedxml import lxml

@app.post("/
upload")
def upload():
    xml_src = request.get_data()
    parser = lxml.XMLParser(resolve_entities=False)
    doc = lxml.fromstring(xml_src, parser=parser)
    return lxml.tostring(doc)

```

### file_name: Bulk/example_107.py

**Vulnerability Name:** Potential XXE and Insecure Password Handling

**Vulnerability:** Similar to example_106.py, this code is vulnerable to XXE attacks due to insecure XML parsing. Additionally, the `check
Password` function is not implemented, and it's unclear how passwords are stored and verified.

**Recommendations:**

*Address the XXE vulnerability as described in example_106.py.
*Implement a secure password storage and verification mechanism, such as bcrypt or Argon2.
*Avoid storing passwords in plaintext or using weak hashing algorithms.

**Recommended code:**

```python
# Implement secure XML parsing as shown in example_106.py
# Implement a secure password storage and verification mechanism
```
"""

### Parse the response and export it into a csv for further analysis.

To parse the response we have used regular expression pattern matching. This is easy to configure and maintain


In [None]:
import re

import pandas as pd


def extract_vulnerability_data(text):
    # Regular expressions for extracting data
    file_pattern = r"###\s*file_name:\s*(.*\.py)"  # Adjusted for "file_name"
    vulnerability_name_pattern = r"\*\*Vulnerability Name:\*\*\s*(.*)"
    vulnerability_pattern = r"\*\*Vulnerability:\*\*\s*(.*?)(?=\*\*Recommendations)"
    recommendation_pattern = r"\*\*Recommendations:\*\*\s*((?:\*.*\n)+)"
    code_pattern = r"```python(.*?)```"  # Added pattern for recommended code

    data = []

    # Iterate through each vulnerability report
    for match in re.finditer(r"###.*?(?=###|$)", text, re.DOTALL):
        report = match.group(0)

        file_name = re.search(file_pattern, report).group(1)
        vulnerability_name = re.search(vulnerability_name_pattern, report).group(1)
        vulnerability = (
            re.search(vulnerability_pattern, report, re.DOTALL).group(1).strip()
        )
        recommendations = (
            re.search(recommendation_pattern, report, re.DOTALL).group(1).strip()
        )

        # Extract recommended code, handling potential absence
        code_match = re.search(code_pattern, report, re.DOTALL)
        recommended_code = code_match.group(1).strip() if code_match else "N/A"

        data.append(
            {
                "File Number": file_name.split("_")[-1].split(".")[0],
                "File Name": file_name,
                "Vulnerability Name": vulnerability_name,
                "Description": vulnerability,
                "Recommendations": recommendations,
                "Recommended Code": recommended_code,  # Include recommended code
            }
        )

    return pd.DataFrame(data)


# Extract data and create DataFrame
df = extract_vulnerability_data(response_text)

# Display DataFrame in Colab
display(df)

# Save DataFrame to CSV file
df.to_csv("vulnerability_report_BULK.csv", index=False)

print("CSV file 'vulnerability_report_BULK.csv' created successfully.")

In [None]:
# Display the created CSV file (assuming you have the 'google.colab' library installed)
from google.colab import files

files.download("vulnerability_report_BULK.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Parse the response and export it into JSON output.

NOTE: The following process is just a way to capture the response text and parse it further with regular expression. With Gemini 1.5, we can force the model to respond in JSON structure as well.

In [None]:
# Extract data and write to JSON file (assuming you have 'response_text' defined)
data = extract_vulnerability_data(response_text)

data.to_json("vulnerabilities.json", indent=4)

print("Vulnerability data extracted and saved to vulnerabilities.json")

In [None]:
from google.colab import files

files.download("vulnerabilities.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Conclusions

In this notebook we have successfully leveraged Gemini 1.5 Pro's code scanning and code generation capability to
1. Analyze multiple python files for potential vulnerabilities
2. Used Gemini 1.5 to provide recommendations (both in wordings and in the form of code)
3. Export the details into csv and json for further analysis

The scope of this experiment is limited to identifying issues and providing helpful and contextual modification. Automating remediations or fitting the findings into a review workflow would exist in a more mature tool, and hasn't been considered as part of the experiment. While Gemini 1.5 Pro demonstrates promising capabilities in code analysis, it's important to note that this approach is still experimental. We believe that it is important to explore the potential of this technology for vulnerability detection, and continue development and validation efforts before it can be considered a robust security tool