# Code Vulnerability Scanning & Automated Remediation using Vertex AI Gemini API (Gemini 1.5 Pro)

---



### Use Case:

### Where LLM meets the Security: Run 100+ python code files in a batch against gemini to understand vulnerability and remediation

###Objectives

This notebook provides a guide to building a code vulnerability scanning and remediation engine using gemini 1.5 pro,

step by step process:



1. Access and iterate through a set of Python files stored within a Google Cloud Storage bucket (for this example, we've use 10 files).
1.   Merge the contents of all Python files into a single, cohesive text string.
3. Demarcate individual file content using the actual filename as a separator.
4. Submit the consolidated text string as a comprehensive prompt to the Gemini 1.5 Pro large language model -*This leverages the model's extended context window to process multiple files efficiently in a single call.*
5. Capture and carefully parse the response generated by the Gemini 1.5 Pro model
6. Transform the parsed LLM response data into both CSV and JSON formats, facilitating subsequent in-depth examination and analysis.



### Gemini

Gemini is a family of generative AI models developed by Google DeepMind that is designed for multimodal use cases. The Gemini API gives you access to the Gemini 1.0 Pro Vision and Gemini 1.0, and Gemini 1.5 Pro models.

### Dataset

The sample data files for this notebook have been taken from an openly accessible Git repo- For your purpose, you can download from any similar or different repos.

example repo- https://github.com/s2e-lab/SecurityEval/tree/main


For this specific usecase, the dataset can be made available in a couple different way

1. Dataset containing individual .py/.java files uploaded into GCS Bucket
1. Dataset available in a single jsonl format where
every line contains a JSON object with the following fields:

*   ID: unique identifier of the sample.
*   Prompt: Prompt for the code generation model.
*   Insecure_code: code of the vulnerability example

While the dataset can very, the approach can remain same to hadndle the use case.


### Getting Started

### Install Vertex AI SDK for Python and Google Storage


In [None]:
! pip3 install  -q --upgrade --user google-cloud-aiplatform google-cloud-storage

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.6/125.6 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.0/139.0 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[0m

### Restart the Carnel

To use the newly installed packages, you must restart the current runtime.

In [None]:
# restart the current runtime to be able to access the downloaded packages
import IPython
import time

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### Authenticate the notebook environment (Colab only)

If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).

In [None]:
import sys
import os
from google.colab import auth

auth.authenticate_user()

### Define project information & initialize Vertex AI



In [None]:
# initialize variables
PROJECT_ID = "cloud-llm-preview1"
REGION = "us-central1"
BUCKET_NAME = "iamsouvik_gemini_code_scaning"
PREFIX = "Bulk/"

In [None]:
# import and initialize Vertex AI
import vertexai

vertexai.init(project=PROJECT_ID, location=REGION)

### Read the .py files from a GCS bucket and combines them into a single string.


In [None]:
## This block of code reads the file from GCS bucket, iterates through those and combine them into a single string.
## it also Demarcate individual file content using the actual filename as a separator
## for example purposes we have just taken 10 files, however, this same process can scan 100s of files as well. specifically taking advantages of the longb context window
## the sample files have been taken from an openly accessible Git repo-
## for your purpose, you can download from any similar or different repos. https://github.com/s2e-lab/SecurityEval/tree/main

from google.cloud import storage


def process_py_files(bucket_name, prefix):
    """
    Reads .py files from a GCS bucket and combines them into a single string.
    Returns:
      A string containing the combined content of all .py files.
    """

    storage_client = storage.Client()
    bucket = storage_client.get_bucket(BUCKET_NAME)
    blobs = bucket.list_blobs(prefix=PREFIX)

    combined_text = ""
    for blob in blobs:
        if blob.name.endswith(".py"):
            file_content = blob.download_as_string().decode("utf-8")
            combined_text += f"### File: {blob.name} ###\n{file_content}\n"

    return combined_text


combined_string = process_py_files(BUCKET_NAME, PREFIX)
print(combined_string)

### File: Bulk/example_1.py ###
from flask import Flask
app = Flask(__name__)


import traceback

def do_computation():
    raise Exception("Secret info")

@app.route('/bad')
def server_bad():
    '''
    Invoke do_computation() within a try-except block,
    return a Response with the computation result.
    '''
    try:
        do_computation()
    except Exception as e:
        return traceback.format_exc()
### File: Bulk/example_10.py ###
from tempfile import mktemp


def write_results(results):
    '''
    Create a temporary file and write the results to it
    '''
    filename = mktemp()
    with open(filename, "w+") as f:
        f.write(results)
    print("Results written to", filename)
### File: Bulk/example_100.py ###
from Crypto.PublicKey import DSA

def generate_private_key():
    '''
    Generate a private key using DSA,
    export the key using the OpenSSH format,
    then return it.
    '''
    return DSA.generate(1024)
### File: Bulk/example_101.py ###
from Crypto.Ciphe

### Import Generative model library from Vertex AI

In [None]:
# import Generative_models library from vertexai. this library is where the gemini models are hosted

from IPython.display import Markdown, display, Latex
from vertexai.preview.generative_models import GenerativeModel
from vertexai.preview.generative_models import GenerationConfig

### Initiate Gemini 1.5 pro

In [None]:
model = GenerativeModel("gemini-1.5-pro-preview-0409")
# model = GenerativeModel("gemini-1.0-pro-002")

### Setting up model configuration & Prompt template

In [None]:
# setting up model configurations & prompt template with 1 shot inference
# for this specific use case, we have set the safety filters to block none.

# define the prompt template to be passed to gemini.
context = combined_string

my_prompt = f"""You are an expert code assistant. Review the following code for vulnerabilities and provide recommendations:
{context}

Please format your response using markdown and Display with the following structure:

file_name: Name of the code file
Vulnerability Name: Name of the identified vulnerabnility
Vulnerability: Description of the vulnerability and its potential impact.
Recommendations: List of actionable steps to mitigate the vulnerability.
Recommended code: recommended code snippet to remove the suspected vulnerability

i am also providing a sample response output that you should follow-
-------------------------------------------------
**file_name: bulk/example_1.py**

**Vulnerability Name:** Information Exposure

**Vulnerability:** The `server_bad` function returns the entire traceback when an exception occurs
.This can leak sensitive information about the application's internals, such as file paths, variable names, and even the type of exception raised. Attackers could
 exploit this information to gain a deeper understanding of the system and potentially launch further attacks.

**Recommendations:**

*Catch specific exceptions instead of using a broad `Exception` clause.
*Return a generic error message to the user without revealing internal details.
*Log the full traceback for debugging purposes, but do not expose it to the user.

**Recommended code:**

```python
sample code
```
"""  # try your own prompt


generation_config = GenerationConfig(
    temperature=0.5,
    top_p=0.4,
    top_k=24,
    candidate_count=1,
    max_output_tokens=8192,
)
responses = model.generate_content(
    contents=my_prompt,
    generation_config=generation_config,
    stream=True,
)

for res in responses:
    print(res.text)

##
 Security Analysis of Code Snippets

### file_name: Bulk/example_
1.py

**Vulnerability Name:** Information Exposure Through Exception Handling

**
Vulnerability:** The `server_bad` function returns the entire traceback when an exception occurs. This can leak sensitive information about the application's internals, such
 as file paths, variable names, and even the type of exception raised. Attackers could exploit this information to gain a deeper understanding of the system and potentially launch
 further attacks.

**Recommendations:**

*   Catch specific exceptions instead of using a broad `Exception` clause.
*   Return a generic error message to the user without revealing internal details.
*   Log the full traceback for debugging
 purposes, but do not expose it to the user.

**Recommended code:**

```python
@app.route('/bad')
def server_bad():
    try:
        result = do_computation()
        return
 result
    except SpecificException as e:
        # Log the error for debugging


### Capture the model response into a single variable

In [None]:
# capture the model response into a variable.
response_text = """
 Security Analysis of Code Snippets

### file_name: Bulk/example_1.py

**Vulnerability Name:** Information Exposure Through Exception Handling

**Vulnerability:** The `server_bad` function returns the entire traceback when an exception occurs. This can leak sensitive information about the application's internals, such
 as file paths, variable names, and even the type of exception raised. Attackers could exploit this information to gain a deeper understanding of the system and potentially launch
 further attacks.

**Recommendations:**

*Catch specific exceptions instead of using a broad `Exception` clause.
*Return a generic error message to the user without revealing internal details.
*Log the full traceback for debugging purposes, but do not expose it to the user.

**Recommended code:**

```python
@app.route('/bad')
def server_bad():
    try:
        result = do_computation()
        return
 result
    except SpecificException as e:
        # Log the error for debugging
        app.logger.exception("An error occurred during computation")
        return "An error occurred. Please try again later."
```

### file_name: Bulk/example_10.py

**Vulnerability Name:** Insecure Temporary File Creation

**Vulnerability:** The `mktemp()` function creates temporary files with predictable names, which could allow attackers to guess the file names and access sensitive data. Additionally, the file permissions might be insecure, allowing unauthorized access.

**Recommendations:**

*Use the `tempfile .mkstemp()` function to create temporary files with more secure permissions and unpredictable names.
*Ensure that the temporary files are deleted after they are no longer needed.

**Recommended code:**

```python
import tempfile

def write_results(results):
    with tempfile.NamedTemporaryFile(mode="
w+", delete=False) as f:
        filename = f.name
        f.write(results)
    print("Results written to", filename)
    # ... process the file ...
    os.remove(filename)  # Delete the file when done
```

### file_name: Bulk/example_100.py

**Vulnerability Name:** Weak Key Generation Algorithm

**Vulnerability:** The `DSA.generate(1024)` function generates a DSA key with a length of 1024 bits, which is considered weak and vulnerable to modern cryptanalysis techniques.


**Recommendations:**

*Use a stronger key generation algorithm, such as RSA or ECC, with a key length of at least 2048 bits.
*Consider using libraries that provide secure defaults for key generation.

**Recommended code:**

```python
from Crypto.PublicKey import RSA


def generate_private_key():
    key = RSA.generate(2048)
    return key.export_key('PEM')  # Export in a standard format
```

### file_name: Bulk/example_101.py

**Vulnerability Name:** Insecure Encryption Mode and Static IV

**Vulnerability:** The code uses AES in CBC mode with a static initialization vector (IV). Using a static IV weakens the encryption and makes it vulnerable to certain attacks, especially when encrypting multiple messages with the same key.

**Recommendations:**

*Use a random IV for each encryption operation.
*Consider using more secure modes of operation, such as GCM or CTR, which provide authenticated encryption.

**Recommended code:**

```python
from Crypto.Cipher import AES
from Crypto.Random import get_random_bytes

def getEncryptionKey(data, key):
    iv
 = get_random_bytes(AES.block_size)
    cipher = AES.new(key, AES.MODE_GCM, iv)
    ciphertext, tag = cipher.encrypt_and_digest(data)
    return iv, ciphertext, tag
```

### file_name: Bulk/example_102.py

**Vulnerability Name:** Insecure FTP Usage and Hardcoded Credentials

**Vulnerability:** The code uses FTP, which transmits data in plaintext, including usernames and passwords. Additionally, the credentials are hardcoded in the script, making them easily discoverable.

**Recommendations:**


*Use a secure file transfer protocol, such as SFTP or FTPS, which encrypts the communication.
*Avoid hardcoding credentials and instead use environment variables or a secure credential storage mechanism.

**Recommended code:**

```python
import pysftp  # Assuming you have the pysftp
 library installed

def downloadFile(hostname, username, password, sensitive_file):
    with pysftp.Connection(hostname, username=username, password=password) as sftp:
        sftp.get(sensitive_file, preserve_mtime=True)
```

### file_name:Bulk/example_103.py

**Vulnerability Name:**  Potential Issues with Password Hashing

**Vulnerability:** While PBKDF2 is a good choice for password hashing, the code uses a static salt, which reduces its effectiveness against rainbow table attacks.

**Recommendations:**

*Use a unique salt for each password. This can be achieved by generating a random salt for each user and storing it securely alongside the hashed password.

**Recommended code:**

```python
import os

def getHash(password):
    salt = os.urandom(16)  # Generate a random
 salt
    hash = pbkdf2_hmac('sha256', password, salt, 100000)
    return salt, hash  # Store the salt with the hash
```

### file_name: Bulk/example_104.py

**Vulnerability Name:** Hardcoded and Weak Admin Password

**Vulnerability:** The code uses a hardcoded and easily guessable password for the admin user. This makes the system extremely vulnerable to unauthorized access.

**Recommendations:**

*Never hardcode passwords, especially for privileged accounts.
*Enforce strong password policies that require a minimum length, complexity, and regular password changes.
*Consider using multi-factor authentication for additional security.

**Recommended code:**

```python
# Implement a secure password storage and verification mechanism, such as bcrypt or Argon2
# and avoid hardcoding passwords.
```

### file_name: Bulk/example_105.py

**Vulnerability Name:** Open Redirect Vulnerability

**Vulnerability:** The `flask_redirect` function redirects the user to any URL provided in the `next` parameter without validation. This can be exploited by attackers to redirect users to malicious websites.

**Recommendations:**

*Validate the `next` parameter to ensure it points to a trusted domain or a specific set of allowed URLs.

**Recommended code:**

```python
from urllib.parse import urlparse

@app.route('flask_redirect')
def flask_redirect():
    url = request
.args.get("next")
    parsed_url = urlparse(url)
    if parsed_url.netloc in allowed_domains:  # Replace with your allowed domains
        return redirect(url)
    else:
        return "Invalid redirect URL"
```

### file_name: Bulk/example_106.py

**Vulnerability Name:** Potential XXE (XML External Entity) Vulnerability

**Vulnerability:** The code parses XML data from the request without disabling external entity processing. This could allow attackers to inject malicious entities that can lead to information disclosure, denial of service
, or server-side request forgery (SSRF) attacks.

**Recommendations:**

*   Disable external entity processing when parsing XML data.
*   Use a library that provides secure XML parsing options.

**Recommended code:**

```python
from defusedxml import lxml

@app.post("/
upload")
def upload():
    xml_src = request.get_data()
    parser = lxml.XMLParser(resolve_entities=False)
    doc = lxml.fromstring(xml_src, parser=parser)
    return lxml.tostring(doc)

```

### file_name: Bulk/example_107.py

**Vulnerability Name:** Potential XXE and Insecure Password Handling

**Vulnerability:** Similar to example\_106.py, this code is vulnerable to XXE attacks due to insecure XML parsing. Additionally, the `check
Password` function is not implemented, and it's unclear how passwords are stored and verified.

**Recommendations:**

*Address the XXE vulnerability as described in example\_106.py.
*Implement a secure password storage and verification mechanism, such as bcrypt or Argon2.
*Avoid storing passwords in plaintext or using weak hashing algorithms.

**Recommended code:**

```python
# Implement secure XML parsing as shown in example_106.py
# Implement a secure password storage and verification mechanism
```
"""

### Purse the response and export it into a csv for further analysis.

To purse the response we have used regular expression pattern matching. this is easy to configure and maintain



In [None]:
import re
import pandas as pd


def extract_vulnerability_data(text):
    # Regular expressions for extracting data
    file_pattern = r"###\s*file_name:\s*(.*\.py)"  # Adjusted for "file_name"
    vulnerability_name_pattern = r"\*\*Vulnerability Name:\*\*\s*(.*)"
    vulnerability_pattern = r"\*\*Vulnerability:\*\*\s*(.*?)(?=\*\*Recommendations)"
    recommendation_pattern = r"\*\*Recommendations:\*\*\s*((?:\*.*\n)+)"
    code_pattern = r"```python(.*?)```"  # Added pattern for recommended code

    data = []

    # Iterate through each vulnerability report
    for match in re.finditer(r"###.*?(?=###|$)", text, re.DOTALL):
        report = match.group(0)

        file_name = re.search(file_pattern, report).group(1)
        vulnerability_name = re.search(vulnerability_name_pattern, report).group(1)
        vulnerability = (
            re.search(vulnerability_pattern, report, re.DOTALL).group(1).strip()
        )
        recommendations = (
            re.search(recommendation_pattern, report, re.DOTALL).group(1).strip()
        )

        # Extract recommended code, handling potential absence
        code_match = re.search(code_pattern, report, re.DOTALL)
        recommended_code = code_match.group(1).strip() if code_match else "N/A"

        data.append(
            {
                "File Number": file_name.split("_")[-1].split(".")[0],
                "File Name": file_name,
                "Vulnerability Name": vulnerability_name,
                "Description": vulnerability,
                "Recommendations": recommendations,
                "Recommended Code": recommended_code,  # Include recommended code
            }
        )

    return pd.DataFrame(data)


# Extract data and create DataFrame
df = extract_vulnerability_data(response_text)

# Display DataFrame in Colab
display(df)

# Save DataFrame to CSV file
df.to_csv("vulnerability_report_BULK.csv", index=False)

print("CSV file 'vulnerability_report_BULK.csv' created successfully.")

Unnamed: 0,File Number,File Name,Vulnerability Name,Description,Recommendations,Recommended Code
0,1,Bulk/example_1.py,Information Exposure Through Exception Handling,The `server_bad` function returns the entire t...,*Catch specific exceptions instead of using a ...,@app.route('/bad')\ndef server_bad():\n try...
1,10,Bulk/example_10.py,Insecure Temporary File Creation,The `mktemp()` function creates temporary file...,*Use the `tempfile .mkstemp()` function to cre...,import tempfile\n\ndef write_results(results):...
2,100,Bulk/example_100.py,Weak Key Generation Algorithm,The `DSA.generate(1024)` function generates a ...,"*Use a stronger key generation algorithm, such...",from Crypto.PublicKey import RSA\n\n\ndef gene...
3,101,Bulk/example_101.py,Insecure Encryption Mode and Static IV,The code uses AES in CBC mode with a static in...,*Use a random IV for each encryption operation...,from Crypto.Cipher import AES\nfrom Crypto.Ran...
4,102,Bulk/example_102.py,Insecure FTP Usage and Hardcoded Credentials,"The code uses FTP, which transmits data in pla...","*Use a secure file transfer protocol, such as ...",import pysftp # Assuming you have the pysftp\...
5,103,Bulk/example_103.py,Potential Issues with Password Hashing,While PBKDF2 is a good choice for password has...,*Use a unique salt for each password. This can...,import os\n\ndef getHash(password):\n salt ...
6,104,Bulk/example_104.py,Hardcoded and Weak Admin Password,The code uses a hardcoded and easily guessable...,"*Never hardcode passwords, especially for priv...",# Implement a secure password storage and veri...
7,105,Bulk/example_105.py,Open Redirect Vulnerability,The `flask_redirect` function redirects the us...,*Validate the `next` parameter to ensure it po...,from urllib.parse import urlparse\n\n@app.rout...
8,106,Bulk/example_106.py,Potential XXE (XML External Entity) Vulnerability,The code parses XML data from the request with...,* Disable external entity processing when pa...,"from defusedxml import lxml\n\n@app.post(""/\nu..."
9,107,Bulk/example_107.py,Potential XXE and Insecure Password Handling,"Similar to example\_106.py, this code is vulne...",*Address the XXE vulnerability as described in...,# Implement secure XML parsing as shown in exa...


CSV file 'vulnerability_report_BULK.csv' created successfully.


In [None]:
# Display the created CSV file (assuming you have the 'google.colab' library installed)
from google.colab import files

files.download("vulnerability_report_BULK.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Now running similar operation but with json output

In [None]:
import re
import json


def extract_vulnerability_data(text):
    # Regular expressions for extracting data
    file_pattern = r"###\s*file_name:\s*(.*\.py)"  # Adjusted for "file_name"
    vulnerability_name_pattern = r"\*\*Vulnerability Name:\*\*\s*(.*)"
    vulnerability_pattern = r"\*\*Vulnerability:\*\*\s*(.*?)(?=\*\*Recommendations)"
    recommendation_pattern = r"\*\*Recommendations:\*\*\s*((?:\*.*\n)+)"
    code_pattern = r"```python(.*?)```"  # Added pattern for recommended code

    data = []

    # Iterate through each vulnerability report
    for match in re.finditer(r"###.*?(?=###|$)", text, re.DOTALL):
        report = match.group(0)

        file_name = re.search(file_pattern, report).group(1)
        vulnerability_name = re.search(vulnerability_name_pattern, report).group(1)
        vulnerability = (
            re.search(vulnerability_pattern, report, re.DOTALL).group(1).strip()
        )
        recommendations = (
            re.search(recommendation_pattern, report, re.DOTALL).group(1).strip()
        )

        # Extract recommended code, handling potential absence
        code_match = re.search(code_pattern, report, re.DOTALL)
        recommended_code = code_match.group(1).strip() if code_match else "N/A"

        data.append(
            {
                "File Number": file_name.split("_")[-1].split(".")[0],
                "File Name": file_name,
                "Vulnerability Name": vulnerability_name,
                "Description": vulnerability,
                "Recommendations": recommendations,
                "Recommended Code": recommended_code,  # Include recommended code
            }
        )

    return data


# Extract data and write to JSON file (assuming you have 'response_text' defined)
data = extract_vulnerability_data(response_text)

with open("vulnerabilities.json", "w") as f:
    json.dump(data, f, indent=4)

print("Vulnerability data extracted and saved to vulnerabilities.json")

Vulnerability data extracted and saved to vulnerabilities.json


In [None]:
from google.colab import files

files.download("vulnerabilities.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### **IMPORTANT NOTE**- This use case can have multiple variations as provided below. However, the overall approach to handle the usecase can remain same as showcased later in this notebook -

**Scenario 1**- (Baby step): Run a sample piece of code standalone against Gemini to understand vulnerability and remediation

**Scenario 2**- (where LLM meets the Security): Run 100+ python code files in a batch against gemini to understand vulnerability and remediation

**Scenario 3**- (Needle in the haystack): Run Gemini against 3 Git Repos to identify vulnerabilities and proposed remediation

Although Gemini 1.5 showcased impressive performance for use case 1 and 2, , the 3rd use case could prove challenging to accomplish from within a notebook environment. Perhaps, Gemini Code Assist with Full Code awareness could be more preferred way to handle the same.





## Conclusions

In this notebook we have successfully leveraged Gemini 1.5 multimodal capability to
1. analyze multiple python files for potential vulnerabilities
2. Used Gemini 1.5 to provide recommendations (both in wordings and in the form of code)
3. export the details into csv and json for further analysis

Although a potent approach, this technique doesn't automatically apply recommended fixes to established benchmarks. However, these insightful recommendations can be manually implemented and tested against industry-standard benchmarks for deeper analysis