# BinT5
Notebook showing the uses and limitations of BinT5.

**Paper**: https://arxiv.org/pdf/2301.01701

## Install Depenencies

In [14]:
%pip install pandas scikit-learn transformers[torch]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.


## Imports and Constants

In [20]:
import json
import pandas as pd
import re
import warnings

from pathlib import Path
from sklearn.metrics import precision_score, recall_score, f1_score
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM, PreTrainedModel, \
    PreTrainedTokenizer, PreTrainedTokenizerFast
from typing import Final, Literal

warnings.filterwarnings('ignore')

SQLITE_DATASET: Final[Path] = Path(
    '/') / 'workspace' / 'storage' / 'SQLite' / 'vuln_sqlite_decomp_2_dedup.csv'
'''
Dataset of source code/non-stripped decompiled code pairs where the source code has been modified to contain simple
vulnerabilities (see ~/dataset_scripts/create_vulns.py).
'''

'\nDataset of source code/non-stripped decompiled code pairs where the source code has been modified to contain simple\nvulnerabilities (see ~/dataset_scripts/create_vulns.py).\n'

## Load SQLite Dataset

In [16]:
df = pd.read_csv(SQLITE_DATASET)
df

Unnamed: 0.1,Unnamed: 0,repo,benign_src_code,benign_decompiled_code,vuln_src_file,vuln_src_code,vuln_decompiled_code,cwes
0,0,sqlite,static int jsonCacheInsert(\n sqlite3_context...,\nundefined8 jsonCacheInsert(undefined8 param_...,/root/motivation_repos/vulnerable2/sqlite/src/...,static int jsonCacheInsert(\n sqlite3_context...,\nundefined8 jsonCacheInsert(undefined8 param_...,"['//VULNERABILITY: CWE-125, Buffer Over-read']"
1,0,sqlite,"static int jsonStringGrow(JsonString *p, u32 N...","\nundefined8 jsonStringGrow(long param_1,uint ...",/root/motivation_repos/vulnerable2/sqlite/src/...,"static int jsonStringGrow(JsonString *p, u32 N...","\nundefined8 jsonStringGrow(long param_1,uint ...",['//VULNERABILITY: CWE-120']
2,0,sqlite,static SQLITE_NOINLINE void jsonStringExpandAn...,"\nvoid jsonStringExpandAndAppend(long param_1,...",/root/motivation_repos/vulnerable2/sqlite/src/...,\n//VULNERABILITY: CWE-120\nstatic SQLITE_NOIN...,"\nvoid jsonStringExpandAndAppend(long param_1,...",['//VULNERABILITY: CWE-120']
3,0,sqlite,"static void jsonAppendRaw(JsonString *p, const...","\nvoid jsonAppendRaw(long param_1,undefined8 p...",/root/motivation_repos/vulnerable2/sqlite/src/...,//VULNERABILITY: CWE-125 (Buffer Over-read)\ns...,"\nvoid jsonAppendRaw(long param_1,undefined8 p...",['//VULNERABILITY: CWE-125 (Buffer Over-read)']
4,0,sqlite,"static void jsonAppendRawNZ(JsonString *p, con...","\nvoid jsonAppendRawNZ(long param_1,undefined8...",/root/motivation_repos/vulnerable2/sqlite/src/...,// Original code\nstatic void jsonAppendRawNZ(...,"\nvoid jsonAppendRawNZ(long param_1,undefined8...","['//VULNERABILITY: CWE-125, Buffer Overflow - ..."
5,0,sqlite,static SQLITE_NOINLINE void jsonAppendCharExpa...,"\nvoid jsonAppendCharExpand(long param_1,undef...",/root/motivation_repos/vulnerable2/sqlite/src/...,//VULNERABILITY: CWE-125 (Buffer Overflow)\nst...,"\nvoid jsonAppendCharExpand(long param_1,undef...",['//VULNERABILITY: CWE-125 (Buffer Overflow)']
6,0,sqlite,"static void jsonAppendChar(JsonString *p, char...","\nvoid jsonAppendChar(long param_1,char param_...",/root/motivation_repos/vulnerable2/sqlite/src/...,"static void jsonAppendChar(JsonString *p, char...","\nvoid jsonAppendChar(long param_1,char param_...",['//VULNERABILITY: CWE-125']
7,0,sqlite,"static void jsonAppendString(JsonString *p, co...","\nvoid jsonAppendString(long param_1,char *par...",/root/motivation_repos/vulnerable2/sqlite/src/...,"static void jsonAppendString(JsonString *p, co...","\nvoid jsonAppendString(long param_1,char *par...",['//VULNERABILITY: CWE-1201']
8,0,sqlite,static void jsonReturnString(\n JsonString *p...,"\nvoid jsonReturnString(undefined8 *param_1,lo...",/root/motivation_repos/vulnerable2/sqlite/src/...,static void jsonReturnString(\n JsonString *p...,"\nvoid jsonReturnString(undefined8 *param_1,lo...","['//VULNERABILITY: CWE-129, Buffer Overflow', ..."
9,0,sqlite,static void jsonParseReset(JsonParse *pParse){...,\nvoid jsonParseReset(undefined8 *param_1)\n\n...,/root/motivation_repos/vulnerable2/sqlite/src/...,\n//VULNERABILITY: CWE-125\nstatic void jsonPa...,\nvoid jsonParseReset(undefined8 *param_1)\n\n...,['//VULNERABILITY: CWE-125']


## Experiment
To demonstrate that other models cannot summarize vulnerable decompile code efficiently

BinT5 has [5 different variations](https://huggingface.co/AISE-TUDelft/BinT5-Decom):
1. [BinT5-C](https://huggingface.co/AISE-TUDelft/BinT5-C): BinT5 trained on C source code.
2. [BinT5-Decom](https://huggingface.co/AISE-TUDelft/BinT5-Decom): BinT5 trained on non-stripped decompiled code.
3. [BinT5-Stripped](https://huggingface.co/AISE-TUDelft/BinT5-Stripped): BinT5 trained on stripped decompiled code.
4. [BinT5-Demi](https://huggingface.co/AISE-TUDelft/BinT5-Demi): BinT5 trained on demi-stripped decompiled code.
5. [BinT5-NoFunName](https://huggingface.co/AISE-TUDelft/BinT5-NoFunName): BinT5 trained on mostly non-stripped decompiled code, but with the function name stripped.

We will load `BinT5`, `CodeT5`, and `CodeLLaMA`. We will run an experiment to determine what the generated summaries are of these three models of the vulnerable decompiled SQLite code.

In [18]:
def run_experiment(model: PreTrainedModel,
                   tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast,
                   casual_lm: bool, num_samples: int) -> dict[Literal['benign', 'vulnerable'], list[dict[Literal['response', 'vulnerability', 'src_code', 'decompiled_code'], str]]]:
    out = {'benign': [], 'vulnerable': []}
    for src_column, decompiled_column in [('benign_src_code', 'benign_decompiled_code'),
                                          ('vuln_src_code', 'vuln_decompiled_code')]:
        for src_code, decompiled_code, vuln_strs in zip(df[src_column].head(n=num_samples),
                                                        df[decompiled_column].head(
                                                            n=num_samples),
                                                        df['cwes'].head(n=num_samples)):
            responses = []
            for code, is_decompiled in [(src_code, False), (decompiled_code, True)]:
                # Generate response
                if not casual_lm:
                    input_ids = tokenizer(
                        code, return_tensors="pt").input_ids
                else:
                    messages = [
                        {"role": "user", "content": f'Can you summarize this {"decompiled" if is_decompiled else "source"} code? ' +
                         'Also, if there is a vulnerability in this code, please describe the CWE.'},
                        {"role": "assistant",
                            "content": 'Of course, I am a cybersecurity expert.'},
                        {"role": "user",
                         "content": f'Great! Here is the code:\n```c{code}\n```'}
                    ]
                    input_ids = tokenizer.apply_chat_template(
                        messages, return_tensors="pt")
                generated_ids = model.generate(
                    input_ids, max_new_tokens=300)  # type: ignore
                response = tokenizer.decode(
                    generated_ids[0], skip_special_tokens=True)
                if casual_lm:
                    response = response.split('[/INST]')[-1]
                responses.append(response)
            src_response, decompiled_response = responses
            if 'vuln' in decompiled_column:
                # Collect CWE comments from vulnerable source code
                out['vulnerable'].append({
                    'src_response': src_response,
                    'decompiled_response': decompiled_response,
                    'vulnerability': vuln_strs,
                    'src_code': src_code,
                    'decompiled_code': decompiled_code
                })
            else:
                out['benign'].append({
                    'src_response': src_response,
                    'decompiled_response': decompiled_response,
                    'vulnerability': 'benign',
                    'src_code': src_code,
                    'decompiled_code': decompiled_code
                })
    return out  # type: ignore


num_samples = len(df)
metrics = {}
metrics['BinT5'] = run_experiment(AutoModelForSeq2SeqLM.from_pretrained('AISE-TUDelft/BinT5-Decom', device_map='auto'),
                                  AutoTokenizer.from_pretrained(
    'Salesforce/codet5-large', device_map='auto'),
    casual_lm=False, num_samples=num_samples)
metrics['CodeT5'] = run_experiment(AutoModelForSeq2SeqLM.from_pretrained('AISE-TUDelft/BinT5-C', device_map='auto'),
                                   AutoTokenizer.from_pretrained(
    'Salesforce/codet5-large', device_map='auto'),
    casual_lm=False, num_samples=num_samples)
metrics['CodeLLaMA'] = run_experiment(AutoModelForCausalLM.from_pretrained('codellama/CodeLlama-7b-Instruct-hf', device_map='auto'),
                                      AutoTokenizer.from_pretrained(
    'codellama/CodeLlama-7b-Instruct-hf', device_map='auto'),
    casual_lm=True, num_samples=num_samples)
# Export metrics
Path('/workspace/storage/motivation_results.json').write_text(json.dumps(metrics))

Token indices sequence length is longer than the specified maximum sequence length for this model (686 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (686 > 512). Running this sequence through the model will result in indexing errors
Loading checkpoint shards: 100%|██████████| 2/2 [00:06<00:00,  3.06s/it]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequenc

1071886

## Evaluation

Evaluation on the generated responses of the SQLite decompiled vulnerabilities for `BinT5`, `CodeT5`, and `CodeLLaMA`:

In [29]:
# Benign comes before vulnerable
# 0 - non-vulnerable; 1 - vulnerable
ground_truth = [0] * len(df)
ground_truth.extend([1] * len(df))

for llm in ['BinT5', 'CodeT5', 'CodeLLaMA']:
    # Collect CWEs for each vulnerable entry
    vuln_strs = [re.findall(r'CWE-\d+', metrics[llm]['vulnerable'][idx]['vulnerability'])
                 for idx in range(len(metrics[llm]['vulnerable']))]
    for response_key in ['src_response', 'decompiled_response']:
        benign_decomp_responses = [metrics[llm]['benign'][idx][response_key]
                                   for idx in range(len(metrics[llm]['benign']))]
        # For the benign predictions, if any mention of "cwe" or "vuln-" is in the response,
        # we consider that as a vulnerable prediction
        benign_preds = [1 if any(i in r.lower() for i in ['cwe', 'vuln']) else 0
                        for r in benign_decomp_responses]
        vuln_decomp_responses = [metrics[llm]['vulnerable'][idx][response_key]
                                 for idx in range(len(metrics[llm]['vulnerable']))]
        # For the vulnerable predictions, the model must be spot on in mentioning any of
        # the CWEs to consider that as a vulnerable prediction
        vuln_preds = [1 if any(i in r.lower() for i in [c.lower() for c in vuln_strs[idx]]) else 0
                      for idx, r in enumerate(vuln_decomp_responses)]
        preds = benign_preds
        preds.extend(vuln_preds)
        print(
            f'{llm} ({"Source Code" if response_key == "src_response" else "Decompiled Code"}):')
        print(f'Precision: {precision_score(ground_truth, preds)}, ' +
              f'Recall: {recall_score(ground_truth, preds)} ' +
              f'F1: {f1_score(ground_truth, preds)}')

BinT5 (Source Code):
Precision: 0.0, Recall: 0.0 F1: 0.0
BinT5 (Decompiled Code):
Precision: 0.0, Recall: 0.0 F1: 0.0
CodeT5 (Source Code):
Precision: 0.0, Recall: 0.0 F1: 0.0
CodeT5 (Decompiled Code):
Precision: 0.0, Recall: 0.0 F1: 0.0
CodeLLaMA (Source Code):
Precision: 0.4583333333333333, Recall: 0.4583333333333333 F1: 0.4583333333333333
CodeLLaMA (Decompiled Code):
Precision: 0.0, Recall: 0.0 F1: 0.0


## Summary
While it's clear that BinT5 is capable of correctly summarizing what the sqlite functions are doing, it fails to mention any of the introduced vulnerabilities in the decompiled code. Identifying potential security concerns in decompiled code is an important task in reverse engineering. Motivated by this limitation, we will fine-tune a similar model trained on vulnerable decompiled code to describe vulnerabilities in decompiled code.