# Use Case Description

YARA is a well-known tool to identify and classify malware samples by describing patterns in textual or binary. <br>
But it's an onerous effort to list up bunch of patterns. <br>
Let's use our models to create YARA patterns and generate test strings.

## Model used for this use case
Instruct Model is ideal for generating snippets based on user instructions. <br>
Using a Reasoning Model for this task may be excessive and could result in performance degradation.

## SetUp

The setup scripts below are essentially the same as those in the [Quickstart (Instruct Model)](https://github.com/RobustIntelligence/foundation-ai-cookbook/blob/main/1_quickstarts/Preview_Quickstart_instruct_model.ipynb)

In [1]:
import os
import transformers
import torch
from IPython.display import display, Markdown

HF_TOKEN = os.getenv("HF_TOKEN")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [2]:
MODEL_ID = "fdtn-ai/minerva-ift-security" # To be relaced with the final model name

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=HF_TOKEN)
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16, # this model's tensor_type is BF16
    token=HF_TOKEN,
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [3]:
generation_args = {
    "max_new_tokens": 1024,
    "temperature": None,
    "repetition_penalty": 1.2,
    "do_sample": False,
    "use_cache": True,
    "eos_token_id": tokenizer.eos_token_id,
    "pad_token_id": tokenizer.pad_token_id,
}

In [4]:
import re

def inference(prompt, system_prompt):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt},
    ]
    inputs = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    
    inputs = tokenizer(inputs, return_tensors="pt").to(DEVICE)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            **generation_args,
        )
    response = tokenizer.decode(outputs[0], skip_special_tokens = False)
    
    # extract the assistant response part only
    match = re.search(r"<\|assistant\|>(.*?)<\|end_of_text\|>", response, re.DOTALL)
    
    return match.group(1).strip()

## Generate YARA rules

Let’s first create a YARA rule pattern to detect known malicious IPs or file hashes.

In [5]:
SYSTEM_PROMPT = "You are a security expert and your task is write a draft YARA pattern to detect a simple malware sample."

user_prompt = '''Let's say 192.168.1.100 is a malicious IP and d41d8cd98f00b204e9800998ecf8427e is a file hash.
The YARA flags if either string being exactly matched in a given log file. No other conditions.'''

yara_pattern = inference(user_prompt, SYSTEM_PROMPT)
display(Markdown(yara_pattern))

```yara
rule SimpleMalwareDetector {
    meta:
        description = "Detects the presence of specific indicators associated with a known malicious IP or file hash."
        author = "YourName"
        date = "2023-04-01"

    strings:
        $malicious_ip = "192.168.1.100"
        $malicious_hash = "d41d8cd98f00b204e9800998ecf8427e"

    condition:
        ($malicious_ip) or ($malicious_hash)
}
```

This YARA rule named `SimpleMalwareDetector` checks for two distinct patterns:

- `$malicious_ip`: This matches any occurrence of the exact string `"192.168.1.100"` within a binary, process memory, or file content that you're scanning.

- `$malicious_hash`: This looks for the hexadecimal representation `"d41d8cd98f00b204e9800998ecf8427e"`, which represents an MD5 hash commonly used as part of a signature for detection purposes.

When this YARA rule scans through data (like files on disk, running processes' memory, etc.), it will flag anything containing these strings according to its configuration (`--string-match-mode`). 

Please note that depending upon where you want to use this rule (file system, network traffic, process memory), adjustments might be necessary such as using different context types like `fs.data` for file contents, `process.create` for newly created processes, etc., instead of just relying on generic strings without specifying contexts.

Also remember that real-world usage would likely involve more complex logic, including possibly checking against databases of known hashes and IPs, looking at additional attributes beyond mere existence (such as frequency of occurrences, surrounding bytes, etc.) and potentially combining multiple rules based on severity levels or confidence scores.

## Make test logs and check by real YARA

First, ensure that the YARA pattern you just created conforms to the correct syntax.

In [6]:
match = re.search(r"```yara(.*?)```", yara_pattern, re.DOTALL)
pattern = match.group(1).strip()
print(pattern)

rule SimpleMalwareDetector {
    meta:
        description = "Detects the presence of specific indicators associated with a known malicious IP or file hash."
        author = "YourName"
        date = "2023-04-01"

    strings:
        $malicious_ip = "192.168.1.100"
        $malicious_hash = "d41d8cd98f00b204e9800998ecf8427e"

    condition:
        ($malicious_ip) or ($malicious_hash)
}


In [7]:
import yara
# If there's an error it shows that the pattern doesn't conform to the correct syntax.
rules = yara.compile(source=pattern)

Let’s also create some test strings to verify that the YARA patterns function as expected.

In [8]:
prompt = f'''Given the following YARA pattern, write two simple log files: fist one is flagged by the YARA, while the second one is not flagged.
Since you are writing logs for unit tests, don't include any YARA related descriptions.
Each log file should be separated by ---.

{pattern}
'''

In [9]:
NEW_SYSTEM_PROMPT = "You are a cybersecurity test engineer."

response = inference(prompt, system_prompt = NEW_SYSTEM_PROMPT)
print(response)

Log File 1 (Flagged by YARA):
```
[2023-04-01T10:30:45] System Log Entry
Network Connection Established to Malicious IP: 192.168.1.100 from User: admin on Host: Server001
```

---

Log File 2 (Not Flagged by YARA):
```
[2023-04-01T11:15:20] System Log Entry
User 'john' logged in successfully from IP: 192.167.0.50 on Host: Workstation002
```


In [10]:
log_to_be_flagged = response.split("---")[0]
log_not_to_be_flagged = response.split("---")[1]

The YARA pattern correctly flags the first log file but does not flag the second one as expected.

In [11]:
log_to_be_flagged_matches = rules.match(data=log_to_be_flagged)
print(log_to_be_flagged_matches)

[SimpleMalwareDetector]


In [12]:
log_not_to_be_flagged_matches = rules.match(data=log_not_to_be_flagged)
print(log_not_to_be_flagged_matches)

[]
