# Creating Synthetic Data using Llama 3.1

In this session, we will explore how to use Llama 3.1 to generate synthetic logs for testing and analyzing systems, particularly focusing on Windows Event Log 7045, both benign (normal) and malicious (suspicious) logs. 

Here is what we will cover in this session:

- Use Llama 3.1 to generate synthetic logs.
- Define system and user content to shape model behavior.
- Create both benign and malicious logs for simulating real-world data.
- Parse unstructured log text into a structured format for easier processing.

Requirement:
    https://aimlapi.com/
    api_key is provided: ed4b5e9d497f4d8badf2ed3929bb0c2d

    !pip install openai
    !pip install pandas

Optional:
    All package installations can be done in a requirement file.
    Add a requirement.txt file in the same dir
    Run the following command:
    !pip install -r requirements.txt



In [1]:
!pip install openai
!pip install pandas

Collecting openai
  Using cached openai-1.43.1-py3-none-any.whl.metadata (22 kB)
Collecting anyio<5,>=3.5.0 (from openai)
  Using cached anyio-4.4.0-py3-none-any.whl.metadata (4.6 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Using cached httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting pydantic<3,>=1.9.0 (from openai)
  Using cached pydantic-2.9.0-py3-none-any.whl.metadata (146 kB)
Collecting tqdm>4 (from openai)
  Using cached tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
Collecting certifi (from httpx<1,>=0.23.0->openai)
  Using cached certifi-2024.8.30-py3-none-any.whl.metadata (2.2 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Using cached httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting annotated-types>=0.4.0 (from pydantic<3,>=1.9.0->openai)
  Using cached annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)
Collecting pydantic-core==2.23.2 (from pydantic<3,>=1.9.0->openai)
  Using cached pydantic_core-2.23.2-cp312-none-win_amd64.whl.me


[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting pandas
  Using cached pandas-2.2.2-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting python-dateutil>=2.8.2 (from pandas)
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Using cached pandas-2.2.2-cp312-cp312-win_amd64.whl (11.5 MB)
Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
Installing collected packages: python-dateutil, pandas
Successfully installed pandas-2.2.2 python-dateutil-2.9.0.post0



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import sys
print (sys.path)

['c:\\Users\\John\\Desktop', 'C:\\Users\\John\\AppData\\Local\\Programs\\Python\\Python311\\python311.zip', 'C:\\Users\\John\\AppData\\Local\\Programs\\Python\\Python311\\Lib', 'C:\\Users\\John\\AppData\\Local\\Programs\\Python\\Python311\\DLLs', 'C:\\Users\\John\\AppData\\Local\\Programs\\Python\\Python311', '', 'C:\\Users\\John\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages', 'C:\\Users\\John\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\win32', 'C:\\Users\\John\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\win32\\lib', 'C:\\Users\\John\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\Pythonwin']


In [2]:
import openai
import pandas as pd
import os

ModuleNotFoundError: No module named 'openai'

The system content serves as the backbone of any structured conversation in AI. Think of it as the part of the model where we set the tone, establish boundaries, and provide essential context for guiding responses. This system role ensures that the model adheres to specific rules, whether it's staying on-topic, maintaining a certain behavior, or even following safety protocols. It’s like giving the model a playbook that it uses to shape its replies in a way that aligns with your goals. 

    we define the system content:
        you are <an expert in cybersecurity>.  you task is <to generate synthetic Windows Event ID 7045 log entries for training purposes>.
        <each entry> should include:
        - a
        - b
        - c

In [None]:
# Define the system content 

system_content = """
You are an expert in cybersecurity. Your task is to generate synthetic Windows Event ID 7045 log entries for training purposes.
Each entry should include:
- A label ("benign" or "malicious")
- Service Name
- Service File Name
- Service Type
- Service Start Type
- Service Account
- Data Service Name
- Timestamp
- ID

For the ID:
- End with "x" for malicious entries
- End with "Y" for benign entries
For the Service File Name:
- Use a command line for the malicious entries
- Use a file path for the benign entries
Ensure the generated entries are varied and realistic.

An example of a benign entry might look like this:
- Label: benign
- Service Name: sysmon
- Service File Name: C:\WINDOWS\sysmon.exe
- Service Type: auto start
- Service Start Type: user mode service
- Account Name: LocalSystem
- Data Service Name: Windows11
- Timestamp: 2024-08-04T17:58:19Z
- ID: fc7deb2c-9f43-49de-aff0-xxxxxxxxxxxxB

An example of a malicious entry might look like this:
- Label: malicious
- Service Name: MTsMjDat
- Service File Name: "powershell -nop -w hidden -noni -c '=New-Object IO.MemoryStream(,[Convert]::FromBase64String(TgBrAHIAVABoAEEAWAAzAHYAbwBiAHMAUAAwAGEATgBXAHAAMwA5AHQAWABQADgAcABiAGUARgBBAG8AVgAwAHkAYgBPADkAUgBnAGIAMABKAFMANgByAGoATwB5ADAAZwBwAHYAaQAyADgANQBCAGEAdgA3AE4ARABMAGEAagBaAHMAQQBmADYAcABVAGEASwBzAEQAagAwADgARABFAFoATABvAEYAUgBiAGYARQBuADUAcABkAEkANAB2AFcAMQBRAFgAUABqAFEANQBlAGIAUQBmAFQAZwBjAFMAYwBqADMAawBxAFEAZwBmADIAYQBuAG4AWABjADEANgAyAFAANABKAEoAZwBUAEMANgBvAHYAWgBIAFAARQB4AEcAYwBYAEEAbAB4AEMAZwBOAE8AWQBPAFcAaQBQAEQAMQBWADUAdgBoADEASwBRAA==))IEX (New-Object IO.StreamReader(New-Object IO.Compression.GzipStream(,[IO.Compression.CompressionMode]::Decompress))).ReadToEnd();',Cj7g12Zes,user mode service,demand start, malicious
2024-08-17T22:23:24Z,7045,60d9b6fa-a407-42ef-94ae-023801db0fb2,doe_admin,%comspec% /b /c start /b /min powershell -nop -w hidden -encodedcommand 'dQA2ADcAbwB1AFEAdAA0AFUARwBmAGMAOAB2AHkAdQBtAEYAdQBiAEcAWgBGAHEAZgBOAEgAegA3AEcAMwBaAHEAdABuAEoAbwBaADUAYwBNAHAARABhAEQAQgBoAHUAegBBADQATQBwAFQAbQBCAFMARABzAGsAZwBiADEAZgBrAFYANwBMAFYAMABZAEwAWQBiAEYAZQA1AE0AZQBQAEoASgBIAGgANwBrADgAdgBUAGEAbgB1AFIAbABQADEASABDAHoAbwA4AHIAVQBRAHcARwBKAFgAUQBMAEsAbwByADcAaAAyAGwAbABYADEAZABuADcARgBwAHgAMwBPADEANQBUAGoARQBZAG8AegAwAHMAZABTAGYANgB1AG8AMQBLAFIAegBXAGEATgBpAGkAUABaADIAbAA3ADcAZwBuAGMAbQBZAGwAQwB4AGYAagBjADcATgBEAFMAawBSAFQAcAA='"
- Service Type: user mode service
- Service Start Type: demand start
- Account Name: LocalSystem
- Data Service Name: Windows11
- Timestamp: 2024-08-04T19:18:42Z
- ID: 6c7fe4d5-31ed-4fbc-b3bb-xxxxxxxxxxxxM
"""


The user content refers to the input provided by the user during a conversation. It is essentially the message or prompt that the user sends to the model, asking for information, requesting actions, or guiding the conversation. This user input serves as the starting point for the model's response.




In [None]:
# Define the user content for generating log entries
user_content_benign = "Generate a random benign Windows Event ID 7045 log entry."
user_content_malicious = "Generate a random malicious Windows Event ID 7045 log entry."

Our application now that interfaces with the Llama model on aiml api.
Once the api key is beign authenticated, the application pass user input to the model and managing the model's output.

In [None]:
client = openai.OpenAI(
    api_key="ed4b5e9d497f4d8badf2ed3929bb0c2d",
    base_url="https://api.aimlapi.com/"
)

We’re working with Llama 3.1 to simulate two types of logs: benign logs (safe, expected events) and malicious logs (potentially harmful or suspicious activity). 
This distinction is important when we’re training models to identify anomalies or threats in systems.

We could modify these functions to fine-tune how Llama generates logs, whether we’re looking for specific types of events in the logs or want to test the system’s reaction to more varied types of activity. This is a powerful way to simulate real-world data for AI model training and validation.

In [None]:

# Generate benign log
def generate_benign_log():
    chat_completion_benign = client.chat.completions.create(
        #model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
        model="meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
        messages=[
            {"role": "system", "content": system_content},
            {"role": "user", "content": user_content_benign},
        ],
        temperature=0.7,
        max_tokens=256,
    )
    return chat_completion_benign

# Generate malicious log
def generate_malicious_log():
    chat_completion_malicious = client.chat.completions.create(
        #model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
        model="meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
        messages=[
            {"role": "system", "content": system_content},
            {"role": "user", "content": user_content_malicious},
        ],
        temperature=0.7,
        max_tokens=256,
    )
    return chat_completion_malicious

We simply creates an empty list called synthetic_logs.

In [None]:
synthetic_logs = []

We're generating synthetic logs by calling the functions generate_benign_log() and generate_malicious_log(), then appending their outputs to the synthetic_logs list. 
In this case, we generate each log 5 times.

We use two loops to generate a total of 10 synthetic logs — 5 benign and 5 malicious. The key functions, generate_benign_log() and generate_malicious_log(), create these logs for us. We then clean up the log content by removing any extra whitespace and store it in the synthetic_logs list. By the end of these loops, we have a collection of logs that we can use for testing or training purposes.

In [None]:
for i in range(5):
    synthetic_logs.append(generate_benign_log().choices[0].message.content.strip())
for i in range(5):
    synthetic_logs.append(generate_malicious_log().choices[0].message.content.strip())

This code will print out each synthetic log stored in synthetic_logs.

We review the synthetic logs we’ve generated. By iterating over each log in synthetic_logs, we print the content and add some extra spacing between entries, making it easier to visually inspect each log. This is useful for validating the logs or simply observing how the Llama model generates benign and malicious events."

In [None]:
for log in synthetic_logs:
    print(log)
    print("\n\n")

We parse a single log entry, extracting key-value pairs from each line and returning them as a dictionary. 

This function helps us take a raw log entry, which is essentially a block of text with key-value pairs, and turn it into a structured dictionary. We loop through each line in the log, check for the presence of a key-value pattern (separated by a colon and space), and then store that information in a dictionary. This parsed format is easier to work with, especially when you need to extract specific details from the logs for further analysis or processing.

In [None]:
# Function to parse a log entry
def parse_log_entry(log_entry):
    lines = log_entry.split('\n')
    log_data = {}
    for line in lines:
        if ': ' in line:
            key, value = line.split(': ', 1)
            log_data[key.strip()] = value.strip()
    return log_data

We build on the parse_log_entry function we discussed earlier.

We loop through the synthetic logs that we generated earlier. For each log, we call the parse_log_entry() function to convert the raw log data into a structured dictionary. Then, we store the parsed version of each log in the parsed_logs list. This gives us a clean, structured format for all the logs, which makes it easier to analyze, manipulate, or store in a database for further processing.

In [None]:
parsed_logs = []
for log in synthetic_logs:
    parsed_log = parse_log_entry(log)
    parsed_logs.append(parsed_log)

In [None]:
df = pd.DataFrame(parsed_logs)
df

You have learned how to:

- Use Llama 3.1 to generate synthetic logs.
- Define system and user content to shape model behavior.
- Create both benign and malicious logs for simulating real-world data.
- Parse unstructured log text into a structured format for easier processing.

This syntheic data creation workflow provides a powerful way to simulate data for security, monitoring, or testing systems, and sets a foundation for training anomaly detection AI models.