# LogPT Data Analysis

Analyze and parse batch API outputs to prepare training data for GPT-2 fine-tuning. This notebook explores the structure of:
- Summary queries (~1685 samples)
- Root cause analysis (~1685 samples)
- Action items (~1685 samples)

Understand data format, develop parsing strategy, and combine with original logs into training format.

In [3]:
import json
import pandas as pd
import logging
from pathlib import Path
from collections import defaultdict
import sys

In [11]:
DATA_DIR = Path('../data')
OUTPUT_DIR = Path('../data/training')
OUTPUT_DIR.mkdir(exist_ok=True)

In [13]:
#Load first 100 samples from each input.jsonl file to understand structure 

DATA_DIR = Path('../data')
output_files = list(DATA_DIR.glob('batch_input_*.jsonl'))

print(f'Found {len(output_files)} files')

for output_file in output_files: 
    print(f'\n{"="*60}')
    print(f'Loading: {output_file.name}')
    print(f'{"="*60}')
    
    samples = []
    with open(output_file, 'r') as f:
        for i, line in enumerate(f): 
            if i >= 100: 
                break 
            samples.append(json.loads(line))
    
    print(f'✓ Loaded {len(samples)} samples')
    print(f'\nSample structure:')
    print(json.dumps(samples[0], indent=2))

Found 3 files

Loading: batch_input_action_items.jsonl
✓ Loaded 100 samples

Sample structure:
{
  "custom_id": "openssh_chunk_0_action_items",
  "method": "POST",
  "url": "/v1/chat/completions",
  "body": {
    "model": "gpt-4o-mini",
    "messages": [
      {
        "role": "system",
        "content": "You are an operations engineer providing actionable recommendations from log analysis."
      },
      {
        "role": "user",
        "content": "You are analyzing SSH authentication and connection logs.\n\n    List the top 3 action items that should be taken based on these logs.\n\n    Focus your response on: Security threats, attack patterns, authentication anomalies\n    Use the following context to inform your answer:\n    - Key fields to pay attention to are: ['timestamp', 'source_ip', 'username', 'action']   \n    Provide a response that is:\n    1. Accurate and relevant to the query\n    2. Informed by the specific context of these logs\n    3. Clear and concise, avoiding 

In [20]:
#Load the batch inputs and extract the relevant information for analysis
def load_batch_inputs(filepath): 
    records = {}
    with open(filepath, 'r') as f: 
        for line in f: 
            record = json.loads(line)
            custom_id = record['custom_id']
            messages = record['body']['messages']

            system_prompt = messages[0]['content']
            user_prompt = messages[1]['content']
            records[custom_id] = {
                'system_prompt': system_prompt,
                'user_prompt': user_prompt
            }
    return records

#Load all the input files 
input_files = list(DATA_DIR.glob('batch_input_*.jsonl'))

all_inputs = {}
for filepath in input_files:
    if filepath.exists():
        records = load_batch_inputs(filepath)
        all_inputs.update(records)
        print(f'✓ Loaded {len(records)} records from {filepath.name}')

print(f"\nTotal input records: {len(all_inputs)}")
print(f"Sample custom_ids: {list(all_inputs.keys())[:5]}")
#print first 5 entries of all_inputs 
for i, (custom_id, prompts) in enumerate(all_inputs.items()):
    if i >= 5:
        break
    print(f"\nCustom ID: {custom_id}")
    print(f"System Prompt:\n{prompts['system_prompt']}\n")
    print(f"User Prompt:\n{prompts['user_prompt']}\n")

            


✓ Loaded 1685 records from batch_input_action_items.jsonl
✓ Loaded 1685 records from batch_input_summary.jsonl
✓ Loaded 1685 records from batch_input_root_cause.jsonl

Total input records: 5055
Sample custom_ids: ['openssh_chunk_0_action_items', 'openssh_chunk_1_action_items', 'openssh_chunk_2_action_items', 'openssh_chunk_3_action_items', 'openssh_chunk_4_action_items']

Custom ID: openssh_chunk_0_action_items
System Prompt:
You are an operations engineer providing actionable recommendations from log analysis.

User Prompt:
You are analyzing SSH authentication and connection logs.

    List the top 3 action items that should be taken based on these logs.

    Focus your response on: Security threats, attack patterns, authentication anomalies
    Use the following context to inform your answer:
    - Key fields to pay attention to are: ['timestamp', 'source_ip', 'username', 'action']   
    Provide a response that is:
    1. Accurate and relevant to the query
    2. Informed by the spe

In [10]:
#Load first 100 samples from each output.jsonl file to understand structure 

DATA_DIR = Path('../data')
output_files = list(DATA_DIR.glob('batch_output_*.jsonl'))

print(f'Found {len(output_files)} files')

for output_file in output_files: 
    print(f'\n{"="*60}')
    print(f'Loading: {output_file.name}')
    print(f'{"="*60}')
    
    samples = []
    with open(output_file, 'r') as f:
        for i, line in enumerate(f): 
            if i >= 100: 
                break 
            samples.append(json.loads(line))
    
    print(f'✓ Loaded {len(samples)} samples')
    print(f'\nSample structure:')
    print(json.dumps(samples[0], indent=2))

Found 3 files

Loading: batch_output_action_items.jsonl
✓ Loaded 100 samples

Sample structure:
{
  "id": "batch_req_69857b3f7240819087c54736e02a9100",
  "custom_id": "openssh_chunk_0_action_items",
  "response": {
    "status_code": 200,
    "request_id": "55aa028e-7d47-4a46-9f0e-7a74f53d77e5",
    "body": {
      "id": "chatcmpl-D68a4mihlRTLgSquMbu7ERcdqZWtB",
      "object": "chat.completion",
      "created": 1770355324,
      "model": "gpt-4o-mini-2024-07-18",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "Based on the analysis of the SSH authentication and connection logs, here are the top three actionable recommendations:\n\n1. **Implement IP Blocking for Suspicious Sources**: The logs indicate repeated failed authentication attempts from specific IP addresses, particularly 173.234.31.186 and 52.80.34.196, which are attempting to log in with invalid usernames. It's advisable to block these IP add

In [19]:
#Load the batch outputs
def load_batch_outputs(filepath): 
    records = {}
    with open(filepath, "r") as f: 
        for line in f: 
            record = json.loads(line)
            custom_id = record['custom_id']
            
            response_text = record['response']['body']['choices'][0]['message']['content']
            records[custom_id] = response_text

    return records

output_files = list(DATA_DIR.glob('batch_output_*.jsonl'))

all_outputs = {}
for output_file in output_files:
    if output_file.exists():
        records = load_batch_outputs(output_file)
        all_outputs.update(records)
        print(f'✓ Loaded {len(records)} records from {output_file.name}')

print(f"\nTotal output records: {len(all_outputs)}")
print(f"Sample custom_ids: {list(all_outputs.keys())[:5]}")
#print first 5 entries of all_outputs
for i, (custom_id, response) in enumerate(all_outputs.items()):
    if i >= 5:
        break
    print(f"\nCustom ID: {custom_id}")
    print(f"Model Response:\n{response}\n")

✓ Loaded 1685 records from batch_output_action_items.jsonl
✓ Loaded 1685 records from batch_output_summary.jsonl
✓ Loaded 1685 records from batch_output_root_cause.jsonl

Total output records: 5055
Sample custom_ids: ['openssh_chunk_0_action_items', 'openssh_chunk_1_action_items', 'openssh_chunk_2_action_items', 'openssh_chunk_3_action_items', 'openssh_chunk_4_action_items']

Custom ID: openssh_chunk_0_action_items
Model Response:
Based on the analysis of the SSH authentication and connection logs, here are the top three actionable recommendations:

1. **Implement IP Blocking for Suspicious Sources**: The logs indicate repeated failed authentication attempts from specific IP addresses, particularly 173.234.31.186 and 52.80.34.196, which are attempting to log in with invalid usernames. It's advisable to block these IP addresses temporarily or permanently to prevent further access attempts. Consider using a firewall or intrusion prevention system to automate this process.

2. **Enforce S

In [22]:
#Extract the embedded log data
def extract_log_data(user_prompt): 
    marker = "Log Data:\n"
    # idx will give the starting index of the log data in the user prompt
    idx = user_prompt.find(marker)
    if idx == -1:
        return None 
    
    log_text = user_prompt[idx + len(marker):]

    #remove any ending markers like "End of Log Data"
    end_markers = [
        "\n\nProvide a concise summary",
        "\n\nProvide a response that",  
    ]

    for end_marker in end_markers: 
        end_idx = log_text.find(end_marker)
        # trim the log to remove the end marker and anything after it
        if end_idx != -1: 
            log_text = log_text[:end_idx]

    return log_text.strip()


#Extract the query/instructions before the log data 
def extract_query(user_prompt): 
    marker = "Log Data:\n"
    idx  = user_prompt.find(marker)
    if idx == -1:
        return user_prompt.strip()
    query_text = user_prompt[:idx]

    #Remove any trailing instruction 
    log_text = user_prompt[idx + len(marker):]
    instruction_markers = [
        "\n\nProvide a concise summary",
        "Provide a response that",  
    ]
    for instruction_marker in instruction_markers: 
        instr_idx = query_text.find(instruction_marker)
        # If instruction marker is found, trim the query to remove the instructions and anything after it
        if instr_idx != -1: 
            query_text += "\n" + log_text[instr_idx:].strip()

    return query_text

#Test the extraction on a sample 
sample_id = list(all_inputs.keys())[0]
sample_input = all_inputs[sample_id]



print(f"Custom ID: {sample_id}")
print(f"\n{'='*60}")
print("EXTRACTED LOG DATA (first 500 chars):")
print(extract_log_data(sample_input['user_prompt'])[:500])
print(f"\n{'='*60}")
print("EXTRACTED QUERY (first 500 chars):")
print(extract_query(sample_input['user_prompt'])[:500])
    
    

Custom ID: openssh_chunk_0_action_items

EXTRACTED LOG DATA (first 500 chars):
Dec 10 06:55:46 LabSZ sshd[24200]: reverse mapping checking getaddrinfo for ns.marryaldkfaczcz.com [173.234.31.186] failed - POSSIBLE BREAK-IN ATTEMPT!
Dec 10 06:55:46 LabSZ sshd[24200]: Invalid user webmaster from 173.234.31.186
Dec 10 06:55:46 LabSZ sshd[24200]: input_userauth_request: invalid user webmaster [preauth]
Dec 10 06:55:46 LabSZ sshd[24200]: pam_unix(sshd:auth): check pass; user unknown
Dec 10 06:55:46 LabSZ sshd[24200]: pam_unix(sshd:auth): authentication failure; logname= uid=0 eu

EXTRACTED QUERY (first 500 chars):
You are analyzing SSH authentication and connection logs.

    List the top 3 action items that should be taken based on these logs.

    Focus your response on: Security threats, attack patterns, authentication anomalies
    Use the following context to inform your answer:
    - Key fields to pay attention to are: ['timestamp', 'source_ip', 'username', 'action']   
    Provide a r

In [23]:
#parse custom id from each input key

def parse_custom_id(custom_id: str):
    parts = custom_id.split('_chunk_')
    log_type = parts[0]
    remainder = parts[1].split('_', 1)
    chunk_idx = int(remainder[0])
    query_type = remainder[1] if len(remainder) > 1 else 'summary'
    return log_type, chunk_idx, query_type

#verify the parsing on a sample custom id
for cid in list(all_inputs.keys())[:5]: 
    log_type, chunk_idx, query_type = parse_custom_id(cid)
    print(f"Custom ID: {cid}")
    print(f"Parsed Log Type: {log_type}, Chunk Index: {chunk_idx}, Query Type: {query_type}")
    print(f"{'-'*50}")

Custom ID: openssh_chunk_0_action_items
Parsed Log Type: openssh, Chunk Index: 0, Query Type: action_items
--------------------------------------------------
Custom ID: openssh_chunk_1_action_items
Parsed Log Type: openssh, Chunk Index: 1, Query Type: action_items
--------------------------------------------------
Custom ID: openssh_chunk_2_action_items
Parsed Log Type: openssh, Chunk Index: 2, Query Type: action_items
--------------------------------------------------
Custom ID: openssh_chunk_3_action_items
Parsed Log Type: openssh, Chunk Index: 3, Query Type: action_items
--------------------------------------------------
Custom ID: openssh_chunk_4_action_items
Parsed Log Type: openssh, Chunk Index: 4, Query Type: action_items
--------------------------------------------------


In [28]:
#Join inputs and outputs to build training records 

training_records = []


for custom_id, response_text in all_outputs.items(): 
    if custom_id not in all_inputs: 
        print(f" Warning: No input found for Custom ID: {custom_id}")
        continue
    input_data = all_inputs[custom_id]
    log_type, chunk_idx, query_type = parse_custom_id(custom_id)

    log_data = extract_log_data(input_data['user_prompt'])
    query = extract_query(input_data['user_prompt'])
    system_prompt = input_data['system_prompt']

    if log_data is None:
        print(f" Warning: No log data found for Custom ID: {custom_id}")
        continue

    training_records.append({
        'custom_id': custom_id,
        'log_type': log_type,
        'chunk_idx': chunk_idx,
        'query_type': query_type,
        'system_prompt': system_prompt,
        'query': query,
        'log_data': log_data,
        'response': response_text
    })

print(f"Total training records: {len(training_records)}")

df = pd.DataFrame(training_records)
print(f'\n Records by query type:\n{df["query_type"].value_counts()}')
print(f'\n Records by log type:\n{df["log_type"].value_counts()}')

Total training records: 5055

 Records by query type:
query_type
action_items    1685
summary         1685
root_cause      1685
Name: count, dtype: int64

 Records by log type:
log_type
openstack      999
bgl            483
hadoop         477
thunderbird    429
mac            366
zookeeper      351
hdfs           345
prox           306
openssh        261
linux          261
health         243
spark          216
apache         183
hpc            135
Name: count, dtype: int64


In [29]:
#Build training strings and special tokens 

SPECIAL_TOKENS = {
    'log_start': '<|log_start|>',
    'log_end': '<|log_end|>',
    'query_start': '<|query_start|>',
    'query_end': '<|query_end|>',
    'response_start': '<|response_start|>',
    'response_end': '<|response_end|>',
}

#Add these special tokens to the training records
def format_training_sample(record): 
    return (
        f"{SPECIAL_TOKENS['log_start']}\n"
        f"{record['log_data']}\n"
        f"{SPECIAL_TOKENS['log_end']}\n"
        f"{SPECIAL_TOKENS['query_start']}\n"
        f"{record['query']}\n"
        f"{SPECIAL_TOKENS['query_end']}\n"
        f"{SPECIAL_TOKENS['response_start']}\n"
        f"{record['response']}\n"
        f"{SPECIAL_TOKENS['response_end']}"
    )

#Test the formatting on a sample record
sample_record = training_records[0]
formatted_sample = format_training_sample(sample_record)
print(f"\nFormatted Training Sample (first 2000 chars):\n{formatted_sample[:2000]}")
print(f"\nFormatted Training Sample (last 500 chars):\n{formatted_sample[-500:]}")



Formatted Training Sample (first 2000 chars):
<|log_start|>
Dec 10 06:55:46 LabSZ sshd[24200]: reverse mapping checking getaddrinfo for ns.marryaldkfaczcz.com [173.234.31.186] failed - POSSIBLE BREAK-IN ATTEMPT!
Dec 10 06:55:46 LabSZ sshd[24200]: Invalid user webmaster from 173.234.31.186
Dec 10 06:55:46 LabSZ sshd[24200]: input_userauth_request: invalid user webmaster [preauth]
Dec 10 06:55:46 LabSZ sshd[24200]: pam_unix(sshd:auth): check pass; user unknown
Dec 10 06:55:46 LabSZ sshd[24200]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=173.234.31.186 
Dec 10 06:55:48 LabSZ sshd[24200]: Failed password for invalid user webmaster from 173.234.31.186 port 38926 ssh2
Dec 10 06:55:48 LabSZ sshd[24200]: Connection closed by 173.234.31.186 [preauth]
Dec 10 07:02:47 LabSZ sshd[24203]: Connection closed by 212.47.254.145 [preauth]
Dec 10 07:07:38 LabSZ sshd[24206]: Invalid user test9 from 52.80.34.196
Dec 10 07:07:38 LabSZ sshd[24206]: input_userauth

In [30]:
# Check context length of the formatted data to ensure it stays within GPT-2 constraints

import tiktoken 

enc = tiktoken.get_encoding("gpt2")

lengths = [] 

for record in training_records:
    formatted = format_training_sample(record)
    tokens = enc.encode(formatted)
    num_tokens = len(tokens)
    
    lengths.append({
        'custom_id': record['custom_id'],
        'query_type': record['query_type'],
        'log_type': record['log_type'],
        'char_len': len(formatted),
        'actual_tokens': num_tokens,
        'log_chars': len(record['log_data']),
        'response_chars': len(record['response']),
    })

len_df = pd.DataFrame(lengths)
print(f"\nContext Length Analysis:")
print(len_df['actual_tokens'].describe())

print(f"\nSamples over 1024 tokens (GPT-2 max): {(len_df['actual_tokens'] > 1024).sum()} / {len(len_df)} ({100*(len_df['actual_tokens'] > 1024).sum()/len(len_df):.1f}%)")
print(f"Samples over 768 tokens: {(len_df['actual_tokens'] > 768).sum()} / {len(len_df)} ({100*(len_df['actual_tokens'] > 768).sum()/len(len_df):.1f}%)")
print(f"Samples over 512 tokens: {(len_df['actual_tokens'] > 512).sum()} / {len(len_df)} ({100*(len_df['actual_tokens'] > 512).sum()/len(len_df):.1f}%)")

print(f"\n{'='*80}")
print("By query type:")
print(f"{'='*80}")
query_stats = len_df.groupby('query_type')['actual_tokens'].describe()
print(query_stats)

print(f"\n{'='*80}")
print("By log type:")
print(f"{'='*80}")
log_stats = len_df.groupby('log_type')['actual_tokens'].describe()
print(log_stats)




Context Length Analysis:
count    5055.000000
mean     2063.101088
std       382.556436
min       489.000000
25%      1619.000000
50%      2260.000000
75%      2359.000000
max      2794.000000
Name: actual_tokens, dtype: float64

Samples over 1024 tokens (GPT-2 max): 5030 / 5055 (99.5%)
Samples over 768 tokens: 5039 / 5055 (99.7%)
Samples over 512 tokens: 5054 / 5055 (100.0%)

By query type:
               count         mean         std    min     25%     50%     75%  \
query_type                                                                     
action_items  1685.0  2259.938872  156.006385  489.0  2227.0  2286.0  2335.0   
root_cause    1685.0  2363.971513  165.577871  593.0  2314.0  2384.0  2446.0   
summary       1685.0  1565.392878  101.304590  589.0  1522.0  1572.0  1623.0   

                 max  
query_type            
action_items  2636.0  
root_cause    2794.0  
summary       1827.0  

By log type:
             count         mean         std     min      25%     50%      

In [35]:
#Analyze where the tokens are being spent - log data vs response vs query

sample_breakdown = [] 

for record in training_records[:100]:
    formatted = format_training_sample(record)
    total_tokens = len(enc.encode(formatted))
    log_tokens = len(enc.encode(record['log_data']))
    response_tokens = len(enc.encode(record['response']))
    query_tokens = len(enc.encode(record['query']))

    sample_breakdown.append({
        'custom_id': record['custom_id'],
        'query_type': record['query_type'],
        'log_type': record['log_type'],
        'total_tokens': total_tokens,
        'log_tokens': log_tokens,
        'response_tokens': response_tokens,
        'query_tokens': query_tokens,
        'log_token_pct': log_tokens / total_tokens * 100,
        'response_token_pct': response_tokens / total_tokens * 100,
        'query_token_pct': query_tokens / total_tokens * 100,
    })

#Get the breakdown of token distribution across log, query and response
sample_breakdown = pd.DataFrame(sample_breakdown)
print(f"\nToken Distribution Breakdown (100 samples):")
print(sample_breakdown[['query_type', 'log_type', 'total_tokens', 'log_tokens', 'response_tokens', 'query_tokens', 'log_token_pct', 'response_token_pct', 'query_token_pct']].head(20))




    


Token Distribution Breakdown (100 samples):
      query_type log_type  total_tokens  log_tokens  response_tokens  \
0   action_items  openssh          2287         966              301   
1   action_items  openssh          2287         976              288   
2   action_items  openssh          2303         976              298   
3   action_items  openssh          2230         948              287   
4   action_items  openssh          2277         982              268   
5   action_items  openssh          2301         985              284   
6   action_items  openssh          2342         953              370   
7   action_items  openssh          2218         944              270   
8   action_items  openssh          2236         968              252   
9   action_items  openssh          2273         969              273   
10  action_items  openssh          2321         979              312   
11  action_items  openssh          2376         988              343   
12  action_items  o

In [36]:
# Compact the queries, as the model learns more from the examples
# Only need a short signal telling it what to do 

COMPACT_QUERIES = {
    'summary': "Summarize these {log_type} logs", 
    'root_cause': "Identify the root cause of these issues in these {log_type} logs",
    'action_items': 'List the action items from these {log_type} logs'
}

#Check again now how much is saved 
sample = training_records[0]
old_tokens = len(enc.encode(sample['query']))
new_query = COMPACT_QUERIES[sample['query_type']].format(log_type=sample['log_type'])
new_tokens = len(enc.encode(new_query))
print(f"Old tokens: {old_tokens}, New tokens: {new_tokens}, Saved: {old_tokens - new_tokens}")

Old tokens: 970, New tokens: 9, Saved: 961


In [47]:
# Now check with out compact query + truncated logs to fit the 1024 tokens context window of GPT-2. 
# We will keep reducing the log data until we fit within the limit.


def format_training_sample_compact(record, max_tokens=1024): 
    query = COMPACT_QUERIES.get(
        record['query_type'], 
        f'Analyze these {record["log_type"]} logs'
    ).format(log_type=record['log_type'])

    log_data = record['log_data']
    response = record['response']

    #measure everything except the log data
    shell = (
        f'{SPECIAL_TOKENS["log_start"]}\n\n'
        f'\n{SPECIAL_TOKENS["log_end"]}\n'
        f'{SPECIAL_TOKENS["query_start"]}\n'
        f'{query}\n'
        f'\n{SPECIAL_TOKENS["query_end"]}\n'
        f'{SPECIAL_TOKENS["response_start"]}\n'
        f'{response}\n'
        f'\n{SPECIAL_TOKENS["response_end"]}'
    )
    shell_tokens  = len(enc.encode(shell))
    available_tokens = max_tokens - shell_tokens 
    log_tokens = enc.encode(log_data)
    if len(log_tokens) > available_tokens: 
        # Need to truncate the log data to fit within the available tokens
        log_data = enc.decode(log_tokens[:max(available_tokens, 0)])
    return (
        f"{SPECIAL_TOKENS['log_start']}\n"
        f"{log_data}\n"
        f"{SPECIAL_TOKENS['log_end']}\n"
        f"{SPECIAL_TOKENS['query_start']}\n"
        f"{query}\n"
        f"{SPECIAL_TOKENS['query_end']}\n"
        f"{SPECIAL_TOKENS['response_start']}\n"
        f"{response}\n"
        f"{SPECIAL_TOKENS['response_end']}"
    )

# Now check the token length of the compact + truncated format
comapct_stats = [] 
for record in training_records:
    formatted = format_training_sample_compact(record, max_tokens=1024)
    total_tokens = len(enc.encode(formatted))
    comapct_stats.append({
        'custom_id': record['custom_id'],
        'query_type': record['query_type'],
        'log_type': record['log_type'],
        'text': formatted,
        'total_tokens': total_tokens,
    })
compact_df = pd.DataFrame(comapct_stats)
print(f"\nCompact Format Context Length Analysis:")
print(compact_df['total_tokens'].describe())
#Give me the percentage of samples that are now under 1024 tokens after compaction and truncation
print(f"\nSamples under 1024 tokens after compaction and truncation: {(compact_df['total_tokens'] <= 1024).sum()} / {len(compact_df)} ({100*(compact_df['total_tokens'] <= 1024).sum()/len(compact_df):.1f}%)")


Compact Format Context Length Analysis:
count    5055.000000
mean     1019.293373
std        35.148174
min       352.000000
25%      1022.000000
50%      1022.000000
75%      1022.000000
max      1023.000000
Name: total_tokens, dtype: float64

Samples under 1024 tokens after compaction and truncation: 5055 / 5055 (100.0%)


In [49]:
#Now we can save these compact + truncated training samples for fine-tuning the model. We will save them in a jsonl format with the custom_id and the formatted training string.

output_file = OUTPUT_DIR / 'training_data.jsonl'

with open(output_file, 'w') as f: 
    for sample in comapct_stats: 
        f.write(json.dumps(sample) + '\n')

print(f'Saved training data to {output_file} with {len(comapct_stats)} samples')

#Verify the saved file 
with open(output_file, 'r') as f: 
    first_sample = json.loads(f.readline())
    print(f"\nFirst sample keys: {first_sample.keys()}")
    #Print the formatted text of the first sample
    print(f"\nFormatted training text of the first sample (first 2000 chars):\n{first_sample['text'][:2000]}")



Saved training data to ../data/training/training_data.jsonl with 5055 samples

First sample keys: dict_keys(['custom_id', 'query_type', 'log_type', 'text', 'total_tokens'])

Formatted training text of the first sample (first 2000 chars):
<|log_start|>
Dec 10 06:55:46 LabSZ sshd[24200]: reverse mapping checking getaddrinfo for ns.marryaldkfaczcz.com [173.234.31.186] failed - POSSIBLE BREAK-IN ATTEMPT!
Dec 10 06:55:46 LabSZ sshd[24200]: Invalid user webmaster from 173.234.31.186
Dec 10 06:55:46 LabSZ sshd[24200]: input_userauth_request: invalid user webmaster [preauth]
Dec 10 06:55:46 LabSZ sshd[24200]: pam_unix(sshd:auth): check pass; user unknown
Dec 10 06:55:46 LabSZ sshd[24200]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=173.234.31.186 
Dec 10 06:55:48 LabSZ sshd[24200]: Failed password for invalid user webmaster from 173.234.31.186 port 38926 ssh2
Dec 10 06:55:48 LabSZ sshd[24200]: Connection closed by 173.234.31.186 [preauth]
Dec 10 07:0