## YARA Rule Generation Based on Pattern Engineering

This notebook implements a direct pattern-matching approach to generate YARA rules.
This strategy was chosen over a Machine Learning model due to the small size and imbalanced nature of the provided input files, which made a data-driven model ineffective.

The goal is to parse security indicators (such as file names and hashes) from the input files and convert them into functional YARA rules, demonstrating a practical and robust method for threat signature generation.

In [127]:
import pandas as pd
import csv
import os
import numpy as np
import yara
import re # is crucial for the new sanitization step to remove problematic characters.
from datetime import datetime

## 1. Data Loading and Preparation

Objective: Load all files from new_input_files and structure them so I can easily process the content.


In [128]:
# Try to import yara, handle if not available
try:
    import yara
    YARA_AVAILABLE = True
    print("yara-python available.")
except ImportError:
    YARA_AVAILABLE = False
    print("yara-python not available. Install with pip install yara-python")


def load_data_and_handle_encoding(data_dir):
    """
    Loads text content from files in a directory and handles UnicodeDecodeError.
    """
    data = {}
    
    for filename in os.listdir(data_dir):
        file_path = os.path.join(data_dir, filename)
        
        if os.path.isfile(file_path):
            label = os.path.splitext(filename)[0]
            print(f"Processing: {filename}")
            
            try:
                # Try UTF-8 first
                with open(file_path, 'r', encoding='utf-8') as f:
                    content = f.read()
                print(f"Read with UTF-8")
            except UnicodeDecodeError:
                try:
                    # Try latin-1 if UTF-8 fails
                    with open(file_path, 'r', encoding='latin-1') as f:
                        content = f.read()
                    print(f"Read with latin-1")
                except Exception as e:
                    print(f"Error Read {filename}: {e}")
                    content = ""
            except Exception as e:
                print(f" Unespected error with {filename}: {e}")
                content = ""
            
            if content:
                data[label] = content
                print(f"Loaded content: {len(content)} characters")
            else:
                print(f"Empty or unreadable file")

    print(f"\n Total files uploaded: {len(data)}")
    return data


yara-python available.


## 2. YARA Rule Generation

Objective: Analyze the contents of each file and build the YARA rule syntax.

Considerations:

Rules for .txt: If text files (such as Adware.txt) contain lists of strings or file names, the code must read each line and convert it into a YARA string.

Rules for .csv: The CSV file likely has columns with hashes (MD5, SHA256) or signature names.

In [129]:
def generate_yara_rule_text(label, content):
    """
    Generates the text for a YARA rule.

    Args:
        label (str): The name of the rule, derived from the file name.
        content (str): The content of the file to be converted into YARA strings.

    Returns:
        str: A string containing the complete YARA rule syntax.
    """
    print(f"Generating rule for: {label}")

    # Create a valid YARA identifier
    yara_identifier = re.sub(r'[^a-zA-Z0-9_]', '_', label)
    yara_strings = []

    # More aggressive sanitization and consistent line splitting
    sanitized_content = re.sub(r'[\x00-\x1F\x7F-\x9F]', '', content)
    
    # Using re.split to handle various line endings more robustly
    lines = re.split(r'[\r\n]+', sanitized_content.strip())
    
    if label.lower().endswith("signatures_list"): # Adjusted condition for flexibility
        # Special handling for CSV
        if not lines:
            print("Empty CSV file")
            return ""

        csv_reader = csv.reader(lines)
        
        # Skip header if it exists
        try:
            first_row = next(csv_reader)
            if first_row and any(word in first_row[0].lower() for word in ['name', 'signature', 'hash', 'file']):
                print(f"Header detected and skipped: {first_row[0]}")
            elif first_row and first_row[0].strip():
                # Process the first row if it's not a header
                sanitized_string = first_row[0].strip()
                if sanitized_string:
                    escaped_string = sanitized_string.replace('\\', '\\\\').replace('"', '\\"')
                    yara_strings.append(f'\t$s0 = "{escaped_string}"')
        except StopIteration:
            print("Empty CSV or no data")
            return ""

        string_counter = len(yara_strings)
        for row in csv_reader:
            if row and row[0]:
                sanitized_string = row[0].strip()
                if sanitized_string:
                    escaped_string = sanitized_string.replace('\\', '\\\\').replace('"', '\\"')
                    yara_strings.append(f'\t$s{string_counter} = "{escaped_string}"')
                    string_counter += 1

    else:
        # Handling for general text files
        string_counter = 0
        for line in lines:
            cleaned_line = line.strip()
            if cleaned_line and not cleaned_line.startswith(('#', '//')):
                escaped_line = cleaned_line.replace('\\', '\\\\').replace('"', '\\"')
                yara_strings.append(f'\t$s{string_counter} = "{escaped_line}"')
                string_counter += 1

    print(f"Valid strings found: {len(yara_strings)}")

    if not yara_strings:
        print(f"No valid strings found for {label}")
        return ""

    strings_section = "\n".join(yara_strings)
    rule = f'''rule {yara_identifier} : {label} {{
    strings:
{strings_section}
    condition:
        any of them
}}

'''
    return rule

## 3. Compilation and Storage

Objective: Compile the generated YARA rules and save them in a .yar file.

In [130]:
def compile_and_save_yara_rules(rules_dict, output_dir="../output", filename="compiled_rules.yar"):
    """
    Compiles and saves YARA rules - COMPLETELY CORRECTED VERSION
    """
    print(f"\n{'='*60}")
    print(f"COMPILING AND SAVING YARA RULES")
    print(f"{'='*60}")
    
    if not rules_dict:
        print("ERROR: No rules to compile")
        return False

    # Create output directory
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        print(f"Directory created: {output_dir}")

    # Filter empty rules and combine
    valid_rules = []
    for label, rule_text in rules_dict.items():
        if rule_text and rule_text.strip():
            valid_rules.append(rule_text)
            print(f"Valid rule included: {label}")
        else:
            print(f"Empty rule omitted: {label}")

    if not valid_rules:
        print("ERROR: No valid rules to save")
        return False

    # Combine all rules
    combined_rules = "\n".join(valid_rules)
    
    # Final sanitization
    combined_rules = re.sub(r'[\x00]', '', combined_rules)
    
    # Define output file
    output_file = os.path.join(output_dir, filename)
    
    print(f"\n📊 SUMMARY:")
    print(f"   Valid rules: {len(valid_rules)}")
    print(f"   Total size: {len(combined_rules):,} characters")
    print(f"   Destination file: {output_file}")

    try:
        # SAVE THE TEXT FILE FIRST
        with open(output_file, 'w', encoding='utf-8') as f:
            f.write(combined_rules)
        
        # Verify that it was created
        if os.path.exists(output_file):
            file_size = os.path.getsize(output_file)
            print(f"\n🎉 FILE SAVED SUCCESSFULLY!")
            print(f"   📍 Location: {os.path.abspath(output_file)}")
            print(f"   📏 Size: {file_size:,} bytes")
            print(f"   🕐 Created: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        else:
            print("ERROR: The file was not created")
            return False

        # ATTEMPT TO COMPILE IF YARA IS AVAILABLE
        if YARA_AVAILABLE:
            print(f"\n🔍 Validating YARA syntax...")
            try:
                compiled_rules = yara.compile(source=combined_rules)
                print("Valid YARA syntax - Rules compiled correctly")
                
                # Save compiled version as well
                compiled_file = output_file.replace('.yar', '_compiled.yar')
                compiled_rules.save(compiled_file)
                print(f"Compiled version saved: {compiled_file}")
                
            except yara.SyntaxError as e:
                print(f"YARA syntax error: {e}")
                print("Text file saved for manual review")
            except Exception as e:
                print(f"Compilation error: {e}")
                print("Text file saved correctly")
        else:
            print("yara-python not available - Only text file was saved")

        # Show a sample of the content
        print(f"\n📄 Sample of the generated file:")
        with open(output_file, 'r', encoding='utf-8') as f:
            lines = f.readlines()[:15]
            for i, line in enumerate(lines, 1):
                print(f"   {i:2d}: {line.rstrip()}")
            if len(lines) >= 15:
                print("   ...")

        return True
        
    except Exception as e:
        print(f"UNEXPECTED ERROR: {e}")
        import traceback
        traceback.print_exc()
        return False

## 4. Main function to execute all 

In [131]:
def main_execution(data_directory):
    """
    Main function that executes the entire process
    """
    print("STARTING FULL YARA GENERATION PROCESS")
    print("="*60)
    
    # 1. Load data
    print("\n 1. LOADING DATA...")
    raw_data = load_data_and_handle_encoding(data_directory)
    
    if not raw_data:
        print(" No data loaded. Process terminated.")
        return False
    
    print(f"Files loaded: {list(raw_data.keys())}")
    
    # 2. Generate YARA rules
    print("\n 2. GENERATING YARA RULES...")
    yara_rules_dict = {}
    
    for label, content in raw_data.items():
        rule_text = generate_yara_rule_text(label, content)
        if rule_text:
            yara_rules_dict[label] = rule_text
    
    print(f"\n Rules generated: {len(yara_rules_dict)}/{len(raw_data)}")
    
    if not yara_rules_dict:
        print(" No valid rules generated. Process terminated.")
        return False
    
    # 3. Compile and save
    print("\n 3. COMPILING AND SAVING...")
    success = compile_and_save_yara_rules(yara_rules_dict)
    
    if success:
        print("\n PROCESS COMPLETED SUCCESSFULLY!")
    else:
        print("\n Error in the compilation process")
    
    return success

## 5. Execute the whole process 

In [132]:
if __name__ == "__main__":
    # My path to the data directory
    data_directory = '/Users/dianaterraza/Desktop/data_scientist_yara_project/new_input_files'
    
    # Verify that the directory exists
    if not os.path.exists(data_directory):
        print(f" ERROR: Directory not found: {data_directory}")
        print("Please, update the 'data_directory' variable with the correct path.")
    else:
        main_execution(data_directory)

STARTING FULL YARA GENERATION PROCESS

 1. LOADING DATA...
Processing: Backdoor.txt
Read with latin-1
Loaded content: 1240170 characters
Processing: Dialer.txt
Read with latin-1
Loaded content: 8244 characters
Processing: Behavior.txt
Read with latin-1
Loaded content: 953526 characters
Processing: BrowserModifier.txt
Read with latin-1
Loaded content: 47888 characters
Processing: Adware.txt
Read with latin-1
Loaded content: 95012 characters
Processing: Microsoft_Defender_All_signatures_list.csv
Read with UTF-8
Loaded content: 11161500 characters
Processing: Constructor.txt
Read with latin-1
Loaded content: 35412 characters

 Total files uploaded: 7
Files loaded: ['Backdoor', 'Dialer', 'Behavior', 'BrowserModifier', 'Adware', 'Microsoft_Defender_All_signatures_list', 'Constructor']

 2. GENERATING YARA RULES...
Generating rule for: Backdoor
Valid strings found: 1
Generating rule for: Dialer
Valid strings found: 1
Generating rule for: Behavior
Valid strings found: 1
Generating rule for: B