<center><b><font size=6>Data Decoding and Sampling<b><center>

This notebook covers the initial steps for preprocessing and cleaning the dataset before performing any analysis or modeling.

0. **Install Dependencies**
1. **Helper Functions**
2. **Loadig and processing data**

<center><b><font size=5>Install Dependencies<b><center>

In [1]:
!python ../scripts/install_dependencies.py section0

[34mInstalling common packages: pandas, pyarrow[0m
[0m[32mSuccessfully installed: pandas[0m
[0m[32mSuccessfully installed: pyarrow[0m
[0m[33mNo dependencies found for section 'Section 0'.[0m
[0m[0m

In [2]:
import pandas as pd
import base64
import re

<center><b><font size=5>Helper Functions<b><center>

**Steps:**

- Decodes a Base64-encoded string into plaintext.
- Splits a string into tokens based on a predefined vocabulary.
- Processes and decodes Base64 commands from an SSH session log.

In [3]:
# Function to decode a Base64-encoded string into plaintext
def decode_base64(word):
    try:
        # Attempt to decode the Base64 string and return it
        return base64.b64decode(word).decode()
    except:
        # Return None if decoding fails (e.g., invalid Base64 input)
        print(f"Failed to decode Base64 string: {word}. Error: {e}")
        pass

In [4]:
# Function to decode Base64 encoded parts of a session
def decode_session(full_session):
    new_full_session = [] # List to store the decoded session chunks
    
    for session_chunk in full_session.split(";"):  # Split the session into chunks by semicolons
        if "base64 --decode" in session_chunk and "echo" in session_chunk:  # Check for decoding pattern
            parts = session_chunk.split()  # Split chunk into parts
            base64_encoded = None
        
            # Find the Base64 encoded string after "echo"
            for i in range(len(parts)):
                if parts[i] == "echo":
                    base64_encoded = parts[i + 1].strip("\"")  # Extract the encoded part
                    break
            
            if base64_encoded:
                decoded = decode_base64(base64_encoded)  # Decode the Base64 string
                if decoded:
                    # print(f"Decoded Base64 string: {decoded}\n")  # Print the decoded string
                    words_decoded = decoded.split("\n")  # Split the decoded string into lines

                    # Filter out empty lines and join into a single chunk
                    new_full_session.append("; ".join(list(filter(None, words_decoded))).strip())
                    global base64_decoded_counter
                    base64_decoded_counter += 1  # Increment the global counter
                else:
                    # print(f"Failed to decode chunk: {session_chunk}")
                    new_full_session.append(session_chunk.strip())
            else:
                # print(f"No Base64 encoded string found in chunk: {session_chunk}")
                new_full_session.append(session_chunk.strip())
        else:
            new_full_session.append(session_chunk.strip()) # Add chunk as-is if no decoding needed
    
    return "; ".join(new_full_session) # Return the reconstructed session

In [5]:
# Function to clean a session and tokenize it into individual commands
def clean_full_session(session):
    # Split the commands using spaces, semicolons, or newlines as separators
    commands = re.split(r"[;\s\n\t]+", session)

    # Clean each command
    cleaned_commands = []
    for cmd in commands:
        
        if "#!/bin/bash" in cmd: # Preserve bash script declarations
            cleaned_commands.append(cmd)
            continue
            
        # Remove leading and trailing spaces
        cmd = cmd.strip()
        
        # Remove unnecessary special characters while keeping paths
        cmd = re.sub(r"[^a-zA-Z0-9\/=\-\.]+", "", cmd)
        
        # Remove values associated with keys (everything after the '=' symbol)
        if "=" in cmd:
            cmd = cmd.split("=")[0]
        
        # Remove isolated numbers
        cmd = re.sub(r"\d+$", "", cmd)
        
        # Add the command if it's not empty
        if cmd:
            cleaned_commands.append(cmd)

    # Return the list of cleaned commands
    return cleaned_commands

<center><b><font size=5>Loading and processing data<b><center>

**Steps:**

- Load the raw dataset.
- Convert timestamps to datetime format.
- Decode Base64-encoded SSH sessions.
- Tokenize the decoded sessions based on a predefined vocabulary.
- Save the processed dataset for further use.

In [6]:
# Load the raw dataset from a Parquet file
df_original = pd.read_parquet('../data/raw/ssh_attacks.parquet')
print("Original dataset loaded.")

df_decoded = df_original.copy()  # Create a copy to preserve the original data

df_decoded

Original dataset loaded.


Unnamed: 0,session_id,full_session,first_timestamp,Set_Fingerprint
0,0,enable ; system ; shell ; sh ; cat /proc/mount...,2019-06-04 09:45:11.151186+00:00,"[Defense Evasion, Discovery]"
1,1,enable ; system ; shell ; sh ; cat /proc/mount...,2019-06-04 09:45:50.396610+00:00,"[Defense Evasion, Discovery]"
2,2,enable ; system ; shell ; sh ; cat /proc/mount...,2019-06-04 09:54:41.863315+00:00,"[Defense Evasion, Discovery]"
3,3,enable ; system ; shell ; sh ; cat /proc/mount...,2019-06-04 10:22:14.623875+00:00,"[Defense Evasion, Discovery]"
4,4,enable ; system ; shell ; sh ; cat /proc/mount...,2019-06-04 10:37:19.725874+00:00,"[Defense Evasion, Discovery]"
...,...,...,...,...
233030,233042,cat /proc/cpuinfo | grep name | wc -l ; echo -...,2020-02-29 23:47:28.217237+00:00,"[Discovery, Persistence]"
233031,233043,cat /proc/cpuinfo | grep name | wc -l ; echo -...,2020-02-29 23:49:01.009046+00:00,"[Discovery, Persistence]"
233032,233044,cat /proc/cpuinfo | grep name | wc -l ; echo -...,2020-02-29 23:56:18.827281+00:00,"[Discovery, Persistence]"
233033,233045,cat /proc/cpuinfo | grep name | wc -l ; echo -...,2020-02-29 23:56:56.263104+00:00,"[Discovery, Persistence]"


In [7]:
# Initialize the global counter for decoded Base64 strings
global base64_decoded_counter
base64_decoded_counter = 0

# Apply the session decoding function to the 'full_session' column
df_decoded["full_session"] = df_decoded["full_session"].apply(lambda session: decode_session(session))
print(f"Number of Base64 strings decoded: {base64_decoded_counter}")

df_decoded

Number of Base64 strings decoded: 90026


Unnamed: 0,session_id,full_session,first_timestamp,Set_Fingerprint
0,0,enable; system; shell; sh; cat /proc/mounts; /...,2019-06-04 09:45:11.151186+00:00,"[Defense Evasion, Discovery]"
1,1,enable; system; shell; sh; cat /proc/mounts; /...,2019-06-04 09:45:50.396610+00:00,"[Defense Evasion, Discovery]"
2,2,enable; system; shell; sh; cat /proc/mounts; /...,2019-06-04 09:54:41.863315+00:00,"[Defense Evasion, Discovery]"
3,3,enable; system; shell; sh; cat /proc/mounts; /...,2019-06-04 10:22:14.623875+00:00,"[Defense Evasion, Discovery]"
4,4,enable; system; shell; sh; cat /proc/mounts; /...,2019-06-04 10:37:19.725874+00:00,"[Defense Evasion, Discovery]"
...,...,...,...,...
233030,233042,cat /proc/cpuinfo | grep name | wc -l; echo -e...,2020-02-29 23:47:28.217237+00:00,"[Discovery, Persistence]"
233031,233043,cat /proc/cpuinfo | grep name | wc -l; echo -e...,2020-02-29 23:49:01.009046+00:00,"[Discovery, Persistence]"
233032,233044,cat /proc/cpuinfo | grep name | wc -l; echo -e...,2020-02-29 23:56:18.827281+00:00,"[Discovery, Persistence]"
233033,233045,cat /proc/cpuinfo | grep name | wc -l; echo -e...,2020-02-29 23:56:56.263104+00:00,"[Discovery, Persistence]"


In [8]:
# Convert the 'first_timestamp' column to a datetime format
df_decoded['first_timestamp'] = pd.to_datetime(df_decoded['first_timestamp'])
print("Converted 'first_timestamp' to datetime format.")

# Apply the cleaning function to the 'full_session' column
df_decoded["full_session"] = df_decoded["full_session"].apply(lambda session: clean_full_session(session))
print("Cleaned and tokenized 'full_session' column.")

# Save the processed dataset to a Parquet file
df_decoded.to_parquet("../data/processed/ssh_attacks_decoded.parquet")
print("Processed dataset saved to ../data/processed/ssh_attacks_decoded.parquet.")

df_decoded

Converted 'first_timestamp' to datetime format.
Cleaned and tokenized 'full_session' column.
Processed dataset saved to ../data/processed/ssh_attacks_decoded.parquet.


Unnamed: 0,session_id,full_session,first_timestamp,Set_Fingerprint
0,0,"[enable, system, shell, sh, cat, /proc/mounts,...",2019-06-04 09:45:11.151186+00:00,"[Defense Evasion, Discovery]"
1,1,"[enable, system, shell, sh, cat, /proc/mounts,...",2019-06-04 09:45:50.396610+00:00,"[Defense Evasion, Discovery]"
2,2,"[enable, system, shell, sh, cat, /proc/mounts,...",2019-06-04 09:54:41.863315+00:00,"[Defense Evasion, Discovery]"
3,3,"[enable, system, shell, sh, cat, /proc/mounts,...",2019-06-04 10:22:14.623875+00:00,"[Defense Evasion, Discovery]"
4,4,"[enable, system, shell, sh, cat, /proc/mounts,...",2019-06-04 10:37:19.725874+00:00,"[Defense Evasion, Discovery]"
...,...,...,...,...
233030,233042,"[cat, /proc/cpuinfo, grep, name, wc, -l, echo,...",2020-02-29 23:47:28.217237+00:00,"[Discovery, Persistence]"
233031,233043,"[cat, /proc/cpuinfo, grep, name, wc, -l, echo,...",2020-02-29 23:49:01.009046+00:00,"[Discovery, Persistence]"
233032,233044,"[cat, /proc/cpuinfo, grep, name, wc, -l, echo,...",2020-02-29 23:56:18.827281+00:00,"[Discovery, Persistence]"
233033,233045,"[cat, /proc/cpuinfo, grep, name, wc, -l, echo,...",2020-02-29 23:56:56.263104+00:00,"[Discovery, Persistence]"
