<img style="max-width:20em; height:auto;" src="../graphics/A-Little-Book-on-Adversarial-AI-Cover.png"/>

Author: Nik Alleyne   
Author Blog: https://www.securitynik.com   
Author GitHub: github.com/securitynik   

Author Other Books: [   

            "https://www.amazon.ca/Learning-Practicing-Leveraging-Practical-Detection/dp/1731254458/",   
            
            "https://www.amazon.ca/Learning-Practicing-Mastering-Network-Forensics/dp/1775383024/"   
        ]   


This notebook ***(stego_basic.ipynb)*** is part of the series of notebooks From ***A Little Book on Adversarial AI***  A free ebook released by Nik Alleyne

### Steganography Basics   

### Lab Objectives:  
- Get an introduction to some basic steganography  
- Perform the task of adding, reading and removing bytes  
- Executing Python code via the appended content   
- Learn how to use tools such as XXD to look at byte sequence  

### Step 1:   
Create a few basic models, so we have something to look at

In [1]:
# Import the needed libraries
import torch
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
import joblib
import numpy as np
import os

In [2]:
# Get some sample data from a toy dataset
X, y = make_classification()
print(f'The shape of X is: {X.shape}')
print(f'The shape of y is: {y.shape}')

# Get the first 5 records
X[:5], y[:5]

The shape of X is: (100, 20)
The shape of y is: (100,)


(array([[ 0.89737271,  1.19883403, -0.57614418, -0.85941023, -1.08271572,
          1.19043099,  0.69201185,  0.7254497 , -0.96392806,  0.51763767,
         -0.8329954 ,  0.26745915, -0.21357159, -0.15951903, -1.30596704,
         -0.64671263, -0.83507562,  1.17507698,  1.56122934,  0.64540703],
        [-0.44280726, -0.28487123, -0.75101194, -1.52651891, -0.58925645,
          0.52610495,  0.25752   , -1.68152247,  0.59657607, -1.413296  ,
          0.32269801,  0.61772267, -0.97169635, -1.53766302,  0.49397657,
          1.38020093,  1.43151648,  2.43857923,  0.13773635,  0.3250355 ],
        [-0.3731393 ,  0.42547094, -0.62212121, -0.29606904,  1.99180325,
         -1.95303076, -1.04132407,  1.13082651,  0.05642607, -0.28032486,
         -0.35760038, -1.41068385,  1.00750231, -1.39252609, -0.01392689,
          1.61319001, -0.63282254,  1.11243071, -0.7898864 ,  0.68300386],
        [-0.55489001,  0.63906201, -0.68288286, -0.32922824, -0.83013118,
          0.75328348,  0.37464092, 

With the data in place, let us create three different models. The idea here is just so see if there are any similarities within the model structure.   

The choice of Logistic Regression, Decision Tree and LinesarSVC are just random. Nothing special to them.  

We also save these models to the file system, so we can analyze them  

### Step 2:   

In [3]:
# Create three different models.
lr_clf = LogisticRegression().fit(X, y)
dt_clf = DecisionTreeClassifier().fit(X, y)
svc_clf = LinearSVC().fit(X, y)

In [4]:
# Save the models
joblib.dump(value=lr_clf, filename=r'/tmp/lr_clf.joblib')
joblib.dump(value=dt_clf, filename=r'/tmp/dt_clf.joblib')
joblib.dump(value=svc_clf, filename=r'/tmp/svc_clf.joblib')

['/tmp/svc_clf.joblib']

In [5]:
# Verify the files were saved to the file system
!ls /tmp/*_clf.joblib

/tmp/dt_clf.joblib  /tmp/lr_clf.joblib	/tmp/svc_clf.joblib


In [6]:
# As always, show that the model can make predictions after loading
loaded_model = joblib.load(filename=r'/tmp/lr_clf.joblib')
loaded_model.predict(X[:5])

array([0, 1, 0, 1, 0])

When looking at the 3 models at the top using XXD, all three ends with the same 30 bytes sequence:   
~$ xxd -s -30 /tmp/svc_clf.joblib
0000035d: 8c10 5f73 6b6c 6561 726e 5f76 6572 7369  .._sklearn_versi
0000036d: 6f6e 948c 0531 2e36 2e31 9475 622e       on...1.6.1.ub.

$ xxd -s -30 /tmp/dt_clf.joblib
000009ab: 8c10 5f73 6b6c 6561 726e 5f76 6572 7369  .._sklearn_versi
000009bb: 6f6e 948c 0531 2e36 2e31 9475 622e       on...1.6.1.ub.

$ xxd -s -30 /tmp/lr_clf.joblib
000003e1: 8c10 5f73 6b6c 6561 726e 5f76 6572 7369  .._sklearn_versi
000003f1: 6f6e 948c 0531 2e36 2e31 9475 622e       on...1.6.1.ub.

This suggest some structure that could be leveraged. Note, in this case, reference is made specifically to sklearn version 1.6.1. This means, this code is version specific and may not work with other versions. This simply mean we can modify our code to suit. 

Rather than using the full 30 bytes, let's use 16 instead. Also, let's get the output in C format

$ xxd -i -s -16 /tmp/lr_clf.joblib
unsigned char _tmp_lr_clf_joblib[] = {
  0x73, 0x69, 0x6f, 0x6e, 0x94, 0x8c, 0x05, 0x31, 0x2e, 0x36, 0x2e, 0x31,
  0x94, 0x75, 0x62, 0x2e
};
unsigned int _tmp_lr_clf_joblib_len = 16;

References: 
https://www.stackzero.net/how-to-hide-messages-in-pictures-with-python-steganography/


### Step 3:  
Let us validate above  

In [7]:
# Reviewing the last 30 bytes of the svc_clf model
!xxd -s -30 /tmp/svc_clf.joblib

0000035d: [1;31m8c[0m[1;31m10[0m [1;32m5f[0m[1;32m73[0m [1;32m6b[0m[1;32m6c[0m [1;32m65[0m[1;32m61[0m [1;32m72[0m[1;32m6e[0m [1;32m5f[0m[1;32m76[0m [1;32m65[0m[1;32m72[0m [1;32m73[0m[1;32m69[0m  [1;31m.[0m[1;31m.[0m[1;32m_[0m[1;32ms[0m[1;32mk[0m[1;32ml[0m[1;32me[0m[1;32ma[0m[1;32mr[0m[1;32mn[0m[1;32m_[0m[1;32mv[0m[1;32me[0m[1;32mr[0m[1;32ms[0m[1;32mi[0m
0000036d: [1;32m6f[0m[1;32m6e[0m [1;31m94[0m[1;31m8c[0m [1;31m05[0m[1;32m31[0m [1;32m2e[0m[1;32m37[0m [1;32m2e[0m[1;32m30[0m [1;31m94[0m[1;32m75[0m [1;32m62[0m[1;32m2e[0m    [1;31m [0m[1;31m [0m [1;32mo[0m[1;32mn[0m[1;31m.[0m[1;31m.[0m[1;31m.[0m[1;32m1[0m[1;32m.[0m[1;32m7[0m[1;32m.[0m[1;32m0[0m[1;31m.[0m[1;32mu[0m[1;32mb[0m[1;32m.[0m


In [8]:
# Reviewing the last 30 bytes of the dt_clf model
!xxd -s -30 /tmp/dt_clf.joblib

000009ab: [1;31m8c[0m[1;31m10[0m [1;32m5f[0m[1;32m73[0m [1;32m6b[0m[1;32m6c[0m [1;32m65[0m[1;32m61[0m [1;32m72[0m[1;32m6e[0m [1;32m5f[0m[1;32m76[0m [1;32m65[0m[1;32m72[0m [1;32m73[0m[1;32m69[0m  [1;31m.[0m[1;31m.[0m[1;32m_[0m[1;32ms[0m[1;32mk[0m[1;32ml[0m[1;32me[0m[1;32ma[0m[1;32mr[0m[1;32mn[0m[1;32m_[0m[1;32mv[0m[1;32me[0m[1;32mr[0m[1;32ms[0m[1;32mi[0m
000009bb: [1;32m6f[0m[1;32m6e[0m [1;31m94[0m[1;31m8c[0m [1;31m05[0m[1;32m31[0m [1;32m2e[0m[1;32m37[0m [1;32m2e[0m[1;32m30[0m [1;31m94[0m[1;32m75[0m [1;32m62[0m[1;32m2e[0m    [1;31m [0m[1;31m [0m [1;32mo[0m[1;32mn[0m[1;31m.[0m[1;31m.[0m[1;31m.[0m[1;32m1[0m[1;32m.[0m[1;32m7[0m[1;32m.[0m[1;32m0[0m[1;31m.[0m[1;32mu[0m[1;32mb[0m[1;32m.[0m


In [9]:
# Reviewing the last 30 bytes of the lr_clf model
!xxd -s -30 /tmp/lr_clf.joblib

000003e1: [1;31m8c[0m[1;31m10[0m [1;32m5f[0m[1;32m73[0m [1;32m6b[0m[1;32m6c[0m [1;32m65[0m[1;32m61[0m [1;32m72[0m[1;32m6e[0m [1;32m5f[0m[1;32m76[0m [1;32m65[0m[1;32m72[0m [1;32m73[0m[1;32m69[0m  [1;31m.[0m[1;31m.[0m[1;32m_[0m[1;32ms[0m[1;32mk[0m[1;32ml[0m[1;32me[0m[1;32ma[0m[1;32mr[0m[1;32mn[0m[1;32m_[0m[1;32mv[0m[1;32me[0m[1;32mr[0m[1;32ms[0m[1;32mi[0m
000003f1: [1;32m6f[0m[1;32m6e[0m [1;31m94[0m[1;31m8c[0m [1;31m05[0m[1;32m31[0m [1;32m2e[0m[1;32m37[0m [1;32m2e[0m[1;32m30[0m [1;31m94[0m[1;32m75[0m [1;32m62[0m[1;32m2e[0m    [1;31m [0m[1;31m [0m [1;32mo[0m[1;32mn[0m[1;31m.[0m[1;31m.[0m[1;31m.[0m[1;32m1[0m[1;32m.[0m[1;32m7[0m[1;32m.[0m[1;32m0[0m[1;31m.[0m[1;32mu[0m[1;32mb[0m[1;32m.[0m


In [10]:
# Take a different view of the bytes
!xxd -i -s -16 /tmp/lr_clf.joblib

unsigned char _tmp_lr_clf_joblib[] = {
  0x73, 0x69, 0x6f, 0x6e, 0x94, 0x8c, 0x05, 0x31, 0x2e, 0x37, 0x2e, 0x30,
  0x94, 0x75, 0x62, 0x2e
};
unsigned int _tmp_lr_clf_joblib_len = 16;


Looks like everything is good now. Let us move forward with writing the bytes to the model.  

Create a function to read the bytes of the model. Specify a **num_bytes** parameter to make it optional for the number of bytes to read

Just for the same of it, let us work with Logistic Regression classifier. The same approach can be used for any of the other models. Feel free to experiment. 

### Step 4:  

In [11]:
# Create a function to read the mode and the number of bytes
def read_bytes(filename=r'/tmp/lr_clf.joblib', num_bytes=32):
    print(f'Reading the last {num_bytes} bytes from model file: {filename}')
    with open(file=filename, mode='rb') as fp:

        # Read the file and return the last num_bytes
        return fp.read()[-num_bytes:]

# Call the function with the default values
# Capture the byte that gets returned
# This will be used to index into the entire byte stream
ending_bytes = read_bytes()
ending_bytes

Reading the last 32 bytes from model file: /tmp/lr_clf.joblib


b'\x00\x00\x8c\x10_sklearn_version\x94\x8c\x051.7.0\x94ub.'

With the number of bytes read, we can now create another function to append our content of interest to these bytes that were read above.  

The parameter **secrete** represents the content we would like to append. Because we ultimately want to execute arbitrary code, we will append a small pythons script, that simply prints **Hello World!**.  

### Step 5:  

In [12]:
# Function to append the secret
def add_secret(filename=r'/tmp/lr_clf.joblib', secret="""python -c "print('Hello World!')" """):

    # Read the file in a mode that allows us to append binary data
    with open(file=filename, mode='ab') as fp:

        # With the file read, ensure the content is captures as bytes
        # force the encoding the utf-8
        print(f'Adding content: **{secret}** to model_file: {filename}')
        fp.write(bytes(secret, encoding='utf-8'))

# Call the function with the default values
add_secret()

Adding content: **python -c "print('Hello World!')" ** to model_file: /tmp/lr_clf.joblib


In [13]:
# Verify our secret is in the file
# by looking at the last 16 bytes
! xxd -s -16 /tmp/lr_clf.joblib


00000411: [1;32m48[0m[1;32m65[0m [1;32m6c[0m[1;32m6c[0m [1;32m6f[0m[1;32m20[0m [1;32m57[0m[1;32m6f[0m [1;32m72[0m[1;32m6c[0m [1;32m64[0m[1;32m21[0m [1;32m27[0m[1;32m29[0m [1;32m22[0m[1;32m20[0m  [1;32mH[0m[1;32me[0m[1;32ml[0m[1;32ml[0m[1;32mo[0m[1;32m [0m[1;32mW[0m[1;32mo[0m[1;32mr[0m[1;32ml[0m[1;32md[0m[1;32m![0m[1;32m'[0m[1;32m)[0m[1;32m"[0m[1;32m [0m


Great, we were able to append contents to the file. Could we re-read the contents now? Obviously, we read it above, so we should be able to read it again. 

The difference this time is not only do we want to read the raw bytes, we need to execute the code. Let us create a function to achieve this objective.   

### Step 6:   

In [14]:
# Function to now read the content
def read_secret(filename=r'/tmp/lr_clf.joblib'):
    # Read the file in binary mode
    with open(file=filename, mode='rb') as fp:
        buf = fp.read()

        # This is where the ending bytes come into play
        # We are indexing into where the string starts 
        idx = buf.index(ending_bytes)
        print(f'Index position is: {idx}')
        print(f'ILen of ending_byte is: {len(ending_bytes)}')
        print(f'Returning bytes from: {idx} to {idx + len(ending_bytes)}')

        # We return idx plus the length of ending bytes
        # We go from this position to the end of the byte stream
        # We then decode it as 'utf-8
        return buf[idx+len(ending_bytes) : ].decode(encoding='utf-8')

In [15]:
# With all of this in place, call the function and decode the bytes
_ = os.system(command=read_secret())

Index position is: 991
ILen of ending_byte is: 32
Returning bytes from: 991 to 1023
Hello World!


Awesome! 
- We read the bytes in a model file.   
- We then appended content to the end of the file. 
- We were then able to execute the python code, print **Hello World!** 

However, the question now is ... Can the model still make predictions? Let us find out.   

### Step 7:   


In [16]:
# Make a prediction with the model
# To ensure the model still works.
loaded_model.predict(X[:5])

array([0, 1, 0, 1, 0])

Looks like we are still good to go. The model still works as expected.   

Just to ensure we are being tidy, let us clean up the content we added to the model file. We do this by create a function for this purpose.   

### Step 8:  

In [17]:
# Remove the secret
def remove_secret(filename=r'/tmp/lr_clf.joblib'):
    with open(file=filename, mode='rb+') as fp:
        buf = fp.read()

        # go to the index in the file where the ending_bytes start
        idx = buf.index(ending_bytes)

        # Truncate the file up to idx + length of ending_bytes
        fp.truncate(idx + len(ending_bytes))


# Call the function to remove the secrete from the file
remove_secret()

In [18]:
# Verify the secret is no longer there
# by looking at the last 32 bytes
! xxd -s -32 /tmp/lr_clf.joblib


000003df: [1;37m00[0m[1;37m00[0m [1;31m8c[0m[1;31m10[0m [1;32m5f[0m[1;32m73[0m [1;32m6b[0m[1;32m6c[0m [1;32m65[0m[1;32m61[0m [1;32m72[0m[1;32m6e[0m [1;32m5f[0m[1;32m76[0m [1;32m65[0m[1;32m72[0m  [1;37m.[0m[1;37m.[0m[1;31m.[0m[1;31m.[0m[1;32m_[0m[1;32ms[0m[1;32mk[0m[1;32ml[0m[1;32me[0m[1;32ma[0m[1;32mr[0m[1;32mn[0m[1;32m_[0m[1;32mv[0m[1;32me[0m[1;32mr[0m
000003ef: [1;32m73[0m[1;32m69[0m [1;32m6f[0m[1;32m6e[0m [1;31m94[0m[1;31m8c[0m [1;31m05[0m[1;32m31[0m [1;32m2e[0m[1;32m37[0m [1;32m2e[0m[1;32m30[0m [1;31m94[0m[1;32m75[0m [1;32m62[0m[1;32m2e[0m  [1;32ms[0m[1;32mi[0m[1;32mo[0m[1;32mn[0m[1;31m.[0m[1;31m.[0m[1;31m.[0m[1;32m1[0m[1;32m.[0m[1;32m7[0m[1;32m.[0m[1;32m0[0m[1;31m.[0m[1;32mu[0m[1;32mb[0m[1;32m.[0m


We did well.  

### Lab Takeaways:   
- We began the process of performing so basic stenography   
- We looked at three different model files and the last 32 bytes of all of them looks basically the same  
- We added some python code to the end of the file  
- We decoded the encoded content  
- We removed the content we added   

Overall, we basically laid the foundation for something we can build on. 