# Extraction attacks via model clonning

<a href="https://colab.research.google.com/drive/115WTsmRYRUGBl3rHHLgAM8GI41JjXXpl" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

Adversarial machine learning is a specialized area within machine learning that aims to develop algorithms and techniques capable of withstanding and responding to adversarial attacks. These attacks involve malicious actors intentionally manipulating input data to machine learning models, resulting in incorrect or misleading outputs.

In essence, adversarial machine learning focuses on enhancing the robustness and security of machine learning models by identifying their vulnerabilities and creating effective countermeasures to mitigate the effects of such attacks.

In this notebook, we will investigate a specific type of adversarial attack known as model extraction, commonly called cloning. But what exactly is a model extraction attack?

![extraction](https://vitalab.github.io/article/images/stealml/fig1.jpeg)

[Source](https://vitalab.github.io/article/2019/11/21/stealml.html)


Model extraction is a type of cyber attack in which an attacker attempts to extract the details of a machine learning model trained by someone else. This can allow the attacker to create a copy of the model or use its insights to develop their own ML model.

These attacks typically involve reverse engineering the model, which can be achieved through techniques such as querying the model with specific inputs or observing its responses to a set of test data.

These attacks can be particularly damaging when the ML model is used to process sensitive or confidential data, such as personal information or financial data, as the attacker may be able to use the extracted model to gain unauthorized access to this data.

> **If you do not want to train the models, you can load the trained versions from the Hub. 🤗**



In [None]:
!pip install textattack -q
!pip install tensorflow==2.10.1 keras==2.10.0 -q
!pip install huggingface_hub -q

In [2]:
from huggingface_hub import hf_hub_download

# Download the model (this will be the target of our attack)
hf_hub_download(repo_id="AiresPucrs/model-api",
                filename="model_api.h5",
                local_dir="./",
                repo_type="model"
                )

# Download the tokenizer file
hf_hub_download(repo_id="AiresPucrs/model-api",
                filename="tokenizer_model_api.json",
                local_dir="./",
                repo_type="model"
                )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model_api.h5:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

tokenizer_model_api.json:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

'tokenizer_model_api.json'

Let us begin our exposition by faking an API around the model we downloaded.

In [5]:
import json
import tensorflow as tf
import keras

model_api = tf.keras.models.load_model('./model_api.h5')

with open('./tokenizer_model_api.json') as fp:
    data = json.load(fp)
    tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(data)
    fp.close()


def api_call(string):
    """
    Sends a POST request to a model API with the input string,
    and returns a dictionary of predicted scores for each class.

    Parameters:
    string (str): Input text to be sent to the model API.

    Returns:
    dict: A dictionary containing the HTTP method used, the request string,
    and a nested dictionary with keys for each sentiment class and their
    corresponding predicted score.

    """
    pediction = model_api.predict(
        tf.keras.preprocessing.sequence.pad_sequences(
            tokenizer.texts_to_sequences([string]),
            maxlen=250,
            truncating='post'
        ),
    verbose=0)
    return {
    'method' : 'POST',
    'request' : f'{string}',
    'response': {
        'negative_score': f'{pediction[0][0]}',
        'neutral_score': f'{pediction[0][1]}',
        'positive_score': f'{pediction[0][2]}',
        }
    }


Let us assume that our victim has a model we can call via an API. This particular model is a sentiment classifier that we would like to clone. As an attacker, we do not have a large budget (i.e., we must limit the number of calls we make to the model/API), and we do not have a database of hundreds of thousands of labeled examples (if we did, we probably wouldn't need to be cloning this model).

In [6]:
import pprint

# 'this explanation is really bad'
# 'i did not like this tutorial 2/10'
# 'this tutorial is garbage i wont my money back'
# 'is nice to see philosophers doing machine learning'
# 'this is a great and wonderful example of nlp'
# 'this tutorial is great one of the best tutorials ever made'

request = 'i did not like this tutorial 2/10'

api_response = api_call(request)

pprint.pprint(api_response)


{'method': 'POST',
 'request': 'i did not like this tutorial 2/10',
 'response': {'negative_score': '0.971930980682373',
              'neutral_score': '0.023702843114733696',
              'positive_score': '0.004366150125861168'}}


The model looks good, and we want to clone it.

How should an attacker proceed?

First, we need data to write only some of our initial samples by hand. This tutorial will start with an initial database containing 3000 unlabeled samples (not enough to train a good sentiment classifier).

This is our [**proto_dataset**](https://huggingface.co/datasets/AiresPucrs/proto-dataset).

We will be pretending to perform a black-box attack, which means we have no access to the model's parameters/gradient/architecture (to us, it is just something that produces outputs after receiving inputs). However, we can use these outputs to classify our proto_dataset. Thus, information about the target model will (indirectly) be passed to our samples. We are stealing this model's predictive power to replicate it later.


In [7]:
!pip install datasets -q

import numpy as np
import pandas as pd
from datasets import load_dataset


dataset = load_dataset('AiresPucrs/proto-dataset', split='train')
df = dataset.to_pandas()

def api_inference_call(string):
    """
    Calls an API to classify the sentiment of a given string,
    and returns the result as a numpy array.

    Parameters:
    -----------
    string : str
        The input string to be classified.

    Returns:
    --------
    numpy.ndarray
        An array of shape (3,) containing the scores for negative,
        neutral, and positive sentiment, in that order.
    """
    api_response = api_call(string)
    return np.array([float(item) for item in list(api_response['response'].values())])

df['proba'] = df.text.apply(api_inference_call)
df['sentiment'] = df.proba.apply(np.argmax)

display(df)

README.md:   0%|          | 0.00/1.37k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/246k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2999 [00:00<?, ? examples/s]

Unnamed: 0,text,proba,sentiment
0,when modi promised “minimum government maximum...,"[0.9994592070579529, 0.0004880620981566608, 5....",0
1,vote such party and leadershipwho can take fas...,"[0.02749091014266014, 0.1769004464149475, 0.79...",2
2,didn’ write chowkidar does mean ’ anti modi tr...,"[0.04485126957297325, 0.9484665393829346, 0.00...",1
3,with firm belief the leadership shri narendra ...,"[0.05731965973973274, 0.9352816939353943, 0.00...",1
4,sultanpur uttar pradesh loksabha candidate sel...,"[0.9963728189468384, 0.002949758665636182, 0.0...",0
...,...,...,...
2994,thats not true many are big supporters bjp and...,"[0.9744218587875366, 0.02039865218102932, 0.00...",0
2995,all the name advertising then telling voters w...,"[0.9355488419532776, 0.056234195828437805, 0.0...",0
2996,11was congress making defence strong ask urslf...,"[0.9995306730270386, 0.0002866442082449794, 0....",0
2997,when will show the real then can gain votes wo...,"[0.020042432472109795, 0.9782055020332336, 0.0...",1


We currently have a small dataset labeled by the target model. If this model is indeed effective—after all, why else would we want to clone it?—then our samples have been accurately classified.

Next, we need to "multiply our data." We are operating under the assumption that the attacker does not possess a large initial database and that classifying 30,000 samples using the target model's API is impractical due to either cost or other limitations.

Data augmentation is a machine learning technique designed to increase the size and diversity of a dataset by applying transformations or modifications to existing data samples. The primary goal of data augmentation is to enhance the robustness and generalization capabilities of machine learning models by expanding the amount of available training data.

One tool we can utilize for this purpose is [TextAttack](https://github.com/QData/TextAttack), a Python framework for adversarial attacks, training, and NLP data augmentation.

At this stage, we are particularly interested in the data augmentation features offered by TextAttack.

In [8]:
from textattack.augmentation import Augmenter
from textattack.transformations import CompositeTransformation, WordInsertionRandomSynonym, WordSwapContract
from textattack.constraints.pre_transformation import RepeatModification, StopwordModification

transformation = CompositeTransformation(
    [WordInsertionRandomSynonym(), WordSwapContract()])
constraints = [RepeatModification(), StopwordModification()]

aug = Augmenter(transformation=transformation,
                constraints=constraints,
                pct_words_to_swap=0.5,
                transformations_per_example=10)

request = df.text[1058]
aug_request = aug.augment(request)
for i, generated_data in enumerate(aug_request):
    print(f'Augmented Sample {i+1}: {generated_data}\n')


textattack: Updating TextAttack package dependencies.
textattack: Downloading NLTK required packages.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package omw to /root/nltk_data...
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

Augmented Sample 1: ingest modi teli obc because his upbringing and thither mindset can chowkidar outlook had there been his kids constitute they too consume would have become chowkidar also gatekeeper

Augmented Sample 2: jolly modi thither teli obc because his upbringing and birth mindset can chowkidar thither had there been his there kids Kid they too would have become chowkidar gatekeeper

Augmented Sample 3: kid modi ostiarius teli obc because his ostiary upbringing and equal mindset can chowkidar had there been his tin kids they porter too would have become chowkidar gatekeeper

Augmented Sample 4: modi canful teli obc because besides his get upbringing and stimulate mindset can chowkidar can had can there been his kids they too would have become chowkidar gatekeeper

Augmented Sample 5: modi minor teli obc because thither his upbringing fostering and mindset can chowkidar had there been porter his fosterage kids they too consume would have become chowkidar gatekeeper

Augmented 

Now, for each labeled sample in our initical **proto_dataset**, we will generate $10$ augmented copies.


In [None]:
labels = []
generated_sentences = []

for i in range(len(df)):
    if i % 250 == 0:
        print(f'{i} samples augmented ...')
    if i % len(df) == 0 and i != 0:
        print(f'{i} samples augmented. Augmentation Complete.')
    request = df.text[i]
    label = df.sentiment[i]
    aug_request = aug.augment(request)
    for generated_data in aug_request:
        generated_sentences.append(generated_data)
        labels.append(label)

data = {'text': generated_sentences,
        'sentiment': labels}

generated_data = pd.DataFrame(data)
generated_data.to_csv('final_dataset.csv', index=False)

0 samples augmented ...
250 samples augmented ...
500 samples augmented ...
750 samples augmented ...
1000 samples augmented ...
1250 samples augmented ...
1500 samples augmented ...
1750 samples augmented ...
2000 samples augmented ...
2250 samples augmented ...
2500 samples augmented ...
2750 samples augmented ...
3000 samples augmented ...
3250 samples augmented ...
3500 samples augmented ...
3750 samples augmented ...
4000 samples augmented ...
4250 samples augmented ...
4500 samples augmented ...
4750 samples augmented ...


We repeat this process twice, wherein the second time, we increase the percentage of words to be changed in each sentence (`pct_words_to_swap=0.8`), including the `WordSwapQWERTY` transformation, to simulate common typing errors. We eliminate duplicates by arriving at a `dataset_final` with $59258$ samples. Any imbalance in the distribution of samples across classes is just a _mirror image of the biases of the original model_ (e.g., most of the samples classified in the `proto_dataset` have the label _"negative sentiment"_). The total time for creating this dataset was $5$ hours.

Also, given the way that the API delivers model outputs (it gives us the _probability distribution_ of the victim's model `softmax` function), more information can be extracted. For example, we could [recover the model's logits from its probability predictions to approximate gradients](https://arxiv.org/abs/2011.14779). However, in this notebook/toy example, we will limit ourselves to the vanilla version of this attack.

In [None]:
df = pd.read_csv('final_dataset.csv')
display(df)

# Or, simply download the final dataset from the Hub
#from datasets import load_dataset
#dataset = load_dataset("AiresPucrs/final-text-dataset", split='train')
#dataset.to_pandas()

Unnamed: 0,text,sentiment
0,State when sodi Dromised “start minimur goveJn...,0
1,behave wBen need modo promised “whC minimum as...,0
2,tabernacle non when modi promised “minimum gov...,0
3,when mMdi promiseO “minimum gBvernment maximum...,0
4,when mbdi promised “minimum government maximum...,0
...,...,...
59253,kamre modi brand stiff remove marque first let...,0
59254,kamre modi stay brand remove corpse first miss...,0
59255,missive kamre modi brand remove offset first l...,0
59256,take kamre modi brand remove first absent lett...,0


Now we can train our own surrogate model.


In [None]:
!pip install keras-preprocessing -q

import io
import json
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from keras_preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer, tokenizer_from_json


df = pd.read_csv('./final_dataset.csv')

vocab_size = 5000
embed_size = 128
sequence_length = 250

# If we knew the original tokenizer, we could use it instead!
tokenizer = Tokenizer(num_words=vocab_size,
                      filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                      lower=True,
                      split=" ",
                      oov_token="<OOV>")

tokenizer.fit_on_texts(df.text)
tokenizer_json = tokenizer.to_json()

with io.open('tokenizer_surrogate_model.json', 'w', encoding='utf-8') as fp:
    fp.write(json.dumps(tokenizer_json, ensure_ascii=False))
    fp.close()


x_train, x_test, y_train, y_test = train_test_split(
    df.text, df.sentiment, test_size=0.2, random_state=42)

x_train = pad_sequences(
    tokenizer.texts_to_sequences(x_train),
    maxlen=sequence_length,
    truncating='post')
x_test = pad_sequences(
    tokenizer.texts_to_sequences(x_test),
    maxlen=sequence_length,
    truncating='post')
y_train = np.array(y_train).astype(float)
y_test = np.array(y_test).astype(float)


inputs = tf.keras.Input(shape=(None,), dtype="int32")
x = tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=embed_size,
                              input_length=sequence_length)(inputs)
x = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64, return_sequences=True))(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))(x)
outputs = tf.keras.layers.Dense(3, activation="softmax")(x)
model = tf.keras.Model(inputs, outputs)


model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")
model.summary()

callbacks = [tf.keras.callbacks.ModelCheckpoint("surrogate_model.h5",
                                                save_best_only=True),
            tf.keras.callbacks.EarlyStopping(monitor="val_loss",
                                            patience=3,
                                            verbose=1,
                                            mode="auto",
                                            baseline=None,
                                            restore_best_weights=True)]

model.fit(x_train,
          y_train,
          epochs=20,
          validation_split=0.2,
          callbacks=callbacks,
          verbose=1)

test_loss_score, test_acc_score = model.evaluate(x_test, y_test)

print(f'Final Loss: {round(test_loss_score, 2)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m30.7/42.6 kB[0m [31m779.4 kB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m700.6 kB/s[0m eta [36m0:00:00[0m
[?25hVersion:  2.15.0
Eager mode:  True
GPU is available
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 128)         640000    
                                                                 
 bidirectional (Bidirection  (None, None, 128)         98816     
 al)                                                             
                       

  saving_api.save_model(


Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 5: early stopping
Final Loss: 0.11.
Final Performance: 96.22 %.


In real-life situations, comparing the original model with our clone would be impossible. However, since this is just a toy example, we have the opportunity to do so! 🙃

For this comparison, we will use a test datset that both models have not encountered before. Our dataset can be downloaded directly from the Hub.

Additionally, you can download our trained model, `surrogate_model.h5`, along with the `tokenizer_surrogate_model.json` file from the Hub.

In [14]:
!pip install datasets -q

from datasets import load_dataset
from huggingface_hub import hf_hub_download
from keras_preprocessing.sequence import pad_sequences

# Download the surrogate model
hf_hub_download(repo_id="AiresPucrs/surrogate-model",
                filename="surrogate_model.h5",
                local_dir="./",
                repo_type="model"
                )

# Download the surrogate tokenizer file
hf_hub_download(repo_id="AiresPucrs/surrogate-model",
                filename="tokenizer_surrogate_model.json",
                local_dir="./",
                repo_type="model"
                )

# load the dataset from the hub
dataset = load_dataset("AiresPucrs/compare-models", split='train')
test_dataset = dataset.to_pandas()

model_api = tf.keras.models.load_model('./model_api.h5')


with open('./tokenizer_model_api.json') as fp:
    data = json.load(fp)
    tokenizer_api = tf.keras.preprocessing.text.tokenizer_from_json(data)
    fp.close()

surrogate_model = tf.keras.models.load_model('./surrogate_model.h5')

with open('./tokenizer_surrogate_model.json') as fp:
    data = json.load(fp)
    tokenizer_surrogate = tf.keras.preprocessing.text.tokenizer_from_json(data)
    word_index_surrogate = tokenizer_surrogate.word_index
    fp.close()

x = pad_sequences(
    tokenizer_api.texts_to_sequences(test_dataset.text),
    maxlen=250, truncating='post')
y = np.array(test_dataset.sentiment).astype(float)

_, test_acc_score = model_api.evaluate(x, y)

print(f'\nAccuracy of the API MODEL: {round(test_acc_score * 100, 2)} %.\n')

x = pad_sequences(
    tokenizer_surrogate.texts_to_sequences(test_dataset.text),
    maxlen=280, truncating='post')

_, test_acc_score = surrogate_model.evaluate(x, y)

print(
    f'\nAccuracy of the SURROGATE MODEL: {round(test_acc_score * 100, 2)} %.\n')


Accuracy of the API MODEL: 79.37 %.


Accuracy of the SURROGATE MODEL: 65.03 %.



~ $14\%$ is less accurate than the original, but still better than a random model! Now, let us put our `surrogate_model` into production.


In [15]:
import pprint

def surrogate_api_call(string):
    pediction = surrogate_model.predict(
        tf.keras.preprocessing.sequence.pad_sequences(
            tokenizer.texts_to_sequences([string]),
            maxlen=250,
            truncating='post'
        ),
    verbose=0)
    return {
    'method' : 'POST',
    'request' : f'{string}',
    'response': {
        'negative_score': f'{pediction[0][0]}',
        'neutral_score': f'{pediction[0][1]}',
        'positive_score': f'{pediction[0][2]}',
        }
    }

request = 'i did not like this tutorial 2/10'

api_response = api_call(request)

pprint.pprint(api_response)


{'method': 'POST',
 'request': 'i did not like this tutorial 2/10',
 'response': {'negative_score': '0.971930980682373',
              'neutral_score': '0.023702843114733696',
              'positive_score': '0.004366150125861168'}}


Model extraction attacks pose a threat to intellectual property and privacy. The availability of a model in the cloud, whether as a service or API, must be carefully architected by developers if they do not want to fall victim to this kind of attack. 🐱‍💻

To mitigate the risk of model extraction attacks, it is important to implement robust security measures, such as data encryption, access controls, and tamper-proofing techniques, to protect the confidentiality and integrity of ML models.

For more information on the subject, check the literature listed below:

- [A Framework for Understanding Model Extraction Attack and Defense](https://arxiv.org/abs/2206.11480).
- [Increasing the Cost of Model Extraction with Calibrated Proof of Work](https://arxiv.org/abs/2201.09243).
- [Data-Free Model Extraction](https://arxiv.org/abs/2011.14779).
- [MEGEX: Data-Free Model Extraction Attack against Gradient-Based Explainable AI](https://arxiv.org/abs/2107.08909).
- [DeepSteal: Advanced Model Extractions Leveraging Efficient Weight Stealing in Memories](https://arxiv.org/abs/2111.04625).
- [Model Extraction and Defenses on Generative Adversarial Networks](https://arxiv.org/abs/2101.02069).

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).