# _Cloning_ Language Models with Data Augmentation via Textattack

Return to the [castle](https://github.com/Nkluge-correa/TeenyTinyCastle).

Adversarial machine learning (`AML`) is a subfield of machine learning that focuses on developing algorithms and techniques that can withstand and respond to adversarial attacks. 

Adversarial attacks are a type of cyber attack where an attacker deliberately manipulates data inputs to ML models to cause them to produce incorrect outputs. 

`AML` aims to improve the robustness and security of ML models by identifying vulnerabilities and developing countermeasures to mitigate the impact of adversarial attacks. A range of techniques has been developed for `AML`, including `adversarial training` (_training models on adversarial examples_), and `defensive distillation` (_creating a distilled version of a model that is resistant to adversarial attacks_).

`AML` is an active area of research, as ML models continue to be deployed in a wide range of applications where they may be vulnerable to attack.

In this notebook, we will explore a type of attack called `model extraction` (_cloning_). But _what is model extraction?_

![extraction](https://vitalab.github.io/article/images/stealml/fig1.jpeg)

[Source](https://vitalab.github.io/article/2019/11/21/stealml.html)

`Model extraction` is a type of cyber attack involving an attacker attempting to extract the details of a machine learning model trained by someone else. This can allow the attacker to create a copy of the model or use its insights to develop their ML model. 

These attacks typically involve a process of reverse engineering the model, which can be achieved through techniques such as querying the model with specific inputs or observing its responses to a set of test data. 

`Model extraction` attacks can be particularly damaging when the ML model is used to process sensitive or confidential data, such as personal information or financial data, as the attacker may be able to use the extracted model to gain unauthorized access to this data. 

If you do not want to train the models, you can load the trained versions in the cell below. But first, you need to download them (instructions in the `models` folder.)

Let us begin our exposition with _an Unprotected Model ..._

In [2]:
import json
import tensorflow as tf
from keras.preprocessing.text import Tokenizer, tokenizer_from_json

model_api = tf.keras.models.load_model('models/model_api.h5')

with open('models/tokenizer_model_api.json') as fp:
    data = json.load(fp)
    tokenizer = tokenizer_from_json(data)
    fp.close()


def api_call(string):
    """
    Sends a POST request to a model API with the input string, 
    and returns a dictionary of predicted scores for each class.

    Parameters:
    string (str): Input text to be sent to the model API.

    Returns:
    dict: A dictionary containing the HTTP method used, the request string, 
    and a nested dictionary with keys for each sentiment class and their 
    corresponding predicted score.

    """
    pediction = model_api.predict(
        tf.keras.preprocessing.sequence.pad_sequences(
            tokenizer.texts_to_sequences([string]),
            maxlen=250,
            truncating='post'
        ),
    verbose=0)
    return {
    'method' : 'POST',
    'request' : f'{string}',
    'response': {
        'negative_score': f'{pediction[0][0]}',
        'neutral_score': f'{pediction[0][1]}',
        'positive_score': f'{pediction[0][2]}',
        }
    }


Let us assume that our victim has a model we can call via an API. This particular model is a _sentiment classifier_ (a.k.a. a language model) that we would like to clone. As an attacker, _we do not have a large budget_ (i.e., we must limit the number of calls we make to the model/API), and _we do not have a database of hundreds of thousands of labeled examples_ (if we did, we probably wouldn't need to be cloning this model).

In [3]:
import pprint

# 'this explanation is really bad'
# 'i did not like this tutorial 2/10'
# 'this tutorial is garbage i wont my money back'
# 'is nice to see philosophers doing machine learning'
# 'this is a great and wonderful example of nlp'
# 'this tutorial is great one of the best tutorials ever made'

request = 'i did not like this tutorial 2/10'

api_response = api_call(request)

pprint.pprint(api_response)


{'method': 'POST',
 'request': 'i did not like this tutorial 2/10',
 'response': {'negative_score': '0.971930980682373',
              'neutral_score': '0.0237028319388628',
              'positive_score': '0.004366150591522455'}}


The model looks good, and we want to clone it.

_How should an attacker proceed?_

First, we need some data if we don't want to write all our initial samples by hand. Via [web scrapping](https://github.com/Nkluge-correa/TeenyTinyCastle/blob/bbe9c0a77499fa68de7c6d53bf5ef7e0b43a25e0/ML-Explainability/NLP%20(en)/scrape_en.ipynb) or through public data repositories (e.g., [Kaggle](https://www.kaggle.com/)) we were able to assemble an initial database containing 3000 unlabeled samples (_not enough to train a good sentiment classifier_).

This is our `proto_dataset.csv`.

This is a _black-box attack_, which means that we have no access to the model's _parameters/gradient/architecture_ (to us, it is just something that produces outputs after receiving inputs). However, we can use these outputs to classify our `proto_dataset`. Thus, information about the target model will (indirectly) be passed to our samples. We are stealing this model's predictive power, to try to replicate later.


In [4]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/proto_dataset.csv')

def api_inference_call(string):
    """
    Calls an API to classify the sentiment of a given string, 
    and returns the result as a numpy array.

    Parameters:
    -----------
    string : str
        The input string to be classified.

    Returns:
    --------
    numpy.ndarray
        An array of shape (3,) containing the scores for negative, 
        neutral, and positive sentiment, in that order.
    """
    api_response = api_call(string)
    return np.array([float(item) for item in list(api_response['response'].values())]) 

df['proba'] = df.text.apply(api_inference_call)
df['sentiment'] = df.proba.apply(np.argmax)

display(df)

Unnamed: 0,text,proba,sentiment
0,when modi promised “minimum government maximum...,"[0.9994592070579529, 0.0004880616324953735, 5....",0
1,vote such party and leadershipwho can take fas...,"[0.027490884065628052, 0.17690037190914154, 0....",2
2,didn’ write chowkidar does mean ’ anti modi tr...,"[0.044851239770650864, 0.9484665393829346, 0.0...",1
3,with firm belief the leadership shri narendra ...,"[0.05731959268450737, 0.9352818131446838, 0.00...",1
4,sultanpur uttar pradesh loksabha candidate sel...,"[0.9963728189468384, 0.0029497554060071707, 0....",0
...,...,...,...
2994,thats not true many are big supporters bjp and...,"[0.9744218587875366, 0.02039860561490059, 0.00...",0
2995,all the name advertising then telling voters w...,"[0.9355488419532776, 0.05623412877321243, 0.00...",0
2996,11was congress making defence strong ask urslf...,"[0.9995306730270386, 0.0002866439172066748, 0....",0
2997,when will show the real then can gain votes wo...,"[0.020042387768626213, 0.9782055020332336, 0.0...",1


Classifying our `proto_dataset` will vary according to the constraints imposed by our victim API (e.g., _cost per call, the limit of calls per minute, etc._). We now have a (_small_) dataset labeled by the target model. And if this model is indeed good (_why else would we want to clone it_), our samples have been accurately classified.

Now we need to "multiply our data". We are assuming that the attacker does not have a large initial database, and it is not feasible to classify 30000 samples using the API of the target model (either by price or other restrictions).

`Data augmentation` is a machine learning technique to increase a dataset's size and diversity by applying transformations or modifications to existing data samples. Data augmentation aims to improve the robustness and generalization ability of ML models by increasing the amount of training data available to them.

This methodology can be particularly useful in applications where data is limited or expensive to collect, such as computer vision and natural language processing. Examples of data augmentation techniques include image cropping, rotation, and flipping in computer vision, text paraphrasing, word substitution, and spelling correction in natural language processing.

[TextAttack](https://github.com/QData/TextAttack) is a Python framework for adversarial attacks, training, and NLP data augmentation.

The part of TextAttack that interests us right now is its _data augmentation part_. Below, we list some of the many ready-made augmentation classes from this library:

> Transforation tools $\rightarrow$ text transformations implemented (e.g., _swaping words, like names and places_) used to create an `Augmenter` object.

- `CompositeTransformation`: used to combine multiple transformations.
- `WordInsertionRandomSynonym`: inserts synonyms of words already in the sequence.
- `WordInsertionMaskedLM`: generate potential insertion for a word using a masked language model.
- `WordSwapHowNet`: transforms an input by replacing its words with synonyms in the stored synonyms bank generated by the OpenHowNet (needs a python version > 3.8.1).
- `WordSwapEmbedding`: transforms an input by replacing its words with synonyms in the word embedding space.
- `WordSwapHomoglyphSwap`: transforms an input by replacing its words with visually similar words using homoglyph swaps.
- `WordSwapQWERTY`: common misspellings related to the QWERTY keyboard style.
- `WordSwapContract`: transforms an input by performing contraction on recognized combinations.
- `WordSwapChangeLocation`: changes a location described in the text (e.g., Brazil -> Argentina).
- `WordSwapChangeNumber`: changes a number mentioned in the text (e.g., 7 -> 13).
- `WordSwapChangeName`: changes a name mentioned in the text (e.g., Alice -> Bob).
- `WordSwapInflections`: transforms an input by replacing its words with their inflections.
- `WordSwapMaskedLM` generates potential replacements for a word using a masked language model.
- `WordSwapRandomCharacterDeletion`: transforms an input by deleting its characters (`random_one=True, skip_first_char=True, skip_last_char=True` works well!).
- `WordSwapRandomCharacterInsertion`: transforms an input by inserting a random character (`random_one=True, skip_first_char=True, skip_last_char=True` works well!).
- `WordSwapRandomCharacterSubstitution` transforms an input by replacing one character in a word with a random new character.

> Constraints $\rightarrow$ constraints determine whether or not a given augmentation is valid, consequently enhancing the quality of the augmentations.

- `RepeatModification`: a constraint disallowing the modification of previously modified words.
- `StopwordModification`: a constraint disallowing the modification of stopwords.

> Augmentation parameters $\rightarrow$ control parameters of the augmenting object.

- `pct_words_to_swap`: percentage of words to swap per augmented example. The default is set to 0.1 (10%).
- `transformations_per_example`: maximum number of augmentations per input. The default is set to 1 (one augmented sentence given one original input)

> Ready Recipes $\rightarrow$ in addition to creating your own augmenter, you could also use pre-built augmentation recipes. These [recipes are implemented from published papers](https://textattack.readthedocs.io/en/latest/3recipes/augmenter_recipes.html) and are very convenient to use.

- `CheckListAugmenter`: augments words by using the transformation methods provided by CheckList INV testing, which combines Name Replacement, Location Replacement, Number Alteration, and Contraction/Extension.
- `WordNetAugmenter`: another pre-made augmentation recipe (`high_yield=True, enable_advanced_metrics=True` works well!).

In [5]:
from textattack.augmentation import Augmenter
from textattack.transformations import CompositeTransformation, WordInsertionRandomSynonym, WordSwapContract
from textattack.constraints.pre_transformation import RepeatModification, StopwordModification

transformation = CompositeTransformation(
    [WordInsertionRandomSynonym(), WordSwapContract()])
constraints = [RepeatModification(), StopwordModification()]

aug = Augmenter(transformation=transformation,
                constraints=constraints,
                pct_words_to_swap=0.5,
                transformations_per_example=10)

request = df.text[1058]
aug_request = aug.augment(request)
for i, generated_data in enumerate(aug_request):
    print(f'Augmented Sample {i+1}: {generated_data}\n')


Augmented Sample 1: also modi teli obc because his toilet upbringing and have mindset brain can chowkidar had there been his mentality kids besides they too would have become chowkidar gatekeeper

Augmented Sample 2: modi equal teli obc fry because his upbringing lav and mindset can chowkidar there had there been his kids they too would have become minor chowkidar thither gatekeeper

Augmented Sample 3: modi josh teli obc because nipper his upbringing and mindset can chowkidar thither had fry there been his kids they constitute too would bear have become chowkidar gatekeeper

Augmented Sample 4: modi pot teli thither obc raising because his upbringing and mindset can chowkidar dope had there been his also kids they too would have there become chowkidar gatekeeper

Augmented Sample 5: modi teli Kid obc because minor his upbringing and mindset can chowkidar had there been his kids they Kyd too would have suffer become go chowkidar canful gatekeeper

Augmented Sample 6: modi teli nurture 

For each labeled sample in our `proto_dataset`, we will generate $10$ augmented copies.


In [10]:
labels = []
generated_sentences = []

for i in range(len(df)):
    if i % 250 == 0:
        print(f'{i} samples augmented ...')
    if i % len(df) == 0 and i != 0:
        print(f'{i} samples augmented. Augmentation Complete.')
    request = df.text[i]
    label = df.sentiment[i]
    aug_request = aug.augment(request)
    for generated_data in aug_request:
        generated_sentences.append(generated_data)
        labels.append(label)

data = {'text': generated_sentences,
        'sentiment': labels}

generated_data = pd.DataFrame(data)
generated_data.to_csv('data/final_dataset.csv', idenx=False)


0 samples augmented ...
250 samples augmented ...
500 samples augmented ...
750 samples augmented ...
1000 samples augmented ...
1250 samples augmented ...
1500 samples augmented ...
1750 samples augmented ...
2000 samples augmented ...
2250 samples augmented ...
2500 samples augmented ...
2750 samples augmented ...



We repeat this process twice, wherein the second time, we increase the percentage of words to be changed in each sentence (`pct_words_to_swap=0.8`), including the `WordSwapQWERTY` transformation, to simulate common typing errors. We eliminate duplicates by arriving at a `dataset_final` with $59258$ samples. Any imbalance in the distribution of samples across classes is just a _mirror image of the biases of the original model_ (e.g., most of the samples classified in the `proto_dataset` have the label _"negative sentiment"_). The total time for creating this dataset was $5$ hours.

Also, given the way that the API delivers model outputs (it gives us the _probability distribution_ of the victim's model `softmax` function), more information can be extracted. For example, we could [recover the model's logits from its probability predictions to approximate gradients](https://arxiv.org/abs/2011.14779). However, in this notebook/toy example, we will limit ourselves to the vanilla version of this attack.

In [11]:
df = pd.read_csv('data/final_dataset.csv')
display(df)


Unnamed: 0,text,sentiment
0,State when sodi Dromised “start minimur goveJn...,0
1,behave wBen need modo promised “whC minimum as...,0
2,tabernacle non when modi promised “minimum gov...,0
3,when mMdi promiseO “minimum gBvernment maximum...,0
4,when mbdi promised “minimum government maximum...,0
...,...,...
59253,kamre modi brand stiff remove marque first let...,0
59254,kamre modi stay brand remove corpse first miss...,0
59255,missive kamre modi brand remove offset first l...,0
59256,take kamre modi brand remove first absent lett...,0


Now we can train our surrogate model the _old fashion_.

- Load & Split the `dataset`;
- Build & Save the `tokenizer`;
- Train the `surrogate_model`.


In [13]:
import io
import json
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from keras_preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer, tokenizer_from_json


df = pd.read_csv('data/final_dataset.csv')

vocab_size = 5000
embed_size = 128
sequence_length = 250

tokenizer = Tokenizer(num_words=vocab_size,
                      filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                      lower=True,
                      split=" ",
                      oov_token="<OOV>")

tokenizer.fit_on_texts(df.text)
tokenizer_json = tokenizer.to_json()

with io.open('models/tokenizer_surrogate_model.json', 'w', encoding='utf-8') as fp:
    fp.write(json.dumps(tokenizer_json, ensure_ascii=False))
    fp.close()


x_train, x_test, y_train, y_test = train_test_split(
    df.text, df.sentiment, test_size=0.2, random_state=42)

x_train = pad_sequences(
    tokenizer.texts_to_sequences(x_train), 
    maxlen=sequence_length, 
    truncating='post')
x_test = pad_sequences(
    tokenizer.texts_to_sequences(x_test), 
    maxlen=sequence_length, 
    truncating='post')
y_train = np.array(y_train).astype(float)
y_test = np.array(y_test).astype(float)


inputs = tf.keras.Input(shape=(None,), dtype="int32")
x = tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=embed_size,
                              input_length=sequence_length)(inputs)
x = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64, return_sequences=True))(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))(x)
outputs = tf.keras.layers.Dense(3, activation="softmax")(x)
model = tf.keras.Model(inputs, outputs)


model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")
model.summary()

callbacks = [tf.keras.callbacks.ModelCheckpoint("models/surrogate_model.h5",
                                                save_best_only=True),
            tf.keras.callbacks.EarlyStopping(monitor="val_loss",
                                            patience=3,
                                            verbose=1,
                                            mode="auto",
                                            baseline=None,
                                            restore_best_weights=True)]

model.fit(x_train,
          y_train,
          epochs=20,
          validation_split=0.2,
          callbacks=callbacks,
          verbose=1)

test_loss_score, test_acc_score = model.evaluate(x_test, y_test)

print(f'Final Loss: {round(test_loss_score, 2)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')


Version:  2.10.1
Eager mode:  True
GPU is available
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 128)         640000    
                                                                 
 bidirectional (Bidirectiona  (None, None, 128)        98816     
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense (Dense)               (None, 3)                 387       
                                                                 
Total par

In real-life situations, we could not do a comparison test between the original model and our clone. But since this is just a toy example, we can! 🙃

For this, we are using a test database not seen by both models.


In [18]:

test_dataset = pd.read_csv('data/compare_models_dataset.csv')

model_api = tf.keras.models.load_model('models/model_api.h5')

with open('models/tokenizer_model_api.json') as fp:
    data = json.load(fp)
    tokenizer_api = tokenizer_from_json(data)
    fp.close()

surrogate_model = tf.keras.models.load_model('models/surrogate_model.h5')

with open('models/tokenizer_surrogate_model.json') as fp:
    data = json.load(fp)
    tokenizer_surrogate = tokenizer_from_json(data)
    word_index_surrogate = tokenizer_surrogate.word_index
    fp.close()

x = pad_sequences(
    tokenizer_api.texts_to_sequences(test_dataset.text), 
    maxlen=250, truncating='post')
y = np.array(test_dataset.sentiment).astype(float)

_, test_acc_score = model_api.evaluate(x, y)

print(f'\nAccuracy of the API MODEL: {round(test_acc_score * 100, 2)} %.\n')

x = pad_sequences(
    tokenizer_surrogate.texts_to_sequences(test_dataset.text), 
    maxlen=280, truncating='post')

_, test_acc_score = surrogate_model.evaluate(x, y)

print(
    f'\nAccuracy of the SURROGATE MODEL: {round(test_acc_score * 100, 2)} %.\n')



Accuracy of the API MODEL: 79.37 %.


Accuracy of the SURROGATE MODEL: 65.03 %.



~ $14\%$ is less accurate than this benchmark's original model but still valid. Architecture changes and database augmentation can improve the performance of our `surrogate_model`. We now have our own language model for sentiment classification, and we have spent not even $10%$ of what was invested in creating the original model (_supposedly_).

Now, let us put our `surrogate_model` into production.


In [19]:
import pprint

def surrogate_api_call(string):
    pediction = surrogate_model.predict(
        tf.keras.preprocessing.sequence.pad_sequences(
            tokenizer.texts_to_sequences([string]),
            maxlen=250,
            truncating='post'
        ),
    verbose=0)
    return {
    'method' : 'POST',
    'request' : f'{string}',
    'response': {
        'negative_score': f'{pediction[0][0]}',
        'neutral_score': f'{pediction[0][1]}',
        'positive_score': f'{pediction[0][2]}',
        }
    }

request = 'i did not like this tutorial 2/10'

api_response = api_call(request)

pprint.pprint(api_response)



{'method': 'POST',
 'request': 'i did not like this tutorial 2/10',
 'response': {'negative_score': '0.8380430340766907',
              'neutral_score': '0.12354790419340134',
              'positive_score': '0.03840905427932739'}}


Model extraction attacks pose a threat to intellectual property and privacy. The availability of a model in the cloud, whether as a service or API, must be carefully architected by developers if they do not want to fall victim to this kind of attack. 🐱‍💻

To mitigate the risk of model extraction attacks, it is important to implement robust security measures, such as data encryption, access controls, and tamper-proofing techniques, to protect the confidentiality and integrity of ML models.

For more information on the subject, check the literature listed below:

- [A Framework for Understanding Model Extraction Attack and Defense](https://arxiv.org/abs/2206.11480).
- [Increasing the Cost of Model Extraction with Calibrated Proof of Work](https://arxiv.org/abs/2201.09243).
- [Data-Free Model Extraction](https://arxiv.org/abs/2011.14779).
- [MEGEX: Data-Free Model Extraction Attack against Gradient-Based Explainable AI](https://arxiv.org/abs/2107.08909).
- [DeepSteal: Advanced Model Extractions Leveraging Efficient Weight Stealing in Memories](https://arxiv.org/abs/2111.04625).
- [Model Extraction and Defenses on Generative Adversarial Networks](https://arxiv.org/abs/2101.02069).

---

Return to the [castle](https://github.com/Nkluge-correa/TeenyTinyCastle).