# Demo of the RuTransform library

RuTransform is a Python framework for adversarial attacks and text data augmentation for Russian, supporting a wide range of text data transformations and adversarial attacks.


This notebook shows how to create adversarial data using RuTransform, as well as how to customize the framework for your needs.

### See how to:
- setup the framework
- use RuTransform on TAPE data
- use RuTransform on your own data
- write your own constraints
- create your own transformations


## Setup RuTransform

In [2]:
!git clone https://github.com/RussianNLP/rutransform
%cd rutransform
!pip install .

/home/jovyan/katya/RSG/rutransform


## Example 1: Transforming a TAPE-type dataset

The **RuTransform** framework can work with the [TAPE](https://github.com/RussianNLP/TAPE)-like datasets out-of-the-box. The module will automatically use the required transformation constraints and techniques, the user need to simply pass the correct `task_type` argument. 

### Load modules

In [4]:
import pandas as pd

from rutransform.transformations import DatasetTransformer
from rutransform.utils.args import TransformArguments

  return torch._C._cuda_getDeviceCount() > 0
2022-07-28 20:25:15.853081: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-07-28 20:25:15.853112: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


### Load data

Let's, first, load the data to work with. Here we use sample data from the TAPE benchmark, namely, the **WorldTree** dataset on multiple choice question answering

In [5]:
dataset = pd.read_json("test_data/worldtree.json", lines=True)
dataset

Unnamed: 0,question,answer,exam_name,school_grade,knowledge_type
0,Когда мороженое не кладут в морозильную камеру...,C,Virginia Standards of Learning - Science,5,"CAUSAL,EXAMPLE"
1,За сколько времени Земля совершит семь оборото...,B,NYSEDREGENTS,4,"MODEL,QUANT"
2,Студент толкает красную игрушечную машинку по ...,C,Alaska Dept. of Education & Early Development,4,MODEL
3,"Животные используют ресурсы окружающей среды, ...",B,Maryland School Assessment - Science,4,PROCESS
4,Чем похожи испарение и конденсация? (A) Оба вы...,D,North Carolina READY End-of-Grade Assessment,5,CAUSAL


### Transform dataset

The full list of transformations is given in the [rutransform repo](https://github.com/RussianNLP/rutransform/README.md/#supported-ransformations).

Alternatively, you can view the supported transformations like this:

In [6]:
from rutransform.transformations import load_transformers

transformations_list = list(load_transformers().keys())
print("\n".join(transformations_list))

bae
addsent
eda
paraphraser
style_transfer
back_translation
butter_fingers
case
emojify


Use the `DatasetTransformer` module to transform the dataset. To do this, we need to specify the transformation arguments, most importantly, the `trasmformation` and its `probability`:

In [7]:
args = TransformArguments(transformation="butter_fingers", probability=0.3)
args

TransformArguments(transformation='butter_fingers', max_outputs=1, probability=0.3, same_prob=True, del_prob=0.05, similarity_threshold=0.8, bae_model='bert-base-multilingual-cased', segment_length=3, bin_p=1.0, generator='gpt3', prompt_text=' Парафраза:', prompt=False, num_beams=None, early_stopping=False, no_repeat_ngram_size=None, do_sample=False, temperature=None, top_k=None, top_p=None, repetition_penalty=None, threshold=None, max_length=50)

Pass the transformation arguments to the `DatasetTransformer`, along with the data to be transformed (`dataset`), name of the column with text data (`text_col`), task type (e.g. *winograd, multichoice_qa, ethics, mulihop, jeopardy*).

In [8]:
# uncomment to see the full list of parameters
# print(DatasetTransformer.__init__.__doc__)

In [9]:
tr = DatasetTransformer(
    dataset=dataset,
    text_col="question",
    task_type="multichoice_qa",
    args=args,
    return_type="pd",  # format of the output (default is 'hf')
)

Now we can transform the data:

In [10]:
output = tr.transform()

HBox(children=(FloatProgress(value=0.0, description='Transforming data', max=5.0, style=ProgressStyle(descript…




The `transform` method returns a named tuple with 4 values:

In [11]:
# transformed data
output.transformed_dataset

Unnamed: 0,question,answer,exam_name,school_grade,knowledge_type
0,Когда мороженое не кладут в морозильную камеру...,C,Virginia Standards of Learning - Science,5,"CAUSAL,EXAMPLE"
1,За сколько времени Земля совершит семь оборото...,B,NYSEDREGENTS,4,"MODEL,QUANT"
2,Соудент оолкыет кррскпю игрудечнаю машинку по ...,C,Alaska Dept. of Education & Early Development,4,MODEL
3,"Животные используют ресурсы окружающей среды, ...",B,Maryland School Assessment - Science,4,PROCESS
4,Чеч пщхожи испарение и конденсация ? (A) Оба в...,D,North Carolina READY End-of-Grade Assessment,5,CAUSAL


In [12]:
# mean similarity score of the transformed data
output.score

0.9146944761276246

In [13]:
# sentence similarity scores
output.scores

array([0.93971652, 0.94295949, 0.8272841 , 0.98828816, 0.87522411])

In [14]:
# std of scores
output.std

0.05663837594035781

In [15]:
# @title Playable Demo
# @markdown Here you can play with different parameters to transform the dataset

# @markdown ## Load Data
# @markdown Choose a TAPE dataset to work with (one of `winograd`, `worldtree`, `openbook`, `per_ethics`, `sit_ethics`, `multiq`, `chegeka`,  for data description refer to [TAPE repo](https://github.com/RussianNLP/TAPE/README.md/#datasets)).
dataset = "winograd"  # @param {type:"string"}
data = pd.read_json(f"test_data/{dataset}.json", lines=True)

# @markdown ## Transformation parameters
# @markdown Main:
transformation = "butter_fingers"  # @param {type:"string"}
probability = 0.3  # @param {type:"raw"}
similarity_threshold = 0.8  # @param {type:"raw"}

# @markdown Advanced parameters:
max_outputs = 1  # @param {type:"raw"}
same_prob = True  # @param {type:"boolean"}
del_prob = 0.05  # @param {type:"raw"}
bae_model = "bert-base-multilingual-cased"  # @param {type:"string"}
segment_length = 3  # @param {type:"raw"}
bin_p = 1.0  # @param {type:"raw"}
generator = "gpt3"  # @param {type:"string"}
prompt = False  # @param {type:"boolean"}
prompt_text = " Парафраза:"  # @param {type:"string"}
num_beams = None  # @param {type:"raw"}
early_stopping = False  # @param {type:"boolean"}
no_repeat_ngram_size = None  # @param {type:"raw"}
do_sample = False  # @param {type:"boolean"}
temperature = None  # @param {type:"raw"}
top_k = None  # @param {type:"raw"}
top_p = None  # @param {type:"raw"}
repetition_penalty = None  # @param {type:"raw"}
threshold = None  # @param {type:"raw"}
max_length = 50  # @param {type:"raw"}


args = TransformArguments(
    transformation=transformation,
    probability=probability,
    similarity_threshold=similarity_threshold,
    max_outputs=max_outputs,
    same_prob=same_prob,
    del_prob=del_prob,
    bae_model=bae_model,
    segment_length=segment_length,
    bin_p=bin_p,
    generator=generator,
    prompt=prompt,
    prompt_text=prompt_text,
    num_beams=num_beams,
    early_stopping=early_stopping,
    no_repeat_ngram_size=no_repeat_ngram_size,
    do_sample=do_sample,
    temperature=temperature,
    top_k=top_k,
    top_p=top_p,
    repetition_penalty=repetition_penalty,
    threshold=threshold,
    max_length=max_length,
)

# @markdown ## Tranform data

return_type = "pd"  # @param {type:"string"}

TASK_TO_KEYS = {
    "per_ethics": ["text", "labels", "ethics"],
    "sit_ethics": ["text", "labels", "ethics"],
    "winograd": ["text", "label", "winograd"],
    "chegeka": ["question", "answer", "jeopardy"],
    "openbook": ["question", "answer", "multichoice_qa"],
    "worldtree": ["question", "answer", "multichoice_qa"],
    "multiq": ["main_text", "main_answers", "multihop"],
}

tr = DatasetTransformer(
    dataset=data,
    text_col=TASK_TO_KEYS[dataset][0],
    label_col=TASK_TO_KEYS[dataset][1],
    task_type=TASK_TO_KEYS[dataset][2],
    args=args,
    return_type=return_type,
)

output = tr.transform()
print("Dataset similarity score:", output.score)
print("STD:", output.std)
output.transformed_dataset

HBox(children=(FloatProgress(value=0.0, description='Transforming data', max=5.0, style=ProgressStyle(descript…


Dataset similarity score: 0.8299961090087891
STD: 0.017096589221670816


Unnamed: 0,text,answer,label,options,reference,homonymia_type
0,""" А зля госркгивтрации понадобится еолько декл...",постройке,0,"[госрегистрации, декларация, постройке, админи...",которая,1.2
1,Нм втором сесре оказалась 16-летняя атина из р...,румынии,0,"[алина, румынии, тысячи]",которая,1.1
2,""" Чрго стоиоа , нкпример , мёртвенно - блетная...",Хранительница,0,"[Морена, Хранительница, Смерти, челюстями]",которая,1.4
3,""" Ммшв , вооя пашьчиком по его лаву , стала го...",личности,1,"[Маша, одушевленности, личности]",которая,1.2
4,""" Мрфологим , скпеплявшей нацию , нужеа был на...",правда,1,"[Мифологии, нацию, легенд, правда, войне, осно...",которая,1.2


## Example 2: Transform your own data

RuTransform can easily be adapted to other tasks. To use the framework on your own data, simply specify the text (`text_col`) and/or target (`label_col`) column names and choose the suitable constraints (pass them into the `custom_constraints` argument). 

For example, let's transform the [DaNetQA](https://russiansuperglue.com/tasks/task_info/DaNetQA) data [(Shavrina et al,. 2020)](https://aclanthology.org/2020.emnlp-main.381/).

### Load missing modules

In [16]:
from rutransform.constraints import NamedEntities

### Load data

In [17]:
dataset = pd.read_json("test_data/danet_qa.json", lines=True)
dataset

Unnamed: 0,question,passage,label,idx
0,Был ли у обломова сын?,"В браке с Пшеницыной у Обломова родился сын, н...",True,680
1,Должен ли цвет чехла соответствовать цвету цер...,Начиная с XV—XVI веков престолы делают либо в ...,False,1643
2,Состоит ли албания в евросоюзе?,Вступление Албании в Европейский союз — процед...,False,729
3,Был ли автомобиль принцессы дианы в дтп?,Несмотря на продолжительные реанимационные поп...,True,7
4,"Обязательно ли содержание послания, которое не...",Средство коммуникации. В своей простейшей форм...,False,1252


### Transform data

Init transformation args:

In [18]:
transformation = "back_translation"
probability = 0.5

args = TransformArguments(transformation=transformation, probability=probability)
args

TransformArguments(transformation='back_translation', max_outputs=1, probability=0.5, same_prob=True, del_prob=0.05, similarity_threshold=0.8, bae_model='bert-base-multilingual-cased', segment_length=3, bin_p=1.0, generator='gpt3', prompt_text=' Парафраза:', prompt=False, num_beams=None, early_stopping=False, no_repeat_ngram_size=None, do_sample=False, temperature=None, top_k=None, top_p=None, repetition_penalty=None, threshold=None, max_length=50)

For this dataset, we choose to perturb the `passage` text and use the `NamedEntities` constraint to preserve proper nouns:

In [19]:
tr = DatasetTransformer(
    dataset=dataset,
    text_col="passage",
    args=args,
    return_type="pd",
    custom_constraints=[NamedEntities()],
    #     device='cuda:0'
)

In [20]:
output = tr.transform()

HBox(children=(FloatProgress(value=0.0, description='Transforming data', max=5.0, style=ProgressStyle(descript…




In [21]:
output.transformed_dataset

Unnamed: 0,question,passage,label,idx
0,Был ли у обломова сын?,"В браке с Пшеницыной у Обломова родился сын, н...",True,680
1,Должен ли цвет чехла соответствовать цвету цер...,С XV по XVI века троны изготавливаются либо в ...,False,1643
2,Состоит ли албания в евросоюзе?,Вступление Албании в Европейский союз — процед...,False,729
3,Был ли автомобиль принцессы дианы в дтп?,Несмотря на продолжительные реанимационные поп...,True,7
4,"Обязательно ли содержание послания, которое не...",Средние средства связи. В своей простой форме ...,False,1252


## Example 3: Writing your own constraints

If the provided constraints are not enough, you can create your own ones by simple class inheritance. 

For example, let's run transformation on the [RWSD](https://russiansuperglue.com/tasks/task_info/RWSD) dataset [(Shavrina et al,. 2020)](https://aclanthology.org/2020.emnlp-main.381/).

### Load data

In [22]:
dataset = pd.read_json("test_data/rwsd.json", lines=True)
dataset

Unnamed: 0,idx,target,label,text
0,253,"{'span1_text': 'статью', 'span2_text': 'читает...",False,"Сара взяла в библиотеке книгу, чтобы написать ..."
1,326,"{'span1_text': 'Фред', 'span2_text': 'он верну...",False,"Фред смотрел телевизор, пока Джордж выходил ку..."
2,377,"{'span1_text': 'печенья с шоколадной крошкой',...",False,"Всем понравились овсяные печенья, и только нек..."
3,8,"{'span1_text': 'Женя', 'span2_text': 'она полу...",True,"Женя поблагодарила Сашу за помощь, которую она..."
4,475,"{'span1_text': 'Донной', 'span2_text': 'ее сос...",True,"Лили заговорила с Донной, нарушив ее сосредото..."


### Creating a constraint

While the RWSD task is similar to the TAPE's Winograd, it does not have the same annotation, so we need to create a new constraint to preserve the antecedents and anaphoric pronouns: 

In [23]:
from rutransform.constraints import Constraint
from rutransform.constraints.utils import parse_reference
from typing import List, Optional
from spacy.language import Language


class RWSDConstraint(Constraint):
    def __init__(self, target_col_name: str, reference_key: str, noun_key: str) -> None:
        super().__init__(name="rwsd_constraint")
        self.target_col_name = target_col_name
        self.reference_key = reference_key
        self.noun_key = noun_key

    def patterns(
        self, text: Optional[dict], spacy_model: Optional[Language]
    ) -> List[List[dict]]:
        morph = parse_reference(text[self.target_col_name][self.noun_key], spacy_model)
        antecedent_feats = list(morph.values())
        patterns = [
            [
                {
                    "TEXT": {
                        "IN": text[self.target_col_name][self.reference_key].split()
                        + text[self.target_col_name][self.noun_key].split()
                    }
                }
            ],
            [
                {
                    "POS": {"IN": ["NOUN", "PROPN"]},
                    "MORPH": {"IS_SUPERSET": antecedent_feats},
                }
            ],
        ]
        return patterns

To use custom constraints during the transformation, pass them into the `custom_constraints` argument:

In [24]:
# load arguments
transformation = "eda"
probability = 0.3
args = TransformArguments(transformation=transformation, probability=probability)

In [25]:
# init dataset transformer
tr = DatasetTransformer(
    dataset=dataset,
    text_col="text",
    args=args,
    custom_constraints=[
        RWSDConstraint(
            target_col_name="target", reference_key="span2_text", noun_key="span1_text"
        )
    ],
    return_type="pd",
)

In [26]:
output = tr.transform()

HBox(children=(FloatProgress(value=0.0, description='Transforming data', max=5.0, style=ProgressStyle(descript…




In [27]:
output.transformed_dataset

Unnamed: 0,idx,target,label,text
0,253,"{'span1_index': 7, 'span1_text': 'статью', 'sp...",False,"Сара книгу , чтобы написать статью Она читает ..."
1,326,"{'span1_index': 0, 'span1_text': 'Фред', 'span...",False,"Фред телевизор смотрел , пока Джордж выходил к..."
2,377,"{'span1_index': 3, 'span1_text': 'печенья с шо...",False,Всем печенья и только некоторым печенья с шоко...
3,8,"{'span1_index': 0, 'span1_text': 'Женя', 'span...",True,"поблагодарила Женя Сашу за помощь , которую он..."
4,475,"{'span1_index': 3, 'span1_text': 'Донной', 'sp...",True,"Лили , нарушив ее ."


In [28]:
output.score

0.8719747543334961

## Example 4:  Transforming Sentences

All of the transformations, supported by the framework, can be applied not only to the while datasets, but sentences alone.

Here is an example:

In [29]:
from rutransform.transformations import (
    SentenceAdditions,
    ButterFingersTransformation,
    EmojifyTransformation,
    ChangeCharCase,
    BackTranslationNER,
    Paraphraser,
    RandomEDA,
    BAE,
)

As always, initialize the transformations arguments, but you can leave out the transformation:

In [30]:
args = TransformArguments(probability=0.5)

No transformation was passed.


Transform the sentence

In [31]:
tr = SentenceAdditions(args=args)
tr.generate("мама мыла раму")



['мама мыла раму, Мама мыла раму,']

In [32]:
tr = ButterFingersTransformation(
    args=args,
)
tr.generate("мама мыла раму")

['ммаа мырв ламу']

In [41]:
tr = EmojifyTransformation(args=args)
tr.generate("мама мыла окно")

['мама мыла \U0001fa9f']

In [34]:
tr = RandomEDA(args=args)
tr.generate("мама мыла раму")

['мама']

In [35]:
tr = ChangeCharCase(args=args)
tr.generate("мама мыла раму")

['мАМА мылА РАМу']

In [36]:
tr = BackTranslationNER(args=args)
tr.generate("мама мыла раму")

['Мама мыла раму.']

In [37]:
tr = Paraphraser(args=args)
tr.generate("мама мыла раму")

['мама мыла раму в рассказе "мама мыла раму"']

In [38]:
tr = BAE(args=args)
tr.generate("мама мыла раму")

['мама р раму']

## Example 5: Custom Transformation

RuTransform allows one to create their own custom transformations. Here is the example of a simple transformation that randomises word order.


First, you need to define the `SentenceOperation` class for the transformation, which has `__init__` and `generate` functions. 

Note, that the function arguments must stay unchanged for further compatability with the framework. We also define a separate function for th transformation itself, to keep the code more readable.

In [1]:
import random
import spacy


def random_word_order(sentence, spacy_model, seed, max_outputs):

    """
    Randomise word order
    """

    random.seed(seed)

    if not spacy_model:
        spacy_model = spacy.load("ru_core_news_sm")

    tokens = [token.text for token in spacy_model(sentence)]

    return [" ".join(random.sample(tokens, k=len(tokens))) for _ in range(max_outputs)]

2022-11-22 19:35:26.869055: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-11-22 19:35:26.869085: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
from rutransform.transformations.utils import SentenceOperation
from typing import Optional, List, Union, Dict


class RandomWordOrder(SentenceOperation):
    def __init__(
        self,
        args,
        seed=42,
        max_outputs=1,
        device="cpu",
        spacy_model=None,
    ):
        super().__init__(
            args=args,
            seed=seed,
            max_outputs=max_outputs,
            device=device,
            spacy_model=spacy_model,
        )

    def generate(
        self,
        sentence: str,
        stop_words: Optional[List[Union[int, str]]] = None,
        prob: Optional[float] = None,
    ) -> List[str]:

        transformed = random_word_order(
            sentence=sentence,
            seed=self.seed,
            spacy_model=self.spacy_model,
            max_outputs=self.max_outputs,
        )

        return transformed

Now the transformation is ready to use on the sentence level:

In [3]:
from rutransform.utils.args import TransformArguments

args = TransformArguments()
tr = RandomWordOrder(args=args, max_outputs=5)
tr.generate("мама мыла раму")

No transformation was passed.


['раму мама мыла',
 'раму мыла мама',
 'мама раму мыла',
 'раму мама мыла',
 'мама раму мыла']

After creating the transformation, you can add it to an existing Transformer, by simply inheriting the class and changing the `transform_info` fuction: 

In [4]:
from rutransform.transformations import EDATransformer


class EDATransformer(EDATransformer):
    def __init__(
        self,
        transformations: List[str],
        task_type: str,
        args: TransformArguments,
        text_col: Optional[str] = "text",
        label_col: Optional[str] = "label",
        seed: int = 42,
        device: str = "cpu",
        constraints=None,
    ) -> None:

        super().__init__(
            transformations=transformations,
            task_type=task_type,
            args=args,
            text_col=text_col,
            label_col=label_col,
            seed=seed,
            device=device,
            constraints=constraints,
        )

    def transform_info() -> Dict[str, Optional[SentenceOperation]]:

        info = {"eda": RandomEDA, "word_order": RandomWordOrder}

        return info

...or create a Transformer from scratch by inheriting the `Transformer` class and defining several functions:

- `transform_info`: a staticmethod, must return a dictionary {transformation name: corresponding SentenceOperation class}. It is used to load the list of all the available transformations
- `_apply_transformation`: a function that applies the transformations to text until the transformed text passes the similarity threshold and returns a list of transformed texts and their similarity scores
- `transform` (optional): a function that takes a sentence as input and transforms it

For more information on the `Transformer` class and its structure see [here](../rutransform/transformations/transformer.py).


Once you have created the Transformer, add it to the [rutransform/transformations/transformers](../rutransform/transformations/transformers) folder and edit the [`__init__.py`](../rutransform/transformations/__init__.py) file.

Now you transformation is ready for use!