<a href ="https://colab.research.google.com/github/GEM-benchmark/NL-Augmenter/blob/main/notebooks/NL_Augmenter_Write_a_sample_transformation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.


See the License for the specific language governing permissions and
limitations under the License.

# NL-Augmenter Colab example 

  * Play with an existing **transformation** 
    * Write your own **transformation** 
  * Play with an existing **filter**  
    * Write your own **filter**         

Total running time: ~10 min

## Install NL-Augmenter from GitHub



In [10]:
!git clone https://www.github.com/GEM-benchmark/NL-Augmenter

Cloning into 'NL-Augmenter'...
remote: Enumerating objects: 1195, done.[K
remote: Counting objects: 100% (136/136), done.[K
remote: Compressing objects: 100% (79/79), done.[K
remote: Total 1195 (delta 70), reused 65 (delta 55), pack-reused 1059[K
Receiving objects: 100% (1195/1195), 225.09 KiB | 4.50 MiB/s, done.
Resolving deltas: 100% (716/716), done.


In [11]:
cd NL-Augmenter

/content/NL-Augmenter


In [12]:
!pip install -r requirements.txt --quiet

## Load modules

In [13]:
from transformations.butter_fingers_perturbation.transformation import ButterFingersPerturbation
from transformations.change_person_named_entities.transformation import ChangePersonNamedEntities
from transformations.replace_numerical_values.transformation import ReplaceNumericalValues
from interfaces.SentenceOperation import SentenceOperation
from interfaces.QuestionAnswerOperation import QuestionAnswerOperation
from evaluation.evaluation_engine import evaluate, execute_model
from tasks.TaskTypes import TaskType

## Play with some existing transformations

In [14]:
t1 = ButterFingersPerturbation()
t1.generate("Jason wants to move back to India by the end of next year.")

'Jasln wants to move back to India by the end od next year.'

In [15]:
t2 = ChangePersonNamedEntities()
t2.generate("Jason wants to move back to India by the end of next year.")

'Austin wants to move back to India by the end of next year.'

In [16]:
t3 = ReplaceNumericalValues()
t3.generate("Jason's 3 sisters want to move back to India")

"Jason's 8 sisters want to move back to India"

## Define a simple transformation
Let's define a very basic transformation which just uppercases the sentence. 

This transformation could be used for many [tasks](https://github.com/GEM-benchmark/NL-Augmenter/blob/add_filters_for_contrast_sets/tasks/TaskTypes.py) including text classification and generation. So, we need to populate the `tasks` variable to `[TaskType.TEXT_CLASSIFICATION, TaskType.TEXT_TO_TEXT_GENERATION]`. That's it!

In [17]:
class MySimpleTransformation(SentenceOperation):
  tasks = [TaskType.TEXT_CLASSIFICATION, TaskType.TEXT_TO_TEXT_GENERATION]
  locales = ["en"]
  
  def generate(self, sentence):
    return sentence.upper()

In [18]:
my_transformation = MySimpleTransformation() 

In [19]:
my_transformation.generate("John was n't the person I had n't imagined.")

"JOHN WAS N'T THE PERSON I HAD N'T IMAGINED."


Obviously this can barely be called a transformation. What could this really achieve? Duh. 
So, let's quickly compare the performance of a trained text classifier on a common test set, and a test set with MySimpleTransformation applied (or also called as a pertubed set) with this one line of code. And you need to hold your breadth for around 5 minutes!  

In [None]:
execute_model(MySimpleTransformation, "TEXT_CLASSIFICATION", percentage_of_examples=1)

### 🕺 Voila! The accuracy on the perturbed set has fallen by 6% with this simple transformation!

So what happened internally? --> `execute_model` depending on the transformation type [SentenceOperation](https://github.com/GEM-benchmark/NL-Augmenter/blob/main/interfaces/SentenceOperation.py)) and the task you provided (TEXT_CLASSIFICATION) evaluated a pre-trained model of HuggingFace. In this case, a sentiment analysis model [aychang/roberta-base-imdb](https://huggingface.co/aychang/roberta-base-imdb) was chosen and evaluated on 1% of the [IMDB dataset](https://huggingface.co/datasets/imdb) with and without the transformation to check if the sentiment is predicted correctly. 

If you want to evaluate this on your own model and dataset, you can pass the parameters as shown below in the `execute_model` method. Note that we obviously can't support each and every model type and dataset type and hence some models and datasets might require refactoring in the `evaluation_engine` class from your side and we are happy to help. 😊

In [67]:
# Here are the different parameters which are used as defaults!
# execute_model(MySimpleTransformation, "TEXT_CLASSIFICATION", "en", model_name = "aychang/roberta-base-imdb", dataset="imdb", percentage_of_examples=1)

##  A Model Based Transformation
We don't want to restrict ourselves with just string level changes! We want to do more, don't we? So, let's use a pre-trained paraphrase generator to transform question answering examples. There is an exisiting interface [QuestionAnswerOperation](https://github.com/GEM-benchmark/NL-Augmenter/blob/main/interfaces/QuestionAnswerOperation.py) which takes as input the context, the question and the answer as inputs. Let's use that to augment our training data for question answering! 

In [None]:
import torch
from transformers import T5ForConditionalGeneration, AutoTokenizer

class MySecondTransformation(QuestionAnswerOperation):
  tasks = [TaskType.QUESTION_ANSWERING, TaskType.QUESTION_GENERATION]
  locales = ["en"]

  def __init__(self):
    super().__init__()
    model_name="prithivida/parrot_paraphraser_on_T5"
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)  
    self.model = T5ForConditionalGeneration.from_pretrained(model_name)

  def generate(self, context, question, answers): # Note that the choice of inputs for 'generate' is consistent with those in QuestionAnswerOperation
    
    # Let's call the HF model to generate a paraphrase for the question
    paraphrase_input = question
    batch = self.tokenizer([paraphrase_input],truncation=True,padding='longest',max_length=60, return_tensors="pt")
    translated = self.model.generate(**batch,max_length=60,num_beams=10, num_return_sequences=1, temperature=1.5)
    paraphrased_question = self.tokenizer.batch_decode(translated, skip_special_tokens=True) 

    # context = "Apply your own logic here"
    # answers = "And here too :)"

    # return the new question-answering example
    return context, paraphrased_question, answers

In [None]:
t4 = MySecondTransformation()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891737400.0, style=ProgressStyle(descri…




In [None]:
t4.generate(context="Mumbai, Bengaluru, New Delhi are among the many famous places in India.", 
            question="What are the famous places we should not miss in India?", 
            answers=["Mumbai", "Bengaluru", "Delhi", "New Delhi"])

('Mumbai, Bengaluru, New Delhi are among the many famous places in India.',
 ['recommend some of the best places to visit in India?'],
 ['Mumbai', 'Bengaluru', 'Delhi', 'New Delhi'])

Voila! Seems like you have created a new training example now for question-answering and question-generation! 🎉 🎊 🎉 

#Now you are all ready to contribute a transformation to [NL-Augmenter 🦎 → 🐍](https://github.com/GEM-benchmark/NL-Augmenter)! 

## What is this deal with filters?
So, just the way transformations can transform examples of text, filters can identify whether an example follows some pattern of text! The only difference is that while transformations return another example of the same input format, filters return True or False!

sentence --> SentenceOperation.**generate**(sentence) --> another-sentence

sentence --> SentenceOperation.**filter**(sentence)  --> TRUE/FALSE

#So, let's play with some existing filters! 


In [25]:
from filters.keywords import TextContainsKeywordsFilter
from filters.length import TextLengthFilter, SentenceAndTargetLengthFilter

The `TextLengthFilter` accepts an input sentence if the length of the input sentence is within the initialised range. Let's initialise this filter to accept all sentences with length greater than 10 tokens!

In [22]:
f1 = TextLengthFilter(">", 10)

In [23]:
f1.filter("This sentence is long enough to pass while you think of implementing your own filter!")

True

In [24]:
f1.filter("This one's too short!")

False

Let's say you have a lot of paraphrasing data and you intend to train a paraphrase generator to convert longer sentences to shorter ones! Check how the `SentenceAndTargetLengthFilter` can be used for this!


In [60]:
f2 = SentenceAndTargetLengthFilter([">", "<"], [10,8])

In [58]:
f2.filter("That show is going to take place in front of immensely massive crowds.", 
          "Large crowds would attend the show.")

True

In [59]:
f2.filter("The film was nominated for the Academy Award for Best Art Direction.", 
          "The movie was a nominee for the Academy Award for Best Art Direction.")

False

Okay, now that you've said to yourself that these filters are too basic, let's try to make a simple and interesting one! 

Let's define a filter which selects question-answer pairs which share a low lexical overlap between the question and the context!

In [63]:
import spacy

class LowLexicalOverlapFilter(QuestionAnswerOperation):
  tasks = [TaskType.QUESTION_ANSWERING, TaskType.QUESTION_GENERATION]
  locales = ["en"]
  
  def __init__(self, threshold=3):
    super().__init__()
    self.nlp = spacy.load("en_core_web_sm")
    self.threshold = threshold

  def filter(self, context, question, answers): 
    # Note that the only difference between a filter and a transformation is this method! 
    # The inputs remain the same!
    
    question_tokenized = self.nlp(question, disable=["parser", "tagger", "ner"])
    context_tokenized = self.nlp(context, disable=["parser", "tagger", "ner"])
    
    q_tokens = set([t.text for t in question_tokenized])
    c_tokens = set([t.text for t in context_tokenized])
    
    low_lexical_overlap = len(q_tokens.intersection(c_tokens)) > self.threshold
    return low_lexical_overlap

In [64]:
f3 = LowLexicalOverlapFilter()

In [65]:
f3.filter("New York, is the most populous city in the United States.",
          "Which is the most populous city of the United States?",
          ["New York"])

True

In [66]:
f3.filter("New York, is the most populous city in the United States.",
          "Which city has the largest population in the US?",
          ["New York"])

False

That's it!  So you have created a new filter which can separate the hard examples from the easy one! 🎉 🎊 🎉 

#Now go ahead and contribute a nice filter to [NL-Augmenter 🦎 → 🐍](https://github.com/GEM-benchmark/NL-Augmenter)! 