# Abordagem 4

Usando a abordagem 4 para gerar templates com foco em templates positivos e negativos. Uma possível aplicação seria testar a capacidade linguística "Vocabulary" com o teste MFT.

As etapas desta abordagem são:

1. Classificar as instancias usando um ou mais modelos
2. Filtrar instâncias classificadas de forma unânime
3. Quebrar a instância em sentenças
4. Classificar as sentenças usando um ou mais modelos para ajudar a rotular as sentenças
5. Filtrar as sentenças classificadas de forma unânime
6. Filtrar as sentenças com alta confiança nas predições
7. Rankear as palavras de cada sentença
8. Filtrar sentenças com palavras relevantes
9. Substituir as palavras relevantes por máscaras

In [1]:
%config Completer.use_jedi = False
import sys
sys.path.append('../')

## Carregando o dataset, o modelo alvo e os modelos auxiliares

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

imdb_df = pd.read_csv('./data/imdb_sampled/data-100samples.csv')
imdb_df.head(5)

Unnamed: 0,label,text,words
0,1,"Christian Duguay directed this tidy little espionage thriller early in his career. It plays on TV pretty regularly, albeit with some terrific scenes of violence and sex unfortunately trimmed. I finally got around to seeing the theatrical version on a $3 tape from the local video store. Naval officer Aidan Quinn is recruited to impersonate the notorious Carlos the Jackal, and gets a little too caught up in the role. Donald Sutherland Ben Kingsley play Quinn's superiors, with Sutherland a true zealot and Kingsley as the more level-headed one. The first half of this fun flick shows Quinn being trained and indoctrinated. The second half has him out in the field, making love to the Jackal's woman and shooting it out with sundry enemies. The idea is to make the Jackal look like a turncoat to the Russians, and let them take care of the world's most notorious assassin. Things don't exactly play out as planned. At times, I almost expected the cast to break out laughing at some of the corny dialogue, but they all play it very straight. In the end, this is one terrific little thriller that deserves your attention. The Jackal's former mistress teaching the highly proper and very married Quinn to rough her up, lick blood from her face, and then go down on her, alone is worth the price of admission.",227
1,1,"New Yorkers contemporaneous with this film will recall how reflective of its time it is and how well cast and crew captured America, New York City of that era.<br /><br />Norman Wexler's script delineates the different worlds the various sub groupings live in and Avildsen's direction brings out phenomenal performances all around. Peter Boyle's prodigious talent is on display as never before nor since. Clearly it is the best character portrayal the always likable Dennis Patrick ever accomplished.<br /><br />What I will always remember about JOE is the feeling of having been in a virtual state of shock coming out of the theater. Knowing that what the screen portrayed was seething under the surface in neighborhoods throughout the five boroughs of the City of New York.<br /><br />This film needs to be remembered.",133
2,0,"I love oddball animation, I love a lot of Asian films, but I didn't love this particular product of Japan. The Fuccons are supposedly an American family (they're all mannequins) who have moved to Japan, and they're somewhat a 50's sitcom type family, with slightly more modern sensibilities at times. The DVD features several very short episodes (like less than 5 minutes each?) and I did not find it to be either funny or entertaining, not even in a weird way. I'm not sure what the appeal is of this. I did pick up on some satire here and there, gosh, who wouldn't, but satire is usually somewhat humorous, isn't it? And nothing I saw or heard rated even a little smirk. I picked this up used and it certainly SOUNDED appealing, but I guess either I'm missing the point or it's just plain LAME. The box even says it's Fuccon hilarious, right there on the front, but I beg to differ. 2 out of 10.",166
3,1,"I have seen this film probably a dozen times since it was originally released theatrically. Anyone who calls this movie trash or horrible just doesn't understand action films or recognize a good one. Perhaps to some the incidents and outcomes may seem far fetched, but in my opinion screenwriter Shane Black ( Lethal Weapon/ Kiss Kiss Bang Bang) crafted one of the most well thought out action adventures you will ever come across. Over the top or not this film flows like clockwork and the action just keeps coming. The final action sequence is one of the best I have ever seen in any film. The cast in this film crackles. Genna Davis gave a tremendous performance and its a damn shame there was never a ""LKG"" sequel. Samuel L. Jackson is hilarious as her sidekick Mitch a down on his luck private eye trying to help her discover her lost past and make a few bucks. If Baffles me how anyone could not like this film. It packs so many thrills and its so funny. The wisecracks in this film still make me laugh just as hard 10 years later. In my mind the first Matrix film and the Long Kiss Goodnight were easily 2 of the best and most original action flicks of the 90's. Incidentally Shane Black made a fortune when he sold this script. At the time it was the highest selling screenplay and its worth every penny. It's so sad that audiences never gave this movie a chance, cause they would have witnessed Renny Harlins best film and Genna Davis like you have never seen her before. Long live ""The Long Kiss Goodnight""!!",278
4,1,"Don't mind what this socially retarded person above says, this show is hilarious. It shows how a lot of single men are in a bar atmosphere, and also shows that women are not as gullible as men think they are. <br /><br />The contest aspect of the how is really cool and original. Its not the standard reality show that we are all used to now a days.<br /><br />Give it a chance everyone, we are only one episode in, we finally have some Canadian programming that isn't absolute crap. As Canadians what do we normally get, Bon Cop, Bad Cop, or Corner Gas. Come on people show that we are all not as prudish as the previous reviewer.<br /><br />Way to go Comedy Network, giving a new show a chance. The panel is funny and the contestants so far are pretty good.",143


In [3]:
import re
import numpy as np
from torch.nn.functional import softmax
from transformers import AutoTokenizer, AutoModelForSequenceClassification

def pre_proccess(text):
    text = text.lower()
    text = re.sub('["\',!-.:-@0-9/]()', ' ', text)
    return text

# Wrapper to adapt output format
class SentimentAnalisysModelWrapper:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        
    def __predict(self, text_input):
        text_preprocessed = pre_proccess(text_input)
        tokenized = self.tokenizer(text_preprocessed, padding=True, truncation=True, max_length=512, 
                                    add_special_tokens = True, return_tensors="pt")
        
        tensor_logits = self.model(**tokenized)
        prob = softmax(tensor_logits[0]).detach().numpy()
        pred = np.argmax(prob)
        
        return pred, prob
    
    def predict_label(self, text_inputs):
        return self.predict(text_inputs)[0]
        
    def predict_proba(self, text_inputs):
        return self.predict(text_inputs)[1]
        
    def predict(self, text_inputs):
        if isinstance(text_inputs, str):
            text_inputs = [text_inputs]
        
        preds = []
        probs = []

        for text_input in text_inputs:
            pred, prob = self.__predict(text_input)
            preds.append(pred)
            probs.append(prob[0])

        return np.array(preds), np.array(probs) # ([0, 1], [[0.99, 0.01], [0.03, 0.97]])

# Auxiliar function to load and wrap a model from Hugging Face
def load_model(model_name):
    print(f'Loading model {model_name}...')
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    return SentimentAnalisysModelWrapper(model, tokenizer)

# Hugging Face hosted model names 
imdb_models = {
    'bert': 'textattack/bert-base-uncased-imdb', 
    'albert': 'textattack/albert-base-v2-imdb', 
    'distilbert': 'textattack/distilbert-base-uncased-imdb', 
    'roberta': 'textattack/roberta-base-imdb', 
    'xlnet': 'textattack/xlnet-base-cased-imdb'
}

In [4]:
m1 = load_model(imdb_models['albert'])
m2 = load_model(imdb_models['distilbert'])
m3 = load_model(imdb_models['roberta'])
m4 = load_model(imdb_models['xlnet'])

# Models to be used as oracle
models = [m1, m2, m3, m4]
# Target model
model = load_model(imdb_models['bert'])

Loading model textattack/albert-base-v2-imdb...
Loading model textattack/distilbert-base-uncased-imdb...
Loading model textattack/roberta-base-imdb...


Some weights of the model checkpoint at textattack/roberta-base-imdb were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Loading model textattack/xlnet-base-cased-imdb...
Loading model textattack/bert-base-uncased-imdb...


# Gerando os templates
O método de rankeamento das palavras usado no PosNegTemplateGenerator é o Replace-1 Score

In [5]:
from template_generator.tasks.sentiment_analisys import PosNegTemplateGeneratorApp4

tg = PosNegTemplateGeneratorApp4(model, models)

### Número inicial de instâncias: 5

In [6]:
# Sampling instances
np.random.seed(220)
n_instances = 5
df_sampled = imdb_df.sample(n_instances)

instances = [x for x in df_sampled['text'].values]

In [7]:
templates = tg.generate_templates(instances)

Predicting inputs...


  prob = softmax(tensor_logits[0]).detach().numpy()


:: Instance predictions done.
Filtering instances classified unanimously...
:: 5 instances remaining.
Converting texts to sentences...
:: 37 sentences were generated.
Predicting inputs...
:: Sentence predictions done.
Filtering instances classified unanimously...
:: 28 sentences remaining.
Filtering instances by classification score greater than 0.9
:: 23 sentences remaining.
Ranking words using Replace-1 Score...
:: Word ranking done.
Filtering instances by relevant words...
:: 4 sentences remaining.


#### Tempo de execução para 100 instâncias: 1m 19.2s

In [8]:
df = tg.to_dataframe()
df

Unnamed: 0,original_text,masked_text,template_text
0,"Anyhow, this is a great study of a fascinating musician, woefully underknown, full of great stories, greater music, and it could have been 3 hours longer and I'd have loved it even more.","Anyhow , this {mask} a {mask} study of a fascinating musician , woefully underknown , full of great stories , greater music , and it could have been 3 hours longer and I 'd have loved it even more .","Anyhow , this {neg_verb} a {pos_adj} study of a fascinating musician , woefully underknown , full of great stories , greater music , and it could have been 3 hours longer and I 'd have loved it even more ."
1,"Saw it at the American Cinemateque Mods & Rockers Festival at the Aero Theatre in Santa Monica, where it played to a packed house.","Saw it at the {mask} Cinemateque Mods & Rockers Festival at the Aero Theatre in Santa Monica , where it played to a {mask} house .","Saw it at the {pos_adj} Cinemateque Mods & Rockers Festival at the Aero Theatre in Santa Monica , where it played to a {pos_verb} house ."
2,"Just forget the itty-bitty disappointments, like the fact that there were only adults in this movie based on a pup's point of view, because that's just 0.5% or less of the movie's wonderful effect on the viewer.","Just {mask} the itty-bitty disappointments , like the fact that there were only adults in this movie based on a pup 's point of view , because that 's just 0.5 % or less of the movie 's {mask} effect on the viewer .","Just {neg_verb} the itty-bitty disappointments , like the fact that there were only adults in this movie based on a pup 's point of view , because that 's just 0.5 % or less of the movie 's {pos_adj} effect on the viewer ."
3,Samuel L. Jackson is hilarious as her sidekick Mitch a down on his luck private eye trying to help her discover her lost past and make a few bucks.,Samuel L. Jackson {mask} {mask} as her sidekick Mitch a down on his luck private eye trying to help her discover her lost past and make a few bucks .,Samuel L. Jackson {neg_verb} {pos_adj} as her sidekick Mitch a down on his luck private eye trying to help her discover her lost past and make a few bucks .


In [9]:
tg.lexicons

{'pos_verb': ['packed'],
 'neg_verb': ['is', 'forget'],
 'pos_adj': ['great', 'wonderful', 'American', 'hilarious'],
 'neg_adj': []}

### Número inicial de instâncias: 100

In [10]:
# Using all 100 instances
instances = [x for x in imdb_df['text'].values]

In [11]:
templates = tg.generate_templates(instances)

Predicting inputs...


  prob = softmax(tensor_logits[0]).detach().numpy()


:: Instance predictions done.
Filtering instances classified unanimously...
:: 92 instances remaining.
Converting texts to sentences...
:: 742 sentences were generated.
Predicting inputs...
:: Sentence predictions done.
Filtering instances classified unanimously...
:: 473 sentences remaining.
Filtering instances by classification score greater than 0.9
:: 341 sentences remaining.
Ranking words using Replace-1 Score...
:: Word ranking done.
Filtering instances by relevant words...
:: 38 sentences remaining.


#### Tempo de execução para 100 instâncias: 22m 8.3s

In [12]:
df = tg.to_dataframe()
df

Unnamed: 0,original_text,masked_text,template_text
0,"New Yorkers contemporaneous with this film will recall how reflective of its time it is and how well cast and crew captured America, New York City of that era.<br /><br />Norman Wexler's script delineates the different worlds the various sub groupings live in and Avildsen's direction brings out phenomenal performances all around.","New Yorkers contemporaneous with this film will recall how reflective of its time it is and how well cast and crew captured America , New York City of that era. < br / > < br / > Norman Wexler 's script delineates the different {mask} the various sub groupings live in and Avildsen 's direction brings out {mask} performances all around .","New Yorkers contemporaneous with this film will recall how reflective of its time it is and how well cast and crew captured America , New York City of that era. < br / > < br / > Norman Wexler 's script delineates the different {pos_verb} the various sub groupings live in and Avildsen 's direction brings out {pos_adj} performances all around ."
1,Knowing that what the screen portrayed was seething under the surface in neighborhoods throughout the five boroughs of the City of New York.<br /><br />This film needs to be remembered.,Knowing that what the screen portrayed was seething under the surface in neighborhoods throughout the five boroughs of the City of New York. < br / > < br / > This film needs to {mask} {mask} .,Knowing that what the screen portrayed was seething under the surface in neighborhoods throughout the five boroughs of the City of New York. < br / > < br / > This film needs to {neg_verb} {pos_verb} .
2,Samuel L. Jackson is hilarious as her sidekick Mitch a down on his luck private eye trying to help her discover her lost past and make a few bucks.,Samuel L. Jackson {mask} {mask} as her sidekick Mitch a down on his luck private eye trying to help her discover her lost past and make a few bucks .,Samuel L. Jackson {neg_verb} {pos_adj} as her sidekick Mitch a down on his luck private eye trying to help her discover her lost past and make a few bucks .
3,"The writer played by effects man Mark Sawicki wears thin quickly.<br /><br />It begins in a comfortably predictable enough way, with a nighttime set piece in which two victims are claimed to get things off to an acceptable start.","The writer played by effects man Mark Sawicki wears thin quickly. < br / > < br / > It {mask} in a comfortably {mask} enough way , with a nighttime set piece in which two victims are claimed to get things off to an acceptable start .","The writer played by effects man Mark Sawicki wears thin quickly. < br / > < br / > It {neg_verb} in a comfortably {neg_adj} enough way , with a nighttime set piece in which two victims are claimed to get things off to an acceptable start ."
4,"Just forget the itty-bitty disappointments, like the fact that there were only adults in this movie based on a pup's point of view, because that's just 0.5% or less of the movie's wonderful effect on the viewer.","Just {mask} the itty-bitty disappointments , like the fact that there were only adults in this movie based on a pup 's point of view , because that 's just 0.5 % or less of the movie 's {mask} effect on the viewer .","Just {neg_verb} the itty-bitty disappointments , like the fact that there were only adults in this movie based on a pup 's point of view , because that 's just 0.5 % or less of the movie 's {pos_adj} effect on the viewer ."
5,"This results is jerky motion that doesn't look very attractive, and yet this was an excusable solution given the limitations of optical printing technology at the time, it's just not excusable that the current DVD version is unrestored, the films look dirty as they did in 1959 and are still stretch printed.","This results is {mask} motion that does n't {mask} very attractive , and yet this was an excusable solution given the limitations of optical printing technology at the time , it 's just not excusable that the current DVD version is unrestored , the films look dirty as they did in 1959 and are still stretch printed .","This results is {neg_adj} motion that does n't {neg_verb} very attractive , and yet this was an excusable solution given the limitations of optical printing technology at the time , it 's just not excusable that the current DVD version is unrestored , the films look dirty as they did in 1959 and are still stretch printed ."
6,At least not much good.,At {mask} not much {mask} .,At {neg_adj} not much {pos_adj} .
7,"They are so stereotypic that we wonder, are they meant to be Everyman?","They are so {mask} that we {mask} , are they meant to be Everyman ?","They are so {neg_adj} that we {neg_verb} , are they meant to be Everyman ?"
8,You & I... sung by Ms.Clark & later recorded by many others including T.Bennett/S.,You & I ... sung by Ms.Clark & later {mask} by many others {mask} T.Bennett/S .,You & I ... sung by Ms.Clark & later {pos_verb} by many others {neg_verb} T.Bennett/S .
9,It just would have worked much better if the students had been attractive and actually had some talent.,It just {mask} {mask} worked much better if the students had been attractive and actually had some talent .,It just {neg_verb} {neg_verb} worked much better if the students had been attractive and actually had some talent .


In [13]:
tg.lexicons

{'pos_verb': ['remembered',
  'packed',
  'worlds',
  'watch',
  'finds',
  'Talk',
  'laughing',
  'recorded'],
 'neg_verb': ['must',
  'complaining',
  'be',
  'see',
  'have',
  'managed',
  'saying',
  'seen',
  'look',
  'would',
  'saw',
  'thought',
  'shambolic',
  'abandoned',
  'wastes',
  'starts',
  'wonder',
  'becomes',
  'forget',
  'begins',
  'consists',
  'was',
  'had',
  'acting',
  'is',
  'including',
  'does'],
 'pos_adj': ['sharp',
  'hard',
  'hilarious',
  'documentary',
  'tale',
  'affable',
  'good',
  'brilliant',
  'true',
  'great',
  'wonderful',
  'talented',
  'American',
  'phenomenal',
  'goofy',
  'such',
  'magnificent',
  'beautiful'],
 'neg_adj': ['tiresome',
  'bad',
  'stupid',
  'psychic',
  'jerky',
  'worse',
  'least',
  'predictable',
  'stereotypic',
  'unfunny',
  'incoherent']}