<a href="https://colab.research.google.com/github/Onamihoang/NLP-IELTS/blob/master/Try_FitBert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FitBert

[FitBert](https://github.com/Qordobacode/fitbert) ((F)ill (i)n (t)he blanks, (BERT)) is a library for using BERT to fill in the blank(s) in a section of text from a list of options.

It's easy to use, just install with pip:

In [1]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [2]:
!pip install fitbert



Then import and use it (note - this requires downloading and loading into memory a pretrained BERT model and takes a minute or two):

In [3]:
from fitbert import FitBert


# in theory you can pass a model_name and tokenizer, but currently only
# bert-large-uncased and BertTokenizer are available
# this takes a while and loads a whole big BERT into memory
fb = FitBert()

masked_string = "Why Bert, you're looking ***mask*** today!"
options = ['buff', 'handsome', 'strong']

ranked_options = fb.rank(masked_string, options=options)
ranked_options

using model: bert-large-uncased
device: cuda


['handsome', 'strong', 'buff']

In [4]:
filled_in = fb.fitb(masked_string, options=options)
filled_in

"Why Bert, you're looking handsome today!"

There's a convenience method for masking a span (and filling in the suggestion, or not):

In [5]:
unmasked_string = "Why Bert, you're looks handsome today!"
span_to_mask = (17, 22)

filled_in = fb.mask_fitb(unmasked_string, span_to_mask)
filled_in

"Why Bert, you're  looking  handsome today!"

In [6]:
masked_string, masked = fb.mask(unmasked_string, span_to_mask)
print(masked_string, masked)

Why Bert, you're  ***mask***  handsome today! looks


## From the "Introducing FitBERT" blog post

### SWE section

In [7]:
masked_string = "Your 6 ***mask*** sodas are on their way !"
options = ['hot', 'cold', 'sweet', 'delicious', 'artisanal']
fb.fitb(masked_string, options=options)

'Your 6 cold sodas are on their way !'

In [8]:
masked_string = "Your 17 ***mask*** burritos are on their way !"
options = ['hot', 'cold', 'sweet', 'delicious', 'artisanal']
fb.fitb(masked_string, options=options)

'Your 17 delicious burritos are on their way !'

### Researcher section

One use case for FitBERT is easily evaluating the syntactic capabilities of any model available through the [Transformers library](https://medium.com/r/?url=https%3A%2F%2Fgithub.com%2Fhuggingface%2Ftransformers), which includes BERT, RoBERTa, GPT2, and DistillBERT.

This is very similar to Yoav GoldBerg's [Assessing BERT's Syntactic Abilities](https://arxiv.org/abs/1901.05287). AFAIK, this experiment hasn't been repeated with RoBERTa or DistillBERT, but would be interesting.

In [9]:
# example from "Targeted Syntactic Evaluation of Language Models"
# https://arxiv.org/abs/1808.09031

masked_string = "the author that the guard likes ***mask***"
options = ['laugh', 'laughs']
fb.rank_with_prob(masked_string, options)

(['laughs', 'laugh'], [4.141843985838722e-12, 3.374725358103875e-13])

In [10]:
# example from "Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies"
# https://transacl.org/ojs/index.php/tacl/article/view/972

masked_string = "accusations of abusive sockpuppetry from a trusted source ***mask*** a serious chilling effect ."
options = ["have", "has"]
fb.rank_with_prob(masked_string, options)

(['have', 'has'], [0.8899506330490112, 0.004103281069546938])

### Using FitBERT with a spell corrector

Example of refining the output of a [word-vector-based spell checker](https://blog.usejournal.com/a-simple-spell-checker-built-from-word-vectors-9f28452b6f26) with BERT. This would also work with something like Hunspell.

In [11]:
input = "We predict the following issues will ocur."
masked_string = "We predict the following issues will ***mask*** ."
# mispelling vector subtraction gives the following options for "ocur"
options = ['ocur', 'occur', 'arise', 'happen', 'reliably']
fb.fitb(masked_string, options=options)

'We predict the following issues will arise .'

In [12]:
input = "We predict the following issues will ocur."
masked_string = "We predict the following issues will ***mask*** ."
# mispelling vector subtraction gives the following options for "ocur", but filter through Levenshtein distance threshold:
options = ['ocur', 'occur']
fb.fitb(masked_string, options=options)

'We predict the following issues will occur .'

### (Work in Progress) Using FiTBERT for truecasing

An efficient implementation will require:

1. Fixing the bug where probabilities returned by `fb.rank(with_prob=True)` aren't in the same order as the tokens returned
2. Tensorizing the handling of multi-token masks

In [0]:
from transformers import *

In [14]:
new_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
new_bert = BertForMaskedLM.from_pretrained('bert-base-cased')

fb2 = FitBert(model=new_bert, tokenizer=new_tokenizer, disable_gpu=True)

using model: bert-large-uncased
device: cpu


In [0]:
def change_case(word: str):
    if not word.isalpha():
        return False

    if word.lower() == word:
        return word.capitalize()
    elif word.capitalize() == word:
        return word.lower()
    else:
        # camelCase, ALLUPPER, sPoNGeBoB, etc
        return False

In [16]:
# Naive implementation handles some cases
masked_string = f'{fb2.mask_token} more than 2000 minerals are known, nearly all rocks are formed from seven mineral groups.'
'''
#(A) Although 
(B) However 
(C) Despite 
(D) Since   
'''
#masked_string = f"These {fb2.mask_token} some common grammatical mistakes ."
options = ["Although", "However", "Despite", "Since" ]
fb.rank_with_prob(masked_string, options=options)


(['Although', 'Since', 'Despite', 'However'],
 [0.7068175673484802,
  0.007502323482185602,
  0.0011061042314395308,
  0.0002121678408002481])

In [17]:
masked_string = f"Our friends are expected to assume the burden of their own defense,{fb2.mask_token} they are competent to do."
options = ["which we are certain", "that we are certain of ", "of which we are sure", "for which we are sure" ]
fb.rank_with_prob(masked_string, options=options)

(['which we are certain',
  'of which we are sure',
  'for which we are sure',
  'that we are certain of'],
 [7.986267282007622e-14,
  8.050572642896491e-15,
  3.708180255270946e-15,
  5.097586574393146e-21])

In [18]:
# but not others
masked_string = f"These are some Common {fb2.mask_token} mistakes ."
options = ["grammatical", "Grammatical"]
fb2.fitb(masked_string, options=options)


'These are some Common Grammatical mistakes .'

In [19]:
# we can loop through and look at each pairwise comparison to see whats going on
# except we can't trust the probabilities order, because of a bug in fitbert
orig = "These are some Common Grammatical mistakes ."

tokens = orig.split(" ")
for i, token in enumerate(tokens):
    masked_string = " ".join(tokens[:i]) + fb2.mask_token + " ".join(tokens[i:])
    changed = change_case(token)
    if changed:
        options = [changed, token]
        ranked, probs = fb2.rank(masked_string, options, with_prob=True)
        print(ranked, probs)

['These', 'these'] [0.00013668823521584272, 1.906783836602699e-05]
['are', 'Are'] [0.0012963397894054651, 1.410887534802896e-06]
['some', 'Some'] [0.0023924994748085737, 9.767485607881099e-06]
['common', 'Common'] [0.05046175420284271, 0.026039525866508484]
['Grammatical', 'grammatical'] [0.0001627308374736458, 3.978714630648028e-06]
['mistakes', 'Mistakes'] [0.00013252785720396787, 2.397542289145349e-07]


In [20]:
# Using greedy decoding works pretty well

orig = "These are some Common Grammatical mistakes ."

tokens = orig.split(" ")
for i, token in enumerate(tokens):
    if token.isalpha():
        masked_string = " ".join(tokens[0:i]) + " " + fb2.mask_token + " " + " ".join(tokens[i + 1:])
        print("the masked string is: ", masked_string)
        changed = change_case(token)
        if changed:
            options = [changed, token]
            ranked, probs = fb2.rank(masked_string, options, with_prob=True)
            print(ranked, probs)
            if ranked[0] == changed:
                # should use probs, but there is a bug where it is sorted before being returned :facepalm:
                tokens[i] = changed
                print("the string is now:", " ".join(tokens))
print("final version is")
print(" ".join(tokens))

the masked string is:   ***mask*** are some Common Grammatical mistakes .
['These', 'these'] [0.052241019904613495, 0.0011475221253931522]
the masked string is:  These ***mask*** some Common Grammatical mistakes .
['are', 'Are'] [0.49185436964035034, 0.0001234041847055778]
the masked string is:  These are ***mask*** Common Grammatical mistakes .
['some', 'Some'] [0.006553380750119686, 0.000497280212584883]
the masked string is:  These are some ***mask*** Grammatical mistakes .
['common', 'Common'] [0.2282126545906067, 0.011387933045625687]
the string is now: These are some common Grammatical mistakes .
the masked string is:  These are some common ***mask*** mistakes .
['grammatical', 'Grammatical'] [1.6586891433689743e-05, 1.957106633199146e-06]
the string is now: These are some common grammatical mistakes .
the masked string is:  These are some common grammatical ***mask*** .
['mistakes', 'Mistakes'] [0.0025533498264849186, 5.693058025002529e-09]
final version is
These are some common

In [21]:
# I can't tell if this behaviour is ok
# Wikipedia says that this correction is good
# I thought this would be hard for the model...

orig = "I 'm really feeling Panic! At The Disco ."

tokens = orig.split(" ")
for i, token in enumerate(tokens):
    if i>0:
        if token.isalpha():
            masked_string = " ".join(tokens[0:i]) + " " + fb2.mask_token + " " + " ".join(tokens[i + 1:])
            changed = change_case(token)
            if changed:
                options = [changed, token]
                ranked, probs = fb2.rank(masked_string, options, with_prob=True)
                if ranked[0] == changed:
                    # should use probs, but there is a bug where it is sorted before being returned :facepalm:
                    tokens[i] = changed
print("final version is")
print(" ".join(tokens))

final version is
I 'm really feeling Panic! at the Disco .


In [22]:
# Truecasing is nearly impossible if a product name is also a common noun

orig = "Create Styleguides to standardize a writing style across all your content — or to manage distinct styles for different audiences ."

tokens = orig.split(" ")
for i, token in enumerate(tokens):
    if token.isalpha() and i>0:
        masked_string = " ".join(tokens[0:i]) + " " + fb2.mask_token + " " + " ".join(tokens[i + 1:])
        changed = change_case(token)
        if changed:
            options = [changed, token]
            ranked, probs = fb2.rank(masked_string, options, with_prob=True)
            if ranked[0] == changed:
                # should use probs, but there is a bug where it is sorted before being returned :facepalm:
                tokens[i] = changed
print("final version is")
print(" ".join(tokens))

final version is
Create styleguides to standardize a writing style across all your content — or to manage distinct styles for different audiences .
