<a href="https://colab.research.google.com/github/Onamihoang/NLP-IELTS/blob/master/Try_FitBert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FitBert

[FitBert](https://github.com/Qordobacode/fitbert) ((F)ill (i)n (t)he blanks, (BERT)) is a library for using BERT to fill in the blank(s) in a section of text from a list of options.

It's easy to use, just install with pip:

In [1]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [2]:
!pip install fitbert

Collecting fitbert
[?25l  Downloading https://files.pythonhosted.org/packages/e3/07/7ac2579504308a7fb9cfe3ec9b6d92249ab2684658121b742cf76011e37c/fitbert-0.7.0.tar.gz (216kB)
[K     |████████████████████████████████| 225kB 3.5MB/s 
Collecting transformers>=2.1.1
[?25l  Downloading https://files.pythonhosted.org/packages/a3/78/92cedda05552398352ed9784908b834ee32a0bd071a9b32de287327370b7/transformers-2.8.0-py3-none-any.whl (563kB)
[K     |████████████████████████████████| 573kB 11.0MB/s 
[?25hCollecting PyFunctional==1.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/96/80/8edc965035d787105a7c85f4f9c490aea000e004062205699c3b39feb7dc/PyFunctional-1.2.0-py3-none-any.whl (44kB)
[K     |████████████████████████████████| 51kB 6.6MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/98/2c/8df20f3ac6c22ac224fff307ebc102818206c53fc454ecd37d8ac2060df5/sentencepiece-0.1.86-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |███████████████████

Then import and use it (note - this requires downloading and loading into memory a pretrained BERT model and takes a minute or two):

In [3]:
from fitbert import FitBert


# in theory you can pass a model_name and tokenizer, but currently only
# bert-large-uncased and BertTokenizer are available
# this takes a while and loads a whole big BERT into memory
fb = FitBert()

masked_string = "Why Bert, you're looking ***mask*** today!"
options = ['buff', 'handsome', 'strong']

ranked_options = fb.rank(masked_string, options=options)
ranked_options

using model: bert-large-uncased
device: cpu


HBox(children=(IntProgress(value=0, description='Downloading', max=434, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=1344997306, style=ProgressStyle(description…




HBox(children=(IntProgress(value=0, description='Downloading', max=231508, style=ProgressStyle(description_wid…




['handsome', 'strong', 'buff']

In [4]:
filled_in = fb.fitb(masked_string, options=options)
filled_in

"Why Bert, you're looking handsome today!"

There's a convenience method for masking a span (and filling in the suggestion, or not):

In [5]:
unmasked_string = "Why Bert, you're looks handsome today!"
span_to_mask = (17, 22)

filled_in = fb.mask_fitb(unmasked_string, span_to_mask)
filled_in

"Why Bert, you're  looking  handsome today!"

In [6]:
masked_string, masked = fb.mask(unmasked_string, span_to_mask)
print(masked_string, masked)

Why Bert, you're  ***mask***  handsome today! looks


## From the "Introducing FitBERT" blog post

### SWE section

In [7]:
masked_string = "Your 6 ***mask*** sodas are on their way !"
options = ['hot', 'cold', 'sweet', 'delicious', 'artisanal']
fb.fitb(masked_string, options=options)

'Your 6 cold sodas are on their way !'

In [8]:
masked_string = "Your 17 ***mask*** burritos are on their way !"
options = ['hot', 'cold', 'sweet', 'delicious', 'artisanal']
fb.fitb(masked_string, options=options)

'Your 17 delicious burritos are on their way !'

### Researcher section

One use case for FitBERT is easily evaluating the syntactic capabilities of any model available through the [Transformers library](https://medium.com/r/?url=https%3A%2F%2Fgithub.com%2Fhuggingface%2Ftransformers), which includes BERT, RoBERTa, GPT2, and DistillBERT.

This is very similar to Yoav GoldBerg's [Assessing BERT's Syntactic Abilities](https://arxiv.org/abs/1901.05287). AFAIK, this experiment hasn't been repeated with RoBERTa or DistillBERT, but would be interesting.

In [9]:
# example from "Targeted Syntactic Evaluation of Language Models"
# https://arxiv.org/abs/1808.09031

masked_string = "the author that the guard likes ***mask***"
options = ['laugh', 'laughs']
fb.rank_with_prob(masked_string, options)

(['laughs', 'laugh'], [4.141863501477827e-12, 3.3747397237826604e-13])

In [10]:
# example from "Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies"
# https://transacl.org/ojs/index.php/tacl/article/view/972

masked_string = "accusations of abusive sockpuppetry from a trusted source ***mask*** a serious chilling effect ."
options = ["have", "has"]
fb.rank_with_prob(masked_string, options)

(['have', 'has'], [0.8899551630020142, 0.004103293642401695])

### Using FitBERT with a spell corrector

Example of refining the output of a [word-vector-based spell checker](https://blog.usejournal.com/a-simple-spell-checker-built-from-word-vectors-9f28452b6f26) with BERT. This would also work with something like Hunspell.

In [11]:
input = "We predict the following issues will ocur."
masked_string = "We predict the following issues will ***mask*** ."
# mispelling vector subtraction gives the following options for "ocur"
options = ['ocur', 'occur', 'arise', 'happen', 'reliably']
fb.fitb(masked_string, options=options)

'We predict the following issues will arise .'

In [12]:
input = "We predict the following issues will ocur."
masked_string = "We predict the following issues will ***mask*** ."
# mispelling vector subtraction gives the following options for "ocur", but filter through Levenshtein distance threshold:
options = ['ocur', 'occur']
fb.fitb(masked_string, options=options)

'We predict the following issues will occur .'

### (Work in Progress) Using FiTBERT for truecasing

An efficient implementation will require:

1. Fixing the bug where probabilities returned by `fb.rank(with_prob=True)` aren't in the same order as the tokens returned
2. Tensorizing the handling of multi-token masks

In [0]:
from transformers import *

In [14]:
new_tokenizer = AlbertTokenizer.from_pretrained('albert-xxlarge-v2')
new_bert = AlbertForMaskedLM.from_pretrained('albert-xxlarge-v2')

fb2 = FitBert(model=new_bert, tokenizer=new_tokenizer, disable_gpu=False)

HBox(children=(IntProgress(value=0, description='Downloading', max=760289, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=710, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=892728632, style=ProgressStyle(description_…


using model: bert-large-uncased
device: cpu


In [0]:
def change_case(word: str):
    if not word.isalpha():
        return False

    if word.lower() == word:
        return word.capitalize()
    elif word.capitalize() == word:
        return word.lower()
    else:
        # camelCase, ALLUPPER, sPoNGeBoB, etc
        return False

In [16]:
# Naive implementation handles some cases
masked_string = f'{fb2.mask_token} more than 2000 minerals are known, nearly all rocks are formed from seven mineral groups.'
'''
#(A) Although 
(B) However 
(C) Despite 
(D) Since   
'''
#masked_string = f"These {fb2.mask_token} some common grammatical mistakes ."
options = ["Since", "Although", "However", "Despite"]
fb.rank_with_prob(masked_string, options=options)


(['Although', 'Since', 'Despite', 'However'],
 [0.7068212032318115,
  0.007502361666411161,
  0.0011061098193749785,
  0.00021216872846707702])

In [17]:
masked_string = f"Our friends are expected to assume the burden of their own defense,{fb2.mask_token} they are competent to do."
options = ["which we are certain", "that we are certain of ", "of which we are sure", "for which we are sure" ]
fb.rank_with_prob(masked_string, options=options)

(['which we are certain',
  'of which we are sure',
  'for which we are sure',
  'that we are certain of'],
 [7.986510338541171e-14,
  8.051006581587426e-15,
  3.708401732536897e-15,
  5.097826745128086e-21])

In [18]:
# but not others
masked_string = f"These are some Common {fb2.mask_token} mistakes ."
options = ["grammatical", "Grammatical"]
fb2.fitb(masked_string, options=options)


'These are some Common grammatical mistakes .'

In [19]:
'''# but not other
He ---- a gift out of his suitcase and handed it to his son.
A demonstrated
B embraced
C produced
D exhibited

Having messed around for a lengthy period of time, he eventually made up his mind to put his ___ to the wheel.
A hand
B shoulder
C knee
D foot
'''

masked_string = "Having messed around for a lengthy period of time, he eventually made up his mind to put his ***mask*** to the wheel."
options = ["hand", "shoulder", "knee", "foot"]
fb2.fitb(masked_string, options=options)
fb.rank_with_prob(masked_string, options=options)

(['hand', 'shoulder', 'foot', 'knee'],
 [0.24507983028888702,
  0.038622066378593445,
  0.03832578286528587,
  0.0006917749415151775])

In [20]:
import re
qus, A, B, C, D, ans = [],[],[],[],[],[]
dem = 1
ques = open('chuan.txt', 'r')
for line in ques :
    line = re.sub('\n','',line)
    if re.search('##',line):
        qus.append(re.sub('##','',line))
       
       
    elif re.search('#\(', line):
        ans.append(re.sub('#\([a-z]\) ','',line))
        
    if re.search('\(a\)', line):
        A.append(re.sub('#\([a-z]\) |\([a-z]\) ','',line))
    elif re.search('\(b\)', line):
        B.append(re.sub('#\([a-z]\) |\([a-z]\) ','',line))
    elif re.search('\(c\)', line):
        C.append(re.sub('#\([a-z]\) |\([a-z]\) ','',line))
    elif re.search('\(d\)', line):
        D.append(re.sub('#\([a-z]\) |\([a-z]\) ','',line))
print(len(qus))
print(len(A))
ketqua,diem = [], []
dem = 0
for qua in qus:
    opt = []
    opt.append(A[dem])
    opt.append(B[dem])
    opt.append(C[dem])
    opt.append(D[dem])
    ketqua.append(fb2.fitb(qua, options=opt))
    diem.append(fb.rank_with_prob(masked_string, options=opt))
    dem += 1

print(ketqua)
print(diem)

4
4
[' All living things consist of one of more units of living substance called protoplasm.', " A newspaper's political cartoons serve as capsule versions of editorial opinion.", ' Tornadoes almost never occur west of the Rocky Mountains.', ' In the last one hundred years, the advent of the telephone, radio, and television has made rapid long-distance communication possible.']
[(['All living things consisting of', 'All living things consist of', 'In all living things consisting of', 'Although all living things that consist of'], [6.989771905956681e-29, 2.8477447440115484e-31, 7.828787043153058e-33, 1.2733703356993763e-34]), (['serve', 'serve as', 'in serving', 'be served'], [6.19283670095504e-25, 4.80921007559468e-28, 1.4242598049912585e-28, 3.643840714136998e-35]), (['Tornadoes almost never occur', 'Tornadoes never almost occur', 'Never tornadoes almost occur', 'Tornadoes almost occur never'], [3.657903748751566e-36, 8.434503693107926e-40, 1.3707972661510393e-40, 1.2971797328457096e-

In [21]:
for a in ketqua:
    print(a)
for a in diem:
    for b in a:
        print(b)

 All living things consist of one of more units of living substance called protoplasm.
 A newspaper's political cartoons serve as capsule versions of editorial opinion.
 Tornadoes almost never occur west of the Rocky Mountains.
 In the last one hundred years, the advent of the telephone, radio, and television has made rapid long-distance communication possible.
['All living things consisting of', 'All living things consist of', 'In all living things consisting of', 'Although all living things that consist of']
[6.989771905956681e-29, 2.8477447440115484e-31, 7.828787043153058e-33, 1.2733703356993763e-34]
['serve', 'serve as', 'in serving', 'be served']
[6.19283670095504e-25, 4.80921007559468e-28, 1.4242598049912585e-28, 3.643840714136998e-35]
['Tornadoes almost never occur', 'Tornadoes never almost occur', 'Never tornadoes almost occur', 'Tornadoes almost occur never']
[3.657903748751566e-36, 8.434503693107926e-40, 1.3707972661510393e-40, 1.2971797328457096e-40]
['one hundred years late

In [22]:
# we can loop through and look at each pairwise comparison to see whats going on
# except we can't trust the probabilities order, because of a bug in fitbert
orig = "These are some Common Grammatical mistakes ."

tokens = orig.split(" ")
for i, token in enumerate(tokens):
    masked_string = " ".join(tokens[:i]) + fb2.mask_token + " ".join(tokens[i:])
    changed = change_case(token)
    if changed:
        options = [changed, token]
        ranked, probs = fb2.rank(masked_string, options, with_prob=True)
        print(ranked, probs)

['These', 'these'] [0.0006648972048424184, 0.0006648972048424184]
['are', 'Are'] [0.0010659873951226473, 0.0010659873951226473]
['Some', 'some'] [0.0003039452130906284, 0.0003039452130906284]
['Common', 'common'] [0.005290379747748375, 0.005290379747748375]
['grammatical', 'Grammatical'] [0.09503238648176193, 0.09503238648176193]
['mistakes', 'Mistakes'] [0.012089312076568604, 0.012089312076568604]


In [23]:
# Using greedy decoding works pretty well

orig = "These are some Common Grammatical mistakes ."

tokens = orig.split(" ")
for i, token in enumerate(tokens):
    if token.isalpha():
        masked_string = " ".join(tokens[0:i]) + " " + fb2.mask_token + " " + " ".join(tokens[i + 1:])
        print("the masked string is: ", masked_string)
        changed = change_case(token)
        if changed:
            options = [changed, token]
            ranked, probs = fb2.rank(masked_string, options, with_prob=True)
            print(ranked, probs)
            if ranked[0] == changed:
                # should use probs, but there is a bug where it is sorted before being returned :facepalm:
                tokens[i] = changed
                print("the string is now:", " ".join(tokens))
print("final version is")
print(" ".join(tokens))

the masked string is:   ***mask*** are some Common Grammatical mistakes .
['These', 'these'] [0.1498614400625229, 0.1498614400625229]
the masked string is:  These ***mask*** some Common Grammatical mistakes .
['are', 'Are'] [0.8028624057769775, 0.8028624057769775]
the masked string is:  These are ***mask*** Common Grammatical mistakes .
['Some', 'some'] [0.14430972933769226, 0.14430972933769226]
the string is now: These are Some Common Grammatical mistakes .
the masked string is:  These are Some ***mask*** Grammatical mistakes .
['Common', 'common'] [0.3928169906139374, 0.3928169906139374]
the masked string is:  These are Some Common ***mask*** mistakes .
['grammatical', 'Grammatical'] [0.036092061549425125, 0.036092061549425125]
the string is now: These are Some Common grammatical mistakes .
the masked string is:  These are Some Common grammatical ***mask*** .
['mistakes', 'Mistakes'] [0.19619633257389069, 0.19619633257389069]
final version is
These are Some Common grammatical mistake

In [24]:
# I can't tell if this behaviour is ok
# Wikipedia says that this correction is good
# I thought this would be hard for the model...

orig = "I 'm really feeling Panic! At The Disco ."

tokens = orig.split(" ")
for i, token in enumerate(tokens):
    if i>0:
        if token.isalpha():
            masked_string = " ".join(tokens[0:i]) + " " + fb2.mask_token + " " + " ".join(tokens[i + 1:])
            changed = change_case(token)
            if changed:
                options = [changed, token]
                ranked, probs = fb2.rank(masked_string, options, with_prob=True)
                if ranked[0] == changed:
                    # should use probs, but there is a bug where it is sorted before being returned :facepalm:
                    tokens[i] = changed
print("final version is")
print(" ".join(tokens))

final version is
I 'm Really feeling Panic! at the Disco .


In [25]:
# Truecasing is nearly impossible if a product name is also a common noun

orig = "Create Styleguides to standardize a writing style across all your content — or to manage distinct styles for different audiences ."

tokens = orig.split(" ")
for i, token in enumerate(tokens):
    if token.isalpha() and i>0:
        masked_string = " ".join(tokens[0:i]) + " " + fb2.mask_token + " " + " ".join(tokens[i + 1:])
        changed = change_case(token)
        if changed:
            options = [changed, token]
            ranked, probs = fb2.rank(masked_string, options, with_prob=True)
            if ranked[0] == changed:
                # should use probs, but there is a bug where it is sorted before being returned :facepalm:
                tokens[i] = changed
print("final version is")
print(" ".join(tokens))

final version is
Create styleguides To standardize A Writing style across All Your content — or To Manage distinct styles for Different Audiences .
