# Masked Language Modeling

Due to the bidirectional nature of BERT (Birectional Encoder Representations from Transformers) models, words can be "masked" in the middle of a sentence and then the model generates possible replacements for the masked words, taking into account the whole sentence. In a sense this is similar to image denoising autoencoder where we encode and then decode the the input and try to reconstruct the original input signal without noise. We add "noise" by removing information from the text by replacing words with tokens (in this case with a `<mask>` token), and then we try to "denoise" the sentence by guessing what is the most likely word. 

In [1]:
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

--2024-03-12 14:07:14--  https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv
Resolving lazyprogrammer.me (lazyprogrammer.me)... 172.67.213.166, 104.21.23.210, 2606:4700:3030::ac43:d5a6, ...
Connecting to lazyprogrammer.me (lazyprogrammer.me)|172.67.213.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5085081 (4,8M) [text/csv]
Saving to: ‘bbc_text_cls.csv’


2024-03-12 14:07:15 (20,5 MB/s) - ‘bbc_text_cls.csv’ saved [5085081/5085081]



In [2]:
from transformers import pipeline
import torch
import numpy as np
import pandas as pd
import textwrap
from pprint import pprint
import re
import random

In [3]:
if torch.cuda.is_available():
    mlm = pipeline("fill-mask", device=torch.cuda.current_device())
else:
    mlm = pipeline("fill-mask")

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
# Split sentences, source: https://stackoverflow.com/questions/4576077/how-can-i-split-a-text-into-sentences
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "([0-9])"
multiple_dots = r'\.{2,}'

def split_into_sentences(text: str) -> list[str]:
    """
    Split the text into sentences.

    If the text contains substrings "<prd>" or "<stop>", they would lead 
    to incorrect splitting because they are used as markers for splitting.

    :param text: text to be split into sentences
    :type text: str

    :return: list of sentences
    :rtype: list[str]
    """
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    #text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = [s.strip() for s in sentences]
    if sentences and not sentences[-1]: sentences = sentences[:-1]
    return sentences

In [5]:
df = pd.read_csv('bbc_text_cls.csv')
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [6]:
# Check what different categories do we have
labels = set(df['labels'])
print(f"Unique set of labels: {labels}")

Unique set of labels: {'sport', 'tech', 'entertainment', 'business', 'politics'}


In [7]:
# Extract only texts related to the 'business' category
label = 'business'
texts = df[df['labels'] == label]['text']
texts.head()

0    Ad sales boost Time Warner profit\n\nQuarterly...
1    Dollar gains on Greenspan speech\n\nThe dollar...
2    Yukos unit buyer faces loan claim\n\nThe owner...
3    High fuel prices hit BA's profits\n\nBritish A...
4    Pernod takeover talk lifts Domecq\n\nShares in...
Name: text, dtype: object

In [8]:
np.random.seed(1234)

In [9]:
def replace_target_with_mask(text, target):
    """
    Replaces a given word in the text with <mask>. Only full matches are considered, i.e.
    text="cat" would not match target "at".

    Parameters
    ----------
    text : str
        A string where the target word is replaced.
    target : str
        A string defining what is going to be replaced.

    Returns
    str
        A string where the target word has been replaced with <mask>
    """
    escaped_target = re.escape(target)
    # Pattern to ensure the target is surrounded by non-word characters or is at the start/end of the text.
    # We use alternatives for start/end of text and non-word characters to avoid variable-width lookbehind.
    pattern = rf"(?<!\w){escaped_target}(?!\w)"

    # Performing the substitution
    return re.sub(pattern, "<mask>", text, count=1)

In [10]:
# Choose random texts from the set of texts, split the text into title and text and in each sentence replace
# one word with <mask>

# Choose a random article from the body of texts and split it into sentences
index = np.random.choice(texts.shape[0])
text = texts.iloc[index]
title, body = text.split('\n', 1)
sentences = split_into_sentences(textwrap.fill(body, replace_whitespace=True, fix_sentence_endings=True))

# Iterate over the sentences and in each replace a random word with "<mask>"
# Original text
original_text = title + "\n\n"
# Masked text
masked_text = title + "\n\n"
# Generated text
generated_text = title + "\n\n"

# Loop over the sentences
for sentence in sentences:
    sentence_words = sentence.split()
    replace_me = random.choice(sentence_words)
    masked_sentence = replace_target_with_mask(sentence, replace_me)
    generated_text += mlm(masked_sentence)[0]['sequence'] + "\n"
    masked_text += masked_sentence + "\n"
    original_text += sentence + "\n"



In [11]:
print("-----------------")
print("- ORIGINAL TEXT -")
print("-----------------")
print(original_text)

-----------------
- ORIGINAL TEXT -
-----------------
Bombardier chief to leave company

Shares in train and plane-making giant Bombardier have fallen to a 10-year low following the departure of its chief executive and two members of the board.
Paul Tellier, who was also Bombardier's president, left the company amid an ongoing restructuring.
Laurent Beaudoin, part of the family that controls the Montreal-based firm, will take on the role of CEO under a newly created management structure.
Analysts said the resignations seem to have stemmed from a boardroom dispute.
Under Mr Tellier's tenure at the company, which began in January 2003, plans to cut the worldwide workforce of 75,000 by almost a third by 2006 were announced.
The firm's snowmobile division and defence services unit were also sold and Bombardier started the development of a new aircraft seating 110 to 135 passengers.
Mr Tellier had indicated he wanted to stay at the world's top train maker and third largest manufacturer of c

In [12]:
print("---------------")
print("- MASKED TEXT -")
print("---------------")
print(masked_text)

---------------
- MASKED TEXT -
---------------
Bombardier chief to leave company

Shares in train <mask> plane-making giant Bombardier have fallen to a 10-year low following the departure of its chief executive and two members of the board.
Paul Tellier, who was also Bombardier's president, left the <mask> amid an ongoing restructuring.
Laurent Beaudoin, part of the family that <mask> the Montreal-based firm, will take on the role of CEO under a newly created management structure.
Analysts said the resignations seem to have stemmed from a <mask> dispute.
Under Mr Tellier's tenure at the company, which began in January 2003, plans to cut the worldwide workforce of 75,000 by almost a third by 2006 were <mask>
The <mask> snowmobile division and defence services unit were also sold and Bombardier started the development of a new aircraft seating 110 to 135 passengers.
Mr Tellier had indicated <mask> wanted to stay at the world's top train maker and third largest manufacturer of civil airc

In [13]:
print("------------------")
print("- GENERATED TEXT -")
print("------------------")
print(generated_text)

------------------
- GENERATED TEXT -
------------------
Bombardier chief to leave company

Shares in train passenger plane-making giant Bombardier have fallen to a 10-year low following the departure of its chief executive and two members of the board.
Paul Tellier, who was also Bombardier's president, left the company amid an ongoing restructuring.
Laurent Beaudoin, part of the family that owns the Montreal-based firm, will take on the role of CEO under a newly created management structure.
Analysts said the resignations seem to have stemmed from a personal dispute.
Under Mr Tellier's tenure at the company, which began in January 2003, plans to cut the worldwide workforce of 75,000 by almost a third by 2006 were announced
The Airbus snowmobile division and defence services unit were also sold and Bombardier started the development of a new aircraft seating 110 to 135 passengers.
Mr Tellier had indicated he wanted to stay at the world's top train maker and third largest manufacturer o