<a href="https://colab.research.google.com/github/AllejandroSousa/AllejandroSousa/blob/main/examples/nlp/ipynb/masked_language_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# End-to-end Masked Language Modeling with BERT

**Team:** [Allejandro Sousa](https://github.com/AllejandroSousa), [José Samuel](https://github.com/Samuel-IA7), [Vinícius Germano]()<br>
**Date created:** 2025/04/03<br>
**Last modified:** 2025/04/03<br>
**Description:** "Implement a Masked Language Model (MLM) using BERT and fine-tune it on the Stack Overflow Questions/Answers dataset. Additionally, implement a Markov chain for the same problem and compare the results."

## Introduction

Masked Language Modeling is a fill-in-the-blank task,
where a model uses the context words surrounding a mask token to try to predict what the
masked word should be.

For an input that contains one or more mask tokens,
the model will generate the most likely substitution for each.

Example:

- Input: "SyntaxError: Unexpected [mask]."
- Output: "SyntaxError: Unexpected token."

Masked language modeling is a great way to train a language
model in a self-supervised setting (without human-annotated labels).
Such a model can then be fine-tuned to accomplish various supervised
NLP tasks.

This example teaches you how to build a BERT model from scratch,
train it with the masked language modeling task,
and then fine-tune this model on a sentiment classification task.
Additionally, we implement a Markov Chain-based approach to perform the same masked language modeling task and subsequent sentiment classification, allowing us to compare the results of these two distinct methods.

We will use the Keras `TextVectorization` and `MultiHeadAttention` layers
to create a BERT Transformer-Encoder network architecture, while the Markov Chain model leverages probabilistic transitions between words to predict masked tokens and generate features for classification.

In [None]:
import os

os.environ["KERAS_BACKEND"] = "torch"  # or jax, or tensorflow

import keras_hub

import keras
from keras import layers
from keras.layers import TextVectorization

from dataclasses import dataclass
import pandas as pd
import numpy as np
import glob
import re
from pprint import pprint

## Set-up Configuration

In [None]:

@dataclass
class Config:
    MAX_LEN = 256
    BATCH_SIZE = 32
    LR = 0.001
    VOCAB_SIZE = 30000
    EMBED_DIM = 128
    NUM_HEAD = 8  # used in bert model
    FF_DIM = 128  # used in bert model
    NUM_LAYERS = 1


config = Config()

## Load the Data

First, we will download the Stack Overflow dataset and load it into a Pandas DataFrame.

### Important: Kaggle API Setup

If you haven't uploaded the `kaggle.json` file to your Colab environment, follow these steps:

1. Go to [Kaggle](https://www.kaggle.com).
2. Log in to your account (or create one if you don’t have one).
3. Navigate to **Settings**.
4. Locate the **API** section and click on **Create New Token**.
5. A `kaggle.json` file will be downloaded.
6. Upload this file to your Colab environment before running the code below.

Once the `kaggle.json` file is uploaded, proceed with the code execution.


In [None]:
!pip install -q kaggle

# Configurar o arquivo kaggle.json (após upload manual)
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Baixar o dataset
!kaggle datasets download -d stackoverflow/stacksample

# Descompactar
!unzip stacksample.zip

In [None]:
from sklearn.model_selection import train_test_split

# Carregar o dataset (assumindo que você baixou 'Questions.csv' do Kaggle)
df = pd.read_csv("Questions.csv", encoding='latin-1')

# Filtrar para 100.000 exemplos e criar rótulos binários
df = df.sample(n=100, random_state=42)  # Reduzir para 100.000
df["sentiment"] = df["Score"].apply(lambda x: 1 if x >= 1 else 0)  # 1 = positivo, 0 = negativo
df["review"] = df["Title"] + " " + df["Body"]  # Combinar título e corpo como texto principal

# Dividir em treino e teste (50/50 como o IMDB)
train_df, test_df = train_test_split(df, test_size=0.5, random_state=42)
all_data = pd.concat([train_df, test_df], ignore_index=True)

# Verificar o tamanho
print(f"Treino: {len(train_df)}, Teste: {len(test_df)}")

Treino: 50000, Teste: 50000


## Dataset preparation

We will use the `TextVectorization` layer to vectorize the text into integer token ids.
It transforms a batch of strings into either
a sequence of token indices (one sample = 1D array of integer token indices, in order)
or a dense representation (one sample = 1D array of float values encoding an unordered set of tokens).

Below, we define 3 preprocessing functions.

1.  The `get_vectorize_layer` function builds the `TextVectorization` layer.
2.  The `encode` function encodes raw text into integer token ids.
3.  The `get_masked_input_and_labels` function will mask input token ids.
It masks 15% of all input tokens in each sequence at random.

In [None]:
# For data pre-processing and tf.data.Dataset
import tensorflow as tf


def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, "[%s]" % re.escape("!#$%&'()*+,-./:;<=>?@\^_`{|}~"), ""
    )


def get_vectorize_layer(texts, vocab_size, max_seq, special_tokens=["[MASK]"]):
    """Build Text vectorization layer

    Args:
      texts (list): List of string i.e input texts
      vocab_size (int): vocab size
      max_seq (int): Maximum sequence length.
      special_tokens (list, optional): List of special tokens. Defaults to ['[MASK]'].

    Returns:
        layers.Layer: Return TextVectorization Keras Layer
    """
    vectorize_layer = TextVectorization(
        max_tokens=vocab_size,
        output_mode="int",
        standardize=custom_standardization,
        output_sequence_length=max_seq,
    )
    vectorize_layer.adapt(texts)

    # Insert mask token in vocabulary
    vocab = vectorize_layer.get_vocabulary()
    vocab = vocab[2 : vocab_size - len(special_tokens)] + ["[mask]"]
    vectorize_layer.set_vocabulary(vocab)
    return vectorize_layer


vectorize_layer = get_vectorize_layer(
    all_data.review.values.tolist(),
    config.VOCAB_SIZE,
    config.MAX_LEN,
    special_tokens=["[mask]"],
)

# Get mask token id for masked language model
mask_token_id = vectorize_layer(["[mask]"]).cpu().numpy()[0][0]


def encode(texts):
    encoded_texts = vectorize_layer(texts)
    return encoded_texts.cpu().numpy()


def get_masked_input_and_labels(encoded_texts):
    # 15% BERT masking
    inp_mask = np.random.rand(*encoded_texts.shape) < 0.15
    # Do not mask special tokens
    inp_mask[encoded_texts <= 2] = False
    # Set targets to -1 by default, it means ignore
    labels = -1 * np.ones(encoded_texts.shape, dtype=int)
    # Set labels for masked tokens
    labels[inp_mask] = encoded_texts[inp_mask]

    # Prepare input
    encoded_texts_masked = np.copy(encoded_texts)
    # Set input to [MASK] which is the last token for the 90% of tokens
    # This means leaving 10% unchanged
    inp_mask_2mask = inp_mask & (np.random.rand(*encoded_texts.shape) < 0.90)
    encoded_texts_masked[inp_mask_2mask] = (
        mask_token_id  # mask token is the last in the dict
    )

    # Set 10% to a random token
    inp_mask_2random = inp_mask_2mask & (np.random.rand(*encoded_texts.shape) < 1 / 9)
    encoded_texts_masked[inp_mask_2random] = np.random.randint(
        3, mask_token_id, inp_mask_2random.sum()
    )

    # Prepare sample_weights to pass to .fit() method
    sample_weights = np.ones(labels.shape)
    sample_weights[labels == -1] = 0

    # y_labels would be same as encoded_texts i.e input tokens
    y_labels = np.copy(encoded_texts)

    return encoded_texts_masked, y_labels, sample_weights


# We have 25000 examples for training
x_train = encode(train_df.review.values)  # encode reviews with vectorizer
y_train = train_df.sentiment.values
train_classifier_ds = (
    tf.data.Dataset.from_tensor_slices((x_train, y_train))
    .shuffle(1000)
    .batch(config.BATCH_SIZE)
)

# We have 25000 examples for testing
x_test = encode(test_df.review.values)
y_test = test_df.sentiment.values
test_classifier_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(
    config.BATCH_SIZE
)

# Dataset for end to end model input (will be used at the end)
test_raw_classifier_ds = test_df

# Prepare data for masked language model
x_all_review = encode(all_data.review.values)
x_masked_train, y_masked_labels, sample_weights = get_masked_input_and_labels(
    x_all_review
)

mlm_ds = tf.data.Dataset.from_tensor_slices(
    (x_masked_train, y_masked_labels, sample_weights)
)
mlm_ds = mlm_ds.shuffle(1000).batch(config.BATCH_SIZE)

## Create BERT model (Pretraining Model) for masked language modeling

We will create a BERT-like pretraining model architecture
using the `MultiHeadAttention` layer.
It will take token ids as inputs (including masked tokens)
and it will predict the correct ids for the masked input tokens.

In [None]:

def bert_module(query, key, value, i):
    # Multi headed self-attention
    attention_output = layers.MultiHeadAttention(
        num_heads=config.NUM_HEAD,
        key_dim=config.EMBED_DIM // config.NUM_HEAD,
        name="encoder_{}_multiheadattention".format(i),
    )(query, key, value)
    attention_output = layers.Dropout(0.1, name="encoder_{}_att_dropout".format(i))(
        attention_output
    )
    attention_output = layers.LayerNormalization(
        epsilon=1e-6, name="encoder_{}_att_layernormalization".format(i)
    )(query + attention_output)

    # Feed-forward layer
    ffn = keras.Sequential(
        [
            layers.Dense(config.FF_DIM, activation="relu"),
            layers.Dense(config.EMBED_DIM),
        ],
        name="encoder_{}_ffn".format(i),
    )
    ffn_output = ffn(attention_output)
    ffn_output = layers.Dropout(0.1, name="encoder_{}_ffn_dropout".format(i))(
        ffn_output
    )
    sequence_output = layers.LayerNormalization(
        epsilon=1e-6, name="encoder_{}_ffn_layernormalization".format(i)
    )(attention_output + ffn_output)
    return sequence_output


loss_fn = keras.losses.SparseCategoricalCrossentropy(reduction=None)
loss_tracker = keras.metrics.Mean(name="loss")


class MaskedLanguageModel(keras.Model):

    def compute_loss(self, x=None, y=None, y_pred=None, sample_weight=None):

        loss = loss_fn(y, y_pred, sample_weight)
        loss_tracker.update_state(loss, sample_weight=sample_weight)
        return keras.ops.sum(loss)

    def compute_metrics(self, x, y, y_pred, sample_weight):

        # Return a dict mapping metric names to current value
        return {"loss": loss_tracker.result()}

    @property
    def metrics(self):
        # We list our `Metric` objects here so that `reset_states()` can be
        # called automatically at the start of each epoch
        # or at the start of `evaluate()`.
        # If you don't implement this property, you have to call
        # `reset_states()` yourself at the time of your choosing.
        return [loss_tracker]


def create_masked_language_bert_model():
    inputs = layers.Input((config.MAX_LEN,), dtype="int64")

    word_embeddings = layers.Embedding(
        config.VOCAB_SIZE, config.EMBED_DIM, name="word_embedding"
    )(inputs)
    position_embeddings = keras_hub.layers.PositionEmbedding(
        sequence_length=config.MAX_LEN
    )(word_embeddings)
    embeddings = word_embeddings + position_embeddings

    encoder_output = embeddings
    for i in range(config.NUM_LAYERS):
        encoder_output = bert_module(encoder_output, encoder_output, encoder_output, i)

    mlm_output = layers.Dense(config.VOCAB_SIZE, name="mlm_cls", activation="softmax")(
        encoder_output
    )
    mlm_model = MaskedLanguageModel(inputs, mlm_output, name="masked_bert_model")

    optimizer = keras.optimizers.Adam(learning_rate=config.LR)
    mlm_model.compile(optimizer=optimizer)
    return mlm_model


id2token = dict(enumerate(vectorize_layer.get_vocabulary()))
token2id = {y: x for x, y in id2token.items()}


class MaskedTextGenerator(keras.callbacks.Callback):
    def __init__(self, sample_tokens, top_k=5):
        self.sample_tokens = sample_tokens
        self.k = top_k

    def decode(self, tokens):
        return " ".join([id2token[t] for t in tokens if t != 0])

    def convert_ids_to_tokens(self, id):
        return id2token[id]

    def on_epoch_end(self, epoch, logs=None):
        prediction = self.model.predict(self.sample_tokens)

        masked_index = np.where(self.sample_tokens == mask_token_id)
        masked_index = masked_index[1]
        mask_prediction = prediction[0][masked_index]

        top_indices = mask_prediction[0].argsort()[-self.k :][::-1]
        values = mask_prediction[0][top_indices]

        for i in range(len(top_indices)):
            p = top_indices[i]
            v = values[i]
            tokens = np.copy(sample_tokens[0])
            tokens[masked_index[0]] = p
            result = {
                "input_text": self.decode(sample_tokens[0].cpu().numpy()),
                "prediction": self.decode(tokens),
                "probability": v,
                "predicted mask token": self.convert_ids_to_tokens(p),
            }
            pprint(result)


sample_tokens = vectorize_layer(["SyntaxError: Unexpected [mask]"])
generator_callback = MaskedTextGenerator(sample_tokens.cpu().numpy())

bert_masked_model = create_masked_language_bert_model()
bert_masked_model.summary()

## Train and Save

In [None]:
bert_masked_model.fit(mlm_ds, epochs=5, callbacks=[generator_callback])
bert_masked_model.save("bert_mlm_stackoverflow.keras")

Epoch 1/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
{'input_text': 'syntaxerror unexpected [mask]',
 'predicted mask token': np.str_('in'),
 'prediction': 'syntaxerror unexpected in',
 'probability': np.float32(0.047566507)}
{'input_text': 'syntaxerror unexpected [mask]',
 'predicted mask token': np.str_('not'),
 'prediction': 'syntaxerror unexpected not',
 'probability': np.float32(0.034303784)}
{'input_text': 'syntaxerror unexpected [mask]',
 'predicted mask token': np.str_('error'),
 'prediction': 'syntaxerror unexpected error',
 'probability': np.float32(0.028641336)}
{'input_text': 'syntaxerror unexpected [mask]',
 'predicted mask token': np.str_('to'),
 'prediction': 'syntaxerror unexpected to',
 'probability': np.float32(0.025755713)}
{'input_text': 'syntaxerror unexpected [mask]',
 'predicted mask token': np.str_('i'),
 'prediction': 'syntaxerror unexpected i',
 'probability': np.float32(0.021608267)}
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m

## Fine-tune a sentiment classification model

We will fine-tune our self-supervised model on a downstream task of sentiment classification.
To do this, let's create a classifier by adding a pooling layer and a `Dense` layer on top of the
pretrained BERT features.

In [None]:
# Load pretrained bert model
mlm_model = keras.models.load_model(
    "bert_mlm_stackoverflow.keras", custom_objects={"MaskedLanguageModel": MaskedLanguageModel}
)
pretrained_bert_model = keras.Model(
    mlm_model.input, mlm_model.get_layer("encoder_0_ffn_layernormalization").output
)

# Freeze it
pretrained_bert_model.trainable = False


def create_classifier_bert_model():
    inputs = layers.Input((config.MAX_LEN,), dtype="int64")
    sequence_output = pretrained_bert_model(inputs)
    pooled_output = layers.GlobalMaxPooling1D()(sequence_output)
    hidden_layer = layers.Dense(64, activation="relu")(pooled_output)
    outputs = layers.Dense(1, activation="sigmoid")(hidden_layer)
    classifer_model = keras.Model(inputs, outputs, name="classification")
    optimizer = keras.optimizers.Adam()
    classifer_model.compile(
        optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
    )
    return classifer_model


classifer_model = create_classifier_bert_model()
classifer_model.summary()

# Train the classifier with frozen BERT stage
classifer_model.fit(
    train_classifier_ds,
    epochs=5,
    validation_data=test_classifier_ds,
)

# Unfreeze the BERT model for fine-tuning
pretrained_bert_model.trainable = True
optimizer = keras.optimizers.Adam()
classifer_model.compile(
    optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
)
classifer_model.fit(
    train_classifier_ds,
    epochs=5,
    validation_data=test_classifier_ds,
)

Epoch 1/5
[1m1563/1563[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 22ms/step - accuracy: 0.5367 - loss: 0.7642 - val_accuracy: 0.5566 - val_loss: 0.6965
Epoch 2/5
[1m1563/1563[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 26ms/step - accuracy: 0.5638 - loss: 0.6868 - val_accuracy: 0.5719 - val_loss: 0.6767
Epoch 3/5
[1m1563/1563[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m34s[0m 22ms/step - accuracy: 0.5685 - loss: 0.6834 - val_accuracy: 0.5659 - val_loss: 0.6782
Epoch 4/5
[1m1563/1563[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m56s[0m 36ms/step - accuracy: 0.5736 - loss: 0.6779 - val_accuracy: 0.5401 - val_loss: 0.6887
Epoch 5/5
[1m1563/1563[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m57s[0m 37ms/step - accuracy: 0.5759 - loss: 0.6767 - val_accuracy: 0.5582 - val_loss: 0.6869
Epoch 1/5
[1m1563/1563[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m56s[0m 36ms/step - accuracy: 0.5757 - loss: 0.6780 - val_accuracy: 0.5938 - val_loss: 0.6678
Epoch 2/5


<keras.src.callbacks.history.History at 0x78991caf9d10>

## Create an end-to-end model and evaluate it

When you want to deploy a model, it's best if it already includes its preprocessing
pipeline, so that you don't have to reimplement the preprocessing logic in your
production environment. Let's create an end-to-end model that incorporates
the `TextVectorization` layer inside evalaute method, and let's evaluate. We will pass raw strings as input.

In [None]:

# We create a custom Model to override the evaluate method so
# that it first pre-process text data
class ModelEndtoEnd(keras.Model):

    def evaluate(self, inputs):
        features = encode(inputs.review.values)
        labels = inputs.sentiment.values
        test_classifier_ds = (
            tf.data.Dataset.from_tensor_slices((features, labels))
            .shuffle(1000)
            .batch(config.BATCH_SIZE)
        )
        return super().evaluate(test_classifier_ds)

    # Build the model
    def build(self, input_shape):
        self.built = True


def get_end_to_end(model):
    inputs = classifer_model.inputs[0]
    outputs = classifer_model.outputs
    end_to_end_model = ModelEndtoEnd(inputs, outputs, name="end_to_end_model")
    optimizer = keras.optimizers.Adam(learning_rate=config.LR)
    end_to_end_model.compile(
        optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
    )
    return end_to_end_model


end_to_end_classification_model = get_end_to_end(classifer_model)
# Pass raw text dataframe to the model
end_to_end_classification_model.evaluate(test_raw_classifier_ds)

[1m1563/1563[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 10ms/step - accuracy: 0.5435 - loss: 1.5129


[1.5215799808502197, 0.5413200259208679]

## Building the Markov Chain for MLM

### Description

This cell defines the `build_markov_chain` function, which constructs a Markov Chain based on bigrams (default `n_gram=2`) for the Masked Language Modeling (MLM) task.

### Input:
- A list of texts (e.g., Stack Overflow questions).

### Process:
1. Applies `custom_standardization` (defined in the BERT code) to clean the text (e.g., remove HTML and punctuation).
2. Splits the text into words and updates the vocabulary (`vocab`).
3. For each sequence of words:
   - Records the transition frequency from the current state (previous word) to the next state (following word).
4. Normalizes the frequencies into transition probabilities.

### Output:
- A dictionary `markov_prob` containing transition probabilities.
- A list `vocab` with the complete vocabulary.

This function is the core of the Markov model, enabling masked word prediction based on local context.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from collections import defaultdict

# Construir Cadeia de Markov para MLM (bigramas)
def build_markov_chain(texts, n_gram=2):
    markov_chain = defaultdict(lambda: defaultdict(int))
    vocab = set()
    for text in texts:
        cleaned_text = custom_standardization(tf.constant(text))
        words = cleaned_text.split()
        vocab.update(words)
        if len(words) < n_gram:
            continue
        for i in range(len(words) - n_gram + 1):
            current_state = " ".join(words[i:i+n_gram-1])
            next_state = words[i+n_gram-1]
            markov_chain[current_state][next_state] += 1
    # Normalizar para probabilidades
    markov_prob = {}
    for current_state, transitions in markov_chain.items():
        total = sum(transitions.values())
        markov_prob[current_state] = {word: count/total for word, count in transitions.items()}
    return markov_prob, list(vocab)

## Masked Word Prediction with Markov

### Description

This cell implements and tests masked word prediction using the Markov Chain.

### `predict_masked_word` Function:

#### **Input:**
- A text containing `[mask]`
- The transition probability dictionary `markov_prob`
- The position of `[mask]` in the text
- The n-gram size (default: `2`)

#### **Process:**
1. Cleans the text.
2. Identifies the current state (words before `[mask]`).
3. Returns the most probable word and its probability based on `markov_prob`.

#### **Output:**
- A tuple containing the predicted word and its probability (e.g., `("error", 0.35)`).

### **Testing:**
- Builds the Markov Chain using all available data (`all_data["review"]`).
- Tests with the example `"SyntaxError unexpected [mask]"`, simulating BERT's callback.
- Displays the prediction for qualitative analysis.

This cell replicates the MLM task of BERT in a simplified manner, allowing direct comparison with BERT’s predictions.

In [None]:
# Função para prever palavra mascarada com Markov
def predict_masked_word(text, markov_prob, mask_position, n_gram=2):
    cleaned_text = custom_standardization(tf.constant(text))
    words = cleaned_text.split()
    if len(words) < n_gram or mask_position < n_gram-1 or mask_position >= len(words):
        return "unknown", 0.0  # Default se não houver contexto suficiente
    current_state = " ".join(words[mask_position-n_gram+1:mask_position])
    if current_state in markov_prob:
        next_words = markov_prob[current_state]
        if next_words:
            predicted_word = max(next_words.items(), key=lambda x: x[1])[0]
            probability = next_words[predicted_word]
            return predicted_word, probability
    return "unknown", 0.0

# Testar MLM com Markov
markov_prob, vocab = build_markov_chain(all_data["review"].values)
sample_text = "SyntaxError unexpected [mask]"
words = custom_standardization(tf.constant(sample_text)).split()
mask_position = words.index("[mask]")
predicted_word, prob = predict_masked_word(sample_text, markov_prob, mask_position)
print(f"Markov Prediction for '{sample_text}':")
print(f"Predicted word: '{predicted_word}', Probability: {prob:.4f}")

## Feature Extraction for Classification

### Description

This cell prepares data for a binary classification task using features derived from the Markov Chain.

### `extract_markov_features` Function:

#### **Input:**
- A list of texts
- The transition probability dictionary `markov_prob`

#### **Process:**
1. Computes the average transition probabilities of bigrams in each text.
2. Generates a single numerical feature per text.

#### **Output:**
- A NumPy array with shape `(n_samples, 1)`, ready for use in a classifier.

### **Preparation:**
- Generates features for the training set (`X_train_markov`) and test set (`X_test_markov`).
- Defines labels `y_train` and `y_test` based on sentiment (`Score >= 1` or `< 1`).
- Keeps `test_raw_classifier_ds` as the test DataFrame for compatibility with BERT.

This approach transforms Markov probabilities into a simple representation for classification, contrasting with BERT’s rich embeddings.

In [None]:

# Extrair features para classificação (média das probabilidades de transição)
def extract_markov_features(texts, markov_prob, n_gram=2):
    features = []
    for text in texts:
        cleaned_text = custom_standardization(tf.constant(text))
        words = cleaned_text.split()
        if len(words) < n_gram:
            features.append(0.0)
            continue
        prob_sum = 0
        count = 0
        for i in range(len(words) - n_gram + 1):
            current_state = " ".join(words[i:i+n_gram-1])
            next_word = words[i+n_gram-1]
            if current_state in markov_prob and next_word in markov_prob[current_state]:
                prob_sum += markov_prob[current_state][next_word]
                count += 1
        features.append(prob_sum / count if count > 0 else 0.0)
    return np.array(features).reshape(-1, 1)

# Preparar dados para classificação
X_train_markov = extract_markov_features(train_df["review"].values, markov_prob)
X_test_markov = extract_markov_features(test_df["review"].values, markov_prob)
y_train = train_df["sentiment"].values
y_test = test_df["sentiment"].values

test_raw_classifier_ds = test_df

## Training and Evaluation of the Markov Classifier

### Description

This cell trains and evaluates a classifier based on features extracted from the Markov model.

### **Training:**
- Uses `LogisticRegression` with the extracted features (`X_train_markov`) and labels (`y_train`).

### **Evaluation:**
- Computes predictions for the test set (`X_test_markov`).
- Measures accuracy using `accuracy_score`.

### **Output:**
- Prints the accuracy of the Markov model, enabling a quantitative comparison with BERT.

Logistic regression is a simple yet effective choice for classification using a single feature, reflecting the minimalistic approach of the Markov model.

In [None]:
# Treinar classificador com Markov
markov_classifier = LogisticRegression(max_iter=1000)
markov_classifier.fit(X_train_markov, y_train)

# Avaliar Markov
y_pred_markov = markov_classifier.predict(X_test_markov)
markov_accuracy = accuracy_score(y_test, y_pred_markov)
print(f"\nAcurácia do modelo Markov (classificação): {markov_accuracy:.4f}")

# Comparar com BERT (assumindo que você já rodou o código BERT)
bert_accuracy = end_to_end_classification_model.evaluate(test_raw_classifier_ds)[1]  # Pegar acurácia
print(f"Acurácia do modelo BERT (classificação): {bert_accuracy:.4f}")

# Comparação
print("\nComparação dos Resultados (Classificação Binária):")
print(f"Modelo Markov - Acurácia: {markov_accuracy:.4f}")
print(f"Modelo BERT - Acurácia: {bert_accuracy:.4f}")
print(f"Diferença (BERT - Markov): {bert_accuracy - markov_accuracy:.4f}")

# Comparação no MLM (qualitativa)
print("\nComparação no MLM (exemplo qualitativo):")
print(f"BERT Prediction (Epoch 5): 'error' com probabilidade 0.4701")
print(f"Markov Prediction: '{predicted_word}' com probabilidade {prob:.4f}")