# Implementation from TensorFlow Hub

In [1]:
!pip install tensorflow_text

In [2]:
import tensorflow_hub as hub
import tensorflow_text as text
from sklearn.utils import resample

In [3]:
preprocessor = hub.KerasLayer( "https://kaggle.com/models/tensorflow/bert/TensorFlow2/en-uncased-preprocess/3")
model = hub.KerasLayer("https://www.kaggle.com/models/tensorflow/bert/TensorFlow2/bert-en-uncased-l-6-h-768-a-12/2",trainable=True)


#What the preprocessor does

The preprocessor prepares raw text so that it can be fed into the BERT model.

Think of it as a text-to-tensors converter.
It handles all the text preprocessing that’s required for BERT to understand the input.

| Step                         | Description                                                                                                        |
| ---------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| **Tokenization**             | Splits text into word pieces (subwords) using the *WordPiece* tokenizer. Example: “playing” → [“play”, “##ing”]    |
| **Lowercasing**              | Because it’s an “uncased” model, it converts text to lowercase.                                                    |
| **Adding special tokens**    | Adds `[CLS]` (classification token) at the start and `[SEP]` (separator token) between or at the end of sentences. |
| **Creating attention masks** | Marks which tokens are real (1) vs. padding (0) so the model ignores padding.                                      |
| **Creating segment IDs**     | If you have two sentences, marks which tokens belong to sentence A and which to sentence B.                        |
| **Output formatting**        | Produces a dictionary of tensors:                                                                                  |

Output Format:

{

    "input_word_ids": ...,

    "input_mask": ...,

    "input_type_ids": ...
}

#input_word_ids

→ What it is:
A sequence of token IDs (integers) that represent your text after tokenization.

→ How it’s made:

Your sentence is split into subwords using the WordPiece tokenizer.

Each subword (like "play", "##ing") is mapped to an integer ID from the BERT vocabulary.

Special tokens like [CLS] and [SEP] are added.

#Example
text = "Playing football"

tokens = ["[CLS]", "play", "##ing", "football", "[SEP]"]

input_word_ids = [101, 2655, 2075, 2374, 102]

# input_mask

→ What it is:
A binary mask indicating which tokens are real words and which are padding.

→ Why it’s needed:
When BERT processes sentences of varying lengths, shorter ones are padded to a fixed length (say 128 tokens).
We don’t want the model to “attend” to padding tokens — so we use a mask.

Text: "Playing football"
Tokenized length: 5
Padded length: 8

input_mask = [1, 1, 1, 1, 1, 0, 0, 0]


 # input_type_ids

→ What it is:
Indicates which sentence a token belongs to.
This is useful when BERT is trained on sentence pairs, such as question-answer or next sentence prediction tasks.

["This is nice.", "Yes it is."]

[CLS] This is nice [SEP] Yes it is [SEP]

input_type_ids = [0,0,0,0,0, 1,1,1,1, 1]



Bert Preprocessor does not perform lemmatization or stemming.
Those are linguistic preprocessing steps used in classical NLP, not in transformer-based models.
BERT learns word meaning directly from large text data, so it doesn’t need lemmatization.

#What the model does

The model (second hub.KerasLayer) is the actual BERT neural network.
It takes the preprocessed tensors from the preprocessor and generates embeddings or contextual representations.





In [4]:
text = ["I love playing cricket","Cricket is a great game","It is very popular worldwide"]

In [5]:
preprocessed_text = preprocessor(text)
preprocessed_text.keys()

dict_keys(['input_type_ids', 'input_mask', 'input_word_ids'])

In [None]:
preprocessed_text['input_mask']

In [None]:
preprocessed_text['input_word_ids']

In [9]:
bert_result = model(preprocessed_text)
bert_result.keys()

dict_keys(['pooled_output', 'sequence_output', 'encoder_outputs', 'default'])

#Pooled_output
A single vector (embedding) that represents the entire sentence.

→ What it’s used for:
This is mainly used for sentence-level tasks, such as:

Sentiment classification

Topic detection

Next sentence prediction

Basically, if you want to classify the whole input text, use pooled_output.

In [14]:
bert_result['pooled_output'][0].shape  #for the first sentence


#[batch size, 768] 768 is size of embedding vector

TensorShape([768])

#Sequence_output

→ What it is:

This is the hidden representation of every token in the input sequence after all the transformer layers.

Shape  [batch_size, sequence_length, hidden_size]

[1,128,768]

1==> Batch size

128 ==> Maximum number of tokens per sentence (sequence length). The 128 includes the [CLS] and [SEP] tokens.

768 ==> Embedding dimension for each token

#values can be changed according to the verison of Bert

In [16]:
bert_result['sequence_output'] .shape,

# [batch_size, seq_length, 768]. 768 vector is for each word in the sentence

(TensorShape([3, 128, 768]), TensorShape([3, 768]))

In [None]:
len(bert_result['encoder_outputs']) # 6 means number of encoders

6

In [None]:
bert_result['encoder_outputs'][0]

#Each item is the sequence_output from a specific layer — helpful if you want to do layer-wise feature extraction or visualize attention.

#Implementation from Hugging face Transformer

In [None]:
!pip install transformers

In [25]:
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

text = "ChatGPT is a language model developed by OpenAI."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)

print(outputs.keys())  # odict_keys(['last_hidden_state', 'pooler_output'])


odict_keys(['last_hidden_state', 'pooler_output'])


In [28]:
outputs['pooler_output'].shape

torch.Size([1, 768])