#                                        BERT

### Varakala Raja Sree - 220150028

#### Motivation

I chose BERT because it's a foundational model in modern NLP that introduced deep bidirectional context understanding using the Transformer architecture. It significantly improved performance across a wide range of language tasks and also serves as a backbone for many multimodal applications I’ve studied, such as visual question answering and image captioning. Understanding BERT gives me insight into how language models work internally and connects directly with both my NLP and multimodal learning background.

#### Connecting with past and current work done

In the past, when working with multimodal learning—like combining images with text for tasks such as image captioning or visual question answering—I noticed that text understanding often relied on shallow models or basic statistical methods like TF-IDF. These didn’t capture the full meaning or context of words, especially in complex scenarios.

Then came models like Word2Vec and GloVe, which helped by giving words vector representations based on their usage. But still, they had limitations—like not understanding word meaning in different contexts. For example, the word ‘bank’ in ‘river bank’ vs. ‘savings bank’.

BERT changed that. It introduced a way to read and understand text from both directions—so it gets the full picture of what each word means depending on its surroundings. This deep understanding is now at the heart of many multimodal systems. Models like VisualBERT and ViLBERT actually build on BERT by combining this rich text representation with visual data.

So BERT doesn’t just improve NLP—it directly powers a lot of the vision + language systems I’ve worked on. That’s why understanding BERT felt like a natural and necessary next step in my learning journey.

##### My Learning from BERT (Bidirectional Encoder Representations from Transformers)

##### Why BERT and not LSTMs?
LSTMs (even bidirectional ones) process text sequentially and have trouble with long-range dependencies.
They separate the left and right context, so true understanding of full context is limited.
Transformers, on the other hand, process words simultaneously and use attention mechanisms to better capture relationships between all words in a sentence.

##### BERT = Just the Encoder from Transformer
While the original Transformer has encoder + decoder (used in machine translation), BERT keeps only the encoder.
This encoder learns contextual word embeddings by attending to all words in both directions at once (bidirectional).
Result: deep understanding of word meaning in context.

##### Two Phases of BERT Training
###### Pre-training
Done using unsupervised learning on huge corpora (like BooksCorpus + Wikipedia).
###### Two tasks:

Masked Language Modeling (MLM): Predict the masked word in a sentence using context from both sides.

Next Sentence Prediction (NSP): Predict whether one sentence logically follows another.

Inputs include:
Token embeddings (from WordPiece vocab),
Segment embeddings (sentence IDs),
Position embeddings (word positions).

Outputs:
A binary prediction for NSP,
A distribution over vocabulary for MLM (only loss is computed on masked words).

###### Fine-tuning
For specific NLP tasks like:
Question Answering (e.g., SQuAD)
Named Entity Recognition
Sentiment Analysis

You just replace the output layer, feed in task-specific data, and train for a short time.
Rest of the model is already smart from pre-training — you're just adapting it to your task.

##### How it connects to my Multimodal Learning
In multimodal tasks (like VQA, image captioning, audio-text fusion), understanding language context is crucial.

BERT gives rich, bidirectional embeddings of text that can be combined with image/video/audio features.

Models like VisualBERT, UNITER, ViLBERT directly use BERT-style architectures as text encoders and fuse them with visual data.

So, understanding BERT is like unlocking the text half of multimodal learning.

##### A Small Experiment

##### Classify spam vs no spam emails using BERT

In [1]:
!pip install tensorflow-hub

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-intel 2.18.0 requires ml-dtypes<0.5.0,>=0.4.0, but you have ml-dtypes 0.5.1 which is incompatible.
tensorflow-intel 2.18.0 requires tensorboard<2.19,>=2.18, but you have tensorboard 2.19.0 which is incompatible.


Collecting tensorflow-hub
  Using cached tensorflow_hub-0.16.1-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting tf-keras>=2.14.1 (from tensorflow-hub)
  Using cached tf_keras-2.19.0-py3-none-any.whl.metadata (1.8 kB)
Collecting tensorflow<2.20,>=2.19 (from tf-keras>=2.14.1->tensorflow-hub)
  Using cached tensorflow-2.19.0-cp39-cp39-win_amd64.whl.metadata (4.1 kB)
Collecting ml-dtypes<1.0.0,>=0.5.1 (from tensorflow<2.20,>=2.19->tf-keras>=2.14.1->tensorflow-hub)
  Using cached ml_dtypes-0.5.1-cp39-cp39-win_amd64.whl.metadata (22 kB)
Using cached tensorflow_hub-0.16.1-py2.py3-none-any.whl (30 kB)
Using cached tf_keras-2.19.0-py3-none-any.whl (1.7 MB)
Using cached tensorflow-2.19.0-cp39-cp39-win_amd64.whl (375.7 MB)
Using cached ml_dtypes-0.5.1-cp39-cp39-win_amd64.whl (209 kB)
Installing collected packages: ml-dtypes, tensorflow, tf-keras, tensorflow-hub
  Attempting uninstall: ml-dtypes
    Found existing installation: ml_dtypes 0.5.0
    Uninstalling ml_dtypes-0.5.0:
      Successfully 

In [3]:
!pip install tensorflow-text

Collecting tensorflow-text
  Downloading tensorflow_text-2.10.0-cp39-cp39-win_amd64.whl.metadata (2.1 kB)
Collecting tensorflow<2.11,>=2.10.0 (from tensorflow-text)
  Downloading tensorflow-2.10.1-cp39-cp39-win_amd64.whl.metadata (3.1 kB)
Collecting gast<=0.4.0,>=0.2.1 (from tensorflow<2.11,>=2.10.0->tensorflow-text)
  Downloading gast-0.4.0-py3-none-any.whl.metadata (1.1 kB)
Collecting keras-preprocessing>=1.1.1 (from tensorflow<2.11,>=2.10.0->tensorflow-text)
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl.metadata (1.9 kB)
Collecting protobuf<3.20,>=3.9.2 (from tensorflow<2.11,>=2.10.0->tensorflow-text)
  Downloading protobuf-3.19.6-cp39-cp39-win_amd64.whl.metadata (807 bytes)
Collecting tensorboard<2.11,>=2.10 (from tensorflow<2.11,>=2.10.0->tensorflow-text)
  Downloading tensorboard-2.10.1-py3-none-any.whl.metadata (1.9 kB)
Collecting tensorflow-estimator<2.11,>=2.10.0 (from tensorflow<2.11,>=2.10.0->tensorflow-text)
  Downloading tensorflow_estimator-2.10.0-py2.py3-n

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-intel 2.18.0 requires keras>=3.5.0, but you have keras 2.10.0 which is incompatible.
tensorflow-intel 2.18.0 requires ml-dtypes<0.5.0,>=0.4.0, but you have ml-dtypes 0.5.1 which is incompatible.
tensorflow-intel 2.18.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.
tensorflow-intel 2.18.0 requires tensorboard<2.19,>=2.18, but you have tensorboard 2.10.1 which is incompatible.


In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

In [3]:
import pandas as pd

df = pd.read_csv("spam.csv")
df.head(5)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
print("TensorFlow version:", tf.__version__)
print("TensorFlow Hub version:", hub.__version__)
print("TensorFlow Text version:", text.__version__)
print("NumPy version:", np.__version__)

TensorFlow version: 2.10.1
TensorFlow Hub version: 0.16.1
TensorFlow Text version: 2.10.0


<IPython.core.display.Javascript object>

NumPy version: 1.23.5


In [7]:

df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [16]:
df['spam']=df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


#### Split it into training and test data set

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['Message'],df['spam'], stratify=df['spam'])

In [20]:
X_train.head(4)

3877                               did u get that message
1724    Hi Jon, Pete here, Ive bin 2 Spain recently & ...
4784    Especially since i talk about boston all up in...
4985    goldviking (29/M) is inviting you to be his fr...
Name: Message, dtype: object

##### Now lets import BERT model and get embeding vectors for few sample statements

In [22]:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

In [25]:
def get_sentence_embeding(sentences):
    preprocessed_text = bert_preprocess(sentences)
    return bert_encoder(preprocessed_text)['pooled_output']

get_sentence_embeding([
    "500$ discount. hurry up", 
    "Bhavin, are you up for a volleybal game tomorrow?"]
)

<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.843517  , -0.51327264, -0.8884572 , ..., -0.74748856,
        -0.75314724,  0.91964495],
       [-0.8720835 , -0.50543964, -0.94446677, ..., -0.85847515,
        -0.71745354,  0.8808298 ]], dtype=float32)>

In [28]:
e = get_sentence_embeding([
    "banana", 
    "grapes",
    "mango",
    "jeff bezos",
    "elon musk",
    "bill gates"
]
)

In [30]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([e[0]],[e[1]])

array([[0.9911088]], dtype=float32)

In [32]:
cosine_similarity([e[0]],[e[3]])

array([[0.84703845]], dtype=float32)

In [34]:
cosine_similarity([e[3]],[e[4]])

array([[0.9872036]], dtype=float32)

In [36]:
# Bert layers
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

# Neural network layers
l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l)

# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs = [l])

In [38]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_word_ids':   0           ['text[0][0]']                   
                                (None, 128),                                                      
                                 'input_type_ids':                                                
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128)}                                                      

In [40]:
len(X_train)

4179

In [42]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
model.fit(X_train, y_train, epochs=5)

In [None]:
model.evaluate(X_test, y_test)

##### Inference

In [None]:
reviews = [
    'Reply to win Â£100 weekly! Where will the 2006 FIFA World Cup be held? Send STOP to 87239 to end service',
    'You are awarded a SiPix Digital Camera! call 09061221061 from landline. Delivery within 28days. T Cs Box177. M221BP. 2yr warranty. 150ppm. 16 . p pÂ£3.99',
    'it to 80488. Your 500 free text messages are valid until 31 December 2005.',
    'Hey Sam, Are you coming for a cricket game tomorrow',
    "Why don't you wait 'til at least wednesday to see if you get your ."
]
model.predict(reviews)

##### Things that surprised me!

Both encoder and decoder understand language: It was surprising to realize that both the encoder and decoder independently develop an understanding of language. This shared understanding is what allows them to be used separately—encoders alone form BERT (used for tasks like question answering, sentiment analysis).

BERT doesn’t use decoders at all: It was surprising to learn that BERT only uses the encoder part of the original transformer architecture, yet is so powerful for a wide range of NLP tasks.

Pre-training with unsupervised tasks: The idea that BERT learns language understanding through masked language modeling (fill-in-the-blanks) and next sentence prediction—without any labels—was both clever and unexpected.
